chapter fifteen

Chapter 15. The classic search engine example

This chapter covers

What better way to close out the book then the way we started it—with a classic search engine example?

You’re in for a treat. We interviewed Ken Krugler and his team from Bixo labs about their recent Public Terabyte Dataset Project, http://mng.bz/gYOt, and how Tika was a core component of a large-scale series of tests that helped shed some light on variations between languages, charsets, and other content available on the internet.

This chapter will show you even more of Tika in action, especially how you can leverage Tika inside of a workflow system such as Cascading, which is built on top of Hadoop to analyze a representative (by today’s standards) data set that many other internet researchers are also exploring. The tests run by Bixo labs that we’ll describe in the rest of the chapter should identify areas of further refinement in Tika, particularly in charset detection and language identification (recall chapter 7). Heck, they may even motivate you to get involved in improving Tika and working within the community.

Let’s hear more about it!

Chapter 15. The classic search engine example

This chapter covers

15.1. The Public Terabyte Dataset Project

15.2. The Bixo web crawler

15.3. Summary