chapter three

Chapter 3. The information landscape

This chapter covers

Now that you’ve gotten started with Tika, you probably feel ready to attack the information content that’s out there. The interfaces that you know so far will allow you to grab content from the command line, GUI, or from Java, and feed that content into Tika for further analysis. In upcoming chapters, you’ll learn advanced techniques for performing those analyses and extending the powerful Java API on which Tika is constructed to classify your content, parse it, and represent its metadata.

Before diving too deep into Tika’s guts, as we’ll do in the next few chapters, we’d like you to collectively take a step back and consider this: where does all of the information that you feed to your Babel Fish come from? How is it stored? What’s the information’s ultimate utility, and where can Tika help to deliver that utility in the way that you (or others) expect?

Chapter 3. The information landscape

This chapter covers

3.1. Measuring information overload

3.2. I’m feeling lucky—searching the information landscape

3.3. Beyond lucky: machine learning

3.4. Summary