Chapter 5. Content extraction

This chapter covers

Full-text extraction
Working with the Parser interface
Reading data from a stream
Exporting in XHTML format

Armed with Tika, you can be confident of knowing each document’s pedigree, so sorting and organizing documents will be a snap. But what do you plan on doing with those documents once they’re organized?

Interactively, you’d likely pull the documents into your favorite editing application and start reading and updating their internal text. Programmatically, you’re more than likely to do the same thing, and once you know what’s what in terms of document types, and what applications are associated with them (like we showed you in chapter 4), you can make sure you’re using the right parser toolkits and libraries to read and modify each document’s text via your software program automatically.

But there are literally scores of those parsing toolkits and libraries, and each extracts the underlying text and information from documents differently. It’d help to have some software in your toolbelt that could assist you in choosing the right parsing library, and then normalizing the extracted text and information. Tika can help you here. The original and most important use case for Tika is extracting textual content from digital documents for use in building a full-text search index—which requires dealing with all of the different parsing toolkits out there—and representing text in a uniform way.

Chapter 5. Content extraction

This chapter covers

5.1. Full-text extraction

5.2. The Parser interface

5.3. Document input stream

5.4. Structured XHTML output

5.5. Context-sensitive parsing

5.6. Summary