Chapter 7. Extracting text with Tika

This chapter covers

Understanding Tika’s logical design
Using Tika’s built-in tool and APIs for text extraction
Parsing XML
Handling known Tika limitations

One of the more mundane yet vital steps when building a search application is extracting text from the documents you need to index. You might be lucky to have an application whose content is already in textual format or whose documents are always the same format, such as XML files or regular rows in a database. If you’re unlucky, you must instead accept the surprisingly wide plethora of document formats that are popular today, such as Outlook, Word, Excel, PowerPoint, Visio, Flash, PDF, Open Office, Rich Text Format (RTF), and even archive file formats like TAR, ZIP, and BZIP2. Seemingly textual formats, like XML or HTML, present challenges because you must take care not to accidentally include any tags or JavaScript sources. The plain text format might seem simplest of all, yet determining its character set may not be easy.

In the past it was necessary to “go it alone”: track down your own document filters, one by one, and interact with their unique and interesting APIs in order to extract the text you need. You’d also need to detect the document type and character encoding yourself. Fortunately, there’s now an open source framework called Tika, under the Apache Lucene top-level project, that handles most of this work for you.

Chapter 7. Extracting text with Tika

This chapter covers

7.1. What is Tika?

7.2. Tika’s logical design and API

7.3. Installing Tika

7.4. Tika’s built-in text extraction tool

7.5. Extracting text programmatically

7.6. Tika’s limitations

7.7. Indexing custom XML

7.8. Alternatives

7.9. Summary

Chapter 7. Extracting text with Tika

This chapter covers

7.1. What is Tika?

7.2. Tika’s logical design and API

7.3. Installing Tika

7.4. Tika’s built-in text extraction tool

7.5. Extracting text programmatically

7.6. Tika’s limitations

7.7. Indexing custom XML

7.8. Alternatives

7.9. Summary

Unable to load book!