Chapter 7. Extracting text with Tika
This chapter covers
- Understanding Tika’s logical design
- Using Tika’s built-in tool and APIs for text extraction
- Parsing XML
- Handling known Tika limitations
One of the more mundane yet vital steps when building a search application is extracting text from the documents you need to index. You might be lucky to have an application whose content is already in textual format or whose documents are always the same format, such as XML files or regular rows in a database. If you’re unlucky, you must instead accept the surprisingly wide plethora of document formats that are popular today, such as Outlook, Word, Excel, PowerPoint, Visio, Flash, PDF, Open Office, Rich Text Format (RTF), and even archive file formats like TAR, ZIP, and BZIP2. Seemingly textual formats, like XML or HTML, present challenges because you must take care not to accidentally include any tags or JavaScript sources. The plain text format might seem simplest of all, yet determining its character set may not be easy.
In the past it was necessary to “go it alone”: track down your own document filters, one by one, and interact with their unique and interesting APIs in order to extract the text you need. You’d also need to detect the document type and character encoding yourself. Fortunately, there’s now an open source framework called Tika, under the Apache Lucene top-level project, that handles most of this work for you.