Appendix A. Tika quick reference

 

All the key interfaces in Tika were described in detail earlier in this book and their Javadocs are all available online, but it’s often useful to have a quick reference for looking up some of the more commonly used functionality. This appendix answers that need by providing a summary of the key parts of the Tika API.

A.1. Tika facade

As discussed in chapter 2 and later in this book, the org.apache.tika.Tika facade class is designed to make simple Tika use cases as easy to use as possible. The facade class supports the methods shown in table A.1.

Table A.1. Key methods of the Tika facade class

Method

Description

detect(...) Returns the automatically detected media type of the given document. The return value is a string like application/pdf.
parse(...) Parses the given document and returns the extracted plain text content. The return value is a java.io.Reader instance and the parsing happens in a background thread while the text stream is read.
parseToString(...) Parses the given document and returns the extracted plain text content. The return value is a string whose length is limited by default to avoid memory issues with large documents.
setMaxStringLength(int) Sets the maximum length of the parseToString return value.

The type detection and text extraction methods accept the document to be processed in various different ways. Table A.2 lists the most common ways of specifying a document.

A.2. Command-line options

A.3. ContentHandler utilities