Chapter 11. Extending Tika

 

This chapter covers

There are thousands of document formats in the world and new ones are constantly being introduced, so it’s impossible for a library like Tika to support all of them out of the box. Thus even though each Tika version adds support for new formats, there will be times when Tika won’t be able to extract content from or even detect the type of a document you’re trying to use. This chapter is about what you can do in such a situation.

Imagine that you’re working with a new XML-based file format for medical prescriptions. Each file describes a single prescription and consists of a set of both fixed and free-form fields of information. Optionally the prescription documents can be digitally signed and encrypted for better security and privacy. Figure 11.1 shows how such digital prescriptions can be used in practice.

Figure 11.1. Illustration of how a digital prescription document can be used to securely transfer accurate prescription information from a doctor to a pharmacy. A digital signature ensures that the document came from someone authorized to make prescriptions, and encryption is used to ensure the privacy of the patient.

11.1. Adding type information

11.2. Custom type detection

11.3. Customized parsing

11.4. Summary