Chapter 4. Document type detection


This chapter covers

  • Introduction to MIME types
  • Working with MIME types in Tika
  • Identifying file formats

Let’s talk about taxonomy. Taxonomy is the science of classification. Taxonomies are used to identify and classify concepts in order to better understand them and to have a shared vocabulary for describing things. For example, the Linnaean taxonomy[1] is the classical system of naming all biological organisms using two-part Latin names that identify both the genus or category and the specific species within that category. The term Homo sapiens identifies the modern human species as a part of the family of earlier human-like species, along with the extinct Homo neanderthalensis. A similar taxonomy, called the internet media type system, is used to identify digital document formats.

4.1. Internet media types

4.2. Media types in Tika

4.3. File format diagnostics

4.4. Tika, the type inspector

4.5. Summary