Chapter 4. Document type detection
This chapter covers
- Introduction to MIME types
- Working with MIME types in Tika
- Identifying file formats
Let’s talk about taxonomy. Taxonomy is the science of classification. Taxonomies are used to identify and classify concepts in order to better understand them and to have a shared vocabulary for describing things. For example, the Linnaean taxonomy[1] is the classical system of naming all biological organisms using two-part Latin names that identify both the genus or category and the specific species within that category. The term Homo sapiens identifies the modern human species as a part of the family of earlier human-like species, along with the extinct Homo neanderthalensis. A similar taxonomy, called the internet media type system, is used to identify digital document formats.