Chapter 1. The case for the digital Babel fish
This chapter covers
The Babel fish in Douglas Adams’ book The Hitchhiker’s Guide to the Galaxy is a universal translator that allows you to understand all the languages in the world. It feeds on data that would otherwise be incomprehensible, and produces an understandable translation. This is essentially what Apache Tika, a nascent technology available from the Apache Software Foundation, does for digital documents. Just like the protagonist Arthur Dent, who after inserting a Babel fish in his ear could understand Vogon poetry, a computer program that uses Tika can extract text and objects from Microsoft Word documents and all sorts of other files. Our goal in this book is to equip you with enough understanding of Tika’s architecture, implementation, extension points, and philosophy that the process of making your programs file-agnostic is equally simple.