Copyright
Brief Table of Contents
Table of Contents
Foreword
Preface
Acknowledgments
About this Book
About the Authors
About the Cover Illustration
1. Getting started
Chapter 1. The case for the digital Babel fish
1.1. Understanding digital documents
1.1.1. A taxonomy of file formats
1.1.2. Parser libraries
1.1.3. Structured text as the universal language
1.1.4. Universal metadata
1.1.5. The program that understands everything
1.2. What is Apache Tika?
1.2.1. A bit of history
1.2.2. Key design goals
1.2.3. When and where to use Tika
1.3. Summary
Chapter 2. Getting started with Tika
2.1. Working with Tika source code
2.1.1. Getting the source code
2.1.2. The Maven build
2.1.3. Including Tika in Ant projects
2.2. The Tika application
2.2.1. Drag-and-drop text extraction: the Tika GUI
2.2.2. Tika on the command line
2.3. Tika as an embedded library
2.3.1. Using the Tika facade
2.3.2. Managing dependencies
2.4. Summary
Chapter 3. The information landscape
3.1. Measuring information overload
3.1.1. Scale and growth
3.1.2. Complexity
3.2. I���m feeling lucky���searching the information landscape
3.2.1. Just click it: the modern search engine
3.2.2. Tika���s role in search
3.3. Beyond lucky: machine learning
3.3.1. Your likes and dislikes
3.3.2. Real-world machine learning
3.4. Summary
2. Tika in detail
Chapter 4. Document type detection