table-of-contents

Table of Contents

Brief Table of Contents

Table of Contents

Acknowledgments

About this Book

About the Authors

About the Cover Illustration

1. Getting started

Chapter 1. The case for the digital Babel fish

1.1. Understanding digital documents

1.1.1. A taxonomy of file formats

1.1.2. Parser libraries

1.1.3. Structured text as the universal language

1.1.4. Universal metadata

1.1.5. The program that understands everything

1.2. What is Apache Tika?

1.2.1. A bit of history

1.2.2. Key design goals

1.2.3. When and where to use Tika

Chapter 2. Getting started with Tika

2.1. Working with Tika source code

2.1.1. Getting the source code

2.1.2. The Maven build

2.1.3. Including Tika in Ant projects

2.2. The Tika application

2.2.1. Drag-and-drop text extraction: the Tika GUI

2.2.2. Tika on the command line

2.3. Tika as an embedded library

2.3.1. Using the Tika facade

2.3.2. Managing dependencies

Chapter 3. The information landscape

3.1. Measuring information overload

3.1.1. Scale and growth

3.1.2. Complexity

3.2. I��m feeling lucky��searching the information landscape

3.2.1. Just click it: the modern search engine

3.2.2. Tika��s role in search

3.3. Beyond lucky: machine learning

3.3.1. Your likes and dislikes

3.3.2. Real-world machine learning

2. Tika in detail

Chapter 4. Document type detection

@font-face { font-family: 'livebook'; src:url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.eot?1.9.0'); src:url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.eot?1.9.0') format('embedded-opentype'), url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.woff?1.9.0') format('woff'), url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.ttf?1.9.0') format('truetype'), url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.svg?1.9.0') format('svg'); font-weight: normal; font-style: normal; }