Chapter 2. Foundations of taming text

 

In this chapter

  • Understanding text processing building blocks like tokenizing, chunking, parsing, and part of speech tagging
  • Extracting text from common file formats using the Apache Tika open source project

Naturally, before we can get started with the hard-core text-taming processes, we need a little warm-up first. We’ll start by laying the ground work with a short high school English refresher where we’ll delve into topics such as tokenization, stemming, parts of speech, and phrases and clauses. Each of these steps can play an important role in the quality of results you’ll see when building applications utilizing text. For instance, the seemingly simple act of splitting up words, especially in languages like Chinese, can be difficult. Even in English, dealing with punctuation appropriately can make tokenization hard. Likewise, identifying parts of speech and phrases in text can also be difficult due to the ambiguity inherent in language.

We’ll follow up the discussion on language foundations by looking at how to extract text from the many different file formats encountered in the wild. Though many books and papers wave their hands at content extraction, assuming users have plain text ready to go, we feel it’s important to investigate some of the issues involved with content extraction for several reasons:

2.1. Foundations of language

2.2. Common tools for text processing

2.3. Preprocessing and extracting content from common file formats

2.4. Summary

2.5. Resources