chapter eight

Chapter 8. Building a text analysis toolkit

This chapter covers

A brief introduction to Lucene
Understanding tokenizers, TokenStream, and analyzers
Building an analyzer to detect phrases and inject synonyms
Use cases for leveraging the infrastructure

It’s now common for most applications to leverage user-generated-content (UGC). Users may generate content through one of many ways: writing blog entries, sending messages to others, answering or posing questions on message boards, through journal entries, or by creating a list of related items. In chapter 3, we looked at the use of tagging to represent metadata associated with content. We mentioned that tags can also be detected by automated algorithm.

In this chapter, we build a toolkit to analyze content. This toolkit will enable us to extract tags and their associated weights to build a term-vector representation for the text. The term vector representation can be used to

Build metadata about the user as described in chapter 2
Create tag clouds as shown in chapter 3
Mine the data to create clusters of similar documents as shown in chapter 9
Build predictive models as shown in chapter 10
Form a basis for understanding search as used in chapter 11
Form a basis for developing a content-based recommendation engine as shown in chapter 12

Chapter 8. Building a text analysis toolkit

This chapter covers

8.1. Building the text analyzers

8.2. Building the text analysis infrastructure

8.3. Use cases for applying the framework

8.4. Summary

8.5. Resources