Chapter 8. Building a text analysis toolkit

 

This chapter covers

  • A brief introduction to Lucene
  • Understanding tokenizers, TokenStream, and analyzers
  • Building an analyzer to detect phrases and inject synonyms
  • Use cases for leveraging the infrastructure

It’s now common for most applications to leverage user-generated-content (UGC). Users may generate content through one of many ways: writing blog entries, sending messages to others, answering or posing questions on message boards, through journal entries, or by creating a list of related items. In chapter 3, we looked at the use of tagging to represent metadata associated with content. We mentioned that tags can also be detected by automated algorithm.

In this chapter, we build a toolkit to analyze content. This toolkit will enable us to extract tags and their associated weights to build a term-vector representation for the text. The term vector representation can be used to

  • Build metadata about the user as described in chapter 2
  • Create tag clouds as shown in chapter 3
  • Mine the data to create clusters of similar documents as shown in chapter 9
  • Build predictive models as shown in chapter 10
  • Form a basis for understanding search as used in chapter 11
  • Form a basis for developing a content-based recommendation engine as shown in chapter 12

8.1. Building the text analyzers

8.2. Building the text analysis infrastructure

8.3. Use cases for applying the framework

8.4. Summary

8.5. Resources