Chapter 6. Clustering text

In this chapter

  • Basic concepts behind common text clustering algorithms
  • Examples of how clustering can help improve text applications
  • How to cluster words to identify topics of interest
  • Clustering whole document collections using Apache Mahout and clustering search results using Carrot2

How often have you browsed through content online and clicked through on an article that had an interesting title, but the underlying story was basically the same as the one you just finished? Or perhaps you’re tasked with briefing your boss on the day’s news but don’t have the time to wade through all the content involved when all you need is a summary and a few key points. Alternatively, maybe your users routinely enter ambiguous or generic query terms or your data covers a lot of different topics and you want to group search results in order to save users from wading through unrelated results. Having a text processing tool that can automatically group similar items and present the results with summarizing labels is a good way to wade through large amounts of text or search results without having to read all, or even most, of the content.

6.1. Google News document clustering

6.2. Clustering foundations

6.3. Setting up a simple clustering application

6.4. Clustering search results using Carrot2

6.5. Clustering document collections with Apache Mahout

6.6. Topic modeling using Apache Mahout

6.7. Examining clustering performance

6.8. Acknowledgments

6.9. Summary

6.10. References