Chapter 9. Discovering patterns with clustering
This chapter covers
- k-means, hierarchical clustering, and probabilistic clustering
- Clustering blog entries
- Clustering using WEKA
- Clustering using the JDM APIs
It’s fascinating to analyze results found by machine learning algorithms. One of the most commonly used methods for discovering groups of related users or content is the process of clustering, which we discussed briefly in chapter 7. Clustering algorithms run in an automated manner and can create pockets or clusters of related items. Results from clustering can be leveraged to build classifiers, to build predictors, or in collaborative filtering. These unsupervised learning algorithms can provide insight into how your data is distributed.
In the last few chapters, we built a lot of infrastructure. It’s now time to have some fun and leverage this infrastructure to analyze some real-world data. In this chapter, we focus on understanding and applying some of the key clustering algorithms. K-means, hierarchical clustering, and expectation maximization (EM) are three of the most commonly used clustering algorithms.
As discussed in section 2.2.6, there are two main representations for data. The first is the low-dimension densely populated dataset; the second is the high-dimension sparsely populated dataset, which we use with text term vectors and to represent user click-through. In this chapter, we look at clustering techniques for both kinds of datasets.