chapter ten

Chapter 10. Evaluating and improving clustering quality

This chapter covers

Inspecting clustering output
Evaluating the quality of clustering
Improving clustering quality

We saw many types of clustering algorithms in the last chapter: k-means, canopy, fuzzy k-means, Dirichlet, and latent Dirichlet analysis (LDA). They all performed well on certain types of data and sometimes poorly on others. The most natural question that comes to mind after every clustering job is, “How well did the algorithm perform on the data?”

Analyzing the output of clustering is an important exercise. It can be done with simple command-line tools or richer GUI-based visualizations. Once the clusters are visualized and problem areas are identified, these results can be formalized into quality measures, which give numeric values that indicate how good the clusters are. In this chapter, we look at several ways to inspect, evaluate, and improve our clustering algorithms.

Tuning a clustering problem typically involves creating a custom similarity metric and choosing the right algorithm. The evaluation measure shows the impact of changes in the distance measure on clustering quality. First, you need to understand what the cluster looks like and the kind of features that are representative of each centroid. You also need to see the distribution of data-points among different clusters. You don’t want to end up in a situation where all k–1 clusters have one point each and the k^th cluster has rest of the points.

Chapter 10. Evaluating and improving clustering quality

This chapter covers

10.1. Inspecting clustering output

10.2. Analyzing clustering output

10.3. Improving clustering quality

10.4. Summary