Chapter 10. Evaluating and improving clustering quality

 

This chapter covers

  • Inspecting clustering output
  • Evaluating the quality of clustering
  • Improving clustering quality

We saw many types of clustering algorithms in the last chapter: k-means, canopy, fuzzy k-means, Dirichlet, and latent Dirichlet analysis (LDA). They all performed well on certain types of data and sometimes poorly on others. The most natural question that comes to mind after every clustering job is, “How well did the algorithm perform on the data?”

Analyzing the output of clustering is an important exercise. It can be done with simple command-line tools or richer GUI-based visualizations. Once the clusters are visualized and problem areas are identified, these results can be formalized into quality measures, which give numeric values that indicate how good the clusters are. In this chapter, we look at several ways to inspect, evaluate, and improve our clustering algorithms.

Tuning a clustering problem typically involves creating a custom similarity metric and choosing the right algorithm. The evaluation measure shows the impact of changes in the distance measure on clustering quality. First, you need to understand what the cluster looks like and the kind of features that are representative of each centroid. You also need to see the distribution of data-points among different clusters. You don’t want to end up in a situation where all k–1 clusters have one point each and the kth cluster has rest of the points.

10.1. Inspecting clustering output

10.2. Analyzing clustering output

10.3. Improving clustering quality

10.4. Summary