This chapter covers
- Understanding hierarchical clustering
- Using linkage methods
- Measuring the stability of a clustering result
In the previous chapter, we saw how k-means clustering finds k centroids in the feature space and iteratively updates them to find a set of clusters. Hierarchical clustering takes a different approach and, as its name suggests, can learn a hierarchy of clusters in a dataset. Instead of getting a “flat” output of clusters, hierarchical clustering gives us a tree of clusters within clusters. As a result, hierarchical clustering provides more insight into complex grouping structures than flat clustering methods like k-means.
The tree of clusters is built iteratively by calculating the distance between each case or cluster, and every other case or cluster in the dataset at each step. Either the case/cluster pair that are most similar to each other are merged into a single cluster, or sets of cases/clusters that are most dissimilar from each other are split into separate clusters, depending on the algorithm. I’ll introduce both approaches to you later in the chapter.
By the end of this chapter, I hope you’ll understand how hierarchical clustering works. We’ll apply this method to the GvHD data from the last chapter to help you understand how hierarchical clustering differs from k-means. If you no longer have the gvhdScaled object defined in your global environment, just rerun listings 16.1 and 16.2.