chapter sixteen

16 Clustering by finding centers: k-means

 

This chapter covers:

  • Why do we need clustering?
  • What do over and under-fitting look like for clustering?
  • What is k-means clustering?
  • How can we validate the performance of a clustering algorithm?

Our first stop in clustering brings us to a very commonly used technique: k-means clustering.

I’ve used the word "technique" here rather than "algorithm", because k-means describes a particular approach to clustering, that multiple algorithms follow. I’ll talk about these individual algorithms later in the chapter.

Important

Don’t confuse k-means with k-nearest neighbors! k-means is for unsupervised learning, whereas k-nearest neighbors is a supervised algorithm for classification.

K-means clustering attempts to learn a grouping structure in a dataset. The k-means approach starts with us defining how many clusters we believe there are in the dataset. This is what the k stands for; if we set k to three, we will identify three clusters (whether these represent a real grouping structure or not). Arguably, this is a weakness for k-means, as we may not have any prior knowledge as to how many clusters to search for, but I’ll show you ways of selecting a sensible value of k.

16.1  What is k-means clustering?

16.1.1  How does Lloyd’s algorithm work?

16.1.2  How does MacQueen’s algorithm work?

16.1.3  Hartigan-Wong algorithm

16.2  Building our first k-means model

16.2.1  Loading and exploring the GvHD dataset

16.2.2  Defining our task and learner

16.2.3  How do I choose the number of clusters?

16.2.4  Tuning k and the algorithm choice for our k-means model

16.2.5  Training the final, tuned k-means model

16.2.6  Using our model to predict clusters of new data

16.3  Strengths and weaknesses of k-means clustering

16.4  Summary

16.5  Solutions to exercises