chapter sixteen

16 Clustering by finding centers: k-means

This chapter covers:

Why do we need clustering?
What do over and under-fitting look like for clustering?
What is k-means clustering?
How can we validate the performance of a clustering algorithm?

Our first stop in clustering brings us to a very commonly used technique: k-means clustering.

I’ve used the word "technique" here rather than "algorithm", because k-means describes a particular approach to clustering, that multiple algorithms follow. I’ll talk about these individual algorithms later in the chapter.

Important

Don’t confuse k-means with k-nearest neighbors! k-means is for unsupervised learning, whereas k-nearest neighbors is a supervised algorithm for classification.

K-means clustering attempts to learn a grouping structure in a dataset. The k-means approach starts with us defining how many clusters we believe there are in the dataset. This is what the k stands for; if we set k to three, we will identify three clusters (whether these represent a real grouping structure or not). Arguably, this is a weakness for k-means, as we may not have any prior knowledge as to how many clusters to search for, but I’ll show you ways of selecting a sensible value of k.

16.1 What is k-means clustering?

16.1.1 How does Lloyd’s algorithm work?

16.1.2 How does MacQueen’s algorithm work?

16 Clustering by finding centers: k-means

This chapter covers:

Important

16.1 What is k-means clustering?

16.1.1 How does Lloyd’s algorithm work?

16.1.2 How does MacQueen’s algorithm work?

16.1.3 Hartigan-Wong algorithm

16.2 Building our first k-means model

16.2.1 Loading and exploring the GvHD dataset

16.2.2 Defining our task and learner

16.2.3 How do I choose the number of clusters?

16.2.4 Tuning k and the algorithm choice for our k-means model

16.2.5 Training the final, tuned k-means model

16.2.6 Using our model to predict clusters of new data

16.3 Strengths and weaknesses of k-means clustering

16.4 Summary

16.5 Solutions to exercises