Chapter 16. Clustering by finding centers with k-means

 

This chapter covers

  • Understanding the need for clustering
  • Understanding over- and underfitting for clustering
  • Validating the performance of a clustering algorithm

Our first stop in clustering brings us to a very commonly used technique: k-means clustering. I’ve used the word technique here rather than algorithm because k-means describes a particular approach to clustering that multiple algorithms follow. I’ll talk about these individual algorithms later in the chapter.

Note

Don’t confuse k-means with k-nearest neighbors! K-means is for unsupervised learning, whereas k-nearest neighbors is a supervised algorithm for classification.

K-means clustering attempts to learn a grouping structure in a dataset. The k-means approach starts with us defining how many clusters we believe there are in the dataset. This is what the k stands for; if we set k to 3, we will identify three clusters (whether these represent a real grouping structure or not). Arguably, this is a weakness for k-means, because we may not have any prior knowledge as to how many clusters to search for, but I’ll show you ways to select a sensible value of k.

16.1. What is k-means clustering?

16.2. Building your first k-means model

16.3. Strengths and weaknesses of k-means clustering

Summary

Solutions to exercises

sitemap