This chapter covers
- Understanding the need for clustering
- Understanding over- and underfitting for clustering
- Validating the performance of a clustering algorithm
Our first stop in clustering brings us to a very commonly used technique: k-means clustering. I’ve used the word technique here rather than algorithm because k-means describes a particular approach to clustering that multiple algorithms follow. I’ll talk about these individual algorithms later in the chapter.
Note
Don’t confuse k-means with k-nearest neighbors! K-means is for unsupervised learning, whereas k-nearest neighbors is a supervised algorithm for classification.
K-means clustering attempts to learn a grouping structure in a dataset. The k-means approach starts with us defining how many clusters we believe there are in the dataset. This is what the k stands for; if we set k to 3, we will identify three clusters (whether these represent a real grouping structure or not). Arguably, this is a weakness for k-means, because we may not have any prior knowledge as to how many clusters to search for, but I’ll show you ways to select a sensible value of k.