This chapter covers:
- Why do we need clustering?
- What do over and under-fitting look like for clustering?
- What is k-means clustering?
- How can we validate the performance of a clustering algorithm?
Our first stop in clustering brings us to a very commonly used technique: k-means clustering.
I’ve used the word "technique" here rather than "algorithm", because k-means describes a particular approach to clustering, that multiple algorithms follow. I’ll talk about these individual algorithms later in the chapter.
Important
Don’t confuse k-means with k-nearest neighbors! k-means is for unsupervised learning, whereas k-nearest neighbors is a supervised algorithm for classification.
K-means clustering attempts to learn a grouping structure in a dataset. The k-means approach starts with us defining how many clusters we believe there are in the dataset. This is what the k stands for; if we set k to three, we will identify three clusters (whether these represent a real grouping structure or not). Arguably, this is a weakness for k-means, as we may not have any prior knowledge as to how many clusters to search for, but I’ll show you ways of selecting a sensible value of k.