6 K-Means Clustering
In this chapter we take a look at a clustering problem where we try to find distinct groups of a wholesale distributor's customer that share a common behavior so that we can devise targeted marketing campaigns that would attract more customers and flourish the business. This type of clustering problems is usually called customer segmentation and it's one of the most common clustering problems that one may encounter. Many approaches can be used to solve such a problem, and here we're going to look at one of them which stems from the similarity-based approach we started this part of the book with. The method we're going to work with here is called k-means. As the name suggests, k-means tries to find k mean points (central point, or centroids) around which groups of the data points cluster.
We're going to learn how does k-means work in detail, how we can use scikit-learn to apply it to our problem, and how exactly can we figure out the right value of k. Down the road, we're going to take a closer at why k-means works, its limitations, and the hidden assumptions that lie behind it. But first of all, and before we can do any of that, we'll start with defining the problem we have and understanding the data associated with it and do the necessary processing to get everything ready for k-means to work its magic.