chapter eleven

11 K-means clustering

This chapter covers

Developing a K-means clustering algorithm
Computing and visualizing optimal cluster counts
Understanding standard deviations and computing z-scores
Creating Cleveland dot plots

Our primary purpose in this chapter is to demonstrate how to develop a K-means clustering algorithm. K-means clustering is a popular unsupervised learning method and multivariate analysis technique that enables purposeful and made-to-order strategies around smart clusters, or groups, cut from the data. Unsupervised learning is a learning method where the goal is to find patterns, structures, or relationships in data using only input variables and therefore no target, or output, data. By contrast, supervised learning methods use both input and output variables, usually to make predictions. In the former, you have no idea what you might be looking for; in the latter, you’ve already figured that out. Multivariate analysis refers to statistical techniques and methods used to analyze and understand relationships among two or more variables simultaneously.

11.1 Loading packages

11.2 Importing data

11.3 A primer on standard deviations and z-scores

11.4 Analysis

11.4.1 Wrangling data

11.4.2 Evaluating payrolls and wins

11.5 K-means clustering

11.5.1 More data wrangling

11.5.2 K-means clustering

Summary