Our next stop in unsupervised learning is clustering. Clustering covers a range of techniques used to identify clusters of cases in a dataset. A cluster is a set of cases that are more similar to each other than they are to cases in other clusters.
Conceptually, clustering can be considered similar to classification, in that we are trying to assign a discrete value to each case. The difference is that while classification uses labeled cases to learn patterns in the data that separate the classes, we use clustering when we don’t have any prior knowledge about class membership or whether there are distinct classes in the data. Clustering therefore describes a set of algorithms that try to identify a grouping structure within a dataset.
In chapters 16 through 19, I’ll arm you with different clustering techniques that can handle a range of clustering problems. Validating the performance of a clustering algorithm can be a challenge, and there may not always be an obvious or even a “correct” answer, but I’ll teach you skills to help maximize the information you get from these approaches.