10 Clustering data into groups

 

This section covers

  • Clustering data by centrality
  • Clustering data by density
  • Trade-offs between clustering algorithms
  • Executing clustering using the scikit-learn library
  • Iterating over clusters using Pandas

Clustering is the process of organizing data points into conceptually meaningful groups. What makes a given group “conceptually meaningful”? There is no easy answer to that question. The usefulness of any clustered output is dependent on the task we’ve been assigned.

Imagine that we’re asked to cluster a collection of pet photos. Do we cluster fish and lizards in one group and fluffy pets (such as hamsters, cats, and dogs) in another? Or should hamsters, cats, and dogs be assigned three separate clusters of their own? If so, perhaps we should consider clustering pets by breed. Thus, Chihuahuas and Great Danes fall into diverging clusters. Differentiating between dog breeds will not be easy. However, we can easily distinguish between Chihuahuas and Great Danes based on breed size. Maybe we should compromise: we’ll cluster on both fluffiness and size, thus bypassing the distinction between the Cairn Terrier and the similar-looking Norwich Terrier.

10.1 Using centrality to discover clusters

10.2 K-means: A clustering algorithm for grouping data into K central groups

10.2.1 K-means clustering using scikit-learn

10.2.2 Selecting the optimal K using the elbow method

10.3 Using density to discover clusters

10.4 DBSCAN: A clustering algorithm for grouping data based on spatial density

10.4.1 Comparing DBSCAN and K-means

10.4.2 Clustering based on non-Euclidean distance

10.5 Analyzing clusters using Pandas

Summary