chapter twelve

12 Clustering

 

This chapter covers

  • Classification of different types of clustering
  • Partitioning Clustering
  • Understanding and implementing k-means
  • Density-based clustering
  • Understanding and implementing DBSCAN
  • OPTICS: refining DBSCAN as a hierarchical clustering
  • Evaluating clustering results

In the previous chapters we have described, implemented and applied three data structures designed to efficiently solve nearest neighbor search; when we moved to their applications, we mentioned that clustering was one of the main areas where an efficient nearest search could make a difference. So far we had to delay this discussion, but now it’s finally time to put the icing on the cake, and get the best of our hard work. In this chapter, we will first briefly introduce clustering, explaining what it is and where it stands with respect to machine learning and AI. We’ll see that there are different types of clustering, with radically different approaches, and then we will present and discuss in detail 3 algorithms that uses different approaches; by going through the whole chapter readers will be exposed to the theoretical foundations for this topic, learn about algorithms that can be implemented or just applied to break down datasets into smaller homogeneous groups, and also, in the process, get a deeper understanding of nearest neighbor search and multi-dimensional indexing.

12.1  Intro to Clustering

12.1.1    Types of Learning

12.1.2    Types of Clustering

12.2  K-means

12.2.1    Issues With K-means

12.2.2    The Curse of Dimensionality Strikes Again

12.2.3    K-means Performance Analysis

12.2.4    Boosting k-means with k-d Trees

12.2.5    Final Remarks on k-means

12.3  DBSCAN

12.3.1    Directly vs Density-Reachable

12.3.2    From Definitions to an Algorithm

12.3.3    And Finally, an Implementation

12.3.4    PROs and CONs With DBSCAN

12.4  OPTICS

12.4.1    Definitions

12.4.2    OPTICS Algorithm

12.4.3    From Reachability Distance to Clustering

12.4.4    Hierarchical clustering

12.4.5    Performance Analysis and Final Considerations

12.5  Evaluating Clustering Results: Evaluation Metrics

12.5.1    Interpreting the Results

12.6  Summary