12 Clustering

This chapter covers

Classifying different types of clustering
Partitioning clustering
Understanding and implementing k-means
Density-based clustering
Understanding and implementing DBSCAN
OPTICS: Refining DBSCAN as hierarchical clustering
Evaluating clustering results

In the previous chapters we have described, implemented, and applied three data structures designed to efficiently solve nearest neighbor search. When we moved to their applications, we mentioned that clustering was one of the main areas where an efficient nearest neighbor search could make a difference. We had to delay this discussion, but now it’s finally time to put the icing on the cake and get the most out of our hard work. In this chapter, we will first briefly introduce clustering, explaining what it is and how it relates to machine learning and AI. We’ll see that there are different types of clustering, with radically different approaches, and then we will present and discuss in detail three algorithms that use different approaches. By going through the whole chapter, readers will be exposed to the theoretical foundations for this topic, learn about algorithms that can be implemented or just applied to break down datasets into smaller homogeneous groups, and also, in the process, get a deeper understanding of nearest neighbor search and multidimensional indexing.

12.1 Intro to clustering

12.1.1 Types of learning

12.1.2 Types of clustering

12.2 K-means

12.2.1 Issues with k-means

12.2.2 The curse of dimensionality strikes again

12.2.3 K-means performance analysis

12.2.4 Boosting k-means with k-d trees

12.2.5 Final remarks on k-means

12.3 DBSCAN

12.3.1 Directly vs density-reachable

12.3.2 From definitions to an algorithm

12.3.3 And finally, an implementation

12.3.4 Pros and cons of DBSCAN

12.4 OPTICS

12.4.1 Definitions

12.4.2 OPTICS algorithm