chapter two

2 Clustering techniques

“Simplicity is the ultimate sophistication – Leonardo da Vinci”

Nature loves simplicity. And teaches us to follow the same path. Most of the time, our decisions are simple choices. Simple solutions are easier to comprehend, are less time consuming, painless to maintain and ponder over. The machine learning world is no different. An elegant machine learning solution is not the one which is the most complicated algorithm available but which solves the business problem. A robust machine learning solution is easier to decipher and pragmatic enough to implement. A fully functional machine learning solution cracks the business challenge effectively and efficiently and is deployable is a production environment. As a data scientist, we always strive to attain a mature, effective and scalable machine learning solution.

Recall from Chapter 1, where we discussed data and its types, nuts and bolts of machine learning and different types of algorithms available – we started with defining unsupervised learning. We also studied the steps followed in an unsupervised learning solution. Continuing on the same path, in this second chapter we are going to start our study on unsupervised learning based clustering algorithms.

2.1 Technical toolkit

2.2 Clustering

2.2.1 Clustering techniques

2.3 Centroid based clustering

2.3.1 K-means clustering

2.3.2 Measure the accuracy of clustering

2.3.3 Finding the optimum value of “k”

2.3.4 Pros and cons of k-means clustering

2.3.5 k-means clustering implementation using Python

2.4 Connectivity based clustering

2.4.1 Types of hierarchical clustering

2.4.2 Linkage criterion for distance measurement

2.4.3 Optimal number of clusters

2.4.4 Pros and cons of hierarchical clustering

2.4.5 Hierarchical clustering case study using Python

2.5 Density based clustering

2.5.1 Neighborhood and density

2.5.2 DBSCAN Clustering

2.6 Case study using clustering

2.7 Common challenges faced in clustering

2.8 Summary