Chapter 9. Clustering algorithms in Mahout

This chapter covers

K-means clustering
Centroid generation using canopy clustering
Fuzzy k-means clustering and Dirichlet clustering
Topic modeling using latent Dirichlet allocation as a variant of clustering

Now that you know how input data is represented as Vectors and how SequenceFiles are created as input for the clustering algorithms, you’re ready to explore the various clustering algorithms that Mahout provides. There are many clustering algorithms in Mahout, and some work well for a given data set whereas others don’t. K-means is a generic clustering algorithm that can be molded easily to fit almost all situations. It’s also simple to understand and can easily be executed on parallel computers.

Therefore, before going into the details of various clustering algorithms, it’s best to get some hands-on experience using the k-means algorithm. Then it becomes easier to understand the shortcomings and pitfalls of other less generic techniques, and see how they can achieve better clustering of data in particular situations. You’ll use the k-means algorithm to cluster news articles and then improve the clustering quality using other techniques. You’ll then learn how the value of k in k-means can be inferred using canopy clustering. With this knowledge, you’ll create a clustering pipeline for a news aggregation website to get a better feel for real-world problems in clustering.

Chapter 9. Clustering algorithms in Mahout

This chapter covers

9.1. K-means clustering

9.2. Beyond k-means: an overview of clustering techniques

9.3. Fuzzy k-means clustering

9.4. Model-based clustering

9.5. Topic modeling using latent Dirichlet allocation (LDA)

9.6. Summary

Chapter 9. Clustering algorithms in Mahout

This chapter covers

9.1. K-means clustering

9.2. Beyond k-means: an overview of clustering techniques

9.3. Fuzzy k-means clustering

9.4. Model-based clustering

9.5. Topic modeling using latent Dirichlet allocation (LDA)

9.6. Summary

Unable to load book!