Chapter 9. Clustering algorithms in Mahout

 

This chapter covers

  • K-means clustering
  • Centroid generation using canopy clustering
  • Fuzzy k-means clustering and Dirichlet clustering
  • Topic modeling using latent Dirichlet allocation as a variant of clustering

Now that you know how input data is represented as Vectors and how SequenceFiles are created as input for the clustering algorithms, you’re ready to explore the various clustering algorithms that Mahout provides. There are many clustering algorithms in Mahout, and some work well for a given data set whereas others don’t. K-means is a generic clustering algorithm that can be molded easily to fit almost all situations. It’s also simple to understand and can easily be executed on parallel computers.

Therefore, before going into the details of various clustering algorithms, it’s best to get some hands-on experience using the k-means algorithm. Then it becomes easier to understand the shortcomings and pitfalls of other less generic techniques, and see how they can achieve better clustering of data in particular situations. You’ll use the k-means algorithm to cluster news articles and then improve the clustering quality using other techniques. You’ll then learn how the value of k in k-means can be inferred using canopy clustering. With this knowledge, you’ll create a clustering pipeline for a news aggregation website to get a better feel for real-world problems in clustering.

9.1. K-means clustering

 

9.2. Beyond k-means: an overview of clustering techniques

 
 

9.3. Fuzzy k-means clustering

 
 
 

9.4. Model-based clustering

 

9.5. Topic modeling using latent Dirichlet allocation (LDA)

 
 
 

9.6. Summary

 
 
 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage