13 Parallel Clustering
Map-Reduce and Canopy Clustering

 

This chapter covers

  • Understanding parallel and distributed computing
  • Canopy clustering
  • Parallelizing k-means by leveraging canopy clustering
  • MapReduce computational model
  • Using MapReduce to write a distributed version of k-means
  • MapReduce canopy clustering
  • MR-DBSCAN

In the previous chapter we have introduced clustering, and described three different approaches to data partitioning: k-means, DBSCAN, and OPTICS.

All these algorithms use a single-thread approach, where all the operations are executed sequentially in the same thread[1]. This is the point where we should question our design: is it really necessary to run these algorithms sequentially?

During the course of this chapter, we will answer this question, and present you with alternatives, design patterns and examples that will give you the tools to spot opportunities for code parallelization, and use the best practices in the industry to easily achieve major speedups.

After going through this chapter, readers will understand the difference between parallel and distributed computing, discover canopy clustering, and, learn about MapReduce, a computational model for distributed computing, and finally be able to re-write the clustering we have seen in the previous chapter to operate in a distributed environment.

13.1  Parallelization

13.1.1    Parallel vs Distributed

13.1.2    Parallelizing k-means

13.1.3    Canopy Clustering

13.1.4    Applying Canopy Clustering

13.2  MapReduce

13.2.1    Imagine You Are Donald Duck…

13.2.2    First Map, Then Reduce

13.2.3    There is More, Under the Hood

13.3  MapReduce k-means

13.3.1    Parallelizing Canopy Clustering

13.3.2    Centroid Initialization with Canopy Clustering

13.3.3    MapReduce Canopy Clustering

13.4  MapReduce DBSCAN

13.5  Summary