chapter thirteen

13 Parallel Clustering
Map-Reduce and Canopy Clustering

This chapter covers

Understanding parallel and distributed computing
Canopy clustering
Parallelizing k-means by leveraging canopy clustering
MapReduce computational model
Using MapReduce to write a distributed version of k-means
MapReduce canopy clustering
MR-DBSCAN

In the previous chapter we have introduced clustering, and described three different approaches to data partitioning: k-means, DBSCAN, and OPTICS.

All these algorithms use a single-thread approach, where all the operations are executed sequentially in the same thread[1]. This is the point where we should question our design: is it really necessary to run these algorithms sequentially?

During the course of this chapter, we will answer this question, and present you with alternatives, design patterns and examples that will give you the tools to spot opportunities for code parallelization, and use the best practices in the industry to easily achieve major speedups.

After going through this chapter, readers will understand the difference between parallel and distributed computing, discover canopy clustering, and, learn about MapReduce, a computational model for distributed computing, and finally be able to re-write the clustering we have seen in the previous chapter to operate in a distributed environment.

13.1 Parallelization

13.1.1 Parallel vs Distributed

13.1.2 Parallelizing k-means

13.1.3 Canopy Clustering

13 Parallel Clustering
Map-Reduce and Canopy Clustering

This chapter covers

13.1 Parallelization

13.1.1 Parallel vs Distributed

13.1.2 Parallelizing k-means

13.1.3 Canopy Clustering

13.1.4 Applying Canopy Clustering

13.2 MapReduce

13.2.1 Imagine You Are Donald Duck…

13.2.2 First Map, Then Reduce

13.2.3 There is More, Under the Hood

13.3 MapReduce k-means

13.3.1 Parallelizing Canopy Clustering

13.3.2 Centroid Initialization with Canopy Clustering

13.3.3 MapReduce Canopy Clustering

13.4 MapReduce DBSCAN

13.5 Summary

13 Parallel Clustering Map-Reduce and Canopy Clustering

This chapter covers

13.1 Parallelization

13.1.1 Parallel vs Distributed

13.1.2 Parallelizing k-means

13.1.3 Canopy Clustering

13.1.4 Applying Canopy Clustering

13.2 MapReduce

13.2.1 Imagine You Are Donald Duck…

13.2.2 First Map, Then Reduce

13.2.3 There is More, Under the Hood

13.3 MapReduce k-means

13.3.1 Parallelizing Canopy Clustering

13.3.2 Centroid Initialization with Canopy Clustering

13.3.3 MapReduce Canopy Clustering

13.4 MapReduce DBSCAN

13.5 Summary

13 Parallel Clustering
Map-Reduce and Canopy Clustering