13 Parallel Clustering
Map-Reduce and Canopy Clustering
This chapter covers
- Understanding parallel and distributed computing
- Canopy clustering
- Parallelizing k-means by leveraging canopy clustering
- MapReduce computational model
- Using MapReduce to write a distributed version of k-means
- MapReduce canopy clustering
- MR-DBSCAN
In the previous chapter we have introduced clustering, and described three different approaches to data partitioning: k-means, DBSCAN, and OPTICS.
All these algorithms use a single-thread approach, where all the operations are executed sequentially in the same thread[1]. This is the point where we should question our design: is it really necessary to run these algorithms sequentially?
During the course of this chapter, we will answer this question, and present you with alternatives, design patterns and examples that will give you the tools to spot opportunities for code parallelization, and use the best practices in the industry to easily achieve major speedups.
After going through this chapter, readers will understand the difference between parallel and distributed computing, discover canopy clustering, and, learn about MapReduce, a computational model for distributed computing, and finally be able to re-write the clustering we have seen in the previous chapter to operate in a distributed environment.