13 Parallel clustering: MapReduce and canopy clustering

This chapter covers

Understanding parallel and distributed computing
Canopy clustering
Parallelizing k-means by leveraging canopy clustering
Using the MapReduce computational model
Using MapReduce to write a distributed version of k-means
Leveraging MapReduce canopy clustering
Working with MR-DBSCAN

In the previous chapter we introduced clustering and described three different approaches to data partitioning: k-means, DBSCAN, and OPTICS.

All these algorithms use a single-thread approach, where all the operations are executed sequentially in the same thread.¹ This is the point where we should question our design: Is it really necessary to run these algorithms sequentially?

^1. Multi-processor machines can, however, apply optimizations where some operations are executed in parallel across different cores. This level of parallelization, however, is limited by the number of cores on a chip—currently at most in the order of a hundred, for the most powerful servers.

During the course of this chapter, we will answer this question, and present you with alternatives, design patterns, and examples that will give you the tools to spot opportunities for code parallelization and use the best practices in the industry to easily achieve major speedups.

13.1 Parallelization

13.1.1 Parallel vs distributed

13.1.2 Parallelizing k-means

13.1.3 Canopy clustering

13 Parallel clustering: MapReduce and canopy clustering

This chapter covers

13.1 Parallelization

13.1.1 Parallel vs distributed

13.1.2 Parallelizing k-means

13.1.3 Canopy clustering

13.1.4 Applying canopy clustering

13.2 MapReduce

13.2.1 Imagine you are Donald Duck . . .

13.2.2 First map, then reduce

13.2.3 There is more under the hood

13.3 MapReduce k-means

13.3.1 Parallelizing canopy clustering

13.3.2 Centroid initialization with canopy clustering

13.3.3 MapReduce canopy clustering

13.4 MapReduce DBSCAN