chapter four

4 The basics of processing big data: data parallelism, part 1

This chapter covers

The importance of data parallelism in a world of big data
Applying the Fork/Join pattern
Writing declarative parallel programs
Understanding the limitation of a parallel for loop
Increasing performance with data parallelism

Imagine you’re cooking a spaghetti for dinner for four, and let’s say it takes 10 minutes to prepare and serve the pasta. You begin the preparation by filling a medium-sized pot with water to boil. Then, two more friends show up at your house for dinner. Clearly, you need to make more pasta. You can switch to a bigger pot of water with more spaghetti, which will take longer to cook, or you can use a second pot in tandem with the first, so that both pots of pasta will finish cooking at the same time. Data parallelism works in much the same way. Massive amounts of data can be processed if “cooked” in parallel.

4.1 What is data parallelism?

4.1.1 Data and task parallelism

4.1.2 The “embarrassingly parallel” concept

4.1.3 Data parallelism support in .NET

4.2 The Fork/Join pattern: parallel Mandelbrot

4.2.1 When the GC is the bottleneck: structs vs. class objects

4.2.2 The downside of parallel loops

4.3 Measuring performance speed

4.3.1 Amdahl’s Law defines the limit of performance improvement

4.3.2 Gustafson’s Law: a step further to measure performance improvement

4.3.3 The limitations of parallel loops: the sum of prime numbers

4.3.4 What can possibly go wrong with a simple loop?

4.3.5 The declarative parallel programming model

Summary