Part 3. Big data patterns

 

Now that you’ve gotten to know Hadoop and know how to best organize, move, and store your data in Hadoop, you’re ready to explore part 3 of this book, which examines the techniques you need to know to streamline your big data computations.

In chapter 6 we’ll examine techniques for optimizing MapReduce operations, such as joining and sorting on large datasets. These techniques make jobs run faster and allow for more efficient use of computational resources.

Chapter 7 examines how graphs can be represented and utilized in Map-Reduce to solve algorithms such as friends-of-friends and PageRank. It also covers how data structures such as Bloom filters and HyperLogLog can be used when regular data structures can’t scale to the data sizes that you’re working with.

Chapter 8 looks at how to measure, collect, and profile your MapReduce jobs and identify areas in your code and hardware that could be causing jobs to run longer than they should. It also tames MapReduce code by presenting different approaches to unit testing. Finally, it looks at how you can debug any Map-Reduce job, and offers some anti-patterns you’d best avoid.