Chapter 1. Introduction

 

This chapter covers

  • Introducing the map and reduce style of programming
  • Understanding the benefits of parallel programming
  • Extending parallel programming to a distributed environment
  • Parallel programming in the cloud

This book teaches a set of programming techniques, tools, and frameworks for mastering large datasets. Throughout this book, I’ll refer to the style of programming you’re learning as a map and reduce style. The map and reduce style of programming is one in which we can easily write parallel programs—programs that can do multiple things at the same time—by organizing our code around two functions: map and reduce. To get a better sense of why we’ll want to use a map and reduce style, consider this scenario:

Scenario

Two young programmers have come up with an idea for how to rank pages on the internet. They want to rank pages based on the importance of the other sites on the internet that link to them. They think the internet should be just like high school: the more the cool kids talk about you, the more important you are. The two young programmers love the idea, but how can they possibly analyze the entire internet?

A reader well versed in Silicon Valley history will recognize this scenario as the Google.com origin story. In its early years, Google popularized a way of programming called MapReduce as a way to effectively process and rank the entire internet. This style was a natural fit for Google because

1.1. What you’ll learn in this book

1.2. Why large datasets?

Dask—A different type of distributed computing

1.3. What is parallel computing?

1.3.1. Understanding parallel computing

1.3.2. Scalable computing with the map and reduce style

1.3.3. When to program in a map and reduce style

1.4. The map and reduce style

1.4.1. The map function for transforming data

1.4.2. The reduce function for advanced transformations

1.4.3. Map and reduce for data transformation pipelines

1.5. Distributed computing for speed and scale

1.6. Hadoop: A distributed framework for map and reduce

1.7. Spark for high-powered map, reduce, and more

1.8. AWS Elastic MapReduce—Large datasets in the cloud

Summary

sitemap