Chapter 2. Accelerating large dataset work: Map and parallel computing

 

This chapter covers

  • Using map to transform lots of data
  • Using parallel programming to transform lots of data
  • Scraping data from the web in parallel with map

In this chapter, we’ll look at map and how to use it for parallel programming, and we’ll apply those concepts to complete two web scraping exercises. With map, we’ll focus on three primary capabilities:

  1. We can use it to replace for loops.
  2. We can use it to transform data.
  3. Map evaluates only when necessary, not when called.

These core ideas about map are also why it’s so useful for us in parallel programming. In parallel programming, we’re using multiple processing units to do partial work on a task and combining that work later. Transforming lots of data from one type to another is an easy task to break into pieces, and the instructions for doing so are generally easy to transfer. Making code parallel with map can be as easy as adding four lines of code to a program.

2.1. An introduction to map

In chapter 1, we talked a little bit about map, which is a function for transforming sequences of data. Specifically, we looked at the example of applying the mathematical function n+7 to a list of integers: [–1,0,1,2]. And we looked at the graphic in figure 2.1, which shows a series of numbers being mapped to their outputs.

2.1.1. Retrieving URLs with map

2.1.2. The power of lazy functions (like map) for large datasets

2.2. Parallel processing

2.2.1. Processors and processing

2.2.2. Parallelization and pickling

2.2.3. Order and parallelization

2.2.4. State and parallelization

2.3. Putting it all together: Scraping a Wikipedia network

2.3.1. Visualizing our graph

2.3.2. Returning to map

2.4. Exercises

2.4.1. Problems of parallelization

2.4.2. Map function

2.4.3. Parallelization and speed

2.4.4. Pickling storage

2.4.5. Web scraping data

2.4.6. Heterogenous map transformations

Summary

sitemap