concept parallel map in category data

appears as: parallel map, The parallel map, parallel map, A parallel map, parallel maps
Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code

This is an excerpt from Manning's book Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code.

This function does exactly what the .foo method does, but it relies on the value of an external variable n instead of an internal state self.n. We can then apply this to the numbers generated by range with a parallel map and get our results back, just like we expect.

Parallel map, which I introduced in section 2.2, is a great technique for transforming a large amount of data quickly. However, we did gloss over some nuances when we were learning the basics. We’ll dig into those nuances in this chapter. Parallel reduce is parallelization that occurs at the reduce step of our map and reduce pattern. That is, we’ve already called map, and now we’re ready to accumulate the results of all those transformations. With parallel reduce, we use parallelization in the accumulation process instead of the transformation process.

Figure 6.1 shows how the lazy map outputs a lazy map object, no iteration involved, whereas the parallel map iterates through the entire sequence. We’ll look at solving this problem using parallel reduce in section 6.2.

Figure 6.1. Lazy map can be faster than parallel map when we’ll follow up our map statement by iterating over the results.

We discussed the benefits to laziness in chapter 4, and when working in parallel there’s no reason we have to give them up. If we want to be lazy and parallel, we can use the .imap and .imap_unordered methods of Pool(). These methods both return iterators instead of lists, as shown in the following listing. Other than that, .imap behaves just like parallel map.

Listing 6.3. Variations of parallel map
from multiprocessing import Pool

def increase(x):
  return x+1

with Pool() as P:
  a = P.map(increase, range(100))

with Pool() as P:
  b = P.imap(increase, range(100))

with Pool() as P:
  c = P.imap_unordered(increase, range(100))

print(a)                                                 #1
# [1, 2, 3, ... 100]
print(b)                                                  #2
# <multiprocessing.pool.IMapIterator object at            #2
 0x7f53207b3be0>                                        #2
print(c)                                                  #2
# <multiprocessing.pool.IMapUnorderedIterator object at   #2
 0x7fbe36ed2828>                                        #2

Now that we know more about chunk size and differences in the behavior of parallel map and lazy map, let’s look at some code. We’ll start by seeing how lazy and parallel map behave over different-sized sequences, and how, for simple operations on small data, there’s really no benefit to parallelization. Then we’ll test out parallel map with a few different chunk sizes and see how that impacts our performance.

sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest