chapter ten

10 Analyzing big data with Dask

This chapter covers

Scaling computation across many machines with extremely large datasets
Introducing Dask’s execution model
Executing code using the dask.distributed scheduler

Processing large amounts of data sometimes requires more than a single computer because the data is too much to process or the algorithms require a lot of computing power. At this stage in the book, we know how to devise more efficient computational processes and how to store and structure our data more intelligently for processing. This final chapter will be about how to scale out—that is, use more than one computer to perform computations.

To scale out, we will be using Dask, which is a library to perform parallel computing for analytics. Dask integrates very well with other libraries in the Python ecosystem, like NumPy and pandas. Dask will serve our purpose to scale out (i.e., use more than one computer). However, it can also be used to scale up (i.e., use computational results in a single computer more efficiently). In that sense, it can be an alternative to the material presented in chapter 3 about parallelism.

There are other alternatives to Dask, Spark being the most common. Spark, coming from the Java space, is less well-integrated with other Python libraries compared to Dask. So I prefer to use a Python-native solution, which simplifies interaction with the Python ecosystem. Many of the concepts exercised here can still be used with other frameworks.

10.1 Understanding Dask’s execution model

10.1.1 A pandas baseline for comparison

10.1.2 Developing a Dask-based data frame solution

10.2 The computational cost of Dask operations

10 Analyzing big data with Dask

This chapter covers

10.1 Understanding Dask’s execution model

10.1.1 A pandas baseline for comparison

10.1.2 Developing a Dask-based data frame solution

10.2 The computational cost of Dask operations

10.2.1 Partitioning data for processing

10.2.2 Persisting intermediate computations

10.2.3 Algorithm implementations over distributed data frames

10.2.4 Repartitioning the data

10.2.5 Persisting distributed data frames

10.3 Using Dask’s distributed scheduler

10.3.1 The dask.distributed architecture

10.3.2 Running code using dask.distributed

10.3.3 Dealing with datasets larger than memory

Summary