Processing large amounts of data sometimes requires more than a single computer because the data is too much to process or the algorithms require a lot of computing power. At this stage in the book, we know how to devise more efficient computational processes and how to store and structure our data more intelligently for processing. This final chapter will be about how to scale out—that is, use more than one computer to perform computations.
To scale out, we will be using Dask, which is a library to perform parallel computing for analytics. Dask integrates very well with other libraries in the Python ecosystem, like NumPy and pandas. Dask will serve our purpose to scale out (i.e., use more than one computer). However, it can also be used to scale up (i.e., use computational results in a single computer more efficiently). In that sense, it can be an alternative to the material presented in chapter 3 about parallelism.
There are other alternatives to Dask, Spark being the most common. Spark, coming from the Java space, is less well-integrated with other Python libraries compared to Dask. So I prefer to use a Python-native solution, which simplifies interaction with the Python ecosystem. Many of the concepts exercised here can still be used with other frameworks.