Part 3 Extending and deploying Dask

In part 3, we round out our exploration of Dask by covering some advanced topics: unstructured data, machine learning, and deploying Dask to the cloud. These are good topics to end on, because you should be fairly comfortable with the Dask paradigm by now. Once again, all the chapters are anchored on real-world datasets and common tasks you may encounter in any data science project.

Chapter 9 discusses how to use Dask Bags—a parallelized implementation of standard Python Lists—and Dask Arrays—a parallelized implementation of NumPy Arrays—to work with more complicated, unstructured datasets. We’ll cover some advanced collections topics such as mapping, folding, and reducing by parsing text data stored in JSON format.

Chapter 10 demonstrates how to use the Dask ML API to build parallelized scikit-learn models. This is extremely useful for building models from huge datasets where training time may be prohibitive and scaling the work out to many different machines effectively speeds up the training process.

Last but not least, chapter 11 covers two things: how to run Dask in the cloud using Docker and AWS, and how to run Dask in cluster mode. The chapter walks through a step-by-step configuration of an AWS environment, and then demonstrates how easy it is to execute and monitor code introduced in previous chapters in the cluster.

Part 3 Extending and deploying Dask

Unable to load book!