11 Scaling and deploying Dask
This chapter covers
- Creating a Dask Distributed cluster on Amazon AWS using Docker and Elastic Container Service
- Using a Jupyter Notebook server and Elastic File System to store and access data science notebooks and shared datasets in Amazon AWS
- Using the Distributed client object to submit jobs to a Dask cluster
- Monitoring execution of jobs on the cluster using the Distributed monitoring dashboard
Up to this point, we’ve been working with Dask in local mode. This means that everything we’ve asked Dask to do has all been executed on a single computer. Running Dask in local mode is very useful for prototyping, development, and ad-hoc exploration, but we can still quickly reach the performance limits of a single computer. Just as our hypothetical chef in chapter 1 needed to call in reinforcements to get her kitchen prepped in time for dinner service, we too can configure Dask to spread the work out across many computers to process large jobs more quickly. This becomes especially important in production systems when time constraints apply. Therefore, it’s typical to scale out and run Dask in cluster mode in production.
Figure 11.1 This chapter will cover the last elements of the workflow: deployment and monitoring.