11 Scaling and deploying Dask

 

This chapter covers

  • Creating a Dask Distributed cluster on Amazon AWS using Docker and Elastic Container Service
  • Using a Jupyter Notebook server and Elastic File System to store and access data science notebooks and shared datasets in Amazon AWS
  • Using the Distributed client object to submit jobs to a Dask cluster
  • Monitoring execution of jobs on the cluster using the Distributed monitoring dashboard

Up to this point, we’ve been working with Dask in local mode. This means that everything we’ve asked Dask to do has all been executed on a single computer. Running Dask in local mode is very useful for prototyping, development, and ad-hoc exploration, but we can still quickly reach the performance limits of a single computer. Just as our hypothetical chef in chapter 1 needed to call in reinforcements to get her kitchen prepped in time for dinner service, we too can configure Dask to spread the work out across many computers to process large jobs more quickly. This becomes especially important in production systems when time constraints apply. Therefore, it’s typical to scale out and run Dask in cluster mode in production.

Figure 11.1 This chapter will cover the last elements of the workflow: deployment and monitoring.

c11_01.eps

11.1 Building a Dask cluster on Amazon AWS with Docker

11.1.1 Getting started

11.1.2 Creating a security key

11.1.3 Creating the ECS cluster

11.1.4 Configuring the cluster’s networking

11.1.5 Creating a shared data drive in Elastic File System

11.1.6 Allocating space for Docker images in Elastic Container Repository

11.1.7 Building and deploying images for scheduler, worker, and notebook

11.1.8 Connecting to the cluster

11.2 Running and monitoring Dask jobs on a cluster

11.3 Cleaning up the Dask cluster on AWS

Summary

sitemap