2 Introducing Dask

This chapter covers

Warming up with a short example of data cleaning using Dask DataFrames
Visualizing DAGs generated by Dask workloads with graphviz
Exploring how the Dask task scheduler applies the concept of DAGs to coordinate execution of code

Now that you have a basic understanding of how DAGs work, let’s take a look at how Dask uses DAGs to create robust, scalable workloads. To do this, we’ll use the NYC Parking Ticket data you downloaded at the end of the previous chapter. This will help us accomplish two things at once: you’ll get your first taste of using Dask’s DataFrame API to analyze a structured dataset, and you’ll start to get familiar with some of the quirks in the dataset that we’ll address throughout the next few chapters. We’ll also take a look at a few useful diagnostic tools and use the low-level Delayed API to create a simple custom task graph.

2.1 Hello Dask: A first look at the DataFrame API

2.1.1 Examining the metadata of Dask objects

2.1.2 Running computations with the compute method

2.1.3 Making complex computations more efficient with persist

2.2 Visualizing DAGs

2.2.1 Visualizing a simple DAG using Dask Delayed objects

2.2.2 Visualizing more complex DAGs with loops and collections

2.2.3 Reducing DAG complexity with persist

2.3 Task scheduling

2.3.1 Lazy computations

2.3.2 Data locality

Summary