2 Introducing Dask
This chapter covers
- Warming up with a short example of data cleaning using Dask DataFrames
- Visualizing DAGs generated by Dask workloads with graphviz
- Exploring how the Dask task scheduler applies the concept of DAGs to coordinate execution of code
Now that you have a basic understanding of how DAGs work, let’s take a look at how Dask uses DAGs to create robust, scalable workloads. To do this, we’ll use the NYC Parking Ticket data you downloaded at the end of the previous chapter. This will help us accomplish two things at once: you’ll get your first taste of using Dask’s DataFrame API to analyze a structured dataset, and you’ll start to get familiar with some of the quirks in the dataset that we’ll address throughout the next few chapters. We’ll also take a look at a few useful diagnostic tools and use the low-level Delayed API to create a simple custom task graph.