chapter three

3 Introducing Dask DataFrames

This chapter covers

Defining structured data and determining when to use Dask DataFrames
Exploring how Dask DataFrames are organized
Inspecting DataFrames to see how they are partitioned
Dealing with some limitations of DataFrames

In the previous chapter, we started exploring how Dask uses DAGs to coordinate and manage complex tasks across many machines. However, we only looked at some simple examples using the Delayed API to help illustrate how Dask code relates to elements of a DAG. In this chapter, we’ll begin to take a closer look at the DataFrame API. We’ll also start working through the NYC Parking Ticket data following a fairly typical data science workflow. This workflow and their corresponding chapters can be seen in figure 3.1.

Figure 3.1 The Data Science with Python and Dask workflow

3 Introducing Dask DataFrames

This chapter covers

Figure 3.1 The Data Science with Python and Dask workflow

3.1 Why use DataFrames?

3.2 Dask and Pandas

3.2.1 Managing DataFrame partitioning

3.2.2 What is the shuffle?

3.3 Limitations of Dask DataFrames

Summary