6 Summarizing and analyzing DataFrames

This chapter covers

Producing descriptive statistics for a Dask Series
Aggregating/grouping data using Dask’s built-in aggregate functions
Creating your own custom aggregation functions
Analyzing time series data with rolling window functions

At the end of the previous chapter we arrived at a dataset ready for us to start digging in and analyzing. However, we didn’t perform an exhaustive search for every possible issue with the data. In reality, the data cleaning and preparation process can take a far longer time to complete. It’s a common adage among data scientists that data cleaning can take 80% or more of the total time spent on a project. With the skills you learned in the previous chapter, you have a good foundation to address all the most common data-quality issues you’ll come across in the wild. As a friendly reminder, figure 6.1 shows how we’re progressing through our workflow—we’re almost at the halfway point!

Figure 6.1 The Data Science with Python and Dask workflow

6.1 Descriptive statistics

6.1.1 What are descriptive statistics?

6.1.2 Calculating descriptive statistics with Dask

6.1.3 Using the describe method for descriptive statistics

6.2 Built-In aggregate functions

6.2.1 What is correlation?

6.2.2 Calculating correlations with Dask DataFrames

6.3 Custom aggregate functions

6.3.1 Testing categorical variables with the t-test

6.3.2 Using custom aggregates to implement the Brown-Forsythe test

6.4 Rolling (window) functions

6.4.1 Preparing data for a rolling function

6.4.2 Using the rolling method to apply a window function

Summary