5 Cleaning and transforming DataFrames

This chapter covers

Selecting and filtering data
Creating and dropping columns
Finding and fixing columns with missing values
Indexing and sorting DataFrames
Combining DataFrames using join and union operations
Writing DataFrames to delimited text files and Parquet

In the previous chapter, we created a schema for the NYC Parking Ticket dataset and successfully loaded the data into Dask. Now we’re ready to get the data cleaned up so we can begin analyzing and visualizing it! As a friendly reminder, figure 5.1 shows what we’ve done so far and where we’re going next within our data science workflow.

Figure 5.1 The Data Science with Python and Dask workflow

Data cleaning is an important part of any data science project because anomalies and outliers in the data can negatively influence many statistical analyses. This could lead us to make bad conclusions about the data and build machine learning models that don’t stand up over time. Therefore, it’s important that we get the data cleaned up as much as possible before moving on to exploratory analysis.

5.1 Working with indexes and axes

5.1.1 Selecting columns from a DataFrame

5.1.2 Dropping columns from a DataFrame

5.1.3 Renaming columns in a DataFrame

5.1.4 Selecting rows from a DataFrame

5.2 Dealing with missing values

5.2.1 Counting missing values in a DataFrame

5.2.2 Dropping columns with missing values

5.2.3 Imputing missing values

5.2.4 Dropping rows with missing data

5.2.5 Imputing multiple columns with missing values

5.3 Recoding data

5.4 Elementwise operations

5.5 Filtering and reindexing DataFrames

5.6 Joining and concatenating DataFrames

5.6.1 Joining two DataFrames

5.6.2 Unioning two DataFrames

5.7 Writing data to text files and Parquet files

5.7.1 Writing to delimited text files

5.7.2 Writing to Parquet files

Summary