chapter six

6 Clean and prepare

This chapter covers

Understanding the types of errors that you might find in your data
Identifying problems in your data
Implementing strategies for fixing or working around bad data
Preparing your data for effective use in production

When we’re working with data, it’s crucial that we can trust our data and work with it effectively. Almost every data-wrangling project is front-loaded with an effort to fix problems and prepare the data for use.

You may have heard that cleanup and preparation equal 80% of the work! I’m not sure about that, but certainly preparation is often a large proportion of the total work.

Time invested at this stage helps save us from later discovering that we’ve been working with unreliable or problematic data. If this happens to you, then much of your work, understanding, and decisions are likely to be based on faulty input. This isn’t a good situation: you must now backtrack and fix those mistakes. This is an expensive process, but we can mitigate against this risk by paying attention early in the cleanup phase.

In this chapter, we’ll learn how to identify and fix bad data. You’ll see so many different ways that data can go wrong, so we can’t hope to look at them all. Instead, we’ll look at general strategies for addressing bad data and apply these to specific examples.

6.1 Expanding our toolkit

6.2 Preparing the reef data

6.3 Getting the code and data

6.4 The need for data cleanup and preparation

6.5 Where does broken data come from?

6.6 How does data cleanup fit into the pipeline?

6 Clean and prepare

This chapter covers

6.1 Expanding our toolkit

6.2 Preparing the reef data

6.3 Getting the code and data

6.4 The need for data cleanup and preparation

6.5 Where does broken data come from?

6.6 How does data cleanup fit into the pipeline?

6.7 Identifying bad data

6.8 Kinds of problems

6.9 Responses to bad data

Techniques for fixing bad data

Cleaning our data set

6.11.1 Rewriting bad rows

6.11.2 Filtering rows of data

6.11.3 Filtering columns of data

Preparing our data for effective use

6.12.1 Aggregating rows of data

6.12.2 Combining data from different files using globby

6.12.3 Splitting data into separate files

Building a data processing pipeline with Data-Forge

Summary