6 Clean and prepare
This chapter covers
- Understanding the types of errors that you might find in your data
- Identifying problems in your data
- Implementing strategies for fixing or working around bad data
- Preparing your data for effective use in production
When we’re working with data, it’s crucial that we can trust our data and work with it effectively. Almost every data-wrangling project is front-loaded with an effort to fix problems and prepare the data for use.
You may have heard that cleanup and preparation equal 80% of the work! I’m not sure about that, but certainly preparation is often a large proportion of the total work.
Time invested at this stage helps save us from later discovering that we’ve been working with unreliable or problematic data. If this happens to you, then much of your work, understanding, and decisions are likely to be based on faulty input. This isn’t a good situation: you must now backtrack and fix those mistakes. This is an expensive process, but we can mitigate against this risk by paying attention early in the cleanup phase.
In this chapter, we’ll learn how to identify and fix bad data. You’ll see so many different ways that data can go wrong, so we can’t hope to look at them all. Instead, we’ll look at general strategies for addressing bad data and apply these to specific examples.