chapter five

5 Cleaning data

In the late 1980s, my employer wanted to know how much rain had fallen in various places. Their solution? They gave me a list of cities and phone numbers and asked me to call each in sequence, recording the previous day’s rainfall in an Excel spreadsheet. Nowadays, getting that sort of information—and many other types—is pretty easy. Many governments provide data sets for free, and numerous companies make data available for a price. No matter what topic you’re researching, data is almost certainly available. The only questions are where you can get it, how much it costs, and what format it comes in.

You should ask another question, too: how accurate is the data you’re using? It’s easy to assume that a CSV file from an official-looking website contains good data. But all too often, it will have problems. That shouldn’t surprise us, given that the data comes from people (who can make mistakes) and machines (which make different types of mistakes). Maybe someone accidentally misnamed a file or entered data into the wrong field. Maybe the automatic sensors whose inputs were used in collecting the data were broken or offline. Maybe the servers were down for a day, or someone misconfigured the XML feed-reading system, or the routers were being rebooted, or a backhoe cut the internet line.

All this assumes there was data to begin with. Often, we’ll have missing data because there wasn’t any data to record.

Exercise 25 • Parking cleanup

Working it out

Solution

Beyond the exercise

Exercise 26 • Celebrity deaths

Working it out

Solution

Beyond the exercise

Exercise 27 • Titanic interpolation

Working it out

Solution

Beyond the exercise

Exercise 28 • Inconsistent data

Working it out

Solution

Beyond the exercise

Summary