chapter eleven

11 The KISS of quality

This chapter covers:

The KISS principle
Multiple quality controls for a single attribute
Reusing software development practices

Collecting data is expensive, in terms of both money and time. You may have to pay salary of data collectors or you may have to pay owners of data for access and use of the data. You may have to spend a lot of time cleaning up the data or driving to a library to copy the data from an old book. Many hours can be spent on just getting the data before you can even start analyzing it. If you’ve worked with data before, chances are you’ve had to spend a lot of effort and money to get your hands on the data you need.

Like all good data wranglers, you always keep source files around, just in case. That means that over time you’ve accumulated files of varying quality, files you have to manipulate differently to get the information you’re after, out of the files. If you’ve been in this position ask yourself this question: Has your ETL process (the process where you extract, transform, and load data into your database) evolved alongside the improvements in the source data? Can you still load the old data without having to wade through your version-control repository? If you’re unable to process all versions of source data with the same process, you’ve just added more cost to data collection when you need to reload all of the data.

11.1 TL;DR

11.2 Reproducibility

11.2.1 Internet memes

11.2.2 Original ETL program

11.2.3 Later ETL process

11.2.4 Comparison of ETL programs

11.2.5 Picking the right ETL process

11.2.6 Create the reproducibility program

11.3 Monitoring reproducibility

11.3.1 Create test data

11.3.2 Using software tests to control quality

11.3.3 Create tests

11.3.4 Create the quality control plugin

11.3.5 You need another quality control plugin

11.3.6 Try out the quality controls

11.4 Summary