This chapter covers:
- The KISS principle
- Multiple quality controls for a single attribute
- Reusing software development practices
Collecting data is expensive, in terms of both money and time. You may have to pay salary of data collectors or you may have to pay owners of data for access and use of the data. You may have to spend a lot of time cleaning up the data or driving to a library to copy the data from an old book. Many hours can be spent on just getting the data before you can even start analyzing it. If you’ve worked with data before, chances are you’ve had to spend a lot of effort and money to get your hands on the data you need.
Like all good data wranglers, you always keep source files around, just in case. That means that over time you’ve accumulated files of varying quality, files you have to manipulate differently to get the information you’re after, out of the files. If you’ve been in this position ask yourself this question: Has your ETL process (the process where you extract, transform, and load data into your database) evolved alongside the improvements in the source data? Can you still load the old data without having to wade through your version-control repository? If you’re unable to process all versions of source data with the same process, you’ve just added more cost to data collection when you need to reload all of the data.