The insights generated by a data platform are only as good as the quality of the underlying data. A good data platform needs to provide some guarantees around data quality. In this chapter, we will focus on data quality.
At the time of writing, data quality testing isn’t yet offered “as a service” by all major cloud providers. Unlike some of the previous topics we’ve covered in this book—such as storage, data processing, or machine learning (ML)—we don’t have an out-of-the-box PaaS (platform as a service) solution, so we’ll have to stitch something together ourselves.
We’ll start by looking at what it means to test data and what are a few common types of data tests. Software engineering has a mature discipline of testing code. We’ll draw an analogy to data engineering and testing data. Next, we’ll look at what a data quality testing framework might look like and sketch a simple solution for our data platform. We’ll see when we should run these data quality tests and how we can handle execution.