9 Data quality

 

This chapter covers

  • Testing data to ensure quality
  • Different types of data quality checks
  • Executing data tests
  • Considerations for scaling out data testing

The insights generated by a data platform are only as good as the quality of the underlying data. A good data platform needs to provide some guarantees around data quality. In this chapter, we will focus on data quality.

At the time of writing, data quality testing isn’t yet offered “as a service” by all major cloud providers. Unlike some of the previous topics we’ve covered in this book—such as storage, data processing, or machine learning (ML)—we don’t have an out-of-the-box PaaS (platform as a service) solution, so we’ll have to stitch something together ourselves.

We’ll start by looking at what it means to test data and what are a few common types of data tests. Software engineering has a mature discipline of testing code. We’ll draw an analogy to data engineering and testing data. Next, we’ll look at what a data quality testing framework might look like and sketch a simple solution for our data platform. We’ll see when we should run these data quality tests and how we can handle execution.

9.1 Testing data

9.1.1 Availability tests

9.1.2 Correctness tests

9.1.3 Completeness tests

9.1.4 Detecting anomalies

9.1.5 Testing data recap

9.2 Running data quality checks

9.2.1 Testing using Azure Data Factory

9.2.2 Executing tests

9.2.3 Creating and using a template

9.2.4 Running data quality checks recap

9.3 Scaling out data testing

9.3.1 Supporting multiple data fabrics

9.3.2 Testing at rest and during movement