chapter nine

9 Data Quality

 

In this chapter:

  • Testing data to ensure quality.
  • Different types of data quality checks.
  • Executing data tests.
  • Considerations for scaling out data testing.

The insights generated by a data platform are only as good as the quality of the underlying data. A good data platform needs to provide some guarantees around data quality. In this chapter we will focus on data quality.

At the time of writing, data quality testing isn’t yet offered “as a service” by all major cloud providers. Unlike some of the previous topics we covered in this book, like storage, data processing, or machine learning, we don’t have an out-of-the-box PaaS solution, so we’ll have to stich something together ourselves.

We’ll start by looking at what it means to test data and what are a few common types of data tests. Software engineering has a mature discipline of testing code. We’ll draw an analogy to data engineering and testing data.

Next, we’ll look at what a data quality testing framework might look like and sketch a simple solution for our data platform. We’ll see when these data quality tests should run and how we handle execution.

Finally, we will talk about some scaling considerations and how we would go about data quality testing in a real-world production system. This is a deep topic, so we won’t be able to implement everything in this chapter, but we’ll cover the necessary patterns and best practices.

Let’s start with the fundamentals of testing data.

9.1      Testing data

9.1.1   Availability tests

9.1.2   Correctness tests

9.1.3   Completeness tests

9.1.4   Detecting anomalies

9.1.5   Testing data recap

9.2      Running data quality checks

9.2.1   Testing using Azure Data Factory

9.2.2   Executing tests

9.2.3   Creating and using a template

9.2.4   Running data quality checks recap

9.3      Scaling out data testing

9.3.1   Supporting multiple data fabrics

9.3.2   Testing at rest and during movement

9.3.3   Authoring tests

9.3.4   Storing tests and results

9.3.5   Scaling out data testing recap

9.4      Summary