10 Design a database batch auditing service

 

This chapter covers

  • Auditing database tables to find invalid data
  • Designing a scalable and accurate solution to audit database tables
  • Exploring possible features to answer an unusual question

Let’s design a shared service for manually defined validations. This is an unusually open-ended system design interview question, even by the usual standards of system design interviews, and the approach discussed in this chapter is just one of many possibilities.

We begin this chapter by introducing the concept of data quality. There are many definitions of data quality. In general, data quality can refer to how suitable a dataset is to serve its purpose and may also refer to activities that improve the dataset’s suitability for said purpose. There are many dimensions of data quality. We can adopt the dimensions from https://www.heavy.ai/technical-glossary/data-quality:

  • Accuracy—How close a measurement is to the true value.
  • Completeness—Data has all the required values for our purpose.
  • Consistency—Data in different locations has the same values, and the different locations start serving the same data changes at the same time.
  • Validity—Data is correctly formatted, and values are within an appropriate range.
  • Uniqueness—No duplicate or overlapping data.
  • Timeliness—Data is available when it is required.

10.1 Why is auditing necessary?

10.2 Defining a validation with a conditional statement on a SQL query’s result

10.3 A simple SQL batch auditing service

10.3.1 An audit script

10.3.2 An audit service

10.4 Requirements

10.5 High-level architecture

10.5.1 Running a batch auditing job

10.5.2 Handling alerts

10.6 Constraints on database queries

10.6.1 Limit query execution time

10.6.2 Check the query strings before submission

10.6.3 Users should be trained early

10.7 Prevent too many simultaneous queries