chapter ten

10 Design a database batch auditing service

This chapter covers

Auditing database tables to find invalid data
Designing a scalable and accurate solution to audit database tables
Exploring possible features to answer an unusual question

Let’s design a shared service for manually defined validations. This is an unusually open-ended system design interview question, even by the usual standards of system design interviews, and the approach discussed in this chapter is just one of many possibilities.

We begin this chapter by introducing the concept of data quality. There are many definitions of data quality. In general, data quality can refer to how suitable a dataset is to serve its purpose and may also refer to activities that improve the dataset’s suitability for said purpose. There are many dimensions of data quality. We can adopt the dimensions from https://www.heavy.ai/technical-glossary/data-quality:

Accuracy—How close a measurement is to the true value.
Completeness—Data has all the required values for our purpose.
Consistency—Data in different locations has the same values, and the different locations start serving the same data changes at the same time.
Validity—Data is correctly formatted, and values are within an appropriate range.
Uniqueness—No duplicate or overlapping data.
Timeliness—Data is available when it is required.

10.1 Why is auditing necessary?

10.2 Defining a validation with a conditional statement on a SQL query’s result

10.3 A simple SQL batch auditing service

10.3.1 An audit script

10 Design a database batch auditing service

This chapter covers

10.1 Why is auditing necessary?

10.2 Defining a validation with a conditional statement on a SQL query’s result

10.3 A simple SQL batch auditing service

10.3.1 An audit script

10.3.2 An audit service

10.4 Requirements

10.5 High-level architecture

10.5.1 Running a batch auditing job

10.5.2 Handling alerts

10.6 Constraints on database queries

10.6.1 Limit query execution time

10.6.2 Check the query strings before submission

10.6.3 Users should be trained early

10.7 Prevent too many simultaneous queries