10 Design a database batch auditing service
This chapter covers
- Auditing database tables to find invalid data
- Designing a scalable and accurate solution to audit database tables
- Exploring possible features to answer an unusual question
Let’s design a shared service for manually defined validations. This is an unusually open-ended system design interview question, even by the usual standards of system design interviews, and the approach discussed in this chapter is just one of many possibilities.
We begin this chapter by introducing the concept of data quality. There are many definitions of data quality. In general, data quality can refer to how suitable a dataset is to serve its purpose and may also refer to activities that improve the dataset’s suitability for said purpose. There are many dimensions of data quality. We can adopt the dimensions from https://www.heavy.ai/technical-glossary/data-quality:
- Accuracy—How close a measurement is to the true value.
- Completeness—Data has all the required values for our purpose.
- Consistency—Data in different locations has the same values, and the different locations start serving the same data changes at the same time.
- Validity—Data is correctly formatted, and values are within an appropriate range.
- Uniqueness—No duplicate or overlapping data.
- Timeliness—Data is available when it is required.