chapter two

2 Using generative AI to ensure sufficient data quality

This chapter covers

Best practices for ensuring high quality of data
Using generative AI to prepare a data cleaning protocol
Evaluating data content quality
Dealing with data errors
Investigating unclear data

In MS Excel, you can calculate the trend line and standard deviation of a sample on the basis of just two data points. Clearly, such “data analysis” is meaningless. This chapter will help you focus your efforts on things you should do with data, rather than just expand on what you can do with it. It explains the necessary background for any analysis you may wish to perform. You will learn about best practices and non-negotiable rules, ensuring that your conclusions are related to the business activities you’re analyzing, rather than to flaws in the underlying data.

You’ll develop a structured approach to quality assessment and assurance, you’ll purge your data of artifacts, you’ll identify the blind spots, and you’ll learn to think about the benefits and risks of guesstimating missing pieces. Finally, you’ll learn to look at the collected data from a new perspective—the perspective of its usefulness for the process of analysis.

2.1 On a whimsy of fortune

2.2 A note on best practices

2.3 Getting started

2.4 Quality assessment structure

2.4.1 Data cleaning steps

2.4.2 Exploratory data analysis elements

2.5 Data cleaning

2.5.1 Removing duplicates

2.5.2 Handling missing values

2.5.3 Correcting data entry errors

2.5.4 Data validation

2.6 Exploratory data analysis

2.6.1 Reviewing score distribution

2.6.2 Time series exploration

2.6.3 Mysterious variable investigation

2.6.4 Harmonizing data

Summary