chapter eight

8 Quality control for data annotation

This chapter covers

Calculating the accuracy of an annotator compared with ground truth data
Calculating the overall agreement and reliability of a dataset
Generating a confidence score for each training data label
Incorporating subject-matter experts into annotation workflow
Breaking a task into simpler subtasks to improve annotation

You have your machine learning model ready to go, and you have people lined up to annotate your data, so you are almost ready to deploy! But you know that your model is going to be only as accurate as the data that it is trained on, so if you can’t get high-quality annotations, you won’t have an accurate model. You need to give the same task to multiple people and take the majority vote, right?

8.1 Comparing annotations with ground truth answers

8.1.1 Annotator agreement with ground truth data

8.1.2 Which baseline should you use for expected accuracy?

8.2 Interannotator agreement

8.2.1 Introduction to interannotator agreement

8.2.2 Benefits from calculating interannotator agreement

8.2.3 Dataset-level agreement with Krippendorff’s alpha

8.2.4 Calculating Krippendorff’s alpha beyond labeling

8.2.5 Individual annotator agreement

8.2.6 Per-label and per-demographic agreement

8.2.7 Extending accuracy with agreement for real-world diversity

8.3 Aggregating multiple annotations to create training data