8 Quality Control for Data Annotation

This chapter covers

Calculating the accuracy of an annotator compared to ground-truth data.
Calculating the overall agreement and reliability of a dataset as a whole.
Measuring inter-annotator agreement on a per-task basis to generate a confidence score for each training data label.
Designing Architectures that incorporate subject matter experts into the annotation workflow.
Breaking up a task into simpler subtasks to improve accuracy, efficiency and quality control.

You have your Machine Learning model ready to go and you have got people lined up to annotate your data, so you are almost ready to deploy! But you know that your model is only going to be as accurate as the data that it is trained on, so if you can’t get high quality annotations then you won’t have an accurate model. You just need to give the same task to multiple people and take the majority vote, right?

8.1 Comparing annotations to ground-truth answers

8.1.1 Annotator agreement with ground-truth data

8.1.2 Which baseline should you use for expected accuracy?

8.2 Inter-annotator agreement

8.2.1 Introduction to inter-annotator agreement

8.2.2 Dataset-level agreement with Krippendorff's alpha

8.2.3 Individual-annotator agreement

8.2.4 Per-label and per-demographic agreement

8.3 Aggregating multiple annotations to create training data

8.3.1 Aggregating annotations when everyone agrees

8.3.2 The mathematical case for diverse annotators and low agreement

8.3.3 Aggregating annotations when annotators disagree

8.3.4 Annotator-reported confidences

8.3.5 Deciding which labels to trust: annotation uncertainty

8.4 Quality control by expert review

8.4.1 Recruiting and training qualified people

8.4.2 Training people to become experts

8.4.3 Machine Learning-assisted experts