This chapter covers
- Using outlier detection to sample data that is unknown to your current model
- Using clustering to sample more diverse data before annotation starts
- Using representative sampling to target data most like where your model is deployed
- Improving real-world diversity with stratified sampling and active learning
- Using diversity sampling with different types of machine learning architectures
- Evaluating the success of diversity sampling
In chapter 3, you learned how to identify where your model is uncertain: what your model “knows it doesn’t know.” In this chapter, you will learn how to identify what’s missing from your model: what your model “doesn’t know that it doesn’t know” or the “unknown unknowns.” This problem is a hard one, made even harder because what your model needs to know is often a moving target in a constantly changing world. Just like humans are learning new words, new objects, and new behaviors every day in response to a changing environment, most machine learning algorithms are deployed in a changing environment.