4 Diversity sampling

This chapter covers

Using outlier detection to sample data that is unknown to your current model
Using clustering to sample more diverse data before annotation starts
Using representative sampling to target data most like where your model is deployed
Improving real-world diversity with stratified sampling and active learning
Using diversity sampling with different types of machine learning architectures
Evaluating the success of diversity sampling

In chapter 3, you learned how to identify where your model is uncertain: what your model “knows it doesn’t know.” In this chapter, you will learn how to identify what’s missing from your model: what your model “doesn’t know that it doesn’t know” or the “unknown unknowns.” This problem is a hard one, made even harder because what your model needs to know is often a moving target in a constantly changing world. Just like humans are learning new words, new objects, and new behaviors every day in response to a changing environment, most machine learning algorithms are deployed in a changing environment.

4.1 Knowing what you don’t know: Identifying gaps in your model’s knowledge

4.1.1 Example data for diversity sampling

4.1.2 Interpreting neural models for diversity sampling

4.1.3 Getting information from hidden layers in PyTorch

4.2 Model-based outlier sampling

4.2.1 Use validation data to rank activations

4.2.2 Which layers should I use to calculate model-based outliers?

4.2.3 The limitations of model-based outliers

4.3 Cluster-based sampling

4.3.1 Cluster members, centroids, and outliers

4.3.2 Any clustering algorithm in the universe