concept Diversity Sampling in category machine learning

This is an excerpt from Manning's book Human-in-the-Loop Machine Learning MEAP V09.
Diversity Sampling is a strategy for identifying unlabeled items that are unknown to the Machine Learning model in its current state. This will typically mean items that contain combinations of feature values that are rare or unseen in the training data. The goal of diversity sampling is to target these new, unusual or outlier items for more labels in order to give the Machine Learning algorithm a more complete picture of the problem space.
While “Uncertainty Sampling” is a widely used term, “Diversity Sampling” goes by different names in different fields, often only tackling one part of the problem. In addition to “Diversity Sampling”, names given to types of Diversity Sampling include “Outlier Detection” and “Anomaly Detection”. For some use cases, like identifying new phenomena in astronomical databases or detecting strange network activity for security, the goal of the task itself is to identify the outlier/anomaly, but we can adapt them here as a sampling strategy for Active Learning.
Other types of Diversity Sampling, like Representative Sampling, are explicitly trying to find the unlabeled items that most look like the unlabeled data, compared to the training data. For example, Representative Sampling might find unlabeled items in text documents that have words that are really common in the unlabeled data but aren’t yet in the training data. For this reason, it is a good method to implement when you know that the data is changing over time.
Diversity Sampling can mean using intrinsic properties of the dataset, like the distribution of labels. For example, you might want to deliberately try to get an equal number of human annotations for each label, even though some labels are much rarer than others. Diversity Sampling can also mean ensuring that the data is representative of important external properties of the data, like ensuring that data comes from a wide variety of demographics of the people represented in the data, in order to overcome real-world bias in the data. We will cover all these variations in depth in the chapter on Diversity Sampling.
There are shortcomings to both Uncertainty Sampling and Diversity Sampling in isolation. Examples can be seen in Figure 1.2. Uncertainty Sampling might just focus on one part of the decision boundary, and Diversity Sampling might just focus on outliers that are a long distance from the boundary. So the strategies are often used together to find a selection of unlabeled items that will maximize both Uncertainty and Diversity.
Bottom Right shows one possible result from combining Uncertainty Sampling and Diversity Sampling. By combining the strategies, items are selected that are near diverse sections of the decision boundary. Therefore, we are optimizing the chance of finding items that are likely to result in changed decision boundary.
Bottom Left shows one possible result from Diversity Sampling. This Active Learning strategy is effective in selecting unlabeled items that are in very different parts of the problem space. However, if the diversity is away from the decision boundary then they are unlikely to be wrongly predicted and so they will not have a large effect on the model when a human gives them the label that is the same as the model already predicted.
Top Right shows one possible result from Uncertainty Sampling. This Active Learning strategy is effective in selecting unlabeled items near the decision boundary. They are the most likely to be wrongly predicted, and therefore the most likely to get a label that will move the decision boundary. However, if all the uncertainty is in one part of the problem space, giving them labels will not have a broad effect on the model.
Top Left shows the decision boundary from a Machine Learning algorithm between items, where some items have been labeled as “A” and some have been labeled as B.
Figure 1.2: Pros and Cons for Different Active Learning Strategies
The boundary from a Machine Learning model, that would predict Label A to the left and Label B to the right.
Uncertainty Sampling: selecting unlabeled items near the decision boundary.
Diversity Sampling: selecting unlabeled items that are in very different parts of the problem space.
Combined Uncertainty & Diversity Sampling: finding a diverse selection that are also near the boundary
Figure 4.7: An example of the problems that Diversity Sampling tries to address. Here, we have items mapped to three real-world demographics that we’re calling X, O & Z.
![]()
You can apply Diversity Sampling to any type of Machine Learning architecture. As you learned with Uncertainty Sampling in the last chapter, sometimes this is no different than in Neural models, and sometimes Diversity Sampling is unique to a given type of Machine Learning model.