11 Synthetic data for outlier detection
This chapter covers
- Creating realistic data to better understand detectors
- Creating more effective synthetic data tune and test detectors
- Using histograms and GMMs to generate data
- Using simulations to generate data
- Using synthetic data to train detectors
We’ve seen already that synthetic data is very useful in outlier detection for at least a few purposes. First, it helps us experiment with detectors to better understand their behavior. We saw examples of this in chapters 6 and 7 when introducing new detectors and in chapter 8 when examining techniques to visualize how detectors work. In these cases, we worked primarily with small, 2D datasets, which were limited but quite useful -- we will look in this chapter at more realistic synthetic datasets, which can provide further value.
A second purpose is tuning and testing models; we saw examples of this in chapter 8 using doped data. Synthetic data is especially important for this, as there are often few other good options available to evaluate detectors, at least until a large body of well-labelled data is collected. We’ll look here at other ways to generate doped data, as well as other forms of synthetic data that may be used for this purpose.