11 Synthetic data for outlier detection

This chapter covers

Creating realistic data to better understand detectors
Creating more effective synthetic data tune and test detectors
Using histograms and GMMs to generate data
Using simulations to generate data
Using synthetic data to train detectors

We’ve seen already that synthetic data is very useful in outlier detection for at least a few purposes. First, it helps us experiment with detectors to better understand their behavior. We saw examples of this in chapters 6 and 7 when introducing new detectors and in chapter 8 when examining techniques to visualize how detectors work. In these cases, we worked primarily with small, 2D datasets, which were limited but quite useful -- we will look in this chapter at more realistic synthetic datasets, which can provide further value.

A second purpose is tuning and testing models; we saw examples of this in chapter 8 using doped data. Synthetic data is especially important for this, as there are often few other good options available to evaluate detectors, at least until a large body of well-labelled data is collected. We’ll look here at other ways to generate doped data, as well as other forms of synthetic data that may be used for this purpose.

11.1 Creating synthetic data to represent inliers

11.1.1 Testing with realistic inlier data

11.1.2 Using realistic synthetic inliers for training

11.2 Generating new synthetic data

11.2.1 Libraries to generate new synthetic data

11.2.2 Using patterns between features

11.2.3 Using GMMs

11.3 Doping

11.4 Simulations

11.5 Training classifiers to distinguish real from fake data

11.6 Summary