We’ve seen already that synthetic data is very useful in outlier detection for at least a few purposes. First, it helps us experiment with detectors to better understand their behavior. We saw examples of this in chapters 6 and 7 when introducing new detectors and in chapter 8 when examining techniques to visualize how detectors work. In these cases, we worked primarily with small, 2D datasets, which were limited but quite useful—in this chapter we will look at more realistic synthetic datasets, which can provide further value.
A second purpose is tuning and testing models; we saw examples of this in chapter 8 using doped data. Synthetic data is especially important for this, as there are often few other good options available to evaluate detectors, at least until a large body of well-labeled data is collected. We’ll look here at other ways to generate doped data, as well as other forms of synthetic data that may be used for this purpose.