This chapter covers
- Working with datasets with many rows and with many features
- Dimensionality reduction
- Finding useful subsets of features
- Training on samples of data
- Tools to support outlier detection at scale
- Working with very small datasets
The cases we’ve looked at so far assume the data is of a manageable size, both in terms of the number of rows and number of features, but you may encounter datasets with sizes that are more challenging to work with. We saw in chapter 8 that different detectors can have very different training and prediction times for large datasets, and generally the best option when faced with very large datasets is to work with faster model types, though these may not provide sufficient accuracy in finding the types of outliers needed for your project. For example, univariate outlier tests will tend to be very fast but will miss rare combinations of values. It may be that the detector, or set of detectors, that best provides the outlier scores suitable for your project struggles with the volume of data it is required to assess.
In this chapter we look at some ways to work with large datasets, including dimensionality reduction, searching for relevant subsets of features, using samples, running outlier detection in parallel, and using tools designed for larger data volumes. We end by looking at the opposite problem: very small datasets.