10 Handling very large and very small datasets

 

This chapter covers

  • Working with datasets with many rows and with many features
  • Dimensionality reduction
  • Finding useful subsets of features
  • Training on samples of data
  • Tools to support outlier detection at scale
  • Working with very small datasets

The cases we’ve looked at so far assume the data is of a manageable size, both in terms of the number of rows and number of features, but you may encounter datasets with sizes that are more challenging to work with. We saw in chapter 8 that different detectors can have very different training and prediction times for large datasets, and generally the best option when faced with very large datasets is to work with faster model types, though these may not provide sufficient accuracy in finding the types of outliers needed for your project. For example, univariate outlier tests will tend to be very fast but will miss rare combinations of values. It may be that the detector, or set of detectors, that best provides the outlier scores suitable for your project struggles with the volume of data it is required to assess.

In this chapter we look at some ways to work with large datasets, including dimensionality reduction, searching for relevant subsets of features, using samples, running outlier detection in parallel, and using tools designed for larger data volumes. We end by looking at the opposite problem: very small datasets.

10.1 Data with many features

10.1.1 Dimensionality reduction

10.1.2 Feature subspaces

10.2 Data with many rows

10.2.1 Training and predicting on data samples

10.2.2 Testing models for stability

10.2.3 Segmenting data

10.2.4 Running in parallel vs. in sequence

10.2.5 Tools for working with large datasets

10.3 Working with very small datasets

Summary