10 Handling very large and very small datasets

 

This chapter covers

  • Working with datasets with many rows and with many features
  • Dimensionality reduction
  • Finding useful subsets of features
  • Training on samples of data
  • Tools to support outlier detection at scale
  • Working with very small datasets

The cases we’ve looked at so far assume the data is of a manageable size, both in terms of the number of rows and number of features, but you may encounter datasets with sizes that are more challenging to work with. We saw in chapter 8 that different detectors can have very different training and prediction times for large datasets, and generally the best option when faced with very large datasets is to work with faster model types, though these may not provide sufficient accuracy in finding the types of outliers needed for your project. For example, univariate outlier tests will tend to be very fast but will miss rare combinations of values. It may be that the detector, or set of detectors, that best provides the outlier scores suitable for your project struggles with the volume of data it is required to assess.

In this chapter we look at some ways to work with large datasets, including dimensionality reduction, searching for relevant subsets of features, using samples, running outlier detection in parallel, and using tools designed for larger data volumes. We end by looking at the opposite problem: very small datasets.

10.1 Data with many features

 
 
 
 

10.1.1 Dimensionality reduction

 
 

10.1.2 Feature subspaces

 
 

10.2 Data with many rows

 
 
 

10.2.1 Training and predicting on data samples

 
 
 

10.2.2 Testing models for stability

 
 
 
 

10.2.3 Segmenting data

 
 

10.2.4 Running in parallel vs. in sequence

 
 
 
 

10.2.5 Tools for working with large datasets

 
 
 

10.3 Working with very small datasets

 
 

Summary

 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest