12 Data distributions

 

This chapter covers

  • Applying statistical principles of distributions in machine learning
  • Understanding the differences between curated and uncurated datasets
  • Using population, sampling, and subpopulation distributions
  • Applying distribution concepts when training a model

As a data scientist and educator, I get a lot of questions from software engineers on how to improve the accuracy of a model. The five basic answers I give out to increase the accuracy of a model are as follows:

  • Increase training time.
  • Increase the depth (or width) of the model.
  • Add regularization.
  • Expand the dataset with data augmentation.
  • Increase hyperparameter tuning.

These are the five most likely places to address, and often working on one or another will improve model accuracy. But it’s important to understand that the limitations to accuracy ultimately lie in the dataset used to train the model. That’s what we are going to look at here: the nuances of datasets, and how and why they affect accuracy. And by nuances, I mean the distribution patterns of the data.

In this chapter, we do a deep dive into the three types of data distributions: population, sampling, and subpopulation. In particular, we will look at how these distributions affect the ability of the model to accurately generalize to data in the real world. The model’s accuracy, you’ll see, often differs from the predictions generated by the training or evaluation dataset, a difference referred to as serving skew and data drift.

12.1 Distribution types

12.1.1 Population distribution

12.1.2 Sampling distribution

12.1.3 Subpopulation distribution

12.2 Out of distribution

12.2.1 The MNIST curated dataset

12.2.2 Setting up the environment

12.2.3 The challenge (“in the wild”)

12.2.4 Training as a DNN