12 Scaling Gaussian processes to large datasets

 

This chapter covers

  • Training a GP on a large dataset
  • Using mini-batch gradient descent when training a GP
  • Using an advanced gradient descent technique to train a GP faster

So far, we have seen that GPs offer great modeling flexibility. In chapter 3, we learned that we can model high-level trends using the GP’s mean function as well as variability using the covariance function. A GP also provides calibrated uncertainty quantification. That is, the predictions for datapoints near observations in the training dataset have lower uncertainty than those for points far away. This flexibility sets the GP apart from other ML models that produce only point estimates, such as neural networks. However, it comes at a cost: speed.

Training and making predictions with a GP (specifically, computing the inverse of the covariance matrix) scales cubically with respect to the size of the training data. That is, if our dataset doubles in size, a GP will take eight times as long to train and predict. If the dataset increases tenfold, it will take a GP 1,000 times longer. This poses a challenge to scaling GPs to large datasets, which are common in many applications:

12.1 Training a GP on a large dataset

12.1.1 Setting up the learning task

12.1.2 Training a regular GP

12.1.3 Problems with training a regular GP

12.2 Automatically choosing representative points from a large dataset

12.2.1 Minimizing the difference between two GPs

12.2.2 Training the model in small batches

12.2.3 Implementing the approximate model

12.3 Optimizing better by accounting for the geometry of the loss surface

12.4 Exercise

Summary