5 Sequential ensembles: Gradient boosting

 

This chapter covers

  • Using gradient descent to optimize loss functions for training models
  • Implementing gradient boosting
  • Training histogram gradient-boosting models efficiently
  • Gradient boosting with the LightGBM framework
  • Avoiding overfitting with LightGBM
  • Using custom loss function with LightGBM

The previous chapter introduced boosting, where we train weak learners sequentially and “boost” them into a strong ensemble model. An important sequential ensemble method introduced in chapter 4 is adaptive boosting (AdaBoost).

AdaBoost is a foundational boosting model that trains a new weak learner to fix the misclassifications of the previous weak learner. It does this by maintaining and adaptively updating weights on training examples. These weights reflect the extent of misclassification and indicate priority training examples to the base-learning algorithm.

In this chapter, we look at an alternative to weights on training examples to convey misclassification information to a base-learning algorithm for boosting: loss function gradients. Recall that we use loss functions to measure how well a model fits each training example in the data set. The gradient of the loss function for a single example is called the residual and, as we’ll see shortly, captures the deviation between true and predicted labels. This error, or residual, of course, measures the amount of misclassification.

5.1 Gradient descent for minimization

5.1.1 Gradient descent with an illustrative example

5.1.2 Gradient descent over loss functions for training

5.2 Gradient boosting: Gradient descent + boosting

5.2.1 Intuition: Learning with residuals

5.2.2 Implementing gradient boosting

5.2.3 Gradient boosting with scikit-learn

5.2.4 Histogram-based gradient boosting

5.3 LightGBM: A framework for gradient boosting

5.3.1 What makes LightGBM “light”?

5.3.2 Gradient boosting with LightGBM

Summary