chapter eight

8 Improving decision trees: Random forests and gradient boosting

 

This chapter covers:

  • What are ensemble methods?
  • What are bagging, boosting, and stacking, and why are they useful?
  • Using the random forest and XGBoost algorithms to predict animal classes
  • How to benchmark multiple algorithms against the same task

In the last chapter I showed you how we can use the recursive partitioning algorithm to train decision trees that are very interpretable. We finished by highlighting an important limitation of decision trees: they have a tendency to overfit the training set. This results in models that generalize poorly to new data. As a result, individual decision trees are rarely used, but can become extremely powerful predictors when many trees are combined together.

By the end of this chapter you’ll understand the difference between ordinary decision trees and ensemble methods, such as random forest and gradient boosting, which combine multiple trees to make predictions. Finally, as this is the last chapter in the classification part of the book, you’ll learn what benchmarking is, and how to use it to find the best performing algorithm for a particular problem. Benchmarking is the process of letting a bunch of different learning algorithms fight it out to select the one that performs the best for a particular problem.

8.1  Ensemble techniques: bagging, boosting, and stacking

8.1.1  Training models on sampled data: bootstrap aggregating

8.1.2  Learning from the previous models' mistakes: boosting

8.1.3  Learning from predictions made by other models: stacking

8.2  Building our first random forest model

8.3  Building our first XGBoost model

8.4  Strengths and weaknesses of tree-based algorithms

8.5  Benchmarking algorithms against each other

8.6  Summary