chapter six

6 Decision trees and ensemble learning

 

This chapter covers

  • Decision trees and the decision tree learning algorithm
  • Random forest: putting multiple trees together into one model
  • Gradient boosting as an alternative way of combining decision trees

In Chapter 3 we’ve described the binary classification problem and used the logistic regression model to predict if a customer is going to churn.

In this chapter, we’ll also solve a binary classification problem, but using a different family of machine learning models: tree-based models. Decision tree is the simplest tree-based model, which is nothing else, but a sequence of if-then-else rules put together. Multiple decision trees can be combined into an ensemble to achieve better performance. We’ll cover two tree-based ensemble models: random forest and gradient boosting.

The project we prepared for this chapter is default prediction: we’ll predict if a customer fails to pay back a loan or not. We’ll learn how to train decision trees and random forest models with Scikit-Learn and explore XGBoost — a library for implementing gradient boosting models.

6.1   Credit risk scoring project

Imagine that you work at a bank. When we receive a loan application, we need to make sure that if we give the money, the customer will be able to pay it back. With every application, there’s a risk of default — the failure to return the money.

6.1.1   Credit scoring dataset

6.1.2   Data cleaning

6.1.3   Dataset preparation

6.2   Decision trees

6.2.1   Decision tree classifier

6.2.2   Decision tree learning algorithm

6.2.3   Parameter tuning for decision tree

6.3   Random forest

6.3.1   Training a random forest

6.3.2   Parameter tuning for random forest

6.4   Gradient boosting

6.4.1   XGBoost: extreme gradient boosting

6.4.2   Model performance monitoring

6.4.3   Parameter tuning for XGBoost

6.4.4   Testing the final model

6.5   Next steps

6.5.1   Exercises

6.5.2   Other projects