chapter six

6 Fitting a decision tree and a random forest

This chapter covers

Decision trees and random forests
Model interpretation and evaluation
Mathematical foundations
Data exploration through grouped bar charts and histograms
Common data wrangling techniques

In the previous chapter, we solved a classification problem using logistic regression, achieving 87% accuracy in predicting the variety of Turkish raisins based on their morphological features. In this chapter, we will approach a similar classification problem using two powerful modeling techniques: decision trees and random forests.

A decision tree is a simple, intuitive model that makes decisions by recursively splitting the data into subsets based on the most significant feature at each step. It operates like a flowchart, where each internal node represents a decision based on a feature, each branch represents the outcome of that decision, and each leaf node represents a class label or a regression value. Decision trees are easy to interpret and visualize, making them a popular choice for both classification and regression tasks.

6.1 Understanding decision trees and random forests

6.2 Importing, wrangling, and exploring the data

6.2.1 Understanding the data

6.2.2 Wrangling the data

6.2.3 Exploring the data

6.3 Fitting a decision tree

6.3.1 Splitting the data

6.3.2 Fitting the model

6.3.3 Predicting responses

6.3.4 Evaluating the model

6.3.5 Plotting the decision tree

6.3.6 Interpreting and understanding decision trees

6.3.7 Advantages and disadvantages of decision trees

6.4 Fitting a random forest

6.4.1 Fitting the model

6.4.2 Predicting responses

6.4.3 Evaluating the model

6.4.4 Feature importance

6.4.5 Extracting random trees