chapter nine

9 Splitting data by asking questions: Decision trees

 

In this chapter

  • what is a decision tree?
  • using decision trees for classification and regression
  • building an app-recommendation system using users’ information
  • accuracy, Gini index, and entropy, and their role in building decision trees
  • using scikit-learn to train a decision tree on a university admissions dataset

In this chapter, we cover decision trees. Decision trees are powerful classification and regression models, which also give us a great deal of information about our dataset. Just like the previous models we’ve learned in this book, decision trees are trained with labeled data, where the labels that we want to predict can be classes (for classification) or values (for regression). For most of this chapter, we focus on decision trees for classification, but near the end of the chapter, we describe decision trees for regression. However, the structure and training process of both types of tree is similar. In this chapter, we develop several use cases, including an app-recommendation system and a model for predicting admissions at a university.

The problem: We need to recommend apps to users according to what they are likely to download

The solution: Building an app-recommendation system

First step to build the model: Asking the best question

Second step to build the model: Iterating

Last step: When to stop building the tree and other hyperparameters

The decision tree algorithm: How to build a decision tree and make predictions with it

Beyond questions like yes/no

Splitting the data using non-binary categorical features, such as dog/cat/bird

Splitting the data using continuous features, such as age

The graphical boundary of decision trees

Using Scikit-Learn to build a decision tree

Real-life application: Modeling student admissions with Scikit-Learn

Setting hyperparameters in Scikit-Learn

Decision trees for regression

Applications

Decision trees are widely used in health care

Decision trees are useful in recommendation systems

Summary

Exercises