chapter nine

9 Splitting data by asking questions: Decision trees

 

This chapter covers

  • What is a decision tree?
  • Using decision trees for classification and regression
  • Building an app recommendation system using the demographic information of the users.
  • Asking a series of successive questions to build a good classifier.
  • Accuracy, Gini index, and entropy, and their role in building decision trees.
  • Using scikit-learn to train a decision tree on a university admissions dataset.

In this chapter, we cover decision trees. Decision trees are very powerful classification and regression models. Not only that, but they also give us a great deal of information about our dataset. Just like the previous models we’ve learned in this book, decision trees are trained with labelled data, where the labels that we want to predict can be classes (for classification) or values (for regression). For most of this chapter, we focus on decision trees for classification, but in section 9.6 we describe decision trees for regression. However, the structure and training process of both types of trees is very similar. In this chapter we develop several use-cases, including an app recommendation system, and a model used for predicting admissions at a university.

9.1    The problem: We need to recommend apps to users according to what they are likely to download

9.2    The solution: Building an app recommendation system

9.2.1   First step to build the model: Asking the best question

9.2.2   Second step to build the model: Iterating

9.2.3   Last step: When to stop building the tree and other hyperparameters

9.2.4   The decision tree algorithm - How to build a decision tree and make predictions with it

9.3    Beyond questions like yes/no

9.3.1   Splitting the data using non-binary categorical features, such as dog/cat/bird

9.3.2   Splitting the data using continuous features, such as a age

9.4    The graphical boundary of decision trees

9.4.1   Using scikit-learn to build a decision tree

9.5    Real life application: Modeling student admissions with scikit-learn

9.5.1   Setting hyperparameters in sklearn

9.6    Decision trees for regression

9.7    Applications

9.7.1   Decision trees are widely used in health care

9.7.2   Decision trees are useful in recommendation systems

9.8    Summary

9.9    Exercises

9.9.1   Exercise 9.1

9.9.2   Exercise 9.2

9.9.3   Exercise 9.3