17 Classification

 

This chapter covers

  • Classifying with decision trees
  • Building a random forest classifier
  • Creating a support vector machine
  • Evaluating classification accuracy
  • Understanding complex models

Data analysts frequently need to predict a categorical outcome from a set of predictor variables. Some examples include

  • Predicting whether an individual will repay a loan, given their demographics and financial history
  • Determining whether an ER patient is having a heart attack, based on their symptoms and vital signs
  • Deciding whether an email is spam, given the presence of key words, images, hypertext, header information, and origin

Each of these cases involves the prediction of a binary categorical outcome (good credit risk/bad credit risk; heart attack/no heart attack; spam/not spam) from a set of predictors (also called features). The goal is to find an accurate method of classifying new cases into one of the two groups.

The field of supervised machine learning offers numerous classification methods for predicting categorical outcomes, including logistic regression, decision trees, random forests, support vector machines, and artificial neural networks. The first four are discussed in this chapter. Artificial neural networks are beyond the scope of this book. See Ciaburro and Venkateswaran (2017) and Chollet and Allaire (2018) to learn more about them.

17.1 Preparing the data

17.2 Logistic regression

17.3 Decision trees

17.3.1 Classical decision trees

17.3.2 Conditional inference trees

17.4 Random forests

17.5 Support vector machines

17.5.1 Tuning an SVM

17.6 Choosing a best predictive solution

17.7 Understanding black box predictions

17.7.1 Break-down plots