Chapter 17. Classification

 

This chapter covers

  • Classifying with decision trees
  • Ensemble classification with random forests
  • Creating a support vector machine
  • Evaluating classification accuracy

Data analysts are frequently faced with the need to predict a categorical outcome from a set of predictor variables. Some examples include

  • Predicting whether an individual will repay a loan, given their demographics and financial history
  • Determining whether an ER patient is having a heart attack, based on their symptoms and vital signs
  • Deciding whether an email is spam, given the presence of key words, images, hypertext, header information, and origin

Each of these cases involves the prediction of a binary categorical outcome (good credit risk/bad credit risk, heart attack/no heart attack, spam/not spam) from a set of predictors (also called features). The goal is to find an accurate method of classifying new cases into one of the two groups.

The field of supervised machine learning offers numerous classification methods that can be used to predict categorical outcomes, including logistic regression, decision trees, random forests, support vector machines, and neural networks. The first four are discussed in this chapter. Neural networks are beyond the scope of this book.

17.1. Preparing the data

17.2. Logistic regression

17.3. Decision trees

17.4. Random forests

17.5. Support vector machines

17.6. Choosing a best predictive solution

17.7. Using the rattle package for data mining

17.8. Summary