Chapter 17. Classification
This chapter covers
- Classifying with decision trees
- Ensemble classification with random forests
- Creating a support vector machine
- Evaluating classification accuracy
Data analysts are frequently faced with the need to predict a categorical outcome from a set of predictor variables. Some examples include
- Predicting whether an individual will repay a loan, given their demographics and financial history
- Determining whether an ER patient is having a heart attack, based on their symptoms and vital signs
- Deciding whether an email is spam, given the presence of key words, images, hypertext, header information, and origin
Each of these cases involves the prediction of a binary categorical outcome (good credit risk/bad credit risk, heart attack/no heart attack, spam/not spam) from a set of predictors (also called features). The goal is to find an accurate method of classifying new cases into one of the two groups.
The field of supervised machine learning offers numerous classification methods that can be used to predict categorical outcomes, including logistic regression, decision trees, random forests, support vector machines, and neural networks. The first four are discussed in this chapter. Neural networks are beyond the scope of this book.