chapter seventeen

Chapter 17. Classification

This chapter covers

Classifying with decision trees
Ensemble classification with random forests
Creating a support vector machine
Evaluating classification accuracy

Data analysts are frequently faced with the need to predict a categorical outcome from a set of predictor variables. Some examples include

Predicting whether an individual will repay a loan, given their demographics and financial history
Determining whether an ER patient is having a heart attack, based on their symptoms and vital signs
Deciding whether an email is spam, given the presence of key words, images, hypertext, header information, and origin

Each of these cases involves the prediction of a binary categorical outcome (good credit risk/bad credit risk, heart attack/no heart attack, spam/not spam) from a set of predictors (also called features). The goal is to find an accurate method of classifying new cases into one of the two groups.

The field of supervised machine learning offers numerous classification methods that can be used to predict categorical outcomes, including logistic regression, decision trees, random forests, support vector machines, and neural networks. The first four are discussed in this chapter. Neural networks are beyond the scope of this book.

Chapter 17. Classification

This chapter covers

17.1. Preparing the data

17.2. Logistic regression

17.3. Decision trees

17.4. Random forests

17.5. Support vector machines

17.6. Choosing a best predictive solution

17.7. Using the rattle package for data mining

17.8. Summary