4 Classifying based on odds: logistic regression

 

This chapter covers:

  • What the logistic regression algorithm is and how it works
  • What is feature engineering?
  • What is missing value imputation?
  • Building a logistic regression classifier to predict survival of the Titanic

In this chapter, I’m going add a new classification algorithm to your toolbox: logistic regression. Just like the k-nearest neighbors algorithm you learned about in the previous chapter, logistic regression is a supervised learning method that predicts class membership. Logistic regression relies on the equation of a straight line and produces models which are very easy to interpret and communicate.

Logistic regression can handle continuous (without discrete categories) and categorical (with discrete categories) predictor variables. In its most simple form, logistic regression is used to predict a binary outcome (cases can belong to one of two classes), but variants of the algorithm can handle multiple classes as well. Its name comes from the algorithm’s use of the logistic function, an equation that calculates the probability that a case belongs to one of the classes.

While logistic regression is most certainly a classification algorithm, it uses linear regression and the equation for a straight line to combine the information from multiple predictors. In this chapter you’ll learn how the logistic function works and how the equation for a straight line is used to build a model.

4.1  What is logistic regression?

4.1.1  How does logistic regression learn?

4.1.2  What if I have more than two classes?

4.2  Building our first logistic regression model

4.2.1  Loading and exploring the Titanic dataset

4.2.2  Making the most of the data: feature engineering and feature selection

4.2.3  Plotting the data

4.2.4  Training the model

4.2.5  Dealing with missing data

4.2.6  Training the model (take two)

4.3  Cross-validating our logistic regression model

4.3.1  Including missing value imputation in our cross-validation

4.3.2  Accuracy is the most important performance metric, right?

4.4  Interpreting the model: the odds ratio

4.4.1  Converting model parameters into odds ratios

4.4.2  When a one unit increase doesn’t make sense

sitemap