chapter seven

7 Number go up! (or down) Correlation and linear regression

 

This chapter covers

  • The Pearson correlation and how it serves as a hypothesis test for a linear relationship between two variables.
  • How to predict values of correlated variables using linear regression
  • Metrics and assumptions for validating correlation and linear regression models

Linear regression is a type of statistical (and machine learning) model that fits a linear function between independent (input) and dependent (output) variables given some data. This way, a line fit to the data can be used to predict on data not seen before, assuming there is indeed a linear relationship between the variables. So far, we have only focused on one variable at a time. But it can be helpful to predict or understand hypothesized relationships between multiple variables, such as how much growth a plant will experience given so many hours of sunlight. Sometimes these relationships happen to resemble a straight line pattern, which can be helpful in making predictions straightforward. Linear relationships may sound elementary and basic, but they are a foundational part of even the most advanced models in statistics and machine learning. Therefore, it’s a great building block to master!

Linear regression has many strengths, making it a workhorse for many statistical and machine learning models:

Correlation

Observing a linear pattern between two variables

The Pearson correlation

Hypothesis testing the correlation coefficient

Assumptions of Pearson correlation

Correlation is not causation!

Linear regression

Simple linear regression

Interpolation versus extrapolation

Residuals and sum of squares

Overfitting and bias/variance tradeoff

Evaluating a simple linear regression

A real-world example

Summary

Reference