chapter twelve

12 Regression with distance and trees: k-nearest neighbors, random forest and XGBoost

 

This chapter covers:

  • Using the k-nearest neighbors algorithm for regression
  • Using tree-based algorithms for regression
  • Comparing k-nearest neighbors, random forest, and XGBoost models for predicting the amount of heat released during combustion of fuels

You’re going to find this chapter a breeze. This is because you’ve done everything in it before (sort of). In chapter 3, I introduced you to the k-nearest neighbors (k-NN) algorithm as a tool for classification. In chapter 7, I introduced you to decision trees and then expanded on this in chapter 8 to cover random forest and XGBoost for classification. Well conveniently, these algorithms can also be used to predict continuous variables. So in this chapter, I’ll help you extend these skills to solve regression problems.

By the end of this chapter, I hope you’ll learn how k-NN and tree-based algorithms can be extended to predict continuous variables. As you learned in chapter 7, decision trees suffer from a tendency to overfit their training data, and so are often vastly improved using ensemble techniques. Therefore, in this chapter you’ll train a random forest model and an XGBoost model, and benchmark their performance against the k-NN algorithm.

12.1  Using k-nearest neighbors to predict a continuous variable

12.2  Using tree-based learners predict a continuous variable

12.3  Building our first k-NN regression model

12.3.1  Loading and exploring the fuel dataset

12.3.2  Tuning the k hyperparameter

12.4  Building our first random forest regression model

12.5  Building our first XGBoost regression model

12.6  Benchmarking the k-NN, random forest, and XGBoost model-building processes

12.7  Strengths and weaknesses of k-NN, random forest, and XGBoost

12.8  Summary

12.9  Solutions to exercises