chapter ten

10 Regression with distance and trees: k-nearest neighbors, random forest and XGBoost

 

This chapter covers:

  • Using the k-nearest neighbors algorithm for regression
  • Using tree-based algorithms for regression
  • Comparing k-nearest neighbors, random forest, and XGBoost models for predicting the amount of heat released during combustion of fuels

You’re going to find this chapter a breeze. This is because you’ve done everything in it before (sort of). In chapter 3, I introduced you to the k-nearest neighbors (k-NN) algorithm as a tool for classification. In chapter 7, I introduced you to tree-based algorithms as tools for classification. Well conveniently, these algorithms can also be used to predict continuous variables. So in this chapter, I’ll help you extend these skills to solve regression problems.

By the end of this chapter, I hope you’ll learn how k-NN and tree-based algorithms can be extended to predict continuous variables. As you learned in chapter 7, decision trees suffer from a tendency to overfit their training data, and so are often vastly improved using ensemble techniques. Therefore, in this chapter you’ll train a random forest model and an XGBoost model, and benchmark their performance against the k-NN algorithm.

10.1  Using k-nearest neighbors to predict a continuous variable

10.2  Using tree-based learners predict a continuous variable

10.3  Building our first k-NN regression model

10.3.1  Loading and exploring the fuel dataset

10.3.2  Tuning the k hyperparameter

10.4  Building our first random forest regression model

10.5  Building our first XGBoost regression model

10.6  Benchmarking the k-NN, random forest, and XGBoost model-building processes

10.7  Strengths and weaknesses of k-NN, random forest, and XGBoost

10.8  Summary

10.9  Solutions to exercises