Chapter 3. Classifying based on similarities with k-nearest neighbors

This chapter covers

Understanding the bias-variance trade-off
Underfitting vs. overfitting
Using cross-validation to assess model performance
Building a k-nearest neighbors classifier
Tuning hyperparameters

This is probably the most important chapter of the entire book. In it, I’m going to show you how the k-nearest neighbors (kNN) algorithm works, and we’re going to use it to classify potential diabetes patients. In addition, I’m going to use the kNN algorithm to teach you some essential concepts in machine learning that we will rely on for the rest of the book.

By the end of this chapter, not only will you understand and be able to use the kNN algorithm to make classification models, but you will be able to validate its performance and tune it to improve its performance as much as possible. Once the model is built, you’ll learn how to pass new, unseen data into it and get the data’s predicted classes (the value of the categorical or grouping variable we are trying to predict). I’ll introduce you to the extremely powerful mlr package in R, which contains a mouth-watering number of machine learning algorithms and greatly simplifies all of our machine learning tasks.

3.1. What is the k-nearest neighbors algorithm?

3.2. Building your first kNN model

3.3. Balancing two sources of model error: The bias-variance trade-off

3.4. Using cross-validation to tell if we’re overfitting or underfitting

3.5. Cross-validating our kNN model

3.6. What algorithms can learn, and what they must be told: Parameters- s and hyperparameters

3.7. Tuning k to improve the model

3.8. Strengths and weaknesses of kNN

Summary

Solutions to exercises