chapter five

5 K-Nearest Neighbors Method

 

“You can be a good neighbor only if you have good neighbors”

- Howard E. Koch, playwright and screenwriter.

After we established our uniformly continuous modeling assumption in the prelude, we'll utilize the fact that uniform continuous targets allow us to exploit the similarity between objects as an indicator for their labels similarity. With this approach we'll develop the k-nearest neighbors (or k-NN) method which simply searches for the k most similar objects (aka, nearest neighbors) to our input and uses their labels to predict a label for it. We'll motivate our discussion  by applying the method to the problem of classifying whether a mushroom is edible or poisonous and along the way we'll also get to learn about.

This chapter is going to be a little long; we're going to take advantage of the simplicity of k-NN and try to build it form scratch in different stages. In each stage we're going to improve upon the previous one until we reach the usage of scikit-learn. This will allow us to see on a manageable scale what it takes to implement a functional and efficient machine learning software that can work on large amount of data and grow our appreciation scikit-learn, its supporting libraries, and the community behind them.

5.1       A Basic K-NN Classifier

5.1.1   The “Can I eat that?” App

5.1.2   The Intuition Behind k-NN

5.1.3   How to Measure Similarity?

5.1.4   k-NN in Action

5.1.5   Boosting Performance with NumPy

5.2       A Better k-NN Classifier

5.2.1   Doing Faster Neighborhood Search Using K-d trees

5.2.2   Using k-d Trees with scikit-learn

5.2.3   Tuning the Value of k

5.2.4   Choosing the Metric

5.3       Is K-nn Reliable?

5.3.1   The Bayes Optimal Classifier

5.3.2   Reliability of 1-NN