Chapter 6. Memorization methods
This chapter covers
- Building single-variable models
- Cross-validated variable selection
- Building basic multivariable models
- Starting with decision trees, nearest neighbor, and naive Bayes models
The simplest methods in data science are what we call memorization methods. These are methods that generate answers by returning a majority category (in the case of classification) or average value (in the case of scoring) of a subset of the original training data. These methods can vary from models depending on a single variable (similar to the analyst’s pivot table), to decision trees (similar to what are called business rules), to nearest neighbor and Naive Bayes methods.[1] In this chapter, you’ll learn how to use these memorization methods to solve classification problems (though the same techniques also work for scoring problems).
1 Be aware: memorization methods are a nonstandard classification of techniques that we’re using to organize our discussion.
We’ll demonstrate all of the techniques in this chapter on the KDD Cup 2009 dataset as our example dataset. The Conference on Knowledge Discovery and Data Mining (KDD) is the premier conference on machine learning methods. Every year KDD hosts a data mining cup, where teams analyze a dataset and then are ranked against each other. The KDD Cup is a huge deal and the inspiration for the famous Netflix Prize and even Kaggle competitions.