chapter six

Chapter 6. Memorization methods

This chapter covers

Building single-variable models
Cross-validated variable selection
Building basic multivariable models
Starting with decision trees, nearest neighbor, and naive Bayes models

The simplest methods in data science are what we call memorization methods. These are methods that generate answers by returning a majority category (in the case of classification) or average value (in the case of scoring) of a subset of the original training data. These methods can vary from models depending on a single variable (similar to the analyst’s pivot table), to decision trees (similar to what are called business rules), to nearest neighbor and Naive Bayes methods.^[1] In this chapter, you’ll learn how to use these memorization methods to solve classification problems (though the same techniques also work for scoring problems).

¹ Be aware: memorization methods are a nonstandard classification of techniques that we’re using to organize our discussion.

6.1. KDD and KDD Cup 2009

We’ll demonstrate all of the techniques in this chapter on the KDD Cup 2009 dataset as our example dataset. The Conference on Knowledge Discovery and Data Mining (KDD) is the premier conference on machine learning methods. Every year KDD hosts a data mining cup, where teams analyze a dataset and then are ranked against each other. The KDD Cup is a huge deal and the inspiration for the famous Netflix Prize and even Kaggle competitions.

Chapter 6. Memorization methods

This chapter covers

6.1. KDD and KDD Cup 2009

6.2. Building single-variable models

6.3. Building models using many variables

6.4. Summary