chapter eight

8 Advanced data preparation

This chapter covers

Using the vtreat package for advanced data preparation
Cross-validated data preparation

In our last chapter, we built substantial models on nice or well-behaved data. In this chapter, we will learn how to prepare or treat messy real-world data for modeling. We will use the principles of chapter 4 and the advanced data preparation package: vtreat. We will revisit the issues that arise with missing values, categorical variables, recoding variables, redundant variables, and having too many variables. We will spend some time on variable selection, which is an important step even with current machine learning methods. The mental model summary (figure 8.1) of this chapter emphasizes that this chapter is about working with data and preparing for machine learning modeling. We will first introduce the vtreat package, then work a detailed real-world problem, and then go into more detail about using the vtreat package.

Figure 8.1. Mental model

8.1. The purpose of the vtreat package

8.2. KDD and KDD Cup 2009

8.2.1. Getting started with KDD Cup 2009 data

8.2.2. The bull-in-the-china-shop approach

8.3. Basic data preparation for classification

8.3.1. The variable score frame

8.3.2. Properly using the treatment plan

8.4. Advanced data preparation for classification

8.4.1. Using mkCrossFrameCExperiment()

8.4.2. Building a model

Building a multivariable model

8.5. Preparing data for regression modeling

8.6. Mastering the vtreat package

8.6.1. The vtreat phases

8.6.2. Missing values