This chapter covers
- Using the vtreat package for advanced data preparation
- Cross-validated data preparation
In our last chapter, we built substantial models on nice or well-behaved data. In this chapter, we will learn how to prepare or treat messy real-world data for modeling. We will use the principles of chapter 4 and the advanced data preparation package: vtreat. We will revisit the issues that arise with missing values, categorical variables, recoding variables, redundant variables, and having too many variables. We will spend some time on variable selection, which is an important step even with current machine learning methods. The mental model summary (figure 8.1) of this chapter emphasizes that this chapter is about working with data and preparing for machine learning modeling. We will first introduce the vtreat package, then work a detailed real-world problem, and then go into more detail about using the vtreat package.