4 Managing data

This chapter covers:

Fixing data quality problems
Transforming data before modeling
Organizing your data for the modeling process

In chapter 3, you learned how to explore your data and to identify common data issues. In this chapter, you’ll see how to fix the data issues that you’ve discovered. After that, we’ll talk about transforming and organizing the data for the modeling process. Most of the examples in this chapter use the same customer data that you used in the previous chapter.^[18]

Figure 4.1. Chapter 4 Mental Model

As shown in the mental model ( Figure 4.1), this chapter again emphasizes the science of managing the data in a statistically valid way, prior to the model-building step.

4.1 Cleaning data

In this section, we’ll address issues that you discovered during the data exploration/visualization phase, in particular invalid and missing values. Missing values in data happen quite commonly, and the way you treat them is generally the same from project to project. Handling invalid values is often domain-specific: which values are invalid, and what you do about them, depends on the problem that you are trying to solve.

4.1.1 Domain-specific data cleaning

4.1.2 Treating missing values (NAs)

4.1.3 The `vtreat` package for automatically treating missing variables

4.2 Data transformations

4.2.1 Normalization

4.2.2 Centering and scaling

4.2.3 Log transformations for skewed and wide distributions

4.3 Sampling for modeling and validation

4.3.1 Test and training splits

4.3.2 Creating a sample group column

4.3.3 Record grouping

4.3.4 Data provenance

4.4 Summary