4 Managing data
This chapter covers:
- Fixing data quality problems
- Transforming data before modeling
- Organizing your data for the modeling process
In chapter 3, you learned how to explore your data and to identify common data issues. In this chapter, you’ll see how to fix the data issues that you’ve discovered. After that, we’ll talk about transforming and organizing the data for the modeling process. Most of the examples in this chapter use the same customer data that you used in the previous chapter.[18]
Figure 4.1. Chapter 4 Mental Model

As shown in the mental model ( Figure 4.1), this chapter again emphasizes the science of managing the data in a statistically valid way, prior to the model-building step.
In this section, we’ll address issues that you discovered during the data exploration/visualization phase, in particular invalid and missing values. Missing values in data happen quite commonly, and the way you treat them is generally the same from project to project. Handling invalid values is often domain-specific: which values are invalid, and what you do about them, depends on the problem that you are trying to solve.