In part 1, we discussed the initial stages of a data science project. After you’ve defined more precisely the questions you want to answer and the scope of the problem you want to solve, it’s time to analyze the data and find the answers. In part 2, we work with powerful modeling methods from statistics and machine learning.
Chapter 6 covers how to identify appropriate modeling methods to address your specific business problem. It also discusses how to evaluate the quality and effectiveness of models that you or others have discovered.
Chapter 7 covers basic linear models: linear regression, logistic regression, and regularized linear models. Linear models are the workhorses of many analytical tasks, and are especially helpful for identifying key variables and gaining insight into the structure of a problem. A solid understanding of them is immensely valuable for a data scientist.
Chapter 8 temporarily moves away from the modeling task to cover advanced data preparation with the vtreat package. vtreat prepares messy real-world data for the modeling step. Because understanding how vtreat works requires some understanding of linear models and of model evaluation metrics, it seemed best to defer this topic until part 2.