3 Exploring data

 

This chapter covers:

  • Using summary statistics to explore data
  • Exploring data using visualization
  • Finding problems and issues during data exploration
Figure 3.1. Chapter 3 Mental Model
Chapter 3 Mental Model

In the last two chapters, you learned how to set the scope and goal of a data science project, and how to start working with your data in R. In this chapter, we’ll start to get our hands into the data. As shown in the mental model ( Figure 3.1), this chapter emphasizes the science of exploring the data, prior to the model-building step. Your goal is to have data that is as clean and useful as possible.

Example Scenario.  Suppose your goal is to build a model to predict which of your customers don’t have health insurance. You’ve collected a dataset of customers whose health insurance status you know. You’ve also identified some customer properties that you believe help predict the probability of insurance coverage: age, employment status, income, information about residence and vehicles, and so on.

You’ve put all your data into a single data frame called customer_data that you’ve input into R.[14] Now you’re ready to start building the model to identify the customers you’re interested in.

3.1  Using summary statistics to spot problems

3.1.1  Typical problems revealed by data summaries

3.2  Spotting problems using graphics and visualization

3.2.1  Visually checking distributions for a single variable

3.2.2  Visually checking relationships between two variables

3.3  Summary

sitemap