Chapter 3. Exploring data
This chapter covers
- Using summary statistics to explore data
- Exploring data using visualization
- Finding problems and issues during data exploration
In the last two chapters, you learned how to set the scope and goal of a data science project, and how to load your data into R. In this chapter, we’ll start to get our hands into the data.
Suppose your goal is to build a model to predict which of your customers don’t have health insurance; perhaps you want to market inexpensive health insurance packages to them. You’ve collected a dataset of customers whose health insurance status you know. You’ve also identified some customer properties that you believe help predict the probability of insurance coverage: age, employment status, income, information about residence and vehicles, and so on. You’ve put all your data into a single data frame called custdata that you’ve input into R.[1] Now you’re ready to start building the model to identify the customers you’re interested in.
1 We have a copy of this synthetic dataset available for download from https://github.com/WinVector/zmPDSwR/tree/master/Custdata, and once saved, you can load it into R with the command custdata <- read.table('custdata.tsv',header=T,sep='\t').