Chapter 3. Exploring data

This chapter covers

Using summary statistics to explore data
Exploring data using visualization
Finding problems and issues during data exploration

In the last two chapters, you learned how to set the scope and goal of a data science project, and how to load your data into R. In this chapter, we’ll start to get our hands into the data.

Suppose your goal is to build a model to predict which of your customers don’t have health insurance; perhaps you want to market inexpensive health insurance packages to them. You’ve collected a dataset of customers whose health insurance status you know. You’ve also identified some customer properties that you believe help predict the probability of insurance coverage: age, employment status, income, information about residence and vehicles, and so on. You’ve put all your data into a single data frame called custdata that you’ve input into R.^[1] Now you’re ready to start building the model to identify the customers you’re interested in.

¹ We have a copy of this synthetic dataset available for download from https://github.com/WinVector/zmPDSwR/tree/master/Custdata, and once saved, you can load it into R with the command custdata <- read.table('custdata.tsv',header=T,sep='\t').

Chapter 3. Exploring data

This chapter covers

3.1. Using summary statistics to spot problems

3.2. Spotting problems using graphics and visualization

3.3. Summary

Chapter 3. Exploring data

This chapter covers

3.1. Using summary statistics to spot problems

3.2. Spotting problems using graphics and visualization

3.3. Summary

Unable to load book!