Chapter 5. Data assessment: poking and prodding

 

This chapter covers

  • Descriptive statistics and other techniques for learning about your data
  • Checking assumptions you have about your data and its contents
  • Sifting through your data for examples of things you want to find
  • Performing quick, rough analyses to gain insight before spending a lot of time on software or product development

Figure 5.1 shows where we are in the data science process: assessing the data available and the progress we’ve made so far. In previous chapters we’ve searched for, captured, and wrangled data. Most likely, you’ve learned a lot along the way, but you’re still not ready to throw the data at the problem and hope that questions get answered. First, you have to learn as much as you can about what you have: its contents, scope, and limitations, among other features.

Figure 5.1. The fourth and final step of the preparation phase of the data science process: assessing available data and progress so far

It can be tempting to start developing a data-centric product or sophisticated statistical methods as soon as possible, but the benefits of getting to know your data are well worth the sacrifice of a little time and effort. If you know more about your data—and if you maintain awareness about it and how you might analyze it—you’ll make more informed decisions at every step throughout your data science project and will reap the benefits later.

5.1. Example: the Enron email data set

5.2. Descriptive statistics

5.3. Check assumptions about the data

5.4. Looking for something specific

5.5. Rough statistical analysis

Exercises

Summary