chapter three

Chapter 3. Data all around us: the virtual wilderness

This chapter covers

Discovering data you may need
Interacting with data in various environments
Combining disparate data sets

This chapter discusses the principal species of study of the data scientist: data. Having possession of data—namely, useful data—is often taken as a foregone conclusion, but it’s not usually a good idea to assume anything of the sort. As with any topic worthy of scientific examination, data can be hard to find and capture and is rarely completely understood. Any mistaken notion about a data set that you possess or would like to possess can lead to costly problems, so in this chapter, I discuss the treatment of data as an object of scientific study.

3.1. Data as the object of study

In recent years, there has been a seemingly never-ending discussion about whether the field of data science is merely a reincarnation or an offshoot—in the Big Data Age—of any of a number of older fields that combine software engineering and data analysis: operations research, decision sciences, analytics, data mining, mathematical modeling, or applied statistics, for example. As with any trendy term or topic, the discussion over its definition and concept will cease only when the popularity of the term dies down. I don’t think I can define data science any better than many of those who have done so before me, so let a definition from Wikipedia (https://en.wikipedia.org/wiki/Data_science), paraphrased, suffice:

Chapter 3. Data all around us: the virtual wilderness

This chapter covers

3.1. Data as the object of study

3.2. Where data might live, and how to interact with it

3.3. Scouting for data

3.4. Example: microRNA and gene expression

Exercises

Summary