Chapter 1. Data science in a big data world

This chapter covers

  • Defining data science and big data
  • Recognizing the different types of data
  • Gaining insight into the data science process
  • Introducing the fields of data science and big data
  • Working through examples of Hadoop

Big data is a blanket term for any collection of data sets so large or complex that it becomes difficult to process them using traditional data management techniques such as, for example, the RDBMS (relational database management systems). The widely adopted RDBMS has long been regarded as a one-size-fits-all solution, but the demands of handling big data have shown otherwise. Data science involves using methods to analyze massive amounts of data and extract the knowledge it contains. You can think of the relationship between big data and data science as being like the relationship between crude oil and an oil refinery. Data science and big data evolved from statistics and traditional data management but are now considered to be distinct disciplines.

The characteristics of big data are often referred to as the three Vs:

  • Volume —How much data is there?
  • Variety —How diverse are different types of data?
  • Velocity —At what speed is new data generated?

1.1. Benefits and uses of data science and big data

1.2. Facets of data

1.3. The data science process

1.4. The big data ecosystem and data science

1.5. An introductory working example of Hadoop

1.6. Summary