Chapter 2. The data science process


This chapter covers

  • Understanding the flow of a data science process
  • Discussing the steps in a data science process

The goal of this chapter is to give an overview of the data science process without diving into big data yet. You’ll learn how to work with big data sets, streaming data, and text data in subsequent chapters.

2.1. Overview of the data science process

Following a structured approach to data science helps you to maximize your chances of success in a data science project at the lowest cost. It also makes it possible to take up a project as a team, with each team member focusing on what they do best. Take care, however: this approach may not be suitable for every type of project or be the only way to do good data science.

The typical data science process consists of six steps through which you’ll iterate, as shown in figure 2.1.

Figure 2.1. The six steps of the data science process

Figure 2.1 summarizes the data science process and shows the main steps and actions you’ll take during a project. The following list is a short introduction; each of the steps will be discussed in greater depth throughout this chapter.

1.  The first step of this process is setting a research goal. The main purpose here is making sure all the stakeholders understand the what, how, and why of the project. In every serious project this will result in a project charter.

2.2. Step 1: Defining research goals and creating a project charter

2.3. Step 2: Retrieving data

2.4. Step 3: Cleansing, integrating, and transforming data

2.5. Step 4: Exploratory data analysis

2.6. Step 5: Build the models

2.7. Step 6: Presenting findings and building applications on top of them

2.8. Summary