Chapter 1 The data science process

 

Chapter 2 from Introducing Data Science by Davy Cielen, Arno D. B. Meysman, and Mohamed Ali.

This chapter covers:

    Understanding the flow of a data science process

    Discussing the steps in a data science process

    The goal of this chapter is to give an overview of the data science process without diving into big data yet. You’ll learn how to work with big data sets, streaming data, and text data in subsequent chapters.

    2.1 Overview of the data science process

    Following a structured approach to data science helps you to maximize your chances of success in a data science project at the lowest cost. It also makes it possible to take up a project as a team, with each team member focusing on what they do best. Take care, however: this approach may not be suitable for every type of project or be the only way to do good data science.

    The typical data science process consists of six steps through which you’ll iterate, as shown in figure 2.1.

    Figure 2.1 The six steps of the data science process
    P1_02_01

    Figure 2.1 summarizes the data science process and shows the main steps and actions you’ll take during a project. The following list is a short introduction; each of the steps will be discussed in greater depth throughout this chapter.

    2.1.1 Don’t be a slave to the process

    2.2 Step 1: Defining research goals and creating a project charter

    2.2.1 Spend time understanding the goals and context of your research

    2.2.2 Create a project charter

    2.3 Step 2: Retrieving data

    2.3.1 Start with data stored within the company

    2.3.2 Don’t be afraid to shop around

    2.3.3 Do data quality checks now to prevent problems later

    2.4 Step 3: Cleansing, integrating, and transforming data

    2.4.1 Cleansing data

    2.4.2 Correct errors as early as possible

    2.4.3 Combining data from different data sources

    2.4.4 Transforming data

    2.5 Step 4: Exploratory data analysis

    2.6 Step 5: Build the models

    2.6.1 Model and variable selection

    2.6.2 Model execution

    2.6.3 Model diagnostics and model comparison

    2.7 Step 6: Presenting findings and building applications on top of them

    2.8 Summary

    sitemap