1 The data science process


This chapter covers

  • Defining data science
  • Defining data science project roles
  • Understanding the stages of a data science project
  • Setting expectations for a new data science project

Data science is a cross-disciplinary practice that draws on methods from data engineering, descriptive statistics, data mining, machine learning, and predictive analytics. Much like operations research, data science focuses on implementing data-driven decisions and managing their consequences. For this book, we will concentrate on data science as applied to business and scientific problems, using these techniques.

The data scientist is responsible for guiding a data science project from start to finish. Success in a data science project comes not from access to any one exotic tool, but from having quantifiable goals, good methodology, cross-discipline interactions, and a repeatable workflow.

This chapter walks you through what a typical data science project looks like: the kinds of problems you encounter, the types of goals you should have, the tasks that you’re likely to handle, and what sort of results are expected.

We’ll use a concrete, real-world example to motivate the discussion in this chapter.[1]


Suppose you’re working for a German bank. The bank feels that it’s losing too much money to bad loans and wants to reduce its losses. To do so, they want a tool to help loan officers more accurately detect risky loans.

This is where your data science team comes in.

1.1. The roles in a data science project

1.1.1. Project roles

1.2. Stages of a data science project

1.2.1. Defining the goal

1.2.2. Data collection and management

1.2.3. Modeling

1.2.4. Model evaluation and critique

1.2.5. Presentation and documentation

1.2.6. Model deployment and maintenance

1.3. Setting expectations

1.3.1. Determining lower bounds on model performance