chapter five

5 Diving into the problem

This chapter covers:

Getting and verifying access to the data
Revisiting, verifying, and refining business understanding
Developing UX and model utilization concepts
Getting the versioning and pipelining system in place and working
Building the initial pipelines to deliver a data set to the team
Starting to build data tests to make your pipelines robust

In sprint 1, the team puts in place and starts using the infrastructure to support the delivery project, then they open the data that’s going to underpin the ML project. In order to crack the data open, they will use the infrastructure (particularly the pipelines and testing systems) that they construct.

5.1 Sprint 1 backlog

The sprint 1 backlog provides tasks that are described in this chapter (S1.1 - S1.4) and in chapter 6 (S1.5 - S1.7). With sprint 1, you prepare for the core ML activity of creating and evaluating useful models using ML algorithms. The work is to dig deeper into the data resources and develop the team’s expertise and capability to use them for modeling. You also need to build the supporting infrastructure that lifts and shifts the data from where it’s resting to where you need it.

5.2 Understanding the data

5.2.1 The data survey

5.2.2 Surveying numerical data

5.2.3 Surveying categorical data

5.2.4 Surveying unstructured data

5.2.5 Reporting and using the survey

5.3 Business problem refinement, UX, and application design

5.4 Building data pipelines

5.4.1 Data fusion challenges

5.4.2 Pipeline jungles

5.4.3 Data testing

5.5 Model repository and model versioning

5.5.1 Features, foundational models , and training regimes