chapter six

6 Operation patterns

This chapter covers

Recognizing areas of improvement in machine learning systems, such as job scheduling and metadata
Preventing resource starvation and avoiding deadlocks using scheduling techniques, such as fair-share scheduling, priority scheduling, and gang scheduling
Handling failures more effectively to reduce any negative effect on users via the metadata pattern

In chapter 5, we focused on machine learning workflows and the challenges of building them in practice. Workflow is an essential component in machine learning systems as it connects all components in the system. A machine learning workflow can be as easy as chaining data ingestion, model training, and model serving. It can also be very complex when handling real-world scenarios, requiring additional steps and performance optimizations to be part of the entire workflow.

6.1 What are operations in machine learning systems?

6 Operation patterns

This chapter covers

6.1 What are operations in machine learning systems?

6.2 Scheduling patterns: Assigning resources effectively in a shared cluster

6.2.1 The problem

6.2.2 The solution

6.2.3 Discussion

6.2.4 Exercises

6.3 Metadata pattern: Handle failures appropriately to minimize the negative effect on users

6.3.1 The problem

6.3.2 The solution

6.3.3 Discussion

6.3.4 Exercises

6.4 Answers to exercises

Section 6.2

Section 6.3

Summary