chapter three

3 Distributed training patterns

This chapter covers

Distinguishing the traditional model training process from the distributed training process
Using parameter servers to build models that cannot fit in a single machine
Improving distributed model training performance using the collective communication pattern
Handling unexpected failures during the distributed model training process

The previous chapter introduced a couple of practical patterns that can be incorporated into the data ingestion process, which is usually the beginning process in a distributed machine learning system that’s responsible for monitoring any incoming data and performing necessary preprocessing steps to prepare model training.

Distributed training, the next step after the data ingestion process, is what distinguishes distributed machine learning systems from other distributed systems. It’s the most critical part of a distributed machine learning system.

The system design needs to be scalable and reliable to handle datasets and models of different sizes and various levels of complexity. Some large and complex models cannot fit in a single machine, and some medium-size models that are small enough to fit in single machines struggle to improve the computational performance of distributed training.

3.1 What is distributed training?

3.2 Parameter server pattern : Tagging entities in 8 million YouTube videos

3.2.1 The problem

3 Distributed training patterns

This chapter covers

3.1 What is distributed training?

3.2 Parameter server pattern : Tagging entities in 8 million YouTube videos

3.2.1 The problem

3.2.2 The solution

3.2.3 Discussion

3.2.4 Exercises

3.3 Collective communication pattern

3.3.1 The problem: Improving performance when parameter servers become a bottleneck

3.3.2 The solution

3.3.3 Discussion

3.3.4 Exercises

3.4.1 The problem: Handling unexpected failures when training with limited computational resources

3 Distributed training patterns

This chapter covers

3.1 What is distributed training?

3.2 Parameter server pattern: Tagging entities in 8 million YouTube videos

3.2.1 The problem

3.2.2 The solution

3.2.3 Discussion

3.2.4 Exercises

3.3 Collective communication pattern

3.3.1 The problem: Improving performance when parameter servers become a bottleneck

3.3.2 The solution

3.3.3 Discussion

3.3.4 Exercises

3.4.1 The problem: Handling unexpected failures when training with limited computational resources

3.2 Parameter server pattern : Tagging entities in 8 million YouTube videos