chapter three

3 Distributed training patterns

 

This chapter covers

  • Distinguishing the traditional model training process from the distributed training process that leverages multiple machines in a distributed cluster.
  • Using parameter servers for building large and complex models that cannot fit in a single machine.
  • Improving distributed model training performance for small or medium-sized models using the collective communication pattern and overcoming the potential communication overhead involved among parameter servers and workers.
  • Handling unexpected failures due to corrupted datasets, unstable networks, and preempted machines during the distributed model training process.

In the previous chapter, we’ve introduced a couple of practical patterns that can be incorporated into the data ingestion process, which is usually the beginning process of a distributed machine learning system that’s responsible for monitoring any incoming data and performing necessary preprocessing steps to prepare model training.

Distributed training is the next step after we’ve completed the data ingestion process and is what distinguishes distributed ML systems from other distributed systems. It’s the most critical part in a distributed machine learning system.

3.1 What is distributed training?

3.2 Parameter server pattern: Tagging entities in 8 millions of YouTube videos

3.2.1 Problem

3.2.2 Solution

3.2.3 Discussion

3.2.4 Exercises

3.3 Collective communication pattern: Improving performance when parameter servers become a bottleneck

3.3.1 Problem

3.3.2 Solution

3.3.3 Discussion

3.3.4 Exercises

3.4 Elasticity and fault-tolerance pattern: Handling unexpected failures when training with limited computational resources

3.4.1 Problem

3.4.2 Solution

3.4.3 Discussion

3.4.4 Exercises

3.5 References

3.6 Summary