3 Distributed training patterns

This chapter covers

Distinguish traditional model training process from the distributed training process that leverages multiple machines in a distributed cluster.
Use parameter servers for building large and complex models that cannot fit in a single machine.
Improve distributed model training performance for small or medium-sized models using the collective communication pattern and overcome the potential communication overhead involved among parameter servers and workers.
Handle unexpected failures due to corrupted datasets, unstable networks, and preemptive worker machines during the distributed model training process.

In the previous chapter, we’ve introduced a couple of practical patterns that can be incorporated into the data ingestion process, which is usually the beginning process of a distributed machine learning system that’s responsible for monitoring any incoming data and performing necessary preprocessing steps to prepare model training.

Distributed training is the next step after we’ve completed the data ingestion process and is what makes distributed machine learning systems different from general distributed systems. It’s the most critical part in a distributed machine learning system.

3.1 What Is Distributed Training?

3.2 Parameter Server Pattern: Tagging Entities in 8 Millions of YouTube Videos

3.2.1 Problem

3.2.2 Solution

3.2.3 Discussion

3.2.4 Exercises

3.3 Collective Communication Pattern: Improving

3.3.1 Problem

3.3.2 Solution

3.3.3 Discussion

3.3.4 Exercises

3.4 Elasticity and Fault-tolerance Pattern: Handling Unexpected Failures When Training with Limited Computational Resources

3 Distributed training patterns

This chapter covers

3.1 What Is Distributed Training?

3.2 Parameter Server Pattern: Tagging Entities in 8 Millions of YouTube Videos

3.2.1 Problem

3.2.2 Solution

3.2.3 Discussion

3.2.4 Exercises

3.3 Collective Communication Pattern: Improving

3.3.1 Problem

3.3.2 Solution

3.3.3 Discussion

3.3.4 Exercises

3.4 Elasticity and Fault-tolerance Pattern: Handling Unexpected Failures When Training with Limited Computational Resources

3.4.1 Problem

3.4.2 Solution

3.4.3 Discussion

3.4.4 Exercises

3.5 References

3.6 Summary

3 Distributed training patterns

This chapter covers

3.1 What Is Distributed Training?

3.2 Parameter Server Pattern: Tagging Entities in 8 Millions of YouTube Videos

3.2.1 Problem

3.2.2 Solution

3.2.3 Discussion

3.2.4 Exercises

3.3 Collective Communication Pattern: Improving

3.3.1 Problem

3.3.2 Solution

3.3.3 Discussion

3.3.4 Exercises

3.4 Elasticity and Fault-tolerance Pattern: Handling Unexpected Failures When Training with Limited Computational Resources

3.4.1 Problem

3.4.2 Solution

3.4.3 Discussion

3.4.4 Exercises

3.5 References

3.6 Summary

Unable to load book!