3 Distributed training patterns
This chapter covers
- Distinguish traditional model training process from the distributed training process that leverages multiple machines in a distributed cluster.
- Use parameter servers for building large and complex models that cannot fit in a single machine.
- Improve distributed model training performance for small or medium-sized models using the collective communication pattern and overcome the potential communication overhead involved among parameter servers and workers.
- Handle unexpected failures due to corrupted datasets, unstable networks, and preemptive worker machines during the distributed model training process.
In the previous chapter, we’ve introduced a couple of practical patterns that can be incorporated into the data ingestion process, which is usually the beginning process of a distributed machine learning system that’s responsible for monitoring any incoming data and performing necessary preprocessing steps to prepare model training.
Distributed training is the next step after we’ve completed the data ingestion process and is what makes distributed machine learning systems different from general distributed systems. It’s the most critical part in a distributed machine learning system.