This chapter covers
- Distinguishing the traditional model training process from the distributed training process
- Using parameter servers to build models that cannot fit in a single machine
- Improving distributed model training performance using the collective communication pattern
- Handling unexpected failures during the distributed model training process
The previous chapter introduced a couple of practical patterns that can be incorporated into the data ingestion process, which is usually the beginning process in a distributed machine learning system that’s responsible for monitoring any incoming data and performing necessary preprocessing steps to prepare model training.
Distributed training, the next step after the data ingestion process, is what distinguishes distributed machine learning systems from other distributed systems. It’s the most critical part of a distributed machine learning system.
The system design needs to be scalable and reliable to handle datasets and models of different sizes and various levels of complexity. Some large and complex models cannot fit in a single machine, and some medium-size models that are small enough to fit in single machines struggle to improve the computational performance of distributed training.