6 Distributed Training
This chapter covers
- Distributed training theories: data parallelism, model parallelism and pipeline parallelism
- Touring a sample training service that supports data parallel training in Kubernetes
- Training large models with multiple GPUs
To get better model performance, one obvious trend in the deep learning research field is datasets are getting larger and models are getting bigger (model architecture gets more complex). As a consequence, model training takes a long time to complete and model development velocity slows down sharply. For example, it will cost months to train a BERT model with a single GPU.
To address the growing dataset and model parameter size problem, researchers come up with different distributed training theories. And major training frameworks, such as Tensorflow and Pytorch, provide SDKs which implement these training theories. With the help of these training SDKs, data scientists can write training code that runs across multiple devices (CPU or GPU) and runs in parallel.
In this chapter, we will focus on how to support distributed training from a software engineer perspective. More specifically, how to write a training service to execute different distributed training code (developed by data scientists) in a group of machines.