4 Distributed training
This chapter covers
- Understanding data parallelism, model parallelism and pipeline parallelism
- A sample training service that supports data parallel training in Kubernetes
- Training large models with multiple GPUs
One obvious trend in the deep learning research field is to improve model performance with larger datasets and bigger models with increasingly more complex architecture. But more data and bulkier models have consequences: they slow down the model training process as well as the model development process. As is often the case in computing, performance is pitted against speed. For example, it can cost several months to train a BERT natural language processing model with a single GPU.
To address the problem of ever-growing datasets and model parameter size, researchers have come up with different distributed training strategies. And major training frameworks, such as Tensorflow and Pytorch, provide SDKs which implement these training strategies. With the help of these training SDKs, data scientists can write training code that runs across multiple devices (CPU or GPU) and runs in parallel.
In this chapter, we will look at how to support distributed training from a software engineer’s perspective. More specifically, we will see how to write a training service to execute different distributed training codes (developed by data scientists) in a group of machines.