6 Distributed Training

This chapter covers

Distributed training theories: data parallelism, model parallelism and pipeline parallelism
Touring a sample training service that supports data parallel training in Kubernetes
Training large models with multiple GPUs

To get better model performance, one obvious trend in the deep learning research field is datasets are getting larger and models are getting bigger (model architecture gets more complex). As a consequence, model training takes a long time to complete and model development velocity slows down sharply. For example, it will cost months to train a BERT model with a single GPU.

To address the growing dataset and model parameter size problem, researchers come up with different distributed training theories. And major training frameworks, such as Tensorflow and Pytorch, provide SDKs which implement these training theories. With the help of these training SDKs, data scientists can write training code that runs across multiple devices (CPU or GPU) and runs in parallel.

In this chapter, we will focus on how to support distributed training from a software engineer perspective. More specifically, how to write a training service to execute different distributed training code (developed by data scientists) in a group of machines.

6.1 Types of distributed training methods

6.2 Data Parallelism

6.2.1 Understanding Data Parallelism

6.2.2 Multi-worker training challenges

6.2.3 Write Distributed training (data parallelism) code in different training frameworks

6.2.4 Engineering effort in data parallel distributed training

6.3 A sample service supports data parallel distributed training

6.3.1 Service overview

6.3.2 Play with the service

6.3.3 Launching training jobs

6.3.4 Updating and fetching job status

6.3.5 Convert the training code to run distributedly

6.3.6 Improvements

6.4 Training large models that can’t load on one GPU

6.4.1 Traditional methods - Memory saving

6.4.2 Pipeline model parallelism

6.4.3 How can software engineers support pipeline parallelism?

6.5 Summary