chapter four

4 Distributed training

 

This chapter covers

  • Understanding data parallelism, model parallelism and pipeline parallelism
  • A sample training service that supports data parallel training in Kubernetes
  • Training large models with multiple GPUs

One obvious trend in the deep learning research field is to improve model performance with larger datasets and bigger models with increasingly more complex architecture. But more data and bulkier models have consequences: they slow down the model training process as well as the model development process. As is often the case in computing, performance is pitted against speed. For example, it can cost several months to train a BERT natural language processing model with a single GPU.

To address the problem of ever-growing datasets and model parameter size, researchers have come up with different distributed training strategies. And major training frameworks, such as Tensorflow and Pytorch, provide SDKs which implement these training strategies. With the help of these training SDKs, data scientists can write training code that runs across multiple devices (CPU or GPU) and runs in parallel.

In this chapter, we will look at how to support distributed training from a software engineer’s perspective. More specifically, we will see how to write a training service to execute different distributed training codes (developed by data scientists) in a group of machines.

4.1  Types of distributed training methods

4.2 Data Parallelism

4.2.1  Understanding Data Parallelism

4.2.2  Multi-worker training challenges

4.2.3  Write Distributed training (data parallelism) code for different training frameworks

4.2.4  Engineering effort in data parallel distributed training

4.3  A sample service supports data parallel distributed training

4.3.1  Service overview

4.3.2  Play with the service

4.3.3  Launching training jobs

4.3.4  Updating and fetching job status

4.3.5  Convert the training code to run distributedly

4.3.6  Improvements

4.4  Training large models that can’t load on one GPU

4.4.1  Traditional methods - Memory saving

4.4.2  Pipeline model parallelism

4.4.3  How can software engineers support pipeline parallelism?

4.5  Summary