chapter four

4 Distributed training

This chapter covers

Understanding data parallelism, model parallelism, and pipeline parallelism
Using a sample training service that supports data parallel training in Kubernetes
Training large models with multiple GPUs

One obvious trend in the deep learning research field is to improve model performance with larger datasets and bigger models with increasingly more complex architecture. But more data and bulkier models have consequences: they slow down the model training process as well as the model development process. As is often the case in computing, performance is pitted against speed. For example, it can cost several months to train a BERT (Bidirectional Encoder Representations from Transformers) natural language processing model with a single GPU.

To address the problem of ever-growing datasets and model parameter size, researchers have created various distributed training strategies. And major training frameworks, such as TensorFlow and PyTorch, provide SDKs that implement these training strategies. With the help of these training SDKs, data scientists can write training code that runs across multiple devices (CPU or GPU) and in parallel.

In this chapter, we will explore how to support distributed training from a software engineer’s perspective. More specifically, we will see how to write a training service to execute different distributed training codes (developed by data scientists) in a group of machines.

4.1 Types of distributed training methods

4.2 Data parallelism

4.2.1 Understanding data parallelism

4.2.2 Multiworker training challenges

4 Distributed training

This chapter covers

4.1 Types of distributed training methods

4.2 Data parallelism

4.2.1 Understanding data parallelism

4.2.2 Multiworker training challenges

4.2.3 Writing distributed training (data parallelism) code for different training frameworks

4.2.4 Engineering effort in data parallel–distributed training

4.3 A sample service supporting data parallel–distributed training

4.3.1 Service overview

4.3.2 Playing with the service

4.3.3 Launching training jobs

4.3.4 Updating and fetching the job status

4.3.5 Converting the training code to run distributedly

4.3.6 Improvements

Summary