chapter sixteen

16 Training models on multiple GPUs

This chapter covers

Distributed training concepts
PyTorch’s distributed package (torch.distributed)
Different forms of parallelism

In previous chapters, we have mostly been focused on training models on a single GPU. However, as models grow larger and datasets grow bigger, training on a single GPU becomes infeasible. Model sizes have exploded over the last several years with the popularity of large language models. In case the naming wasn’t clear, large language models are, in fact, quite large. For example, Meta’s open source^[1] LLaMA 3.1 model has 7 billion, 80 billion, and 405 billion parameter variants. The 405 billion parameter variant requires about 800 GB of memory just to run inference ^[2]!

To address this, we will explore how to leverage PyTorch’s distributed subpackage for training models across multiple GPUs, covering distributed training concepts and various forms of parallelism.

16.1 Introduction to parallel programming

Parallel programming is the process of breaking down a problem into smaller tasks that can be executed simultaneously. This is done to improve performance and efficiency.

16.1.1 Distributed computing terminology

16.1.2 Hardware requirements

16.1.3 Initializing a distributed program

16.2 Collective communication

16.3 Introduction to parallelisms

16.4 Data parallelism

16.5 Model parallelism

16.5.1 Pipeline parallelism

16.5.2 Tensor parallelism

16.5.3 Deciding between pipeline and tensor parallelism

16.6 N-dimensional parallelism

16.7 Fully sharded data parallelism

16.8 Large language model specific parallelisms

16.8.1 Context parallelism

16.8.2 Expert Parallelism

16.9 Tying all parallelisms together

16.10 Conclusion

16.11 Exercises

16.12 Summary