16 Training models on multiple GPUs
This chapter covers
- Distributed training concepts
- PyTorch’s distributed package (
torch.distributed) - Different forms of parallelism
In previous chapters, we have been mostly focused on training models on a single GPU. But, as models grow larger and datasets grow bigger, training on a single GPU becomes infeasible. Model sizes have exploded over the last several years with the popularity of large language models. In case the naming wasn’t clear, large language models are, in fact, quite large. For example, Meta’s open LLaMA 3.1 model has 7-billion- , 80-billion- , and 405-billion-parameter variants. The 405-billion-parameter variant requires about 800 GB of memory just to run inference (this is without optimizations such as quantization).
Note Readers should check the official licensing on models depending on how they want to use them.
To address this, we will explore how to use PyTorch’s distributed subpackage for training models across multiple GPUs, covering distributed training concepts and various forms of parallelism.
16.1 Introduction to parallel programming
Parallel programming is the process of breaking down a problem into smaller tasks that can be executed simultaneously. This is done to improve performance and efficiency.