16 Training models on multiple GPUs
This chapter covers
- Distributed training concepts
- PyTorch’s distributed package (
torch.distributed
) - Different forms of parallelism
In previous chapters we have been mostly focused on training models on a single GPU. However, as models grow larger and datasets grow bigger, training on a single GPU becomes infeasible. Model sizes have exploded over the last several years with the popularity of large language models. In case the naming wasn’t clear, large language models are, in fact, quite large. For example, Meta’s open source[1] LLaMa 3.1 model has 7 billion, 80 billion, and 405 billion parameter variants. The 405 billion parameter variant requires about 800 GB of memory just to run inference [2]!
To address this, we will explore how to leverage PyTorch’s distributed subpackage for training models across multiple GPUs, covering distributed training concepts and various forms of parallelism.
16.1 Introduction to parallel programming
Parallel programming is the process of breaking down a problem into smaller tasks that can be executed simultaneously. This is done to improve performance and efficiency.