chapter ten
10 Distributed computing
This chapter covers
- Mapping multi-GPU interconnects and understanding NCCL collectives
- Implementing tensor parallelism across 8–16 GPUs
- Building pipeline parallel schedules that keep every stage busy
- Coordinating tensor and pipeline parallelism across nodes
- Evaluating scaling behavior and communication trade-offs
- Choosing the right parallelism mix for real workloads
Before we write a single line of distributed code, we have to understand the ground truth of our hardware. When you have multiple GPUs in a single machine, they aren’t all created equal. How they talk to each other is dictated by physical wires on the motherboard, and understanding this physical reality is the first step to writing fast code.
The two main ways GPUs communicate inside a server are over the PCIe bus and NVLink.
- PCIe (Peripheral Component Interconnect Express): This is the standard bus that connects everything on a motherboard—your GPU, network cards, storage drives—to the CPU. It’s a general-purpose interconnect. Think of it like the main city roads; they get you anywhere, but they’re shared and can have traffic jams.
- NVLink: This is a proprietary, point-to-point interconnect developed by NVIDIA specifically for connecting GPUs to each other and to the CPU. Think of it as a private, multi-lane superhighway built exclusively for GPU traffic. It offers significantly higher bandwidth and lower latency than PCIe.