chapter ten

10 Distributed computing

This chapter covers

Mapping multi-GPU interconnects and understanding NCCL collectives
Implementing tensor parallelism across 8–16 GPUs
Building pipeline parallel schedules that keep every stage busy
Coordinating tensor and pipeline parallelism across nodes
Evaluating scaling behavior and communication trade-offs
Choosing the right parallelism mix for real workloads

Before we write a single line of distributed code, we have to understand the ground truth of our hardware. When you have multiple GPUs in a single machine, they aren’t all created equal. How they talk to each other is dictated by physical wires on the motherboard, and understanding this physical reality is the first step to writing fast code.

The two main ways GPUs communicate inside a server are over the PCIe bus and NVLink.

PCIe (Peripheral Component Interconnect Express): This is the standard bus that connects everything on a motherboard—your GPU, network cards, storage drives—to the CPU. It’s a general-purpose interconnect. Think of it like the main city roads; they get you anywhere, but they’re shared and can have traffic jams.
NVLink: This is a proprietary, point-to-point interconnect developed by NVIDIA specifically for connecting GPUs to each other and to the CPU. Think of it as a private, multi-lane superhighway built exclusively for GPU traffic. It offers significantly higher bandwidth and lower latency than PCIe.

10.1 Understanding the hardware

10.1.1 Lab 1: mapping your hardware

10.1.2 Inter-node communication: InfiniBand

10.1.3 Understanding NCCL’s primitives

10.1.4 Ring versus tree algorithms

10 Distributed computing

This chapter covers

10.1 Understanding the hardware

10.1.1 Lab 1: mapping your hardware

10.1.2 Inter-node communication: InfiniBand

10.1.3 Understanding NCCL’s primitives

10.1.4 Ring versus tree algorithms

10.2 Tensor parallelism

10.2.1 Initialization and GPU assignment

10.2.2 Memory allocation

10.2.3 Local matrix multiplication

10.2.4 Combining the results with AllReduce

10.3 Pipeline parallelism

10.3.1 Naive pipeline parallelism implementation: sequential processing

10.3.2 Optimized implementation: CUDA streams

10.3.3 Why streams work

10.4 Scaling to multiple nodes

10.4.1 Tensor parallelism: 16 GPUs

10.5 Evaluating scaling strategies