chapter five
5 Transformer inference in CUDA
This chapter covers
- Understanding the decoder-only transformer architecture that powers GPT-style models
- The fundamental split between prefill (processing the prompt) and decode (generating tokens)
- Building naive CUDA kernels for inference: GEMV, softmax, layer normalization, and attention
- KV caching to avoid redundant computation during autoregressive generation
- Mixture-of-Experts routing with custom TopK kernels
- The CUDA-Python binding pipeline that connects our kernels to PyTorch
chapter 4 gave you the complete training story: forward passes, backpropagation, gradient updates. Those principles apply to any architecture. But inference is a different beast. When generating text token by token, the computational profile shifts dramatically. Batch sizes shrink to one. Matrix multiplications become matrix-vector products. Memory bandwidth, not compute throughput, becomes the bottleneck. This chapter tackles those challenges head-on.
5.1 Getting Started: Transformer Architecture and Setup
Before writing inference kernels, we need to understand what we’re optimizing. This section establishes the transformer architecture, the terminology we’ll use throughout, and the PyTorch baseline that our CUDA kernels must match numerically.