chapter five

5 Transformer inference in CUDA

This chapter covers

Understanding the decoder-only transformer architecture that powers GPT-style models
The fundamental split between prefill (processing the prompt) and decode (generating tokens)
Building naive CUDA kernels for inference: GEMV, softmax, layer normalization, and attention
KV caching to avoid redundant computation during autoregressive generation
Mixture-of-Experts routing with custom TopK kernels
The CUDA-Python binding pipeline that connects our kernels to PyTorch

chapter 4 gave you the complete training story: forward passes, backpropagation, gradient updates. Those principles apply to any architecture. But inference is a different beast. When generating text token by token, the computational profile shifts dramatically. Batch sizes shrink to one. Matrix multiplications become matrix-vector products. Memory bandwidth, not compute throughput, becomes the bottleneck. This chapter tackles those challenges head-on.

5.1 Getting Started: Transformer Architecture and Setup

Before writing inference kernels, we need to understand what we’re optimizing. This section establishes the transformer architecture, the terminology we’ll use throughout, and the PyTorch baseline that our CUDA kernels must match numerically.

5.1.1 Establishing the Baseline Scenario

5 Transformer inference in CUDA

This chapter covers

5.1 Getting Started: Transformer Architecture and Setup

5.1.1 Establishing the Baseline Scenario

5.2 Inference: Theory and Optimization

5.2.1 Optimizing Autoregressive Generation

5.2.2 Deploying the Optimized Stack

5.3 Inference Kernels

5.3.1 Memory Layout and Contiguity

5.3.2 GEMV: The Inference Workhorse

5.4 Summary