chapter six
6 Optimizing our transformer kernels
This chapter covers
- Profiling transformer kernels with Nsight Compute to pinpoint real bottlenecks.
- Upgrading GEMV, Softmax, LayerNorm, top-K, and GEMM through a repeatable optimization playbook.
- Balancing memory bandwidth, register pressure, and occupancy as first-class design constraints.
- Deciding when shared memory, vectorization, and warp-level intrinsics actually pay off.
- Validating every improvement with progressive benchmarks and checks.
- Packaging the resulting patterns so they transfer cleanly to production transformer stacks.
A naive CUDA kernel that computes the right answer is only the beginning. The gap between "correct" and "fast" can span two orders of magnitude, and closing that gap requires understanding how GPUs actually move and process data. This chapter takes five kernels central to transformer inference - GEMV, Softmax, LayerNorm, top-K, and GEMM - and optimizes each one from first principles. Rather than presenting a checklist of tricks, we will profile each kernel, identify its bottleneck, apply a targeted fix, and measure the result. The patterns that emerge transfer directly to any memory-bound or compute-bound workload you encounter in production.