chapter six

6 Optimizing our transformer kernels

This chapter covers

Profiling transformer kernels with Nsight Compute to pinpoint real bottlenecks.
Upgrading GEMV, Softmax, LayerNorm, top-K, and GEMM through a repeatable optimization playbook.
Balancing memory bandwidth, register pressure, and occupancy as first-class design constraints.
Deciding when shared memory, vectorization, and warp-level intrinsics actually pay off.
Validating every improvement with progressive benchmarks and checks.
Packaging the resulting patterns so they transfer cleanly to production transformer stacks.

A naive CUDA kernel that computes the right answer is only the beginning. The gap between "correct" and "fast" can span two orders of magnitude, and closing that gap requires understanding how GPUs actually move and process data. This chapter takes five kernels central to transformer inference - GEMV, Softmax, LayerNorm, top-K, and GEMM - and optimizes each one from first principles. Rather than presenting a checklist of tricks, we will profile each kernel, identify its bottleneck, apply a targeted fix, and measure the result. The patterns that emerge transfer directly to any memory-bound or compute-bound workload you encounter in production.

6.1 Preparing to optimize

6.1.1 Profiling CUDA kernels

6.2 GEMV

6.2.1 Understanding the memory access problem

6.2.2 Achieving coalesced access

6.2.3 Profiling the coalescing improvement

6.2.4 Scaling up with block-level reduction

6.2.5 Maximizing bandwidth with vectorization

6.2.6 Final GEMV performance analysis

6.2.7 GEMV optimization summary

6.3 Softmax

6.3.1 Online softmax: a better sequential algorithm

6.3.2 Parallel softmax with warp shuffles

6.3.3 Fused softmax with shared memory

6.3.4 Vectorized softmax

6.3.5 Profiling softmax optimizations

6.3.6 Softmax optimization summary

6.4 Layer normalization

6.4.1 Fused LayerNorm with warp operations

6.4.2 Fused LayerNorm kernel

6.5 Top-K