chapter seven
7 Tensor cores
This chapter covers
- Tensor core architecture and when to use tensor cores versus CUDA cores
- WMMA fragments, tiling, and the high-level tensor core API (Volta and newer)
- WGMMA asynchronous operations with inline PTX on Hopper
- TMA producer-consumer pipelines and circular buffering
- Progressive optimization from 71 TFLOPS (WMMA) to 618 TFLOPS (WGMMA)
- Mapping manual CUDA core patterns to their tensor core equivalents
Every major deep learning workload spends the majority of its GPU time inside matrix multiplications. Large language model training, diffusion model inference, protein structure prediction, real-time speech recognition: each of these reduces to dense matrix multiply-accumulate at its core. Tensor cores exist to accelerate exactly this bottleneck.