chapter seven

7 Tensor cores

This chapter covers

Tensor core architecture and when to use tensor cores versus CUDA cores
WMMA fragments, tiling, and the high-level tensor core API (Volta and newer)
WGMMA asynchronous operations with inline PTX on Hopper
TMA producer-consumer pipelines and circular buffering
Progressive optimization from 71 TFLOPS (WMMA) to 618 TFLOPS (WGMMA)
Mapping manual CUDA core patterns to their tensor core equivalents

Every major deep learning workload spends the majority of its GPU time inside matrix multiplications. Large language model training, diffusion model inference, protein structure prediction, real-time speech recognition: each of these reduces to dense matrix multiply-accumulate at its core. Tensor cores exist to accelerate exactly this bottleneck.

7.1 Tensor core fundamentals

7.1.1 What are tensor cores?

7.1.2 When to use tensor cores

7.1.3 A brief history of tensor core instructions

7.1.4 Tensor core performance across GPU generations

7.2 WMMA: the high-level tensor core API

7.2.1 WMMA fragments and tiling strategy

7.2.2 Loading data with WMMA

7.2.3 Matrix multiply with WMMA

7.2.4 Storing results

7.3 WGMMA: asynchronous warp group operations

7.3.1 Basic WGMMA: first asynchronous tensor cores

7.3.2 Understanding PTX inline assembly

7.3.3 WGMMA with larger tiles

7.3.4 WGMMA with asynchronous memory operations

7.3.5 WGMMA with maximum tiles

7.4 Performance analysis: CUDA cores vs. tensor cores

7.4.1 Performance progression analysis

7.4.2 Decision framework: when to use tensor cores

7.5 From CUDA cores to tensor cores: pattern mapping

7.5.1 Manual tiling becomes hardware tiling

7.5.2 Vectorized loads become TMA

7.5.3 Shared memory becomes implicit