chapter one

1 When PyTorch just isn’t enough

This chapter covers

The CUDA programming model and parallel computing fundamentals
Why deep learning operations are suited for GPU parallelization
When to use CUDA versus PyTorch
The optimization ladder from naive kernels to production performance

Large language models (LLMs) consume staggering compute budgets. The difference between a model that trains in days versus weeks, or infers in milliseconds versus seconds, often comes down to how effectively you use GPU hardware. CUDA is the programming layer where that optimization happens: the bridge between high-level frameworks like PyTorch and the physical silicon that executes your computations.

Raw CUDA programming remains the most direct path to squeezing every last ounce of performance from modern hardware.^[1] This book arms you with the ability to recognize when you need a custom CUDA kernel instead of an existing library, and more importantly, how to create entirely new algorithms that push the boundaries of what’s computationally possible. By mastering CUDA at this level, you become one of the rare engineers capable of solving the performance bottlenecks that limit the next generation of AI systems.

1.1 What is CUDA?

1.2 When Do You Need Custom CUDA?

1.3 CUDA Basics

1.3.1 Host and Device: Two Worlds Working Together

1.3.2 Kernels: Functions That Run on Thousands of Threads

1.4 Recognizing Parallel Opportunities

1.4.1 The Memory Hierarchy: Your Performance Bottleneck

1.5 Deep Learning Through the CUDA Lens

1.5.1 The Nature of Deep Learning Computation

1.5.2 Why GPUs Excel at Deep Learning

1.5.3 Transformers and Modern AI: The Perfect Match

1.5.4 When We Don’t Need Parallel Computing

1.6 The Optimization Stack

1.6.1 Starting Point: Naive Kernels

1.6.2 Optimization Layers

1.6.3 The Compounding Effect

1.6.4 Scaling to Multiple GPUs

1.6.5 Why This Matters

1.7 Getting Ready to Build

1.7.1 Our Roadmap

1.8 Summary