chapter two

2 Your first CUDA program

This chapter covers

Deconstructing your first CUDA kernel, vectorAdd, line by line.
Understanding the core CUDA programming model: Host vs. Device and data transfers.
Mastering the GPU’s threading hierarchy: Grids, Blocks, and Threads.
Introducing essential C concepts like pointers "just-in-time" as they appear in the code.
Compiling and running your program with the nvcc compiler.

This chapter is all about demystifying the first piece of code. We are going to take a vectorAdd kernel, put it under a microscope, and dissect it piece by piece. We won’t just look at the kernel itself; we’ll look at the entire program required to launch it—the "host" code that runs on your CPU. By the end of this chapter, you will have written, compiled, and run your very first complete CUDA program. You will understand not just what the code does, but why it’s structured the way it is. This is our first real step from being a user of .to("cuda") to being a builder.

2.1 The "Why": Throughput vs. Latency

We’re going to peek into some code in a second, but let’s first tackle the most fundamental question: why even bother with a GPU? Why not just use a faster CPU? The answer comes down to two crucial concepts: throughput and latency.

2.2 Your First Run

2.3 Anatomy of a CUDA Program: The vecadd.cu File

2 Your first CUDA program

This chapter covers

2.1 The "Why": Throughput vs. Latency

2.2 Your First Run

2.3 Anatomy of a CUDA Program: The vecadd.cu File

2.4 Where the "parallelism" comes in

2.5 Scaling Up: Grids, Blocks, and Boundary Checks

2.6 Expanding to Three-Dimensional Arrays

2.6.1 3D Tensor Addition: CPU Implementation

2.6.2 The CUDA Kernel for 3D Tensor Addition

2.6.3 3D Kernel Launch Configuration

2.6.4 Real-World Example: Batch Image Processing

2.7 The Universal Pattern: From nD to 1D

2.8 Summary