2 Your first CUDA program
This chapter covers
- Deconstructing your first CUDA kernel,
vectorAdd, line by line. - Understanding the core CUDA programming model: Host vs. Device and data transfers.
- Mastering the GPU’s threading hierarchy: Grids, Blocks, and Threads.
- Introducing essential C concepts like pointers "just-in-time" as they appear in the code.
- Compiling and running your program with the
nvcccompiler.
This chapter is all about demystifying the first piece of code. We are going to take a vectorAdd kernel, put it under a microscope, and dissect it piece by piece. We won’t just look at the kernel itself; we’ll look at the entire program required to launch it—the "host" code that runs on your CPU. By the end of this chapter, you will have written, compiled, and run your very first complete CUDA program. You will understand not just what the code does, but why it’s structured the way it is. This is our first real step from being a user of .to("cuda") to being a builder.
2.1 The "Why": Throughput vs. Latency
We’re going to peek into some code in a second, but let’s first tackle the most fundamental question: why even bother with a GPU? Why not just use a faster CPU? The answer comes down to two crucial concepts: throughput and latency.