chapter ten

10 GPU programming model

This chapter covers

Developing a general GPU programming model
Understanding how it maps to different vendors’ hardware
Learning what details of the programming model influence performance
Mapping the programming model to different GPU programming languages

In this chapter, we will develop an abstract model of how work is performed on GPUs. This programming model fits a variety of GPU devices from different vendors and across the models from each vendor. It is also a simpler model than what occurs on the real hardware, capturing just the essential aspects required to develop an application. Fortunately, various GPUs have a lot of similarities in structure. This is a natural result of the demands of high-performance graphics applications.

The choice of data structures and algorithms has a long-range impact on the performance and ease of programming for the GPU. With a good mental model of the GPU, you can plan how data structures and algorithms map to the parallelism of the GPU. Especially for GPUs, our primary job as application developers is to expose as much parallelism as we can. With thousands of threads to harness, we need to fundamentally change the work so that there are a lot of small tasks to distribute across the threads. In a GPU language, as in any other parallel programming language, there are several components that must exist. These are a way to

10.1 GPU programming abstractions: A common framework

10.1.1 Massive parallelism

10.1.2 Inability to coordinate among tasks

10.1.3 Terminology for GPU parallelism

10.1.4 Data decomposition into independent units of work: An NDRange or grid

10.1.5 Work groups provide a right-sized chunk of work

10.1.6 Subgroups, warps, or wavefronts execute in lockstep

10.1.7 Work item: The basic unit of operation

10.1.8 SIMD or vector hardware

10.2 The code structure for the GPU programming model

10.2.1 “Me” programming: The concept of a parallel kernel

10.2.2 Thread indices: Mapping the local tile to the global world

10.2.3 Index sets

10.2.4 How to address memory resources in your GPU programming model

10.3 Optimizing GPU resource usage

10.3.1 How many registers does my kernel use?

10.3.2 Occupancy: Making more work available for work group scheduling