chapter nine

9 GPU architectures and concepts

This chapter covers

The GPU hardware and its connected components
Estimating the theoretical performance of your GPU
Measuring the performance of your GPU
Different application use models for effectively using a GPU

Why do we care about GPUs for high-performance computing? GPUs provide a massive source of parallel operations that can greatly exceed that which is available on the more conventional CPU architecture. It is essential to understand GPU architectures to be able to exploit their capabilities. Though GPUs have often been used for graphical processing, GPUs are also used for general purpose parallel computing. This chapter provides an overview of the hardware on a graphics processing unit (GPU)-accelerated platform. What systems today are GPU accelerated? Virtually every computing system provides the powerful graphics capabilities expected by today’s users. These GPUs range from small components of the main CPU to large peripheral cards taking up a large part of space in a desktop case. HPC systems are increasingly coming equipped with multiple GPUs. On occasion, even personal computers used for simulation or gaming can sometimes connect two GPUs for higher graphics performance.

In this chapter we present a conceptual model that identifies key hardware components of a GPU accelerated system. This consists of the components shown in figure 9.1.

9.1 The CPU-GPU system as an accelerated computational platform

9.1.1 Integrated GPUs: an underused option on commodity-based systems

9.1.2 Dedicated GPUs: the workhorse option

9.2 The GPU and the thread engine

In this section, we’ll explore each of the components in our model of a GPU. With each component, we will discuss models for theoretical peak bandwidth. Additionally, we’ll show how to use micro-benchmark tools.

9.2.1 The compute unit is the streaming multiprocessor

A GPU compute device has multiple compute units. Compute units (CUs) is the term agreed to by the community for the OpenCL standard. Nvidia calls these streaming multiprocessors (SMs) and Intel refers to them as subslices.

9.2.2 Processing elements are the individual processors

9.2.3 Multiple data operations by each processing element

Within each processing element, it may be possible for an operation to be performed on more than one data item. Depending on the details of the GPU microprocessor architecture and the GPU vendor, these are referred to as SIMT, SIMD, or Vector operations. A similar type of functionality may be provided by ganging processing elements together.

9.2.4 Calculating the peak theoretical flops for some leading GPUs

With an understanding of the GPU hardware, we can now calculate the peak theoretical flops for some recent GPUS. These include the Nvidia V100, AMD Vega20, and the integrated Gen11 GPU on the Intel Ice Lake CPU. The specifications for these three GPUs are listed in table 9.3.

9 GPU architectures and concepts

This chapter covers

9.1 The CPU-GPU system as an accelerated computational platform

9.1.1 Integrated GPUs: an underused option on commodity-based systems

9.1.2 Dedicated GPUs: the workhorse option

9.2 The GPU and the thread engine

9.2.1 The compute unit is the streaming multiprocessor

9.2.2 Processing elements are the individual processors

9.2.3 Multiple data operations by each processing element

9.2.4 Calculating the peak theoretical flops for some leading GPUs

9.3 Characteristics of GPU memory spaces

9.3.1 Calculating theoretical peak memory bandwidth

9.3.2 Measuring the GPU stream benchmark

9.3.3 Roofline performance model for GPUs

9.3.4 Using the mixbench performance tool to choose the best GPU for a workload

9.4 The PCI bus: CPU to GPU data transfer overhead