9 GPU architectures and concepts

This chapter covers

The GPU hardware and its connected components
Estimating the theoretical performance of your GPU
Measuring the performance of your GPU
Different application use models for effectively using a GPU

Why do we care about GPUs for high-performance computing? GPUs provide a massive source of parallel operations that can greatly exceed that which is available on the more conventional CPU architecture. It is essential to understand GPU architectures to be able to exploit their capabilities. Though GPUs have often been used for graphical processing, GPUs are also used for general purpose parallel computing. This chapter provides an overview of the hardware on a graphics processing unit (GPU)-accelerated platform. What systems today are GPU accelerated? Virtually every computing system provides the powerful graphics capabilities expected by today’s users. These GPUs range from small components of the main CPU to large peripheral cards taking up a large part of space in a desktop case. HPC systems are increasingly coming equipped with multiple GPUs. On occasion, even personal computers used for simulation or gaming can sometimes connect two GPUs for higher graphics performance.

In this chapter we present a conceptual model that identifies key hardware components of a GPU accelerated system. This consists of the components shown in figure 9.1.

9.1 The CPU-GPU system as an accelerated computational platform

9.1.1 Integrated GPUs: an underused option on commodity-based systems

9.1.2 Dedicated GPUs: the workhorse option

9.2 The GPU and the thread engine

9.2.1 The compute unit is the streaming multiprocessor

9.2.2 Processing elements are the individual processors

9.2.3 Multiple data operations by each processing element

9.2.4 Calculating the peak theoretical flops for some leading GPUs

9.3 Characteristics of GPU memory spaces

9.3.1 Calculating theoretical peak memory bandwidth

9.3.2 Measuring the GPU stream benchmark

9.3.3 Roofline performance model for GPUs

9.3.4 Using the mixbench performance tool to choose the best GPU for a workload

9.4 The PCI bus: CPU to GPU data transfer overhead

9.4.1 Theoretical bandwidth of the PCI bus