chapter nine

9 GPU architectures and concepts

This chapter covers

Understanding the GPU hardware and connected components
Estimating the theoretical performance of your GPU
Measuring the performance of your GPU
Different application uses for effectively using a GPU

Why do we care about graphics processing units (GPUs) for high-performance computing? GPUs provide a massive source of parallel operations that can greatly exceed that which is available on the more conventional CPU architecture. To exploit their capabilities, it is essential that we understand GPU architectures. Though GPUs have often been used for graphical processing, GPUs are also used for general-purpose parallel computing. This chapter provides an overview of the hardware on a GPU-accelerated platform.

What systems today are GPU accelerated? Virtually every computing system provides the powerful graphics capabilities expected by today’s users. These GPUs range from small components of the main CPU to large peripheral cards taking up a large part of space in a desktop case. HPC systems are increasingly coming equipped with multiple GPUs. On occasion, even personal computers used for simulation or gaming can sometimes connect two GPUs for higher graphics performance. In this chapter, we present a conceptual model that identifies key hardware components of a GPU accelerated system. Figure 9.1 shows these components.

9.1 The CPU-GPU system as an accelerated computational platform

9.1.1 Integrated GPUs: An underused option on commodity-based systems

9.1.2 Dedicated GPUs: The workhorse option

9.2 The GPU and the thread engine

9.2.1 The compute unit is the streaming multiprocessor (or subslice)

9.2.2 Processing elements are the individual processors

9.2.3 Multiple data operations by each processing element

9.2.4 Calculating the peak theoretical flops for some leading GPUs

9.3 Characteristics of GPU memory spaces

9.3.1 Calculating theoretical peak memory bandwidth

9.3.2 Measuring the GPU stream benchmark

9.3.3 Roofline performance model for GPUs

9.3.4 Using the mixbench performance tool to choose the best GPU for a workload

9.4 The PCI bus: CPU to GPU data transfer overhead

9.4.1 Theoretical bandwidth of the PCI bus

9.4.2 A benchmark application for PCI bandwidth

9.5 Multi-GPU platforms and MPI