chapter thirteen

13 GPU profiling and tools

This chapter covers

Available profiling tools for the GPU
A sample workflow for these tools
How to use the output from the GPU profiling tools

In this chapter, we will cover the tools and the different workflows that you can use to accelerate your application development. We’ll show you how profiling tools for the GPU can be helpful. In addition, we’ll discuss how to deal with the challenges of using profiling tools when working on a remote HPC cluster. Because the profiling tools continue to change and improve, we’ll focus on the methodology rather than the details of any one tool. The main takeaway of this chapter will be understanding how to create a productive workflow when using the powerful GPU profiling tools.

13.1 An overview of profiling tools

Profiling tools allow for quicker optimization, improving hardware utilization, and a better understanding of the application performance and hotspots. We’ll discuss how profiling tools expose bottlenecks and assist you in attaining better hardware usage. The following bulleted list highlights the commonly used tools in GPU profiling. We specifically show the NVIDIA tools for use with their GPUs because these tools have been around the longest. If you have a different vendor’s GPU on your system, substitute their tools in the workflow. Don’t forget about the standard Unix profiling tools such as gprof that we’ll use later in section 13.4.2.

13.2 How to select a good workflow

13.3 Example problem: Shallow water simulation

13.4 A sample of a profiling workflow

13.4.1 Run the shallow water application

13.4.2 Profile the CPU code to develop a plan of action

13.4.3 Add OpenACC compute directives to begin the implementation step

13.4.4 Add data movement directives

13.4.5 Guided analysis can give you some suggested improvements

13.4.6 The NVIDIA Nsight suite of tools can be a powerful development aid

13.4.7 CodeXL for the AMD GPU ecosystem

13.5 Don’t get lost in the swamp: Focus on the important metrics

13.5.1 Occupancy: Is there enough work?

13.5.2 Issue efficiency: Are your warps on break too often?

13.5.3 Achieved bandwidth: It always comes down to bandwidth

13.6 Containers and virtual machines provide alternate workflows

13.6.1 Docker containers as a workaround