chapter twelve

12 GPU languages: Getting down to basics

 

This chapter covers

  • Understanding the current landscape of native GPU languages
  • Creating simple GPU programs in each language
  • Tackling more complex multi-kernel operations
  • Porting between various GPU languages

This chapter covers lower-level languages for GPUs. We call these native languages because they directly reflect features of the target GPU hardware. We cover two of these languages, CUDA and OpenCL, that are widely used. We also cover HIP, a new variant for AMD GPUs. In contrast to the pragma-based implementation, these GPU languages have a smaller reliance on the compiler. You should use these languages for more fine-tuned control of your program’s performance. How are these languages different than those presented in chapter 11? Our distinction is that these languages have grown up from the characteristics of the GPU and CPU hardware, while the OpenACC and OpenMP languages started with high-level abstractions and rely on a compiler to map those to different hardware.

Figure 12.1 The interoperability map for the GPU languages shows an increasingly complex situation. Four GPU languages are shown at the top with the various hardware devices at the bottom. The arrows show the code generation pathways from the languages to the hardware. The dashed lines are for hardware that is still in development.

12.1 Features of a native GPU programming language

12.2 CUDA and HIP GPU languages: The low-level performance option

12.2.1 Writing and building your first CUDA application

12.2.2 A reduction kernel in CUDA: Life gets complicated

Figure 12.2 Pair-wise reduction tree for a warp that sums up values in log n steps.

12.2.3 Hipifying the CUDA code

12.3 OpenCL for a portable open source GPU language

12.3.1 Writing and building your first OpenCL application

12.3.2 Reductions in OpenCL

Figure 12.3 Comparison of OpenCL and CUDA reduction kernels: sum_within_block

Figure 12.4 Comparison for the first of two kernel passes for the OpenCL and CUDA reduction kernels

Figure 12.5 Comparison of the second pass for the reduction sum

12.4 SYCL: An experimental C++ implementation goes mainstream

12.5 Higher-level languages for performance portability

12.5.1 Kokkos: A performance portability ecosystem

12.5.2 RAJA for a more adaptable performance portability layer