chapter ten

Chapter 10. General coding principles

This chapter covers

Determining values for global size and local size
Implementing the reduction algorithm in OpenCL
Synchronizing work-items in different work-groups

In the preceding chapters, the example host applications have executed kernels using a single work-item. This is fine when you’re learning OpenCL or testing a new application, but for production code, this is unacceptable. OpenCL’s great strength is that you can execute kernels using millions or even billions of work-items, and if you’re not going to put them to use, you might as well program in regular C.

Making use of all this processing power isn’t easy. You need a clear understanding of how work-items and work-groups access memory, and how synchronization can be used to coordinate their operation. To reach this understanding, it helps to look at a fully optimized example application. Most of this chapter will be concerned with the process of reduction, or adding together elements of an array. Specifically, we’re going to compute the sum of 2²⁰ floating-point values using 2²⁰ work-items. We’ll spend some time examining the reduction algorithm, but remember that it’s the method that’s important. This example will illuminate the issues that arise when processing large amounts of data, such as memory bandwidth, memory bank conflicts, and work-group synchronization. The better you understand these issues, the better your own OpenCL applications will perform.

Chapter 10. General coding principles

This chapter covers

10.1. Global size and local size

10.2. Numerical reduction

10.3. Synchronizing work-groups

10.4. Ten tips for high-performance kernels

10.5. Summary