Chapter 10. General coding principles
This chapter covers
In the preceding chapters, the example host applications have executed kernels using a single work-item. This is fine when you’re learning OpenCL or testing a new application, but for production code, this is unacceptable. OpenCL’s great strength is that you can execute kernels using millions or even billions of work-items, and if you’re not going to put them to use, you might as well program in regular C.
Making use of all this processing power isn’t easy. You need a clear understanding of how work-items and work-groups access memory, and how synchronization can be used to coordinate their operation. To reach this understanding, it helps to look at a fully optimized example application. Most of this chapter will be concerned with the process of reduction, or adding together elements of an array. Specifically, we’re going to compute the sum of 220 floating-point values using 220 work-items. We’ll spend some time examining the reduction algorithm, but remember that it’s the method that’s important. This example will illuminate the issues that arise when processing large amounts of data, such as memory bandwidth, memory bank conflicts, and work-group synchronization. The better you understand these issues, the better your own OpenCL applications will perform.