6 Vectorization: FLOPs for free
This chapter covers
- Is vectorization important?
- What kind of parallelization is provided by a vector unit?
- What are the different ways you can access vector parallelization?
- What kind of performance benefits can be expected?
Processors have special vector units that can load and operate on more than one data element at a time. If we’re limited by floating point operations, it is absolutely necessary to use vectorization to reach peak hardware capabilities. Vectorization is the process of grouping operations together so more than one can be done at a time. Adding more flops to hardware capability when an application is memory bound has limited benefit. Take note, most applications are memory bound. Compilers may be powerful, but as you will see, real performance gain may not be as pretty as they make it out to be. Still, the performance gain from vectorization can be gained with little effort, and should not be ignored. We will show how programmers, with a little bit of effort and knowledge, can achieve a performance boost through vectorization. Some of these techniques just require the use of the right compiler flags and programming styles, while others require much more work. Real-world examples will demonstrate the various ways vectorization is achieved.[3]