Processors have special vector units that can load and operate on more than one data element at a time. If we’re limited by floating-point operations, it is absolutely necessary to use vectorization to reach peak hardware capabilities. Vectorization is the process of grouping operations together so more than one can be done at a time. But, adding more flops to hardware capability when an application is memory bound has limited benefit. Take note, most applications are memory bound. Compilers can be powerful, but as you will see, real performance gain with vectorization might not be as easy as the compiler documentation suggests. Still, the performance gain from vectorization can be achieved with a little effort and should not be ignored.
In this chapter, we will show how programmers, with a little bit of effort and knowledge, can achieve a performance boost through vectorization. Some of these techniques simply require the use of the right compiler flags and programming styles, while others require much more work. Real-world examples demonstrate the various ways vectorization is achieved.
Note
We encourage you to follow along with the examples for this chapter at https://github.com/EssentialsofParallelComputing/Chapter6.