6 Vectorization: FLOPs for free

 

This chapter covers

  • The importance of vectorization
  • The kind of parallelization provided by a vector unit
  • Different ways you can access vector parallelization
  • Performance benefits you can expect

Processors have special vector units that can load and operate on more than one data element at a time. If we’re limited by floating-point operations, it is absolutely necessary to use vectorization to reach peak hardware capabilities. Vectorization is the process of grouping operations together so more than one can be done at a time. But, adding more flops to hardware capability when an application is memory bound has limited benefit. Take note, most applications are memory bound. Compilers can be powerful, but as you will see, real performance gain with vectorization might not be as easy as the compiler documentation suggests. Still, the performance gain from vectorization can be achieved with a little effort and should not be ignored.

In this chapter, we will show how programmers, with a little bit of effort and knowledge, can achieve a performance boost through vectorization. Some of these techniques simply require the use of the right compiler flags and programming styles, while others require much more work. Real-world examples demonstrate the various ways vectorization is achieved.

Note

We encourage you to follow along with the examples for this chapter at https://github.com/EssentialsofParallelComputing/Chapter6.

6.1 Vectorization and single instruction, multiple data (SIMD) overview

6.2 Hardware trends for vectorization

6.3 Vectorization methods

6.3.1 Optimized libraries provide performance for little effort

6.3.2 Auto-vectorization: The easy way to vectorization speedup (most of the time1)

6.3.3 Teaching the compiler through hints: Pragmas and directives

6.3.4 Crappy loops, we got them: Use vector intrinsics

6.3.5 Not for the faint of heart: Using assembler code for vectorization

6.4 Programming style for better vectorization

6.5 Compiler flags relevant for vectorization for various compilers

6.6 OpenMP SIMD directives for better portability

6.7 Further explorations

6.7.1 Additional reading

6.7.2 Exercises

Summary