This chapter covers
- What parallel computing is and why it’s growing in importance
- Where the parallelism exists in modern hardware
- Why the amount of parallelism in applications is important
- The software approaches to exploit the parallelism
Parallel computing is the execution of many operations at a single instance in time. Fully exploiting parallel computing does not happen automatically. It requires some effort from the programmer. First, you must identify and expose the potential for parallelism in an application. Potential parallelism, or concurrency, means that you certify that it is safe to conduct the operations in any order as the resources become available on the system. For parallel computing there is an additional requirement; these operations must occur at the same time. For this to happen, you must also properly leverage the resources to execute them simultaneously. Parallel computing introduces new concerns that are not present in a serial world. A change in thought process is needed to adapt to the additional complexities of parallel execution, but with practice, it begins to be second nature. This book begins your journey on how to access the power of parallel.
Examples of parallel processing are numerous in everyday life and often are the basis for strategies developed for the computer. Think about a supermarket checkout line (figure 1.1). The goal is to have customers quickly pay for the items they want to purchase. This can be done by employing multiple cashiers to process, or check out, customers one at a time. In this case, the cashiers can become experienced in executing the checkout process and check out customers faster. Another strategy is to employ many self-checkout stations and allow customers to execute the checkout process on their own. This strategy requires fewer human resources from the supermarket and can open more lanes to process customers. Customers may not be able to check themselves out as efficiently as a trained cashier, but perhaps more customers can check out quickly due to increased parallelism resulting in shorter lines.
Figure 1.1 Everyday parallelism in supermarket checkout queues. The checkout cashiers (with caps) process their queue of customers (with baskets). On the left side, one cashier processes four self-checkout lanes simultaneously. On the right side, one cashier is required for each checkout lane. Each option has impacts on costs to the supermarket and the checkout rate.

Computational problems are solved by developing algorithms, a set of steps to achieve a desired result. In the supermarket analogy, the algorithm is the process of checking out. In this case, it includes unloading items from a basket, scanning the items to obtain a price, and paying for the items. This algorithm is sequential, or “serial”, and must follow this order. When there are hundreds of customers that need to execute this task, the algorithm for checking out many customers contains parallelism that can be taken advantage of; theoretically, there is no dependency between any two customers going through the checkout process. By having multiple checkout lines or self-checkout stations, supermarkets have exposed parallelism to increase the rate at which customers can buy goods and leave the store. Each choice in how the parallelism is exposed results in different costs and benefits. Parallel computing is the practice of identifying and exposing parallelism in algorithms, expressing this parallelism in software, and understanding the costs, benefits, and limitations of the chosen parallel implementation.
In the end, parallel computing is about performance. This includes more than just speed, but also the size of the problem and energy efficiency. Our goal in this book is to give you a understanding of the breadth of the current parallel computing field and familiarize you with enough of the most commonly used languages, techniques and tools so that you can tackle a parallel computing project with confidence. Important decisions about how to incorporate parallelism are often made at the outset of a project. A reasoned design is an important step toward success and avoiding it can lead to problems much later down the road. It is equally important to keep expectations realistic and informed by both the resources available and the nature of the project.
Another goal of this chapter is to introduce the terminology used in parallel computing with clear definitions. We also want to point you to the glossary in the Appendices for reference on terminology as you read the book. Partially because the field and technology has grown incrementally, the use of many of the terms by those in the parallel community has been sloppy and imprecise. With the increased complexity of the hardware and parallelism within applications, clear, unambiguous use of terminology is very important.
Figure 1.2 The single thread performance, CPU clock frequency (MHz), CPU power consumption (Watts), and the number of cores on CPU chips are shown from 1970 to 2018. The Parallel Computing era begins about 2005, when the core count in CPU chips began to rise, while the clock frequency and power consumption plateaued, yet performance steadily increased. (Horowitz et al. and Rupp)

Welcome to the world of parallel computing. As you delve deeper, the techniques and approaches become more natural and the power is captivating. Problems that you never thought to attempt become commonplace.
highlight, annotate, and bookmark
You can automatically highlight by performing the text selection while keeping the alt/ key pressed.

The future is parallel. The increase in serial performance has plateaued as processor designs have hit the limits of miniaturization, clock frequency, power and heat. Figure 1.2 shows the trends in clock frequency (the rate at which an instruction can be executed), power consumption, number of computational cores, or cores for short, and hardware performance over time for commodity processors. In 2005 the number of cores abruptly started increasing from a single core to multiple cores. At the same time the clock frequency and power consumption flattened out. Theoretical performance steadily increased, due to the fact that performance is proportional to the product of the clock frequency and the number of cores. This shift towards increasing the core count over the clock speed indicates that achieving the most ideal performance of a central processing unit (CPU) is only available through parallel computing.
Modern consumer grade computing hardware comes equipped with multiple central processing units (CPUs) and/or graphics processing units (GPUs) that can process many sets of instructions simultaneously. These smaller systems often rival the computing power of supercomputers of two decades ago. Making full use of compute resources (on laptops, workstations, smart phones) requires you, the programmer, to have a working knowledge of the tools available for writing parallel applications. You must also understand the hardware features used to boost parallelism. Because there are so many different parallel hardware features, this presents complexities to the programmer. One of these features is hyperthreading, introduced by Intel, to make a single physical core appear as two cores to the operating system by having two instruction queues interleaving work to the hardware logic units. Vector processors are another hardware feature that has appeared in commodity processors starting in about 2000. These vector processors can execute multiple instructions at one time. The number of instructions is specified by the width in bits of the vector processor, also called a vector unit. Thus a 256 bit-wide vector unit can execute four 64-bit (doubles) or eight 32-bit (single precision) instructions simultaneously.
Example
Let’s take a 16 core CPU with hyperthreading and a 256 bit-wide vector unit, commonly found in home desktops. A serial program, using a single core and no vectorization, can only use 0.8% of the theoretical processing capability of this processor! The calculation is as follows
16 cores x 2 hyperthreads x (256 bit-wide vector unit)/(64-bit double) = 128-way parallelism. 1 serial path/128 parallel paths = .008 or 0.8%. This is a very small fraction of the total processing power as shown in figure 3.
Calculating theoretical and realistic expectations for serial and parallel performance, as illustrated in the example, is an important skill and will be discussed in more depth in chapter 3.
Figure 1.3 A serial application can access only 0.8% of the processing power of a commonly found 16 core CPU.

There has been some improvement in the software development tools to help add parallelism and there is even more being done in the research community, but it is a long way from really addressing the performance gap. This puts a lot of the burden on the software developer to get the most of this new generation of processors.
But software developers have been slow to adapt to this fundamental change in computing power. This has meant that transitioning current applications to make use of modern parallel architectures can be daunting with the explosion of new programming languages and application programming interfaces (APIs) available. But a good working knowledge of your application, an ability to see and expose parallelism, and a solid understanding of the tools available can result in substantial benefits. So exactly what kind of benefits would applications see? Let’s take a closer look.
Parallel computing can reduce the time-to-solution, increase the energy efficiency in your application, and enable you to tackle larger problems on currently existing hardware. The excitement today about parallel computing is that it is no longer the sole domain of the largest computing systems. The technology is now present in everybody’s desktop, laptop and even hand-held devices. This makes it possible for every software developer to create parallel software on their local systems thereby greatly expanding the opportunity for new applications.
Cutting edge research from both industry and academia has revealed new application areas for parallel computing. Interest has broadened out from scientific computing into machine learning, big data, computer graphics and consumer applications. The emergence of new technologies such as self-driving cars, computer vision, voice recognition and artificial intelligence require large computational capabilities both within the consumer device and in the development sphere, where massive training datasets must be consumed and processed. In scientific computing, which has long been the exclusive domain of parallel computing, there are new, exciting possibilities. More extensive data is becoming available from the proliferation of remote sensors and hand-held devices that can feed into larger, more realistic computations to better inform decision-making around natural and man-made disasters.
It must be remembered that parallel computing itself is not the goal. Rather, the goals are what results from parallel computing: reducing run-time, performing larger calculations, or reducing energy consumption.
Reduction of an application’s runtime, or the speedup, is often thought to be the primary goal parallel computing. Indeed, this is usually the biggest impact. Parallel computing can speed up intensive calculations, multimedia processing and operations on big data. Your applications may take days or even weeks to process or the results may be needed in real-time. In the past, the programmer would spend large efforts on serial optimization to squeeze out a few percent improvement. Now, there is the potential for orders of magnitude improvement with multiple avenues to choose from. This creates a new problem in exploring the possible parallel paradigms; more opportunities than resources. But, a thorough knowledge of your application and awareness of parallelization opportunities can lead you down a clear path towards reducing your application’s runtime.
By exposing parallelism in your application, you can scale up your problem size to sizes that were out of reach with a serial application. This is because the amount of compute resources dictates what can be done, and exposing parallelism permits you to operate on larger resources, presenting opportunities that could never be considered before. The larger problem sizes are enabled by the larger amounts of main memory, disk storage, bandwidth over networks and to disk, and CPUs. In analogy with the supermarket, exposing parallelism is equivalent to employing more cashiers or opening more self-checkout lanes to handle a larger number of customers.
One of the new impact areas of parallel computing is energy efficiency. With the emergence of parallel resources in hand-held devices, parallelization can speed-up applications, allowing the device to return to sleep mode sooner and allowing the use of slower, but more parallel processors that consume less power. Moving heavy-weight multimedia applications to run on the GPU can have even a more dramatic effect on energy efficiency while also resulting in vastly improved performance. The net result of employing parallelism can reduce power consumption and extend battery life which is a strong competitive advantage in this market niche.
Another area where energy efficiency is important is remote sensors, network devices and operational field deployed devices, such as remote weather stations. Often without large power supplies, these devices must be able to function in small packages and few resources. Parallelism opens up what can be done on these devices and offloads the work from the central computing system in a growing trend that is called “edge compute”. Moving the computation to the very edge of the network can enable processing at the source of the data, condensing it into a smaller result that can be more easily sent over the network.
Accurately calculating the energy costs of an application is challenging without direct measurements of power usage. However, you can estimate the cost by multiplying the manufacturer’s thermal design power by the runtime of the application and the number of processors used. The thermal design power is the rate at which energy is expended under typical operational loads. The energy consumption for your application can be estimated using the formula
P = (N Processors ) × (R Watts/Processor ) × (T hours )
Where P is the energy consumption, N is the number of processors, R is the thermal design power, and T is the run-time of your application.
Example
Intel’s 16 core Xeon E5-4660 Processor has a thermal design power of 120W. Suppose that your application uses 20 of these processors for 24 hours to run to completion. The estimated energy usage for your application is
P = (20 Processors) × ( 120 W/Processors) × ( 24 hrs ) = 57.60 kWhrs
In general, GPUs have a higher thermal design power than modern CPUs, but can potentially reduce runtime or require only a few GPUs to obtain the same result. The same formula can be used as before, where N is now seen as the number of GPUs
Example:
Suppose that you’ve ported your application to a multi-GPU platform. You can now run your application on four Nvidia Tesla V100 GPUs in 24 hrs! Nvidia’s Tesla V100 GPU has a maximum thermal design power of 300 W. The estimated energy usage for your application is
P = (4 GPUs) × ( 300 W/GPUs) × ( 24 hrs ) = 28.80 kWhrs
In this example, the GPU accelerated application runs at half the energy cost as the CPU-only version. Note that, in this case, even though the time to solution remained the same, the energy expense was cut in half!
Achieving a reduction in energy cost through accelerator devices, like GPUs, requires that the application has sufficient parallelism that can be exposed and the resources on the device are efficiently utilized.
Actual monetary cost is becoming a more visible concern for software developer teams, software users, and researchers alike. As the sizes of applications and systems grow, they must perform a cost-benefit analysis on the resources available to inform their development. For example, with the next large High Performance Computing (HPC) systems, the power costs are projected to be three times the cost of the hardware acquisition. The usage costs have promoted cloud computing as an alternative and it is being increasingly adopted across academia, start-ups, and industry. In general, cloud providers bill by the type and quantity of resources used and the amount of time spent using them. Although GPU’s are generally more expensive than CPU per unit time, some applications can leverage GPU accelerators such that there are sufficient reductions in run-time, relative to the CPU expense, to yield lower costs.
And yet, parallel computing is not a panacea. Many applications are neither large enough or require enough run-time to need parallel computing. Some may not even have enough inherent parallelism to exploit. Transitioning applications to leverage multicore and many-core (GPU) hardware requires a dedicated effort that can temporarily shift attention away from direct research or product goals. The investment of time and effort must first be deemed worthwhile. It is always more important that the application run and generate the desired result before making it fast and scale up to larger problems.
We strongly recommend that you start your parallel computing project with a plan. It’s important to know what options are available for accelerating the application and select the most appropriate for your project. Then it is crucial to have a reasonable estimate of the effort involved and the potential payoffs (in terms of dollar cost, energy consumption, time-to-solution, and other metrics that may be important).
We begin here to give you the knowledge and skills to make decisions on parallel computing projects up front.
discuss

In serial computing, all operations speed up as the clock frequency increases. In contrast, with parallel computing, some thought and modification to applications is necessary to fully exploit parallel hardware. Why is the amount of parallelism important? Let’s look at the parallel computing laws.
We will need a way to calculate the potential speedup of a calculation based on the amount of the code that is parallel. This can be done using Amdahl’s Law, proposed by Gene Amdahl in 1967. This law describes the speedup of a fixed size problem as the processors increase as shown in the following equation where P is the parallel fraction of the code, S is the serial fraction, which means that P+S=1, and N is the number of processors.

This law highlights that no matter how fast we make the parallel part of the code, we will always be limited by the serial portion. Figure 1.4 visualizes this limit. This scaling of a fixed size problem is referred to as Strong Scaling.
Definition:
Strong Scaling—the function of the solution time with the number of processors for a fixed total problem size.
Figure 1.5 Speedup for a problem size that grows with the number of available processors, according to Gustafson-Barsis’s Law, is shown as function of the number of processors. Lines are shown for ideal speedup, when 100% of an algorithm is parallelized, and 90%, 75%, and 50%.
Figure 1.4 Speedup for a fixed problem size, according to Amdahl’s Law, is shown as function of the number of processors. Lines are shown for ideal speedup, when 100% of an algorithm is parallelized, and 90%, 75%, and 50%. Amdahl’s Law shows that the speedup is limited by the fractions of code that remain serial.


Gustafson and Barsis pointed out in 1988 that parallel code runs should increase the size of the problem as more processors are added. This can give us an alternate way to calculate the potential speedup of our application. If the problem size grows proportionally to the number of processors, the speedup is now expressed as:
SpeedUp(N) = N − S ∗ (N − 1)
where N is the number of processors and S is the serial fraction as before. The result is that a larger problem can be solved in the same time by using more processors and opens up additional opportunities to exploit the parallelization. Indeed, growing the size of the problem with the number of processors makes sense since the application user will want to benefit from more than just the processing power of the additional processor and will want to also use the additional memory. The run-time scaling for this scenario, shown in figure 1.5, is called weak scaling.
Definition:
Weak Scaling—the function of the solution time with the number of processors for a fixed problem size per processor.
Figure 1.6 Strong scaling keeps the same overall problem size and splits it up across the additional processors. In weak scaling, the size of the mesh stays the same for each processor and the total size increases.

Figure 1.6 shows the difference between strong and weak scaling in a visual representation. The weak scaling argument that the mesh size should stay constant on each processor makes good use of the resources of the additional processor. The strong scaling perspective is primarily concerned with speedup of the calculation. In practice, both strong scaling and weak scaling are important since they address different user scenarios.
The term scalability is often used to refer to whether more parallelism can be added in either the hardware or software and whether there is an overall limit to how much improvement can occur. While the traditional focus has been on the run-time scaling, we will make the argument that memory scaling is often more important. Shown in figure 1.7 is an application that has limited memory scalability. A replicated array is a dataset that is duplicated across all the processors. A distributed array would be partitioned and split across the processors. For example, in a game simulation, 100 characters may be distributed across four processors with 25 characters on each processor. But the map of the game board may be copied to every processor. This is shown in figure 1.7 with the replicated array ‘R’ duplicated across the mesh. Since this figure is for weak scaling, the problem size grows as the number of processors increases. For four processors, the array is four times as big on every processor. As the number of processors and the problem size grow, there soon is not enough memory on a processor for the job to run. Limited run-time scalability means the job runs slowly; limited memory scalability means the job can’t run at all. It is also the case that if the memory of an application can be distributed, the run-time will usually scale as well. The reverse is not necessarily true.
Figure 1.7 Distributed arrays stay the same size as the problem and number of processors doubles (weak scaling). But replicated (copied) arrays need all the data on each processor and the memory size grows rapidly with the number of processors. Even if the run-time weakly scales (stays constant), the memory requirements will limit scalability.

One view of a computationally intensive job is that every byte of memory will get touched in every cycle of processing. To first order than, the run-time is a function of the memory size. Reducing the memory size will necessarily reduce the run-time. So the first focus in parallelization should be to reduce the memory size as the number of processors grows.
settings

Parallel computing requires combining an understanding of hardware, software, and parallelism to develop an application. It is more than just message passing or threading. Current hardware and software give many different options to bring parallelism to your application. Some of these options can even be combined to yield even greater efficiency and speedup.
Figure 1.8 Parallelism is expressed in a application software layer that gets mapped to the computer hardware through the compiler and operating system.

It is important to have an understanding of the parallelism in your application and the way different hardware components allow you to expose it. Further, developers need to recognize that between your source code and the hardware, your application must traverse additional layers, including a compiler and an operating system as shown in figure 1.8.
As a developer, you are responsible for the application software layer, which includes your source code. In the source code, you make choices about the programming language and parallel software interfaces you use to leverage the underlying hardware. Additionally, you decide how to break up your work into parallel units. A compiler is designed to translate your source code into a form the hardware can execute. With these instructions at hand, an operating system manages executing these instructions on the computer hardware.
We will show you, through an example, how parallelism is introduced to an algorithm through a prototype application. This is the process that takes place in the application software layer, but requires an understanding of the computer hardware. For now, we refrain from discussing the choice in compiler and operating system. We will incrementally add each layer of parallelism so that you can see how this works. With each parallel strategy, we will explain how the available hardware influences the choices that are made. The purpose in doing this is to demonstrate how hardware features influence the parallel strategies.
Following the example, we will introduce a model for thinking about modern hardware. This model breaks down modern compute hardware into individual components and the variety of compute devices. A simplified view of the memory is included in this chapter. A more detailed look at the memory hierarchy will be presented in chapter 3. Finally, we will discuss in more detail, the application and software layer. We categorize the parallel approaches a developer can take into process-based parallelism, thread-based parallelism, vectorization, and stream processing. Parallelism based on individual processes with their own memory spaces can be distributed memory on different nodes of a computer, or it can be within a node. Stream processing is generally associated with GPUs. The model for modern hardware and the application software will help you better understand how to plan to port your application to modern parallel hardware.
For this introduction to parallelism in an application, we will look at a data parallel approach. This is one of the most common parallel computing application strategies. We perform the computation on a spatial mesh composed of a regular xy grid of rectangular elements or cells. The steps (summarized here and described in detail below) to create the spatial mesh and prepare for the calculation are:
- Discretize (breakup) the problem into smaller cells or elements
- Define a computational kernel (operation) to conduct on each element of the mesh
Then to perform the calculation, we will add the following layers of parallelism on CPUs and GPUs:
- Vectorization (work on more than one unit of data at a time)
- Threads (deploy more than one compute pathway) to engage more processing cores
- Processes (separate program instances) to spread out the calculation into separate memory spaces
- Off-loading the calculation to GPUs (send the data to the graphics processor to calculate)
Figure 1.9 An example 2D spatial domain for a numerical simulation. Numerical simulations typically involve stencil operations or processing large matrix-vector systems. These types of operations are often used in fluids modeling to yield predictions of tsunami arrival times, weather forecasts, smoke plume spreading, and other processes necessary for informing decision making.

We start out with a 2D problem domain of a region of space. For purposes of illustration, we will use the 2D image of the Krakatau volcano in Figure 1.9 for our example. The goal of our calculation could be to model the volcanic plume, the resulting tsunami, or the early detection of a volcanic eruption using machine learning. For all of these, calculation speed is critical if we want real-time results to inform decision-making.
Figure 1.10 The domain is discretized into cells. At each cell in the computational domain, properties such as wave height, fluid velocity, or smoke density are solved for according to physical laws. Ultimately, this discrete system is represented by a stencil operation or a matrix-vector system.

For any detailed calculation, we must first break up the domain of the problem into smaller pieces as shown in figure 1.10, a process that is called discretization. In image processing, this is often just the pixels in a bitmap image. For a computational domain, these are called cells or elements. The collection of cells or elements form a computational mesh that cover the spatial region for the simulation. There are data values for each cell which might be integers, floats, or doubles.
Figure 1.11 A five-point stencil operator is shown as a cross pattern on the computational mesh. The data marked by the stencil are the data being read in the operation and stored in the center cell. This pattern is repeated for every cell. The blur operator, one of the simpler stencil operators, is a weighted sum of the five points marked with the red dots and is used to update a value at the central point of the stencil. This type of operation is done for smoothing operations or wave propagation numerical simulations.

The calculations on this discretized data are often some form of a stencil operation, so-called because it involves a pattern of adjacent cells to calculate the new value for each cell. This could be an average (blur operation, which blurs the image or makes it fuzzier), gradient (edge-detection, which sharpens the edges in the image), or another more complex operation associated with solving physical systems described by partial differential equations (PDEs). The stencil operation shown in figure 1.11 is a five-point stencil that does a blur operation by using a weighted average of the stencil values.
But just what are these partial differential equations? Let’s go back to our example and imagine this time it is a color image composed of separate red, green, and blue arrays to make a RGB color model. The term ‘partial’ in this mathematical term means that there is more than one variable and that we are separating out the change of red with space and time from that of green and blue. Then we would do the blur operator separately on each of these colors. There is one more requirement that we apply a rate of change with time and space. In other words, the red would spread at one rate, and green and blue at others. This could be to produce a special effect on an image or it may describe how real colors would bleed and merge in a photographic image during development. In the scientific world, instead of red, green, blue, we might have mass, x velocity and y velocity. With the addition of a little more physics we might have the motion of a wave or an ash plume.
We start out introducing parallelism by looking at vectorization.
Figure 1.12 In this example, a vector operation is conducted on four doubles as a special vector operation. This operation can be executed in a single clock cycle, with little additional energy costs to the serial operation.

So what is vectorization? Some processors have the ability to operate on more than one piece of data at a time; a capability referred to as vector operations. As shown in the shading blocks in figure 1.12, multiple data values are operated on simultaneously in a vector unit in a processor with one instruction in one clock cycle.
Figure 1.13 In this example, four threads are used to process four rows of vector units simultaneously.

Most of today’s CPUs have at least 4 processing cores. So, we use threading to use the cores to operate simultaneously across four rows at a time as shown in figure 1.13.
Figure 1.14 This algorithm can be parallelized further by distributing the 4x4 blocks amongst distinct processes. Each process uses four threads that each process a 4-wide vector unit in a single clock cycle. The process boundaries are illustrated here by the additional white space.

We can further split the work between processors on two desktops, often called nodes in parallel processing. When the work is split across nodes, the memory spaces for each node are distinct and separate. This is indicated by putting a gap between the rows in figure 1.14.
Even for this fairly modest hardware scenario, there is a potential theoretical speedup of 32x.
2 desktops (nodes) x 4 cores x (256 bit-wide vector unit)/(64-bit double) = 32x potential speedup
If we look at a high-end cluster with 16 nodes, 36 cores per node and a 512-bit vector processor, the potential theoretical speedup is 4,608x faster than a serial process.
16 nodes x 36 cores x (512 bit-wide vector unit)/(64-bit double) = 4608x potential speedup
Figure 1.15 On a GPU, the vector length is much larger than a CPU. Here, 8x8 tiles are distributed across GPU workgroups.

The GPU is another hardware resource for supercharging parallelism. With GPUs, there are lots of streaming multiprocessors that can be harnessed for work. For example, as shown in figure 1.15, the work can be split up into 8x8 tiles to be worked on separately. Using the hardware specifications for the Nvidia Volta GPU, these tiles can be operated on by 32 double-precision cores spread out on 84 streaming multiprocessors for a total of 2688 double-precision cores working simultaneously. If we have one GPU per node of a 16-node cluster, each with a 2688 double-precision streaming multiprocessor, this is a 43,008-way parallelism from the 16 GPUs.
These are impressive numbers, but at this point we must temper expectations by acknowledging that actual speedup falls far short of this full potential. Our challenge now becomes organizing such extreme and disparate layers of parallelism to obtain as much speedup as possible.
For this high-level application walk-through, we have left out a lot of important details which will be covered in the later chapters of the book. But even at this level of detail, it highlights some of the strategies for exposing parallelism of an algorithm. To be able to develop similar strategies for other problems, an understanding of modern hardware and software is necessary. We now dive deeper into the current hardware and software models. These conceptual models are simplified representations of the diverse real-world hardware to avoid the complexity and maintain generality over the quickly evolving systems.
To build a basic understanding of how parallel computing works, we start with an explanation of the components that comprise modern hardware. Information (data) is stored in main memory composed of Dynamic Random Access Memory, called DRAM. A computational core, or core for short, is a component that can perform arithmetic operations (add, subtract, multiply, divide), evaluate logical statements, and load and store data from DRAM. When an operation is performed on data, the instructions and data must be loaded from memory onto the core, operated on, and stored back in memory. Modern CPUs, sometimes also called processors, are outfitted with many cores capable of executing these operations in parallel. It is becoming more common to find systems outfitted with accelerator hardware, like Graphics Processing Units (GPUs). GPUs are equipped with 1000s of cores and a memory space that is separate from the CPU’s DRAM.
A compute node is composed by the combination of a processor (or two), DRAM, and an accelerator and may be referred to in the context of a single home desktop or a “rack” in a supercomputer. Compute nodes can be connected to each other with one or more networks, sometimes called an interconnect. Conceptually, a node runs a single instance of the operating system that manages and controls all of the hardware resources.
Hardware is becoming more complex and heterogeneous. We start with simplified models of components of the system so that each are more obvious.
One of the first and most scalable approaches to parallel computing is the distributed memory cluster shown in figure 1.16. Each Central Processing Unit (CPU) has its own local memory, composed of Dynamic Random-Access Memory (DRAM), and is connected to other CPUs by a communication network. The good scalability of distributed memory clusters arises from its seemingly limitless ability to incorporate more nodes.
Figure 1.16 The distributed memory architecture links nodes composed of separate memory spaces. These nodes can be workstations or racks.

A further strength of this architecture is that it provides some memory locality by dividing the total addressable memory into smaller subspaces for each node, thereby making the accessing of memory off-node clearly different than on-node. This forces the programmer to explicitly access different memory regions. The disadvantage is that the programmer must manage the partitioning of the memory spaces at the very outset of the application.
An alternative approach is to connect the two CPUs directly to the same shared memory as shown in figure 1.17. The strength of this approach is that the processors share the same address space, simplifying programming. But it introduces potential memory conflicts resulting in correctness and performance issues. Synchronizing memory access and values between the CPUs, or the processing cores on a multi-core CPU, is complicated and expensive.
Figure 1.17 The shared memory architecture provides parallelism within a node.

The addition of more CPUs and processing cores does not increase the amount of memory available to the application. This and the synchronization costs limits the scalability of the shared memory architecture.
So why not just increase the clock frequency for the processor to get greater throughput as has been done in the past? The biggest limitation on increasing CPU clock frequencies is that it requires more power and produces more heat. Whether it is an HPC supercomputing center with limits on installed power lines or your cell phone with limited battery capacity, devices today all have power limitations. This problem is called “the power wall”. Rather than increasing the clock frequency, why not do more than one operation per cycle? This is the idea behind the resurgence of vectorization on many processors. It takes only a little more energy to do multiple operations in a vector unit compared to a single operation (more formally called a scalar operation). Vectorization is able to process more data in a single clock cycle than a serial process. There is little change to the power requirements for for multiple operations versus just one and a reduction in execution time can lead to reduction in energy consumption for an application.
Figure 1.18 Vector processing example with 4 array elements operated on at once.

Much like a four lane freeway allows four cars to move simultaneously in comparison to a single lane road, the vector operation gives greater processing throughput. Indeed, the four pathways through the vector unit, as shown in a different shading in the figure, are commonly called lanes of a vector operation. Most current CPU and GPU processors have some capability for vectorization or equivalent operations. The amount of data processed in one clock cycle, the vector length, depends on the size of the vector units on the processor. Currently, the most common available vector length is 256-bits. If the discretized data are 64-bit doubles, then we can do four floating point operations simultaneously as a vector operation. As shown in figure 1.18, vector hardware units load a block of data at a time, perform a single operation on the data simultaneously, and then store the result.
An accelerator device is a discrete piece of hardware designed for executing specific tasks at a fast rate. The most common accelerator device is the GPU. When used for computation, the device is sometimes referred to as General Purpose Graphics Processing Unit (GPGPU). The GPU contains many small processing cores, called streaming multiprocessors (SMs). These SMs are simpler than a CPU core, but they do provide a massive amount of processing power. There is usually a small integrated GPU on the CPU. Most modern computers also have a separate discrete GPU connected to the CPU by the Peripheral Component Interface (PCI) bus as shown in figure 1.19. The bus introduces a data communication cost but the discrete card is often more powerful. We will discuss this interesting hardware architecture more in chapters 9-12.
Figure 1.19 GPUs come in two varieties, integrated and discrete. Discrete, or dedicated, GPUs typically have a large number of streaming multiprocessors and have their own DRAM. Accessing data on a discrete GPU requires communication over a PCI bus.

Now let’s combine all of these different hardware architectures into one combined model as shown in figure 1.20. There are two nodes, each with 2 CPUs sharing the same DRAM main memory. Each CPU is a dual-core processor with an integrated GPU. There is also a discrete GPU on the PCI bus attached to one of the CPUs. Though the CPUs share main memory, they are commonly in different Non-Uniform Memory Access (NUMA) regions which means that accessing the other CPU’s memory is more expensive than it’s own memory.
Figure 1.20 This is a model of a general heterogeneous parallel architecture. It consists of two nodes connected by a network. Each node has a multicore CPU with integrated and discrete GPU and some memory (DRAM). Modern compute hardware will have some arrangement of these components.

Throughout this hardware discussion, we have presented a simplified model of the memory hierarchy, showing just DRAM or main memory. We’ve shown a cache in the combined model, but no detail on its composition or how it functions. The complexities of memory management, including the multiple levels of cache, are reserved for discussion in chapter 3. We have presented a model for modern hardware to help you identify the components available and to help select the parallel strategy best suited for your application and hardware choices.
The software model for parallel computing is necessarily motivated by the underlying hardware model, but is nonetheless distinct from it. The interface between the two is through the operating system. Parallel operations do not spring to life on their own; rather, source code must indicate how to parallelize by spawning processes or threads; offloading data, work, and instructions to a compute device; or operating on blocks of data at a time. The programmer must first expose the parallelism, determine the best technique to operate in parallel, and then explicitly direct its operation in a safe, correct, and efficient manner. The following methods are the most common techniques for parallelization.
The message passing approach was developed for distributed memory architectures. The name “Message Passing” is because explicit messages are needed to move data between processes. In this model, your application spawns separate processes, called ranks in message passing, with their own memory space and instruction pipeline as shown in figure 1.21. Also shown in the figure is that the processes are handed to the operating system for placement on the processors. The application lives in the part of the diagram marked as user space where the user has permissions to operate. The part below is kernel space and is protected from dangerous operations by the user.
Figure 1.21 The message passing library spawns processes. The operating system places the processes on the cores of two nodes. The question marks indicate that the operating system controls the placement of the processes and may move them during the run as suggested by the dashed arrows. The operating system also allocates memory for each process from the node main memory.

Keep in mind that the processors, i.e. the CPUs, have multiple processing cores which are not equivalent to the processes. Processes are an operating system concept and processors a hardware component. For however many processes the application spawns, they are scheduled by the operating system to the processing cores. You can actually run eight processes on your quad-core laptop and they will just swap in and out of the processing cores. For this reason, mechanisms have been developed to tell the operating system how to place processes and whether to “bind” the process to a processing core.
To move data between processes, explicit messages must be programmed into the application. These messages may be sent over a network or through shared memory. There have been many message passing libraries, but they coalesced into the Message Passing Interface (MPI) standard in 1992. Since then, MPI has taken over this niche and is present in almost all parallel applications that scale beyond a single node. Many different implementations of MPI libraries are available.
Distributed Computing versus Parallel Computing:
There are parallel applications that use a lower-level approach to parallelism called distributed computing. We define distributed computing as a set of loosely-coupled processes that cooperate via operating system level calls. While distributed computing is a subset of parallel computing, the distinction is important. Examples of distributed computing applications include peer-to-peer networks, the World-Wide Web, and internet mail. The Search for Extraterrestial Intelligence SETI@home is just one example of many scientific distributed computing applications.
Figure 1.22 The application process spawns threads. The threads are restricted to the domain of the node. The question marks show that the operating system decides where to place the threads. Some memory is shared between threads.

The location of each process is usually on a separate node and is created through the operating system using something like a remote procedure call (RPC) or a network protocol. The processes then exchange information through the passing of messages between the processes by inter-process communication (IPC) of which there are several varieties. Simple parallel applications often use a distributed computing approach, but often through a higher level language such as python and specialized parallel modules or libraries.
The thread approach to parallelism spawns separate instruction pointers within the same process as shown in figure 1.22. As a result, portions of the process memory can be easily shared between threads. This comes with correctness and performance pitfalls. The programmer is left to determine which sections of the instruction set and data are independent and can support threading. These considerations are discussed in more detail in chapter 6 where we will also look at OpenMP, one of the leading threading systems. OpenMP provides the capability to spawn threads and divide up the work among them.
Figure 1.23 Vector instructions in source code getting different performance from compilers.

There are many varieties of threading approaches ranging from heavy to light-weight and managed by either the user space or operating system. While threading systems are limited to scaling within a single node, they are an attractive option for modest speedup. The memory limitations of the single node have larger implications for the application.
Vectorizing an application can be far more cost effective than expanding compute resources at an HPC center and may be absolutely necessary on portable devices like cell phones. When vectorizing, work is done in blocks of 2-16 data items at a time. The more formal term for this class of operation is Single Instruction, Multiple Data or SIMD. The term SIMD is used a lot when talking about vectorization. SIMD is just one category of parallel architectures that will be discussed later in section 1.4.
Invoking vectorization from a user’s application is most often done through source code pragmas or through compiler analysis. Pragmas and directives are hints given to the compiler to guide how to parallelize or vectorize a section of code. Both pragmas and compiler analysis are highly dependent on the compiler capabilities as shown in figure 1.23. Here we are dependent on the compiler where the previous parallel mechanisms were dependent on the operating system. Also, without explicit compiler flags, the generated code will be for the least powerful processor and vector length, significantly reducing the effectiveness of the vectorization. There are mechanisms where the compiler can be by-passed, but they require much more programming effort and are not portable.
Stream processing is a dataflow concept where a stream of data is processed by a simpler special purpose processor. Long used in embedded computing, the technique began to be adapted for rendering large sets of geometric objects for computer displays in a specialized processor, called a Graphics Processing Unit (GPU). These GPUs were filled with a broad set of arithmetic operations and multiple streaming multiprocessors (SMs) to process geometric data in parallel. Scientific programmers soon found ways of adapting this to large sets of simulation data such as cells, expanding the role of the GPU to what some have termed as a General Purpose Graphics Processing Unit (GPGPU). In figure 1.24, the data and kernel are shown off-loaded over the PCI Bus to the GPU for computation. GPUs are still limited in functionality in comparison to CPUs, but where the specialized functionality can be used, they provide extraordinary compute capability at a lower power requirement. Other specialized processors also fit this category, though we focus on the GPU for our discussions.
Figure 1.24 : In the stream processing approach, data and compute kernel are offloaded to the GPU and its streaming multiprocessors. Processed data, or output, are transferred back to the CPU for file IO or other work.

highlight, annotate, and bookmark
You can automatically highlight by performing the text selection while keeping the alt/ key pressed.

If you read more about parallel computing, you will encounter acronyms such as SIMD (single-instruction, multiple data) and MIMD (multiple-instruction, multiple data). These terms refer to categories of computer architectures proposed by Michael Flynn in 1966 in what has become known as Flynn’s Taxonomy. These classes help inform different ways of viewing potential parallelism in architectures. The categorization is based on breaking up instructions and data into either serial or multiple operations as shown in figure 1.25. Be aware, that though the taxonomy is useful, some architectures and algorithms do not fit neatly within a category. The usefulness comes from recognizing patterns in categories such as SIMD having potential difficulties with conditionals. This is because each data item might want to be in a different block of code, but they all have to execute the same instruction.
Figure 1.25 Flynn’s Taxonomy categorizes different parallel architectures.

A serial architecture is single-data and single-instruction (SISD). Two categories only have partial parallelism in that either the instructions or data are parallel, but the other is serial.
In the case where there is more than one instruction sequence, the category is called multiple-instruction, single-data (MISD). This is not a common architecture and the best example is a redundant computation on the same data. This is used in highly-fault tolerant approaches such as spacecraft controllers.
Vectorization is a prime example of single-instruction, multiple-data (SIMD) in which the same instruction is performed across multiple data. A variant of SIMD is single-instruction, multi-thread (SIMT) which is commonly used to describe GPU workgroups.
The final category has parallelism in both instructions and data, and is referred to as multiple-instruction, multiple-data (MIMD). The MIMD category describes multi-core parallel architectures which comprise the majority of large parallel systems.
discuss

So far, in our initial example in section 1.3.1, we looked at the data parallel strategy for cells or pixels. But the data parallel approach can also be used for particles and other data objects. Data parallel is the most common approach and often the simplest. Essentially, each process executes the same program, but operates on a unique subset of data, as illustrated in figure 1.26. The data parallel approach has the advantage that it scales well as the problem size and number of processors grows.
Figure 1.26 Various parallel strategies, including master-worker, pipeline or bucket-brigade and data parallel.

Another approach is task parallelism. This includes master-worker, pipeline, or bucket-brigade strategies, also shown in figure 1.26. The pipeline approach is used in superscalar processors where address and integer calculations are done with a separate logic unit than the floating point processing allowing these to be done in parallel. The bucket-brigade uses each processor to operate on and transform the data in a sequence of operations. In the master-worker, one processor schedules the tasks for all the workers and each worker checks for the next work-item as it returns the previous completed task. It is also possible to combine different parallel strategies to expose a greater degree of parallelism.
settings

Take our tour and find out more about liveBook's features:
- Search - full text search of all our books
- Discussions - ask questions and interact with other readers in the discussion forum.
- Highlight, annotate, or bookmark.
We will be presenting a lot of comparative performance numbers and speedups. Often the term speedup is used to compare two different run-times with little explanation or context to understand what it means. Speedup is a general term that is used in many contexts such as quantifying the effects of optimization. To clarify the difference between the two major categories of parallel performance numbers, we’ll define two different terms.
Definition: Parallel Speedup -
this should really be Serial to Parallel Speedup. The speedup is relative to a baseline serial run on a standard platform, usually a single CPU. The parallel speedup could be due to running on a GPU or with OpenMP or MPI on all the cores on the node of a computer system.
Definition: Comparative Speedups -
the full term should be Comparative Speedups between Architectures. This is usually a performance comparison between two parallel implementations or other comparison between reasonably constrained sets of hardware. For example, it may be between a parallel MPI implementation on all the cores of the node of a computer versus the GPU(s) on a node.
These two categories of performance comparisons represent two different goals. The first is to understand how much speedup can be obtained through adding a particular type of parallelism. It is not a fair comparison between architectures. It is about parallel speedup. For example, comparing a GPU run-time to a serial CPU run is not a fair comparison of a multi-core CPU and the GPU. Comparative Speedups between Architectures is more appropriate when trying to compare a multi-core CPU to the performance of one or more GPUs on a node. In recent years, some have normalized the two architectures so that the relative performance is to similar power or energy requirements rather than an arbitrary node.
Still, there are so many different architectures and possible combinations that any performance numbers can be obtained. You can pick a fast GPU and a slow CPU or a quad-core CPU versus a 16 core processor. We are suggesting the following terms in parenthesis to comparisons to help give the performance comparisons more context and to be more fair.
- Add (Best 2016) to each term as in Parallel Speedup (Best 2016) and Comparative Speedup (Best 2016) to indicate that the comparison is between the best hardware released in a particular year. This might be a high-end GPU vs a high-end CPU released in 2016.
- Add (Common 2016) or (2016) if the two architectures were released in 2016, but are not the highest-end hardware. This might be relevant to developers and users who have more mainstream parts than that found in the top-end systems.
- Add (Mac 2016) if the GPU and the CPU were released in a 2016 Mac laptop or desktop or similar for other brands with fixed components over a period of time. Performance comparisons of this type are valuable to users of a commonly available system.
- Add (GPU 2016:CPU 2013) to show that there is a possible mismatch in the hardware release year of the components being compared.
- No qualifications added to comparison numbers. Who knows what the numbers mean?
Because of the explosion in CPU and GPU models, performance numbers will necessarily be more of a comparison between apples and oranges than a well-defined metric. But for more formal settings, we should at least indicate the nature of the comparison so that others have a better idea of the meaning of the number and to be more fair to the hardware vendors.
highlight, annotate, and bookmark
You can automatically highlight by performing the text selection while keeping the alt/ key pressed.

This book is written with the application code developer in mind and no previous knowledge of parallel computing is assumed. You should just have a desire to improve the performance and scalability of your application. The application areas include scientific computing, machine learning, and analysis of big data on systems ranging from a desktop to the largest supercomputers.
To fully benefit from this book, readers should be proficient programmers, preferably with a compiled, HPC language such as C, C++, or Fortran. We also assume a rudimentary knowledge of hardware architectures. In addition, readers should be comfortable with computer technology terms such as bits, bytes, ops, cache, RAM, etc. It is also helpful to have a basic understanding of the functions of an operating system and how it manages and interfaces with the hardware components.
After reading this book, some of the skills you will have are:
- Determining when message passing (MPI) is more appropriate than threading (OpenMP) and vice-versa
- Estimating how much speedup is possible with vectorization
- Discerning which sections of your application have the most potential for speedup
- Deciding when it might be beneficial to leverage a GPU to accelerate your application
- Establishing what is the peak potential performance for your application
- Estimating the energy cost for your application
Even after this first chapter, you should feel comfortable with the different approaches to parallel programming. We suggest that you work through the exercises to help you integrate the many concepts that have been presented so far. If you are beginning to feel a little overwhelmed by the complexity of the current parallel architectures, you are not alone. It has become very challenging to grasp all the possibilities. We’ll break it down, piece-by-piece, in the following chapters.
- What are some other examples of parallel operations in your daily life? How would you classify your example? What does the parallel design appear to optimize for? Can you compute a parallel speedup for this example?
- For your [desktop, laptop, cell-phone], what is the theoretical parallel processing power of your system in comparison to its serial processing power? What kinds of parallel hardware are present in it?
- Which parallel strategies do you see in the store checkout example in figure 1.1? Are there some present that are not shown. How about in your examples from exercise 1?
- You have an image processing application that needs to process 1000 images daily that are 4 mebibytes (MiB, 220 or 1,048,576 bytes) each in size. It takes 10 minutes in serial to process each image. Your cluster is composed of multi-core nodes with 16 cores and a total of 16 gibibytes (GiB, 230 bytes, or 1024 mebibytes) of main memory storage per node. Note that we are using the proper binary terms, MiB and GiB, rather than MB and GB which are the metric terms for 106 and 109 bytes, respectively.
- a. What parallel processing design best handles this workload?
- b. Now customer demand increases by 10x. Does your design handle this? What changes would you have to make?
- An Intel Xeon E5-4660 Processor has a thermal design power of 130 W; this is the average power consumption rate when all 16 cores are used. Nvidia’s Tesla V100 GPU and AMD’s MI25 Radeon GPU have a thermal design power of 300 W. Suppose you port your software to use one of these GPUs. How much faster should your application run on the GPU to be considered more energy efficient than your 16 core CPU application?
discuss

- This is the era of Parallel Computing where most of the compute capabilities of hardware are only accessible through parallelism. Programmers should be well-versed in the techniques used to exploit parallelism.
- Applications must have parallel work. The most important job of a parallel programmer is to expose more parallelism.
- Improvements to hardware are nearly all enhancing parallel components. Just relying on increasing serial performance will not result in future speedup. The key to increasing application performance will all be in the parallel realm.
- A variety of parallel software languages are emerging to help access the hardware capabilities. Programmers should know which are suitable for different situations.