HPC

A very brief introduction to parallel (high-performance) computing,excerpted from my Master’s Thesis, follows:

Parallel Computing

Imagine a post office in a large city with only one person responsible for processing all of the incoming mail. This person takes stacks of mail and walks around to the different bins. The person has to sort the mail into the bins by ZIP code. A single person can only do this

so fast. One way to speed this up is to add another person to help.  Working together they can sort the mail twice as fast. If two more people are added, the mail is sorted at a rate four times as fast as a single person.

In addition to the rate at which the task is accomplished, other factors such as efficiency can affect overall performance. In the previous case, all the workers are walking around to all of the bins. A more efficient method of splitting up the workload can be used. The total number of bins can be divided among the workers. Each worker is then responsible for a smaller number of bins and takes a portion of the incoming mail. This is an example of task partitioning or a divide-and-conquer approach. The worker files the mail that matches the workers’ bins and passes non-matching mail to the worker responsible for it. This cuts down on the amount of walking around that each worker does.

This may, however, increase the efficiency only to a certain point.  If the workers are required to verbally communicate when passing along mail to the responsible worker, this adds some communications overhead. As the number of workers increases, the communication burden gets larger. At some point, the communication requirement starts to dominate a workers’ time. Adding even more workers only makes the situation worse and performance can only decrease. If workers are added beyond even this point, it could take them longer to sort the mail than it did for a single worker.

The mailroom analogy is similar to many computer programs. Computer programs often perform the same task many times. Instead of adding workers, additional processors or computers can be added to increase performance. For instance, the mailroom case is analogous to sorting routines which vary in speed and efficiency and sometimes benefit from the addition of processors in parallel. Two such parallel sorting algorithms are:

  • Odd-Even Transposition with a worst case time of O(n) and
  • Shear Sort with a worst case time of O(n^{1/2} log n).

How much performance increases is determined primarily by the amount of communications required between processes.  Two important measures of how well a program can be executed in parallel are speed up and efficiency.

Control Parallelism vs. Data Parallelism

Two major categories of parallel programming are data-parallel programming and control-parallel programming (ref Quinn.)  Control-parallel algorithms can apply different operations to different data concurrently. This is considered a pipelined approach
and the data flow in this case can vary widely in complexity. Any manufacturing assembly line is a good example of this type of parallelism. On the other hand, data-parallel algorithms split the data among the different nodes and apply the same operations
to that data. The operations on the data are typically independent and the communication between nodes is much simpler than control-parallel algorithms. These differences in operation and structure between parallel algorithms and their serial equivalents makes performance much more difficult to measure.

Measuring Performance

Parallel programs can’t be evaluated in the same simple terms, such as execution time and input size, used for a serial program. This is due to additional issues such as the use of multiple processors and different communication architectures. The combination of a program written for parallel execution and a parallel architecture is known as a parallel system. Once a program has been developed or converted to run in a parallel environment, it must be tested using metrics that have meaning for a parallel system.

Several metrics have been developed for parallel systems with the most common being the parallel run-time Tp.  Tp is measured from the start of the program to the end of the last process.  Tp is used in the computation of other common parallel metrics such as speedup, efficiency, and isoefficiency.  These three metrics are discussed in the next three sections.

Speedup

Speedup is defined by Kumar as “the ratio of the serial run time of the best sequential algorithm for solving a problem to the time taken by the parallel algorithm to solve the same problem on processors.”  The speedup equation is

Speedup(S) = Ts / Tp

where Ts is the best serial execution time and Tp is the parallel execution time. This metric provides a basic measure of the gain in performance by parallelizing the algorithm.

In theory, the best speedup that can be obtained is a linear speedup. Ideally, if a program runs in time Ts on a single processor then it should run in time Ts / p on p processors. In reality, this can never be achieved due to communications overhead required in the distribution of multiple processes. A deceptive condition known as super-linear speedup occurs in which the observed parallel speedup is greater than Ts / p. Unless there are no other constraints such as the serial algorithm is memory bound, this really implies that the parallel algorithm is encoded more efficiently and that the serial algorithm could be re coded to match. By itself, speedup doesn’t provide a completely accurate portrayal of the algorithm’s performance.

Efficiency

A better understanding of the algorithm’s performance is achieved by looking at the parallel algorithm’s efficiency in addition to speedup. Although the speedup for a program may continue to increase as processors are added this will eventually taper off. This is a result of the parallel program not getting one hundred percent of the processor’s time as each node is added. This small percentage of other tasks taking processor time adds up as nodes are added and causes the speedup to fall off. The efficiency function, E, is

      E = S / p = Ts / (p x Tp)

where S is the speedup and p is the number of processors. This metric measures the benefit of adding more processors.

Due to the previously mentioned upper bound, p, of speedup, the efficiency of a parallel program will never exceed one. If the efficiency can be maintained as processors are added and the program is scaled accordingly, the program is said to be scalable.

Isoefficiency

The rate at which the problem size has to be scaled as the processors are added to maintain a fixed (iso) efficiency is calculated by the isoefficiency function is

      W = [E / (1 – E)] x To(W,p)

where W is the workload or problem size, E is the efficiency, and To is the parallel overhead. To is a function of the problem size and the number of processors used. The idea is to maintain the same (iso) efficiency as the number of processors changes.

The main difference between the efficiency function and the isoefficiency function is that the overhead of the parallelization is taken into account. The lower the value of the isoefficiency function the more scalable the program is. If however, a fixed efficiency cannot be maintained regardless of how fast the problem is scaled, the program is considered unscalable.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.