by Damjan Jevtic and Ted Unnikumaran

Introduction to Parallel Processing

Parallel processing is an integral part of everyday life. The idea is such a big part of our daily lives that we benefit from it without realizing. Lets think of some examples of parallel processing we find in the world. In nature, parallel processing is evident when ants work together to carry a heavy load. In a factory, an assembly line where at every station somebody is doing part of the work to complete the finished product, is another example of parallel processing. The goal of parallel processing is thus to solve a given problem more rapidly, or to enable the solution of a problem that would otherwise be inefficient by one agent. Pipelining is to divide a task T into subtasks T1, T2, ..., Tk and assign them to a chain of processing stations. These stations are called pipeline segments. Parallelism is achieved when several segments operate simultaneously. Lets look at the simple problem of emptying a swimming pool using buckets. A single person will complete the job, in a certain time. However this process may be speeded up by utilizing additional workers. Ideally, two people should be able to empty the pool in half the time, and a large number or workers will be able to complete the job in a small fraction of the original time. However, there are limitations. The solution will need some co-operation between workers. Also we have to think of access to the pool and we will need to avoid collision. The time that we spend achieving this co- operation will take time and can be termed overhead. Another factor to consider is dependencies. Think of the assembly plant idea of pipelining mentioned earlier. The first person completes his task and passes it along to the next person who completes his task and passes it on to the next person and this will continue until the product gets completed. The completion of the product then is dependent on the time taken by each person. If one person takes a long time, then people after this person will stand idle awaiting the product, while the people before the slow worker will be unable to move their products to the next stage of the assembly line. Additionally, there has to be a limit on the number of workers. Consider, the idea of workers emptying a room with only one door. If we have a large number of workers, a bottleneck could occur when all the workers try to move their chairs through the door at the same time. The delays caused by the bottleneck could be so great that the time taken to empty the room of chairs by all these workers may be longer than the original time taken by the single worker. Also we have to consider the cost of all these workers, we might not be able to afford all of these workers. So in this case, reducing the number of workers can alleviated the bottleneck, and reduce solution time and make the task profitable. Okay, so now you should understand some basic concepts about parallel processing, so lets start applying those concepts to computers. A traditional sequential computer is the von Neumann model. This model contains a processor, an associated memory, an input/output interface and various busses connecting these devices. The processor in the model is responsible for fetching, decoding and executing a programs instructions. Parallel processing may be added to this model by using functional units that repeatedly performs the same operation on data received from the preceding functional unit. So for the simplest case, the processor could contain three functional units, one to fetch the instructions from memory, one to decode these instructions and one to execute the decoded instructions. As we saw in the assembly line, a pipeline is only as effective as the slowest component, so any delays in the pipeline affects the whole system. Brief History The computers of the late 1940s and 1950s used vacuum tube technology which was expensive and unreliable. The performance and reliability was dramatically improved with the invention of the transistor in 1948 and the integration of a number of these transistors into single "Small Scale Integration" packages. "Medium Scale Integration" and "Large Scale" integration followed, until the present "Very Large Scale Integration" (VLSI) computers. I am assuming the next integration will be called the "Very Very Large Scale Integration". Anyway, the performance of sequential computers (the computers that didn't use parallelism) has improved leaps and bounds so that now, their descendants are capable of a lot better performance and are in use everyday throughout the world. Due to the programming and design difficulties associated with parallel computers, they were not accepted by the commercial world. In 1969, Control Data Corporation produced their CDC 7600 computer which exploited pipelining. This machine had superior performance over other existing computers. In 1976, the Cray-1, was delivered to Los Alamos National Laboratory. Since then progress on parallel architectures has accelerated rapidly with the rise of new manufacturers.


Since the first computer was developed, there have been numerous computer architectures that have been proposed, and built. This has led to the need to classify the designs so we could better evaluate and compare architectures. Early computers which adopted the model of a single central processing unit (CPU) connected to a memory unit have been described as von Neumann architectures, which we briefly mentioned earlier. In this model, different instructions operate on a single sequence of data. As computer architectures started to get more complicated, other classifications were developed. In 1972, some guy named Flynn proposed a classification, that has since become known as Flynn's taxonomy. This categorizes architectures into four areas. 1. SISD - Single Instruction Single Data 2. SIMD - Single Instruction Multiple Data 3. MISD - Multiple Instruction Single Data 4. MIMD - Multiple Instruction Multiple Data While people have criticized Flynn's taxonomy as being too broad, it is simple, and people like simple things so that is why it is popular. This web page will only deal with three of the four architectures, leaving out MISD because no architecture falls obviously into the MISD category.

SISD Computers

Conventional single processor computers are classified as SISD systems. Each arithmetic instruction initiates an operation on a data item taken from a single stream of data elements. Historical supercomputers such as the Control Data Corporation 6600 and 7600 fit this category as do most contemporary microprocessors.

Vector processors such as the Cray-1 and its descendants are often classified as SIMD machines, although they are more properly regarded as SISD machines. Vector processors achieve their high performance by passing successive elements of vectors through separate pieces of hardware dedicated to independent phases of a complex operation. For example, in order to add two numbers such as  and , the numbers must have the same exponent. The processor must shift the mantissa (and decrement the exponent) of one number until its exponent matches the exponent of the other number. In this example  is adjusted to  so it can be added to , and the sum is . A vector processor is specially constructed to feed a data stream into the processor at a high rate, so that as one part of the processor is adding the mantissas in the pair  another part of the processor is adjusting the exponents in .

The ambiguity over the classification of vector machines depends on how one views the flow of data. A static ``snapshot'' of the processor during the processing of a vector would show several pieces of data being operated on at one time, and under this view one could say one instruction (a vector add) initiates several data operations (adjust exponents, add mantissas, etc.) and the machine might be classified SIMD. A more dynamic view shows that there is just one stream of data, and elements of this stream are passed sequentially through a single pipeline (which implements addition in this example). Another argument for not including vector machines in the SIMD category will be presented when we see how SIMD machines implement vector addition.

Figure 1. SISD Model

The SISD Model (Single Instruction, Single Data Stream) can be compared to the assembly of a car before the advent of the assembly line. In this case, one person would do each task (single instruction) required for the assembly of one (single data stream) car. The fastest that the car can be built is dependent on how fast that one person can work. If that person is gone for a week, nothing gets done.

SIMD Computers

SIMD machines have one instruction processing unit, sometimes called a controller and indicated by a K in the PMS notation, and several data processing units, generally called D-units or processing elements (PEs). The first operational machine of this class was the ILLIAC-IV, a joint project by DARPA, Burroughs Corporation, and the University of Illinois Institute for Advanced Computation. Later machines included the Distributed Array Processor (DAP) from the British corporation ICL, and the Goodyear MPP.

The control unit is responsible for fetching and interpreting instructions. When it encounters an arithmetic or other data processing instruction, it broadcasts the instruction to all PEs, which then all perform the same operation. For example, the instruction might be `` add R3,R0.'' Each PE would add the contents of its own internal register R3 to its own R0. To allow for needed flexibility in implementing algorithms, a PE can be deactivated. Thus on each instruction, a PE is either idle, in which case it does nothing, or it is active, in which case it performs the same operation as all other active PEs. Each PE has its own memory for storing data. A memory reference instruction, for example ``load R0,100'' directs each PE to load its internal register with the contents of memory location 100, meaning the 100th cell in its own local memory.

One of the advantages of this style of parallel machine organization is a savings in the amount of logic. Anywhere from 20% to 50% of the logic on a typical processor chip is devoted to control, namely to fetching, decoding, and scheduling instructions. The remainder is used for on-chip storage (registers and cache) and the logic required to implement the data processing (adders, multipliers, etc.). In an SIMD machine, only one control unit fetches and processes instructions, so more logic can be dedicated to arithmetic circuits and registers. For example, 32 PEs fit on one chip in the MasPar MP-1, and a 1024- processor system is built from 32 chips, all of which fit on a single board (the control unit occupies a separate board).

Vector processing is performed on an SIMD machine by distributing elements of vectors across all data memories. For example, suppose we have two vectors, a and b, and a machine with 1024 PEs. We would store  in location 0 of memory i and  in location 1 of memory i. To add a and b, the machine would tell each PE to load the contents of location 0 into one register, the contents of location 1 into another register, add the two registers, and write the result. As long as the number of PEs is greater than the length of the vectors, vector processing on an SIMD machine is done in constant time, i.e. it does not depend on the length of the vectors. Vector operations on a pipelined SISD vector processor, however, take time that is a linear function of the length of the vectors.

MIMD Computers

The processors in the MIMD classification independently obey their own instruction sequence and apply these instruction to their own data. Unlike the SIMD processors where the processors bound to a synchronous method, MIMD processors can operate asynchronously. If we provide these processors with the ability to communicate with each other, they can interact and co-operate and solve a single problem. The level of communication has led MIMD systems to sometimes being classified as tightly coupled if the degree of communication is high, and loosely coupled if the degree of interaction is low. Okay, so now we need to look at how they make this communication between processors work. There are two methods. 1. Shared memory 2. Distributed memory Shared memory systems allow the processors to communicate by reading and writing to a common address space. Controls are necessary to prevent processors updating the same portion of the shared memory simultaneously. In distributed memory systems, on the other hand, processors address only their private memory and communicate by passing messages along some form of communication path. The interconnection method of shared memory allows all the processors to be connected to the shared memory. If two or more processors wish to access the same portions of this shared memory at the same time then some arbitration mechanism must be used to ensure that only one processor accesses that memory portion at a time. This problem of memory contention may restrict the number of processors that can be interconnected using the shared memory model. The interconnection method of the distributed memory system connects the processors in some fashion and if one or more processors wish to access another processor's private memory, it, or they, can only do so by sending a message to the appropriate processor along their interconnection network. There is thus no memory contention. However, the density of the messages that result in distributed memory systems may still limit the number of processors that may be interconnected, although this number is generally large than shared memory systems.

If you still don't understand the difference, let me explain again. With shared memory all the processors share memory, so they all write to one big block of memory called shared memory. With distributed memory each processor has their own private memory and so they write to their own private memory. An example of this could be a family. If the entire family has a joint bank account then they are using the shared memory. If each member of the family has a separate bank account then they are using distributed memory. If they are using shared memory, the family needs to figure some method to resolve situations when two people want money, additionally the family needs to communicate so they don't run out of money. As a result you wouldn't want your entire extended family including your unemployed, beer-guzzling uncle bob, in this joint account because, well you would run out of money real quick. If you were using distributed memory and therefore separate bank accounts, you would not have a problem of fighting for money, and uncle bob won't be a bother. If the situation arises when you run out of money, the only way you can access someone else's money would be to send a message to the other person asking for money. Busses have been used successfully as an interconnection structure to connect low numbers of processors together. However, if more than one processor wished to send a message on the bus at the same time, an arbiter must decide which message gets access to the bus first. As the number of processors increases, so the contention for the use of the bus grows. Therefore, a bus in not good for large multi-processor systems. An alternative to the bus method is to connect processors via dedicated links to form large networks. This removes bus-contention problems by spreading the communication load across many independent links. Theoretically we can build very large systems, so we have a huge company with a 'spokesman'. Practically, the overheads will restrict the size of the machine. The flexibility and potential scalability of distributed memory MIMD architectures make them the most suitable for 'practical parallel processing'.

Figure 2:

Test Questions:

What are parallel processors?

How do single processors work?

What is the history of parallel processors?

How is the advantages of a parallel processors over single processors?

What are the disadvantages of parallel processors?

What are the different types of parallel processors?