20 Exploiting ILP with Software Approaches I

Dr A. P. Shanthi

The objectives of this module are to discuss about the software approaches used for exploiting ILP. The concept of loop unrolling is discussed in detail.

All processors since about 1985 have used pipelining to overlap the execution of instructions and improve the performance of the processor. This overlap among instructions is called Instruction Level Parallelism since the instructions can be evaluated in parallel. There are two largely separable techniques to exploit ILP:

Dynamic and depend on the hardware to locate parallelism
Static and rely much more on software

Static techniques are adopted by the compiler to improve performance. Processors that predominantly use hardware approaches use other techniques on top of these optimizations to improve performance. The dynamic, hardware-intensive approaches dominate processors like the PIII, P4, and even the latest processors like i3, i5 and i7. The software intensive approaches are used in processors like Itanium.

We know that,

Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control hazard stalls

The ideal pipeline CPI is a measure of the maximum performance attainable by the implementation.

When we look at performing optimizations, and if we only consider the basic block, then, both the compiler as well as the hardware does not have too many options. A basic block is a straight-line code sequence with no branches in, except to the entry, and no branches out, except at the exit. With the average dynamic branch frequency of 15% to 25%, we normally have only 4 to 7 instructions executing between a pair of branches. Additionally, the instructions in the basic block are likely to depend on each other. So, the basic block does not offer too much scope for exploiting ILP. In order to obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks. The simplest method to exploit parallelism is to explore loop-level parallelism, to exploit parallelism among iterations of a loop. Vector architectures are one way of handling loop level parallelism. Otherwise, we will have to look at either dynamic or static methods of exploiting loop level parallelism. One of the common methods is loop unrolling, dynamically via dynamic branch prediction by the hardware, or statically via by the compiler using static branch prediction.

While performing such optimizations, both the hardware as well as the software must preserve program order. That is, the order in which instructions would execute if executed sequentially, one at a time, as determined by the original source program, should be maintained. The data flow, the actual flow of data values among instructions that produce results and those that consume them, should be maintained. Also, instructions involved in a name dependence can execute simultaneously only if the name used in the instructions is changed so that these instructions do not conflict. This is done by the concept of register renaming. This resolves name dependences for registers. This can again be done by the compiler or by hardware. Preserving exception behavior, i.e., any changes in instruction execution order must not change how exceptions are raised in program, should also be considered.

In the previous modules we looked at how dynamic scheduling can be done by the hardware, where the hardware does dynamic unrolling of the loop. In this module, we will discuss about how the compiler exploits loop level parallelism by way of static unrolling of loops. Determining instruction dependence is critical to Loop Level Parallelism. If two instructions are

parallel, they can execute simultaneously in a pipeline of arbitrary depth without causing any stalls (assuming no structural hazards)
dependent, they are not parallel and must be executed in order, although they may often be partially overlapped.

Let us consider the following iterative loop and the assembly equivalent of that.

for(i=1;i<=1000;i++)

x(i) = x(i) + s;

Loop: LD F0,0(R1) ;F0=vector element

ADDD F4,F0,F2 ;add scalar from F2

SD 0(R1),F4 ;store result

SUBI R1,R1,8 ;decrement pointer 8bytes (DW)

BNEZ R1,Loop ;branch R1!=zero

NOP ;delayed branch slot

The body of the loop consists of three instructions and each of these is dependent on the other. The true data dependences in the loop are indicated below. The other two instructions are called the loop maintenance code.

Loop: LD F0,0(R1) ;F0=vector element

ADDD F4,F0,F2 ;add scalar in F2

SD 0(R1),F4 ;store result

SUBI R1,R1,8 ;decrement pointer 8B (DW)

BNEZ R1,Loop ;branch R1!=zero

NOP ;delayed branch slot

Let us consider the following set of latencies for discussion purposes:

Instruction producing result	Instruction using result	Latency in clock cycles
FP ALU operation	Another FP Operation	3
FP ALU operation	Store double	2
Load double	FP ALU operation	1
Load double	Store double	0
Integer operation	Integer operation	0

Original schedule: If we try to schedule the code for the above latencies, the schedule can be as indicated below. There is one stall between the first and the second dependent instruction, two stalls between the Add and the dependent Store. In the loop maintenance code also, there needs to be one stall introduced because of the data dependency between the Sub and Branch. The last stall is for the branch instruction, i.e. the branch delay slot. One iteration of the loop takes ten clock cycles to execute.

1 Loop: LD F0, 0(R1) ;F0=vector element

2 stall

3 ADDD F4, F0, F2 ;add scalar in F2

4 stall

5 stall

6 SD 0(R1), F4 ;store result

7 SUBI R1, R1, 8 ;decrement pointer 8 Byte (DW)

8 stall

9 BNEZ R1, Loop ;branch R1!=zero

10 stall ;delayed branch slot

Schedule with reorganization of code: The compiler can try to optimize the schedule by doing a reorganization of the code. It identifies that the Sub instruction can be moved ahead and the Store address be appropriately modified and put in the branch delay slot. This reorganized code is shown below. This takes only six clock cycles per loop iteration. There is only one stall.

1 Loop: LD F0,0(R1)

2 SUBI R1,R1,8

3 ADDD F4,F0,F2

4 stall

5 BNEZ R1,Loop ;delayed branch

6 SD 8(R1),F4 ;altered when move past SUBI

Unrolled and not reorganized loop: Now, let us assume that the loop is unrolled, say, four times. While unrolling the loop, we have to bear in mind that the intention in replicating the loop body is to generate enough number of parallel instructions. Therefore, except for true data dependences, all the name dependences will have to be removed by register renaming. Also, since the body of the loop gets replicated, the loop maintenance code involving Branch instructions (control dependence) vanishes for the replicated iterations. Observe the code shown below. You can see that different destination registers have been used for the various load operations (F0, F6, F10, F14), add operations (F4, F8, F12, F16) and the addresses for the load and store instructions have been appropriately modified as 0(R1), -8(R1), -16(R1) and -24(R1). The unrolled versions of the loop do not have loop maintenance code and the last loop maintenance code modifies the address by -32, setting it appropriately to carry out the fifth iteration. However, these unrolled instructions are not scheduled properly and you find that there are a number of stalls in the scheduled code. This code takes 28 clock cycles, or seven per iteration.

1 Loop: LD F0,0(R1)

2 stall

3 ADDD F4,F0,F2

4 stall

5 stall

6 SD 0(R1),F4

7 LD F6,-8(R1)

8 stall

9 ADDD F8,F6,F2

10 stall

11 stall

12 SD -8(R1),F8

13 LD F10,-16(R1)

14 stall

15 ADDD F12,F10,F2

16 stall

17 stall

18 SD -16(R1),F12

19 LD F14,-24(R1)

20 stall

21 ADDD F16,F14,F2

22 stall

23 stall

24 SD -24(R1),F16

25 SUBI R1,R1,#32

26 BNEZ R1,LOOP

27 stall

28 NOP

Unrolled and reorganized loop: We will now look at how the unrolled instructions can be properly scheduled by reorganizing the code. This is shown below. Observe that all the dependent instructions have been separated, leading to a stall-free schedule, taking only 3.5 clock cycles per iteration.

1 Loop: LD F0,0(R1)

2 LD F6,-8(R1)

3 LD F10,-16(R1)

4 LD F14,-24(R1)

5 ADDD F4,F0,F2

6 ADDD F8,F6,F2

7 ADDD F12,F10,F2

8 ADDD F16,F14,F2

9 SD 0(R1),F4

10 SD -8(R1),F8

11 SD -16(R1),F12

12 SUBI R1,R1,#32

13 BNEZ R1,LOOP

14 SD 8(R1),F16 ; 8-32 = -24

How many times to unroll the loop? Even if we do not know the upper bound of the loop, assume it is n, and we would like to unroll the loop to make k copies of the body. So, instead of a single unrolled loop, we generate a pair of consecutive loops:

– 1st executes (n mod k) times and has a body that is the original loop
– 2nd is the unrolled body surrounded by an outer loop that iterates (n/k) times

For example, if we have to execute the loop 100 times and we are able to unroll it 6 times, the 6-times unrolled loop will be iterated 16 times, accounting for 96 iterations, and the remaining 4 iterations will be done with the original rolled loop. For large values of n, most of the execution time will be spent in the unrolled loop, thus providing the advantage of having large number of independent instructions, thus providing ample opportunities to the compiler to do proper scheduling.

Points to Remember with Loop Unrolling: The general points to remember while unrolling a loop are as follows:

Determine that it was legal to move the SD after the SUBI and BNEZ, and find the amount to adjust the SD offset
Determine that unrolling the loop would be useful by finding that the loop iterations were independent, except for the loop maintenance code
Use different registers to avoid unnecessary constraints that would be forced by using the same registers for different computations
Eliminate the extra tests and branches and adjust the loop maintenance code
Determine that the loads and stores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This requires analyzing the memory addresses and finding that they do not refer to the same address
Schedule the code, preserving any dependence needed to yield the same result as the original code.

There are however limits to loop unrolling. They are:

Decrease in the amount of overhead amortized with each extra unrolling

Amdahl’s Law of diminishing returns
When we unrolled the loop four times, it generated sufficient parallelism among the instructions that the loop could be scheduled with no stall cycles. In fact, in 14 clock cycles, only 2 cycles were loop overhead: the Sub, which maintains the index value, and the Branch, which terminates the loop. If the loop is unrolled eight times, the overhead is reduced from 1/ 2 cycle per original iteration to 1/ 4

Growth in code size

For larger loops, it increases code size enormously, and it becomes a matter of concern if there is an increase in the instruction cache miss rate

There will be a potential shortfall in registers created by the aggressive unrolling and scheduling. This will lead to register pressure. As the number of live values increases, it may not be possible to allocate all live values to registers, and we may lose some or all of the advantages of unrolling.

Loop unrolling thus reduces the impact of branches on a pipeline. Another alternative is to use branch prediction.

Compiler perspectives on code movement: The compiler is concerned only about dependences in a program and not concerned if a hardware hazard depends on a given pipeline. It tries to schedule code to avoid hazards. The following points need to be remembered.

The compiler looks for Data dependences (RAW hazard for hardware)
- – Instruction i produces a result used by instruction j, or
- – Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i.
If there is dependence, we cannot execute these instructions in parallel
- – It is easy to determine for registers (fixed names)
- – Hard to resolve for memory locations as 32(R4) and 20(R6) might refer to the same effective address or different effective address.
- – Also, from different loop iterations, the compiler does not know whether 20(R6) and 20(R6) refer to the same effective address or different effective address.
Another kind of dependence called the name dependence:
two instructions use the same name (register or memory location) but don’t exchange data
- – Anti-dependence (WAR if a hazard for hardware)
  - Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first
- – Output dependence (WAW if a hazard for hardware)
  - Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved.
- Final kind of dependence called control dependence
  - – Example

if p1 {S1;};

if p2 {S2;};

S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.

Two (obvious) constraints on control dependences:

An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch and

An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch.

Control dependences can be relaxed to get parallelism provided they preserve order of exceptions (address in register checked by branch before use) and data flow (value in register depends on branch)

In the example loop we considered, we looked at handling all these issues. The registers were renamed for the Load, Add and Store instructions. Control dependences were removed by replicating the loops. Also, our example required the compiler to know that if R1 doesn’t change then, 0(R1) ≠ -8(R1) ≠ -16(R1) ≠ -24(R1). There were no dependences between some loads and stores so they could be moved around each other. More importantly, the loop that we considered did not have any loop carried dependences. If we consider the loop given below, where A,B,C are distinct and non-overlapping,

for(i=1;i<=100;i=i+1){

A[i+1]=A[i]+C[i]; /*S1*/

B[i+1]=B[i]+A[i+1];/*S2*/

}

S2 uses the value, A[i+1], computed by S1 in the same iteration
S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1].
This is a “loop-carried dependence” between iterations implying that iterations are dependent, and can’t be executed in parallel.

Compilers can be smart and identify and eliminate name dependences, determine which loops might contain parallelism and produce good scheduling of code.

To summarize, we have seen that compilers can exploit ILP in loops without a loop carried dependence. Loop unrolling is a common way of exploiting ILP and name dependences are handled by register renaming.

Web Links / Supporting Materials

Computer Organization and Design – The Hardware / Software Interface, David A. Patterson and John L. Hennessy, 4th Edition, Morgan Kaufmann, Elsevier, 2009.

Computer Architecture – A Quantitative Approach , John L. Hennessy and David A. Patterson, 5th Edition, Morgan Kaufmann, Elsevier, 2011.