Previous: Assume The Position Home: Contents Next: Part Two: Variation on


Subsections

Part One: In the beginning...

Keeping it simple...

First, we'll take a look at a rather simple loop:

int i, x[100];
for (i=0;i<100;i++) x[i]*=2;
This is (possibly) how a compiler would translate this loop into DLX.
	  addi r1, r0, #1000 ;address of array
	  addi r3, r0, #100  ;loop counter
loop: lw r2, 0(r1)
	  add r2,r2,r2
	  sw 0(r1),r2
	  addi r1,r1,#4      ;next array elt
 	  subi r3,r3,#1      ;decrement loop counter
      bnez r3,loop       ;are we done yet?
	  (nop)              ;branch delay slot
Note: There are possibly a few optimizations and shortcuts to the DLX, but they will only become restrictive upon unrolling and rescheduling the loop.

Here is the pipeline diagram for the DLX:

Instruction 1 2 3 4 5 6 7 8 9 10 11 12 13 14
addi r1, r0, #1000 I D X M W                  
addi r3, r0, #100   I D X M W                
loop: lw r2, 0(r1)     I D X M W              
add r2,r2,r2       I D s X M W          
sw 0(r1),r2         I s D X M W        
addi r1,r1,#4             I D X M W      
subi r3,r3,#1               I D X M W    
bnez r3,loop                 I s D X M W
branch delay slot nop                     I      
next instruction                       I    

We have a loop initiation interval of 9 cycles, and a loop latency of 12 cycles: inside each loop, we have 12 cycles per 7 instructions, giving a loop latency CPI of 1.71. For the entire code fragment, we have 2 startup instructions, plus 100 iterations of 7 instructions: 702 instructions. This takes 2 cc for the startup, 9*100 for the loops, and an additional 2 cycles for the M and W stages of the branch: 904 cycles in 702 instructions gives an overall CPI of 1.28. Not bad, but we can do better. We have two major stalls: the RAW hazard between the add and sw, and the stall just before the branch.

Now we take the loop and unroll it, performing five loop iterations per loop execution.

Here is the resulting code:

addi r1, r0, #1000
addi r3, r0, #100
loop:
lw r10, 0(r1)
lw r11, 4(r1)
lw r12, 8(r1)
lw r13, 12(r1)
lw r14, 16(r1)
add r10, r10, r10
add r11, r11, r11
add r12, r12, r12
add r13, r13, r13
add r14, r14, r14
sw 0(r1),r10
sw 4(r1),r11
sw 8(r1),r12
sw 12(r1),r13
sw 16(r1),r14
addi r1,r1,#20 
subi r3,r3,#5
bnez r3,loop
(nop)

What have we done? Instead of operating on one array element per loop, we now operate five at a time. This requires a LOT of registers and a LOT of compiler intelligence (oxymoron?). This is one major limitation to the number of times you can unroll a loop without reusing registers. An intelligent compiler could unroll the entire loop, but it would be an extremely complex operation - it would have to determine which registers it could reuse and when.

Now to find the CPI. Would you like to diagram this? :-) Didn't think so. It doesn't take very long to see that we've eliminated most of the stalls here: the distance between the ADDs and SWs has increased, so there are no hazards left. We still have one stall just before the BNEZ.

The loop initiation interval is now 22 cycles over 19 instructions. For the entire loop, we have 2 cycles startup, 21 cycles per 20 iterations, plus 2 for the branch finish, for a total of 424 cycles. We have executed 2+20*19=382 instructions, for a CPI of 1.109. Much better!

What else can we do to improve the CPI? We still have a stall before the branch, but that can be eliminated with a little bit of reordering. The stores can happen at any time, so we can reorder the last lines to:

sw 4(r1),r11
sw 8(r1),r12
sw 12(r1),r13
subi r3,r3,#5
sw 16(r1),r14
bnez r3,loop
addi r1,r1,#20

We can now make use of the branch delay slot: the add will complete as the branch is taken, so the result will be ready for use on the next loop iteration.

Please sir, I want some more!

Okay, here's another loop to look at:

int i, x[100], y[100], z[100];
for (i=0; i<100; i++) { z[i]=x[i]*y[i]; }
For those of you who have seen linear algebra before, this should look quite familiar - the dot product of two vectors. Okay, so we're doing an integer product...it's not too useful, but it makes for good loop unrolling practice. :-)

	addi r1, r0, #1000  ;address of array x
	addi r2, r0, #2000  ;                 y
	addi r3, r0, #3000  ;                 z
	addi r4, r0, #100   ;loop counter
loop:	
	lw r10, 0(r1)	    ; load x
	lw r11, 0(r2)       ; load y
	mult r12, r10, r11  ; multiply
	sw 0(r3), r12	    ; store z
	addi r1, r1, #4	    ; increment three array counters
	addi r2, r2, #4 
	addi r3, r3, #4
	subi r4, r4, #1		; decrement loop counter
	bnez r4, loop		; are we done yet?
	(nop)				; branch delay slot

Here's just a portion of our lovely pipeline diagram - only the stuff inside the loop is what we care about.

Instruction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
loop: lw r10, 0(r1) I D X M W                    
lw r11, 0(r2)   I D X M W                  
mult r12, r10, r11     I D s X M W              
sw 0(r3), r12       I s D X M W            
addi r1, r1, #4           I D X M W          
addi r2, r2, #4             I D X M W        
addi r3, r3, #4               I D X M W      
subi r4, r4, #1                 I D X M W    
bnez r4,loop                   I s D X M W
branch delay slot nop                       I      
next instruction                         I    

Again, this doesn't look so bad. We have only a minimal number of stalls to work around. The big evil here is all the overhead required to perform only ONE iteration. We have 10 instructions executed in each loop: the loop init. int. is 12 cycles and the loop latency is 15 cycles. Each loop takes 14 cycles, so our loop latency CPI is 15/10=1.5. For the entire code segment, we have 12 cycles * 100 iterations + 4 startup cycles + 2 finish cycles=1206 cycles in 4+10*100=1004 instructions, for an overall CPI of 1.2. This isn't bad, but look at all the work that needs to be done just for one iteration! The work to overhead ratios inside the loop can be improved with unrolling, and the CPI can be improved to remove one stall cycle between the MULT and SW.

The unrolled loop will look something like this:

loop2:	
	lw r10, 0(r1)	    
	lw r11, 0(r2)
	lw r12, 4(r1)
	lw r13, 4(r2)
	lw r14, 8(r1)
	lw r15, 8(r2)
	lw r16, 12(r1)
	lw r17, 12(r2)
	lw r18, 16(r1)
	lw r19, 16(r2)
    mult r20, r10, r11  
	mult r21, r12, r13
	mult r22, r14, r15
	mult r23, r16, r17
	mult r24, r18, r19
	sw 0(r3), r20
	sw 4(r3), r21
	sw 8(r3), r22
	sw 12(r3), r23
	sw 16(r3), r24	    
	addi r1, r1, #20	    
	addi r2, r2, #20 
	addi r3, r3, #20
	subi r4, r4, #5
	bnez r4, loop2
	(nop)

It is easy to see that we still have one stall remaining, and the wasted branch delay slot: the overall impact is now in the overall cpi: 26*20+4=524 instructions in (28*20)+4+2=566 cycles, or a CPI of 1.08. We get a better CPI because our instructions to stalls ratio is higher since now we only have half as many instructions. Rescheduling would reorder the tail end of this loop a bit to put one of the adds in the branch delay slot, and move the loop counter subtract up one statement.

Okay, integers are nice and all, what's so different about multicycle?

Seems easy, no? Guess what, it's going to get a little bit more complicated. With the multicycle pipeline, instructions take longer than one cycle in the execute stage - this makes things a lot harder to deal with unrolling - you actually get *more* stalls per loop iteration.


previous up next
Previous: Assume The Position Home: Contents Next: Part Two: Variation on