Okay, so you've seen now how reordering can lower the CPI. However, we still have hit the nasty control hazard when we hit the branch, in this case 1000 times. We can fix this by resheduling and unrolling the loop.
-- assume R1 contains the base address for the 'a' array
and R2 has the base address for the 'b' array and
R3 has the base address for the 'c' array and R4 starts at 1000

Loop: LW R5, 0(R2) ; element of b
LW R6, 0(R3) ; element of c
ADD R7, R6, R5 ; make next a
LW R8, 4(R2) ; next element of b
LW R9, 4(R3) ; next element of c
ADD R10, R8, R9 ; make next a + 1
SW 0(R1), R7 ;
SW 4(R1), R10 ;
ADDI R1, R1, #8 ;
ADDI R2, R2, #8 ; increment addresses
ADDI R3, R3, #8
SUBI R4, R4, #2 ; decrement loop var
BNEZ R4, Loop

Instruction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Loop: LW R5, 0(R2) I D X M W                      
LW R6, 0(R3)   I D X M W                    
LW R8, 4(R2)     I D X M W                  
ADD R7, R6, R5       I D X M W                
LW R9, 4(R3)         I D X M W              
SW 0(R1), R7           I D X M W            
ADD R10, R8, R9             I D X M W          
SW 4(R1), R10               I D X M W        
ADDI R1, R1, #8                 I D X M W      
ADDI R2, R2, #8                   I D X M W    
ADDI R3, R3, #8                     I D X M W  
SUBI R4, R4, #2                       I D X M W
The average CPI for the loop now is 16 clock cycles / 12 instuctions = 1.333. This is a significent improvement from the orginal code, which had a CPI of 1.625. However, this resheduled and unrolled version only hits the BNEZ command half as many times as the orginal, so it has less overall stalls from the control hazard, 50% less. This amount can be lowered even more by unrolling the loop more. Remember that unrolling is limited by the number of registers you have.


Prev Next