Okay, you say you want a longer example? Here it is. So far we've looked at unrolling and rescheduling on the simple 5-stage DLX pipelike pipe. Now we will look at an example using the Multi-cycle DLX.

The DLX code is as follows:


ADDI R4, R0, #5200 ; make a float 5200
MVI2FP F4, R4 ;
CVTI2FP F4, F4 ; F4 has a float constant
ADD R1, R0, R0 ; init counter to 0
Loop: LF F2, 100(R1) ; F2 is array element, R1 has offset of lowest unused array element
LF F3, 500(R1) ; F3 holds array element
SUBF F5, F3, F2 ; perform subtraction
ADDF F5, F5, F4 ; perform addition of a constant
SF 1000(R1), F5 ; store the results
ADDI R1, R1, #4 ; increment pointer
SUBI R5, R1, #400 ; check pointer
BNEZ R5, Loop ; branch while not done

Instruction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Loop: LF F2, 100(R1) I D X M W                          
LF F3, 500(R1)   I D X M W                        
SUBF F5, F3, F2     I D s A1 A2 A3 A4 M W              
ADDF F5, F5, F4       I s D s s s A1 A2 A3 A4 M W      
SF 1000(R1), F5           I s s s D s s s X M W    
ADDI R1, R1, #4                   I s s s D X M W  
SUBI R5, R1, #400                           I D X M W

The average CPI for this loop is 18 clock cycles / 7 instructions = 2.571. Now lets see what happens when we unroll the loop 4 times, and reschedule to avoid stalls. The new code is as follows.


ADD R1, R0, RO
ADDI R4, R0, #5200 ; make a float 5200
MVI2FP F14, R4 ;
CVTI2FP F14, F14 ; F14 has a float constant
ADD R1, R0, R0 ; init counter to 0
Loop: LF F2, 100(R1) ;
LF F6, 500(R1)
LF F3, 100(R1) ;
LF F7, 504(R1) ;
SUBF F10, F6, F2 ;
LF F4, 108(R1) ;
SUBF F11, F7, F3 ;
LF F8, 508(R1)
LF F5, 112(R1)
LF F9, 512(R1)
SUBF F12, F8, F4
SUBF F13, F9, F5
ADDI R1, R1, #16
ADDF F10, F10, F14
ADDF F11, F11, F14
ADDF F12, F12, F14
ADDF F13, F13, F14
SUBI R5, R1, #400
SF 984(R1), F10
SF 988(R1), F11
SF 992(R1), F12
SF 996(R1), F13
BNEZ R5, Loop ; branch while not done

Instruction 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Loop: LF F2, 100(R1) I D X M W                      
LF F6, 500(R1)   I D X M W                    
LF F3, 104(R1)     I D X M W                  
LF F7, 504(R1)       I D X M W                
SUBF F10, F6, F2         I D A1 A2 A3 A4 M W        
LF F4, 108(R1)           I D X M W            
SUBF F11, F7, F3             I D A1 A2 A3 A4 M W    
LF F8, 508(R1)               I D X s M W      
LF F5, 112(R1)                 I D s X s M W  
LF F9, 512(R1)                   I s D s X M W
SUBF F12, F8, F4                       I s D A1 A2
SUBF F13, F9, F5                           I D A1
ADDI R1, R1, #16                             I D
ADDF F10, F10, F14                               I


Instruction 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
SUBF F12, F8, F4 A3 A4 M W                        
SUBF F13, F9, F5 A2 A3 A4 M W                      
ADDI R1, R1, #16 X M W                          
ADDF F10, F10, F14 D A1 A2 A3 A4 M W                  
ADDF F11, F11, F14 I D A1 A2 A3 A4 M W                
ADDF F12, F12, F14   I D A1 A2 A3 A4 M W              
ADDF F13, F13, F14     I D A1 A2 A3 A4 M W            
SUBI R5, R1, #400       I D X s s s M W          
SF 984(R1), F10         I D s s s X M W        
SF 988(R1), F11           I s s s D X M W      
SF 992(R1), F12                   I D X M W    
SF 996(R1), F13                     I D X M W  

The average CPI for this loop is now 31 clock cycles / 22 instructions = 1.409. This is significantly less then the original 2.571 CPI. Also you hit the BNEZ only 25% of the time of the original.


Prev Next