Okay, you say you want a longer example? Here it is. So far we've looked at unrolling and rescheduling on the simple 5-stage DLX pipelike pipe. Now we will look at an example using the Multi-cycle DLX.
- Assume Multi-cycle DLX.
- Assume Forwarding is implemented.
- Assume ignore BNEZ command in analysis of CPI.
- Assume ignore contributions to the branch.
- Assume use latencies given in the book. (Figure 3.43)
 
The DLX code is as follows:
  
 
- ADDI R4, R0, #5200 ; make a float 5200
  - MVI2FP F4, R4 ;
  - CVTI2FP F4, F4 ; F4 has a float constant
  - ADD R1, R0, R0 ; init counter to 0
  - Loop: LF F2, 100(R1) ; F2 is array element, R1 has offset of lowest unused array element  
  
    - LF F3, 500(R1) ; F3 holds array element 
    - SUBF F5, F3, F2 ; perform subtraction 
    - ADDF F5, F5, F4 ; perform addition of a constant 
    - SF 1000(R1), F5 ; store the results
    - ADDI R1, R1, #4 ; increment pointer 
    - SUBI R5, R1, #400 ; check pointer 
    - BNEZ R5, Loop ; branch while not done
| Instruction | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 
| Loop: LF F2, 100(R1) | I | D | X | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  | 
| LF F3, 500(R1) |  | I | D | X | M | W |  |  |  |  |  |  |  |  |  |  |  |  | 
| SUBF F5, F3, F2 |  |  | I | D | s | A1 | A2 | A3 | A4 | M | W |  |  |  |  |  |  |  | 
| ADDF F5, F5, F4 |  |  |  | I | s | D | s | s | s | A1 | A2 | A3 | A4 | M | W |  |  |  | 
| SF 1000(R1), F5 |  |  |  |  |  | I | s | s | s | D | s | s | s | X | M | W |  |  | 
| ADDI R1, R1, #4 |  |  |  |  |  |  |  |  |  | I | s | s | s | D | X | M | W |  | 
| SUBI R5, R1, #400 |  |  |  |  |  |  |  |  |  |  |  |  |  | I | D | X | M | W | 
The average CPI for this loop is 18 clock cycles / 7 instructions = 2.571. Now lets see what happens when we unroll the loop 4 times, and reschedule to avoid stalls. The new code is as follows.
  
 
- ADD R1, R0, RO 
  - ADDI R4, R0, #5200 ; make a float 5200
  - MVI2FP F14, R4 ;
  - CVTI2FP F14, F14 ; F14 has a float constant
  - ADD R1, R0, R0 ; init counter to 0
  - Loop: LF F2, 100(R1) ;   
  
    - LF F6, 500(R1)
    - LF F3, 100(R1) ; 
    - LF F7, 504(R1) ; 
    - SUBF F10, F6, F2 ; 
    - LF F4, 108(R1) ;
- SUBF F11, F7, F3 ;
- LF F8, 508(R1)
- LF F5, 112(R1)
- LF F9, 512(R1)
- SUBF F12, F8, F4
- SUBF F13, F9, F5
- ADDI R1, R1, #16
- ADDF F10, F10, F14
- ADDF F11, F11, F14
- ADDF F12, F12, F14
- ADDF F13, F13, F14
- SUBI R5, R1, #400
- SF 984(R1), F10
- SF 988(R1), F11
- SF 992(R1), F12
- SF 996(R1), F13
    - BNEZ R5, Loop ; branch while not done
| Instruction | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 
| Loop: LF F2, 100(R1) | I | D | X | M | W |  |  |  |  |  |  |  |  |  |  |  | 
| LF F6, 500(R1) |  | I | D | X | M | W |  |  |  |  |  |  |  |  |  |  | 
| LF F3, 104(R1) |  |  | I | D | X | M | W |  |  |  |  |  |  |  |  |  | 
| LF F7, 504(R1) |  |  |  | I | D | X | M | W |  |  |  |  |  |  |  |  | 
| SUBF F10, F6, F2 |  |  |  |  | I | D | A1 | A2 | A3 | A4 | M | W |  |  |  |  | 
| LF F4, 108(R1) |  |  |  |  |  | I | D | X | M | W |  |  |  |  |  |  | 
| SUBF F11, F7, F3 |  |  |  |  |  |  | I | D | A1 | A2 | A3 | A4 | M | W |  |  | 
| LF F8, 508(R1) |  |  |  |  |  |  |  | I | D | X | s | M | W |  |  |  | 
| LF F5, 112(R1) |  |  |  |  |  |  |  |  | I | D | s | X | s | M | W |  | 
| LF F9, 512(R1) |  |  |  |  |  |  |  |  |  | I | s | D | s | X | M | W | 
| SUBF F12, F8, F4 |  |  |  |  |  |  |  |  |  |  |  | I | s | D | A1 | A2 | 
| SUBF F13, F9, F5 |  |  |  |  |  |  |  |  |  |  |  |  |  | I | D | A1 | 
| ADDI R1, R1, #16 |  |  |  |  |  |  |  |  |  |  |  |  |  |  | I | D | 
| ADDF F10, F10, F14 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | I | 
| Instruction | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 
| SUBF F12, F8, F4 | A3 | A4 | M | W |  |  |  |  |  |  |  |  |  |  |  |  | 
| SUBF F13, F9, F5 | A2 | A3 | A4 | M | W |  |  |  |  |  |  |  |  |  |  |  | 
| ADDI R1, R1, #16 | X | M | W |  |  |  |  |  |  |  |  |  |  |  |  |  | 
| ADDF F10, F10, F14 | D | A1 | A2 | A3 | A4 | M | W |  |  |  |  |  |  |  |  |  | 
| ADDF F11, F11, F14 | I | D | A1 | A2 | A3 | A4 | M | W |  |  |  |  |  |  |  |  | 
| ADDF F12, F12, F14 |  | I | D | A1 | A2 | A3 | A4 | M | W |  |  |  |  |  |  |  | 
| ADDF F13, F13, F14 |  |  | I | D | A1 | A2 | A3 | A4 | M | W |  |  |  |  |  |  | 
| SUBI R5, R1, #400 |  |  |  | I | D | X | s | s | s | M | W |  |  |  |  |  | 
| SF 984(R1), F10 |  |  |  |  | I | D | s | s | s | X | M | W |  |  |  |  | 
| SF 988(R1), F11 |  |  |  |  |  | I | s | s | s | D | X | M | W |  |  |  | 
| SF 992(R1), F12 |  |  |  |  |  |  |  |  |  | I | D | X | M | W |  |  | 
| SF 996(R1), F13 |  |  |  |  |  |  |  |  |  |  | I | D | X | M | W |  | 
The average CPI for this loop is now 31 clock cycles / 22 instructions = 1.409.  This is significantly less then the original 2.571 CPI.  Also you hit the BNEZ only 25% of the time of the original.