A Longer example

Okay, you say you want a longer example? Here it is. So far we've looked at unrolling and rescheduling on the simple 5-stage DLX pipelike pipe. Now we will look at an example using the Multi-cycle DLX.

Assume Multi-cycle DLX.
Assume Forwarding is implemented.
Assume ignore BNEZ command in analysis of CPI.
Assume ignore contributions to the branch.
Assume use latencies given in the book. (Figure 3.43)

The DLX code is as follows:


ADDI R4, R0, #5200 ; make a float 5200
MVI2FP F4, R4 ;
CVTI2FP F4, F4 ; F4 has a float constant
ADD R1, R0, R0 ; init counter to 0
Loop: LF F2, 100(R1) ; F2 is array element, R1 has offset of lowest unused array element LF F3, 500(R1) ; F3 holds array element SUBF F5, F3, F2 ; perform subtraction ADDF F5, F5, F4 ; perform addition of a constant SF 1000(R1), F5 ; store the results ADDI R1, R1, #4 ; increment pointer SUBI R5, R1, #400 ; check pointer BNEZ R5, Loop ; branch while not done

Instruction

Loop: LF F2, 100(R1)

LF F3, 500(R1)

SUBF F5, F3, F2

ADDF F5, F5, F4

SF 1000(R1), F5

ADDI R1, R1, #4

SUBI R5, R1, #400

The average CPI for this loop is 18 clock cycles / 7 instructions = 2.571. Now lets see what happens when we unroll the loop 4 times, and reschedule to avoid stalls. The new code is as follows.


ADD R1, R0, RO
ADDI R4, R0, #5200 ; make a float 5200
MVI2FP F14, R4 ;
CVTI2FP F14, F14 ; F14 has a float constant
ADD R1, R0, R0 ; init counter to 0
Loop: LF F2, 100(R1) ; LF F6, 500(R1) LF F3, 100(R1) ; LF F7, 504(R1) ; SUBF F10, F6, F2 ; LF F4, 108(R1) ; SUBF F11, F7, F3 ; LF F8, 508(R1) LF F5, 112(R1) LF F9, 512(R1) SUBF F12, F8, F4 SUBF F13, F9, F5 ADDI R1, R1, #16 ADDF F10, F10, F14 ADDF F11, F11, F14 ADDF F12, F12, F14 ADDF F13, F13, F14 SUBI R5, R1, #400 SF 984(R1), F10 SF 988(R1), F11 SF 992(R1), F12 SF 996(R1), F13 BNEZ R5, Loop ; branch while not done

Instruction

Loop: LF F2, 100(R1)

LF F6, 500(R1)

LF F3, 104(R1)

LF F7, 504(R1)

SUBF F10, F6, F2

LF F4, 108(R1)

SUBF F11, F7, F3

LF F8, 508(R1)

LF F5, 112(R1)

LF F9, 512(R1)

SUBF F12, F8, F4

SUBF F13, F9, F5

ADDI R1, R1, #16

ADDF F10, F10, F14

Instruction

SUBF F12, F8, F4

SUBF F13, F9, F5

ADDI R1, R1, #16

ADDF F10, F10, F14

ADDF F11, F11, F14

ADDF F12, F12, F14

ADDF F13, F13, F14

SUBI R5, R1, #400

SF 984(R1), F10

SF 988(R1), F11

SF 992(R1), F12

SF 996(R1), F13

The average CPI for this loop is now 31 clock cycles / 22 instructions = 1.409. This is significantly less then the original 2.571 CPI. Also you hit the BNEZ only 25% of the time of the original.


Prev	Next