1. This problem addresses how a common vector loop runs on a variety of pipeline versions of the DLX. The loop is the so-called SAXPY loop, shown below, which implements the vector operation Y = a X + Y for a vector of length 100.

 foo: LD F2, 0(R1) ; load X(i) MULTD F4, F2, F0 ; multiply a*X(i) F6, 0(R2) ; load Y(i) ADDD F6, F4, F6 ; add a*X(i) + Y(i) SD 0(R2), F6 ; store new Y(i) ADDI R1,R1,#8 ; increment X index ADDI R2,R2,#8 ; increment Y index SUBI R3,R1,#800 ; test if done BEQZ R3, foo ; loop if not done

Assume that all of the functional units are fully bypassed and that the FP functional units are fully pipelined except for the divide. Other relevant information is given in the tables below.

FU type Time in EX
Integer 1
FP multiplier 4
FP divider 12
FU type # of FUs
Integer 2
FP multiplier 1
FP divider 1

Producer Consumer Latencies
FP MULTIPLY any FP ALU op 4
FP add any FP ALU op 3
FP MULTIPLY FP store 3
FP load any FP ALU op 1
I ALU op any I ALU op 0
I load any I ALU op 1

a. Using the single issue DLX pipeline (IF/ID/EX/MEM/WB), fill in the table below showing when each instruction begins execution (i.e., enters its first EX cycle), the number of stall cycles for each instruction, and why a stall occurs for the first iteration of the loop. Assume that contention for MEM or WB is resolved by allowing the instruction that issued earliest to complete. Ignore the branch delay.

Instruction First EX Cycle # Stall Cycles Why Stall Occured
 LD F2, 0(R1)
3 0
 MULTD F4,F2,F0
5 1 RAW
 LD F6, 0(R2)
6 1 structural hazard
8 2 RAW
 SD 0(R2), F6
9 1 structural hazard in decode stage
10 1 structural hazard in fetch stage
12 1 WAR
 SUBI R3,R1,#800
13 1 structural hazard in decode stage
 BEQZ R3,foo
14 1 structural hazard in fetch stage
 LD F2, 0(R1)
15 0
 MULTD F4,F2,F0
18 1 RAW

Using the DLX with scoreboarding

Instruction Unit IS RO EX WB Comments
 LD F2, 0(R1)
INT1 1 2 3 4
 MULTD F4,F2,F0
MULT 2 4 5-8 9 stalls in cc 3, waiting for F2
 LD F6, 0(R2)
INT2 3 4 5 6
ADD1 4 9 10-12 13 stalls in cc 5-8, waiting for F4,F6
 SD 0(R2), F6
INT1 5 13 14 15 stalls in cc 6-12, waiting for F6
INT2 6 7 8 9
INT2 9 10 11 14 stall 12-13
 SUBI R3,R1,#800
INT2 14 15 16 17
 BEQZ R3,foo
INT1 15 16 17 18

Using the DLX with Tomasulo's Algoritm

Instruction UNIT IS EX WR Comments
 LD F2, 0(R1)
INT1 1 2-3 4
 MULTD F4,F2,F0
MULT1 2 4-8 9 waiting for F2
 LD F6, 0(R2)
INT2 3 4-5 6
ADD1 4 9-12 13 waiting for F4, F6, on cc 5-8
 SD 0(R2), F6
INT1 5 13-14 15 waiting for F6 on cc 6-12