1. This problem addresses how a common vector loop runs on a variety of pipeline
versions of the DLX. The loop is the so-called SAXPY loop, shown below, which
implements the vector operation Y = a X + Y for a vector of length 100.
foo: |
LD F2, 0(R1) |
| | |
; load X(i) |
|
MULTD F4, F2, F0 |
| | |
; multiply a*X(i) |
|
F6, 0(R2) |
| | |
; load Y(i) |
|
ADDD F6, F4, F6 |
| | |
; add a*X(i) + Y(i) |
|
SD 0(R2), F6 |
| | |
; store new Y(i) |
|
ADDI R1,R1,#8 |
| | |
; increment X index |
|
ADDI R2,R2,#8 |
| | |
; increment Y index |
|
SUBI R3,R1,#800 |
| | |
; test if done |
|
BEQZ R3, foo |
| | |
; loop if not done |
Assume that all of the functional units are fully bypassed and that the FP functional units are fully
pipelined except for the divide. Other relevant information is given in the tables below.
FU type |
Time in EX |
Integer |
1 |
FP adder |
3 |
FP multiplier |
4 |
FP divider |
12 |
|
|
FU type |
# of FUs |
Integer |
2 |
FP adder |
2 |
FP multiplier |
1 |
FP divider |
1 |
|
Producer |
Consumer |
Latencies |
FP MULTIPLY | any FP ALU op | 4 |
FP add | any FP ALU op | 3 |
FP MULTIPLY | FP store | 3 |
FP add | FP store | 2 |
FP load | FP store | 0 |
FP load | any FP ALU op | 1 |
I ALU op | any I ALU op | 0 |
I load | any I ALU op | 1 |
a. Using the single issue DLX pipeline (IF/ID/EX/MEM/WB), fill in the table below
showing
when each instruction begins execution (i.e., enters its first EX cycle), the number of
stall cycles for each instruction, and why a stall occurs for the first iteration of
the loop. Assume that contention for MEM or WB is resolved by allowing the instruction
that issued earliest to complete. Ignore the branch delay.
Instruction |
First EX Cycle |
# Stall Cycles |
Why Stall Occured |
|
3 |
0 |
|
|
5 |
1 |
RAW |
|
6 |
1 |
structural hazard |
|
8 |
2 |
RAW |
|
9 |
1 |
structural hazard in decode stage |
|
10 |
1 |
structural hazard in fetch stage |
|
12 |
1 |
WAR |
|
13 |
1 |
structural hazard in decode stage |
|
14 |
1 |
structural hazard in fetch stage |
|
15 |
0 |
|
|
18 |
1 |
RAW |
Using the DLX with scoreboarding
Instruction |
Unit |
IS |
RO |
EX |
WB |
Comments |
|
INT1 |
1 |
2 |
3 |
4 |
|
|
MULT |
2 |
4 |
5-8 |
9 |
stalls in cc 3, waiting for F2 |
|
INT2 |
3 |
4 |
5 |
6 |
|
|
ADD1 |
4 |
9 |
10-12 |
13 |
stalls in cc 5-8, waiting for F4,F6 |
|
INT1 |
5 |
13 |
14 |
15 |
stalls in cc 6-12, waiting for F6 |
|
INT2 |
6 |
7 |
8 |
9 |
|
|
INT2 |
9 |
10 |
11 |
14 |
stall 12-13 |
|
INT2 |
14 |
15 |
16 |
17 |
|
|
INT1 |
15 |
16 |
17 |
18 |
|
Using the DLX with Tomasulo's Algoritm
Instruction |
UNIT |
IS |
EX |
WR |
Comments |
|
INT1 |
1 |
2-3 |
4 |
|
|
MULT1 |
2 |
4-8 |
9 |
waiting for F2 |
|
INT2 |
3 |
4-5 |
6 |
|
|
ADD1 |
4 |
9-12 |
13 |
waiting for F4, F6, on cc 5-8 |
|
INT1 |
5 |
13-14 |
15 |
waiting for F6 on cc 6-12 |
|
INT2 |
6 |
7-8 |
9 |
on cc 9 there is a structural hazard |
|
INT2 |
10 |
11-12 |
14 |
structural hazard bus on wb on cc 13 |
|
INT2 |
14 |
15-16 |
17 |
structural hazard Int on 11-13 |
|
|
|
|
|
|
|
|
INT1 |
15 |
16-17 |
18 |
structural hazard int |
|
INT2 |
17 |
19-23 |
24 |
wait for f2 |