CMSC 411 Fall 1998 Midterm Solutions - set 1
The following is a set of solutions to the midterm as supplied by a
fellow student.
Problem 1
---------
A. False. A RISC machine usually supports fewer register address modes
that CISC machines (DLX supports 3). Also, more complex
modes cause higher CPI.
B. True. Data hazards can be solved using forwarding.
C. False. See p.44 of textbook.
D. False. This is loading a double word, which occupies 8 bytes in
memory. 100 mod 8 = 4 and this causes a memory alignment
error.
E. False. There isn't a strict relationship between pipe depth and CPI.
F. True. Rate = time/job, Performance = 1/Rate, Performance = job/time.
G. False. See p.33 of textbook.
H. True/False. True for a pipelined machine, false for a sequential one.
I. False. Structural hazards are fixed by splitting the instruction
and data memories (called a Harvard Architecture) or by
adding more read/write ports.
J. True. M is in memory, and (1,2) means that at most one operand to
an ALU instruction can be in memory.
K. False. Latency is the number of clock cycles between the first EX
stage (M1) and the availability of the sum (M4). 4 - 1 = 3.
L. False. Length of the clock cycle is determined by the slowest
stage in the pipeline.
M. False. The floating-point divide is not pipelined.
Problem 2.1
-----------
CRnew = .8 Crold
ICnew = .6 ICold Xeq = (CPI*IC)/CR
CPInew = 1.3 CPIold
Speedup Overall=XeqOld/ XeqNew =(CPIold*ICold*CRnew)/(CRold*CPInew*ICnew)
Speedup Overall = .8/(.6*1.3) = 1.026 Very little speedup
Problem 2.2
-----------
Old New
Inst time inst time
ALU .5 XeqOld ALU .25 XeqNew
Load .2 XeqOld Load .1 XeqNew
Other .3 XeqOld Other .65 XeqNew
Time spent on Other inst. Old = Time spent on Other inst. New
(Because the problems says other instructions are unaffected by the compiler.)
.3 XeqOld = .65 XeqNew
XeqNew = .3/.65 XeqOld = .46 XeqOld
Speedup Overall = XeqOld/XeqNew = 1/.46 = 2.17
Problem 2.3
-----------
Old
Inst. freq. (%)
ALU .4
Loads .2
Stores .15
Other .25
= 1.0
A quarter of the ALU instructions cause a load, so the addition of the
ALU-L eliminates a quarter of the ALUs ( or .1). The same number of loads
must also be eliminated by the ALU-L, so the loads decrease by .1 also.
The new frequencies are:
Inst. Freq. (%)
ALU .3
ALU-L .1
Loads .1
Stores .15
Other .25
=.9 We have to re-normalize^Å
Inst. Freq. (%)
ALU .3/.9 = .334
ALU-L .1/.9 = .111
Loads .1/.9 = .111
Stores .15/.9 = .167
Other .25/.9 = .277
= 1.0
Problem 3.1
-----------
DLX-LAX is a (0,3) RISC ISA, just like good old DLX. The changes made
to the original DLX don't affect the format of the ALU instructions, which
the (m,n) notation refers to.
Problem 3.2
-----------
ADDI R1, R0, #100 // R1 IS FOR INDEXING
LW R2, (R1) // R2 <- MEM[R1] (A)
ADDI R1, R1, #4 // INCREMENT FOR NEXT WORD
LW R3, (R1) // R3 <- MEM[R1] (B)
ADDI R1,R1, #4 // R1 <- R1 +4
LW R4,(R1) // R4 <- MEM[R1] (C)
ADDI R1, R1,#4 // R1 <- R1 +4
LW R5, (R1) // R5 <- MEM[R1] (D)
ADD R6,R2,R3 // R6 = A + B
ADD R7,R4,R5 // R7 = C + D
ADD R8, R6,R7 // R8 = A + B + C + D
ADDI R1,R0,#200 // R1 <- 200
SW (R1), R8 // MEM[R1] <- R8
Problem 3.3
-----------
The changes made to the DLX architecture to make it DLX-LAX do not
affect the way we handle branches. We know an inst is a branch by the ID
stage, and we have the necessary data to test by ID as well. So, we can
test the branch condition in ID just as in the 5 stage DLX.
Problem 3.4
-----------
Structural hazards - resource contention: if for example, ID and ALUWB
occur simultaneously (register file can't handle both), or if IF1, IF2 and
MEM1, MEM2 occur simultaneously (no shared memory).
Data hazards - RAR: no such thing.
WAR: may be possible.
WAW: if ALUWB completes after a subsequent EX1/LWB, the
old value may overwrite new one.
RAW: ALU's don't get written until stage 8, but
dependent ALU's need them in EX1.
Control hazards - just as in 5 stage DLX (see 3.2)
Problem 3.5
-----------
The clock cycle time for DLX-LAX would probably be lower, since the
longer stages involving memory (fetch, mem) have been split up. CPI on
DLX-LAX would probably be slightly higher also (but not conclusively)
because the clock cycle time is shorter. IC would increase because the
lack of any addressing modes besides register deferred means we need extra
ALU operations to calculate address for loads and stores.
Problem 4.1
-----------
Assume still no forwarding or [WB, ID], no structural hazards present
(because of separation of instruction/data memories and multiple read/write
ports).
Stage A (Producers) - ALUWB, EX1/LWB
Stage B (Consumers) - EX1/LWB, MEM1
1. (EX2, EX1/LWB) 1 2 3 4 5 6 7 8 9 10 11
ADD R3,R1,R1 F1 F2 ID M1 M2 EX1 EX2 WB
ADD R4,R1,R3 F1 F2 ID M1 M2 S S EX1 EX2 WB
2 stalls needed
2. (EX2,MEM1) 1 2 3 4 5 6 7 8 9 10 11
ADDI R2,R0,#100 F1 F2 ID M1 M2 EX1 EX2 WB
LW R4,(R2) F1 F2 ID S S S S M1 M2 EX1
4 stalls needed
3. (MEM2, EX1/LWB) 1 2 3 4 5 6 7 8 9 10 11
LW R2,(R1) F1 F2 ID M1 M2 EX1 EX2 WB
ADD R4,R2,R3 F1 F2 ID M1 M2 EX1 EX2 WB
No stalls needed
4. (MEM2, MEM1) 1 2 3 4 5 6 7 8 9 10 11
LW R2,(R3) F1 F2 ID M1 M2 EX1 EX2 WB
LW R4,(R2) F1 F2 ID S S M1 M2 EX1 EX2 WB
2 stalls needed
Problem 4.2
-----------
Speedup from pipeline = Xeq Unpiped/ Xeq Piped
Xeq = CPI*IC*CC
Piped Unpiped
Same IC Same
8+1= 9ns CC 44ns
from overhead sum of all stages
1 CPI 1
ideal because no stalls present note: 1 inst/CC = 1 CC/inst -> 1 CPI
Xeq Unpiped = 1 * IC * 44 = 44IC
Xeq Piped = 1 * IC * 9 = 9IC
Speedup from pipeline = Xeq Unpiped/ Xeq Piped = 44/9 = 4.89
Note: this makes sense because in a perfect world^Å
Speedup from pipeline goes like = Time Unpiped/# stages in pipeline
= 44/8 = 5.5
the decrease from ideal performance comes from the 1ns overhead in each stage.
Back