CMSC 411 Fall 1998 Midterm Solutions - set 1

The following is a set of solutions to the midterm as supplied by a fellow student.

Problem 1
---------
A. False.  A RISC machine usually supports fewer register address modes
           that CISC machines (DLX supports 3).  Also, more complex
           modes cause higher CPI.
B. True.   Data hazards can be solved using forwarding.
C. False.  See p.44 of textbook.
D. False.  This is loading a double word, which occupies 8 bytes in
           memory. 100 mod 8 = 4 and this causes a memory alignment
           error. 
E. False.  There isn't a strict relationship between pipe depth and CPI.
F. True.   Rate = time/job, Performance = 1/Rate, Performance = job/time.
G. False.  See p.33 of textbook.
H. True/False.  True for a pipelined machine, false for a sequential one.
I. False.  Structural hazards are fixed by splitting the instruction
           and data memories (called a Harvard Architecture) or by
           adding more read/write ports.
J. True.   M is in memory, and (1,2) means that at most one operand to
           an ALU instruction can be in memory.
K. False.  Latency is the number of clock cycles between the first EX
           stage (M1) and the availability of the sum (M4).  4 - 1 = 3.        
L. False.  Length of the clock cycle is determined by the slowest
           stage in the pipeline.
M. False.  The floating-point divide is not pipelined.



Problem 2.1
-----------
CRnew = .8 Crold
ICnew = .6 ICold            Xeq = (CPI*IC)/CR
CPInew = 1.3 CPIold 

Speedup Overall=XeqOld/ XeqNew =(CPIold*ICold*CRnew)/(CRold*CPInew*ICnew)

Speedup Overall = .8/(.6*1.3) = 1.026       Very little speedup

Problem 2.2
-----------
    Old                         New
    Inst        time            inst        time
    ALU         .5 XeqOld       ALU         .25 XeqNew
    Load        .2 XeqOld       Load        .1 XeqNew
    Other       .3 XeqOld       Other       .65 XeqNew

Time spent on Other inst. Old = Time spent on Other inst. New
(Because the problems says other instructions are unaffected by the compiler.)
               .3 XeqOld = .65 XeqNew
                  XeqNew = .3/.65 XeqOld = .46 XeqOld

Speedup Overall = XeqOld/XeqNew = 1/.46 = 2.17

Problem 2.3
-----------
    Old     
Inst.       freq. (%)
ALU         .4
Loads       .2
Stores      .15
Other       .25
            = 1.0

A quarter of the ALU instructions cause a load, so the addition of the
ALU-L eliminates a quarter of the ALUs ( or .1).  The same number of loads
must also be eliminated by the ALU-L, so the loads decrease by .1 also.
The new frequencies are:

Inst.       Freq. (%)
ALU         .3
ALU-L       .1
Loads       .1
Stores      .15
Other       .25
            =.9     We have to re-normalize^Å

Inst.       Freq. (%)
ALU         .3/.9 = .334
ALU-L       .1/.9 = .111
Loads       .1/.9 = .111
Stores      .15/.9 = .167
Other       .25/.9 = .277
            = 1.0

Problem 3.1
-----------
DLX-LAX is a (0,3) RISC ISA, just like good old DLX.  The changes made
to the original DLX don't affect the format of the ALU instructions, which
the (m,n) notation refers to.

Problem 3.2
-----------
ADDI R1, R0, #100       // R1 IS FOR INDEXING
LW R2, (R1)             // R2 <- MEM[R1]    (A)
ADDI R1, R1, #4         // INCREMENT FOR NEXT WORD
LW R3, (R1)             // R3 <- MEM[R1]    (B)
ADDI R1,R1, #4          // R1 <- R1 +4
LW R4,(R1)              // R4 <- MEM[R1]   (C)
ADDI R1, R1,#4          // R1 <- R1 +4
LW R5, (R1)             // R5 <- MEM[R1]  (D)
ADD R6,R2,R3            // R6 = A + B
ADD R7,R4,R5            // R7 = C + D
ADD R8, R6,R7           // R8 = A + B + C + D
ADDI R1,R0,#200         // R1 <- 200
SW (R1), R8             // MEM[R1] <- R8

Problem 3.3
-----------
The changes made to the DLX architecture to make it DLX-LAX do not
affect the way we handle branches.  We know an inst is a branch by the ID
stage, and we have the necessary data to test by ID as well.  So, we can
test the branch condition in ID just as in the 5 stage DLX.

Problem 3.4
-----------
Structural hazards - resource contention:  if for example, ID and ALUWB
occur simultaneously (register file can't handle both), or if IF1, IF2 and
MEM1, MEM2 occur simultaneously (no shared memory).

Data hazards -  RAR: no such thing.
                WAR: may be possible.
                WAW: if ALUWB completes after a subsequent EX1/LWB, the
                     old value may overwrite new one.
                RAW: ALU's don't get written until stage 8, but
                     dependent ALU's need them in EX1.

Control hazards - just as in 5 stage DLX (see 3.2)

Problem 3.5
-----------
The clock cycle time for DLX-LAX would probably be lower, since the
longer stages involving memory (fetch, mem) have been split up.  CPI on
DLX-LAX would probably be slightly higher also (but not conclusively)
because the clock cycle time is shorter.  IC would increase because the
lack of any addressing modes besides register deferred means we need extra
ALU operations to calculate address for loads and stores.


Problem 4.1
-----------
Assume still no forwarding or [WB, ID], no structural hazards present
(because of separation of instruction/data memories and multiple read/write
ports).

Stage A (Producers) - ALUWB, EX1/LWB
Stage B (Consumers) - EX1/LWB, MEM1

1. (EX2, EX1/LWB)     1    2    3    4    5    6    7    8    9   10   11
    ADD R3,R1,R1     F1   F2   ID   M1   M2   EX1  EX2  WB
    ADD R4,R1,R3          F1   F2   ID   M1   M2    S    S   EX1  EX2  WB
                      2 stalls needed

2. (EX2,MEM1)         1    2    3    4    5    6    7    8    9   10   11
    ADDI R2,R0,#100  F1   F2   ID   M1   M2   EX1  EX2  WB
    LW R4,(R2)            F1   F2   ID    S    S    S    S   M1   M2   EX1
                      4 stalls needed

3. (MEM2, EX1/LWB)    1    2    3    4    5    6    7    8    9   10   11
    LW R2,(R1)       F1   F2   ID   M1   M2   EX1  EX2  WB
    ADD R4,R2,R3          F1   F2   ID   M1   M2   EX1  EX2  WB
                      No stalls needed

4. (MEM2, MEM1)       1    2    3    4    5    6    7    8    9   10   11
    LW R2,(R3)       F1   F2   ID   M1   M2   EX1  EX2  WB
    LW R4,(R2)            F1   F2   ID    S    S   M1   M2   EX1  EX2  WB
                      2 stalls needed

Problem 4.2
-----------
Speedup from pipeline = Xeq Unpiped/ Xeq Piped
Xeq = CPI*IC*CC

    Piped               Unpiped
    Same        IC      Same
    8+1= 9ns    CC      44ns
from overhead           sum of all stages
    1           CPI     1
ideal because no stalls present note:  1 inst/CC = 1 CC/inst -> 1 CPI

    Xeq Unpiped = 1 * IC * 44 = 44IC
    Xeq Piped   = 1 * IC * 9  =  9IC

Speedup from pipeline = Xeq Unpiped/ Xeq Piped = 44/9 = 4.89

Note:  this makes sense because in a perfect world^Å
    Speedup from pipeline goes like  =  Time Unpiped/# stages in pipeline
                          =  44/8 = 5.5
the decrease from ideal performance comes from the 1ns overhead in each stage.
Back