CMSC 411 Fall 1998 Midterm Solutions - set 2

Courtesy of a fellow student ...
Problem 1: True Lies
--------------------
A: False:   RISC has very few register addressing modes to simplyify
the instructions.

B: True

C: False:  The MIPS rating Meas`ns absolutely squat.  The overhead
on one machine may be totally different than on another.

D:  False:  There is a byte alignment error.  100 is not evenly 
divisable by 8

E: False:  This really depends on the instructions and the hardware.

F: True

G: False: The RISC Arcitechture is proff that this is false.  The RISC
Arcitechure uses computer hardware to solve the hazards, thus 
increasing the CPI.

H:  False.  In actual fact this question is false, since the next
instruction on a sequential machine (CISC) waits til the last 
instruction is finished, and the PC is then waiting.  In a RISC 
machine, however, this is ALWAYS true.

I: False:  

J: True

K: False:  Latency is the number of clocks from when it enters the 
FPADD, til the sum is ready.

L: False:  The length of the clock cycle is determined by the SLOWEST
Stage.

M: False:  DIV is not fully Pipelined, but multiply, add, and subtract 
are.


Problem 2:  Feirce Creatures
----------------------------
2.1: This question is similar to the one on the last Quiz.
     The following information is used to solve the problem:

     The clock RATE was DECREASED by 20%. 
     CRnew = .8CRold
     The CPI is INCREASED by 30%
     CPInew = 1.3 CPIold
     The IC is DECREASED by 40%
     ICnew = .6ICold

     So the equation is:

      ICold        CPIold    .8CRold      .8        .8
     -------  *  --------- * ------- = -------- = ----- = 1.02
     .6ICold     1.3CPIold    CRold    .6 * 1.3    .78


2.2  The question uses percentages of the execution time.
     Amdahls law uses execution time.

     Unoptimized Percentages of execution time:
     ALU = 50%
     Loads = 20%
     Other = 30%

     Optimized Percentages of Execution Time:
     ALU = 25%
     Loads = 10%
     Other = 65%  ==>  This is the remaining time not used
                       in the other instructions already listed.
                       since this is a percentage of program
                       run-time.

     Using the picture method of Amdahl's Law, we get the total
     execution time of the unmodifed process to be "A", and the 
     total execution time of the Modified process to be "B".
     Also the Unmodified potrion was the portion that did not 
     change.   Based on this information, the following is true:

               x             x
    .65 = ---  ;  .30 = ---
               B             A

     The question asks for effective speed up, which is A/B.
     well above we have enough information to figure out what
     A and B are.  Minor Algebra makes the Above equations trivial:

    .65B = x ;  .30A = x;   .65B = .30A;  By this we can say:

        A   .65
        - = ---  This is the effective Speed up.
        B   .30


2.3  This is the Normalization question.  The setup is simple:

     Original program:
     ALU ops:  40%
     Loads:    20%
     Stores:   15%
     All others: 25% ==> Not given but not all instructions are of the
                         Types above
    
     With the updates, 1/4 of the ALU ops do not have loads, so 10%
     of the loads disappear, 10% of the ALU ops disappear, but 10% of
     the instructions become ALU-L ops.  The rest are unaffected.  So
     the new IC breakdown looks like this:

     Upgraded Program:
     ALU ops:   30
     ALU-L ops:  10
     Loads:     10
     Stores:    15   ==>  No Changes
     Other:     25   ==>  No Changes

     This breakdown, however, does not add up to 100%.  So it must be 
     normalized.  The total of the the above is 90, not 100.  So it 
     must be divides out on each of the catagories.

     The answer to the question, then, is:

     ALU ops:     30/.9 = 33.3%
     ALU-L ops:   10/.9 = 11.1%
     Loads:       10/.9 = 11.1%
     Stores:      15/.9 = 16.7%
     Other:       25/.9 = 27.8%  ==> Not asked for, but helps to make
                                     sure Total is 100%

Problem 3:  Trading Places:
---------------------------
3.1  This question is looking for the type of architechure in terms
     of the GPR measure.  DLX is  (0,3) architechture, and since the
     only change was in the Loads, and stores, not in the ALU 
     instructions, it too is a (0,3) as well.

3.2  This is coding in the new architechture.  The only difference
     between this and the original DLX is the memory references.
     So here is one version of the code:


     ADDI R1, R0, #100
     LW R2, (R1)
     ADDI R1, R1, #4
     LW R3, (R1)     
     ADDI R1, R1, #4
     LW R4, (R1)
     ADDI R1, R1, #4
     LW R5, (R1)
     ADD R2, R2, R3
     ADD R5, R5, R4
     ADDI R1, R0, #200
     ADD R2, R2, R5
     SW (R1), R2

3.3  Because you need to compute whether or not the branch condition
     is true or false to determine whether the branch is taken, you
     either need to wait til the EX2 phase is complete, to use the 
     ALU hardware, or put in new hardware somewhere in the Mem1 phase.
     putting condition checking hardware in the MEM1 phase would allow
     use of the PC after the clock pulse of Mem1.  If you use the 
     existing hardware the PC would be ready after EX2, or perhaps EX1.

3.4  The hazards are similar to the original DLX model.  The potential 
     hazards are as follows:
     Control Hazards
     Data Hazards
       RAW
       WAW
     Structural Hazards (Resource Contention)

3.5  CPI will vary in this architechture.  Depending on the type of
     instructions executed.
     IC will be higher in the DLX-LAX, since you do not have the
     use of the offset.  So instead of using an immediate offset, you
     must have an additional instruction to initiate a register with
     the offset.
     Clock Cycle time will be shorter due to the splitting of the
     memory stage.


Problem 4: Forever Young:
-------------------------
4.1  The problem is looking fo rthe Consumer-Producer pairs.  Those
     pairs are:

     Pair           Code Fragment
     (Ex2, Mem1)    ADD R1, R0, R3
                    SW (R2), R1
                    3 stalls neccessary.
                

     (Ex2, Ex1)     This doesnt happen in this arcitechture
                    since the ALU Writeback is finished
                    in time for the Decode phase.

     (Mem2, Mem1)   LW R2, (R1)
                    SW (R3), R2
                    One Stall Neccessary, before entering
                    Mem1 stage

     (Mem2, Ex1)    LW R2, (R1)
                    ADD R2, R2, R3
                    No Stalls Neccessary.

4.2  This question is fairly simple in terms of the hints given.
     With a sequential machine, the clock cycle time is the total of
     all the steps, if it is to complete in one clock cycle anyway.
     The clock cycle of the piped version is the length of the longest
     cycle, plus 1 ns for overhead.  Therefore the clock cycle times
     for each machine are as follows:

     CCseq = 44 ns
     CCpipe = 9 ns  (ID phase takes 8 ns, plus 1 for overhead)

     Next is the execution time.  Well the equation for execution time
     is the same for each machine, and is:

     EQseq = IC * CCseq * CPIseq
     EQpipe = IC * CCpipe * CPIpipe

     In the hint, it says that the CPI for the sequential machine is 1,
     and in the piped version, the CPI approaches its ideal value,
     which is also 1.  So as the CPI of the piped machine approaches
     its ideal state, and with the above info, we get teh following:

     CCseq = 44 ns  ==> From above
     CCpipe = 9 ns  ==> From above

     CPIseq = 1
     CPIpipe = 1  ==>  it is approaching this value

     ICseq = ICpipe  ==>  There is no change in the language, just
                          the machine platform.

     With this information, and given the rules for effective speed up,
     the following is the answer to the question:

     EQseq     IC    CCseq    CPIseq     1   44   1   44
     ------  = --  * ------ * ------- =  - * -- * - = --
     EQpipe    IC    CCpipe   CPIpipe    1    9   1    9
Back