CMSC 411 Spring 1997 Final Solutions

Problem 1

1.1 
	MIPS and MFLOPS depend on both IC and exe time. After
optimization, IC is decreased, performance increases, while MIPS and
MFLOPS rating decreases. Average exe time measure how fast the CPU is,
regardless of IC. And sometimes program may have a low floating point
operation count, and thus a low MFLOPS rating, while the performance might
be quite high. MIPS and MFLOPS rating depend on the program. The same
machine can have quite different MIPS and MFLOPS ratings on different
programs. 

1.2
	A good benchmark should profile the typical workload of the
machine. Benchmarks that consist of programs that you don^Òt usually run on
the machine will not measure how good the expected performance will be. So
to construct an appropriate benchmark, you should use the list of programs
with their frequencies. 

1.3
	The 10% of code is hard to optimize. This is the core of the
program and may not need to run frequently. The rule of thumb is to make
common case fast. But the limit of speed up is less than 1/(1-FE). The FE
may be too small to begin with and the overall speed up is hence limited. 

1.4
	By doing loop unrolling, register renaming, scheduling, compilers
can minimize RAW, WAR, and WAW hazards and reduce branch delay penalities.
But compilers can not resolve structural hazards, which is an issue for
hardware. Limitation of compilers may be it not being able to find
appropriate instructions to issue to keep the pipeline full. 

1.5
	CPU = CPI * IC * CC
	IC and CC are easy to compute since they can easily counted or
measured. CPI is hard to compute since it^Òs the average of cycles for an
instruction. We have to find the cycles and frequencies for each type of
instruction and sum them over.

Problem 2

2.
 A. T
 B. F
 C. F
 D. ?
 E. F
 F. T
 G. F

2.1 False. The dirty bit is used to indicate whether the block has been
modified.  

2.2 False. Dynamic scheduling is use to issue and execute instructions out
of order. 

2.3 False. Compiler writers, instead of programmers, are required to
decide what instructions can be issued simultaneously. 

2.4 False. The principal difference is that RISC architecture is
pipelined, while CISC is not. In addition, RISC uses simpler addressing
modes than CISC.

2.5 False. The code being executed has to be taken into consideration. For
some codes, they may not take advantage of multiple issue, may not be able
to find enough instructions to issue simultaneously and cannot keep the
multiple machine full and result in the same performance as single issue
machines.

problem

3.1.1
1   2   3   4   5   6   7   8   9  10   11   12   13   14   15   16   17   18
F   D   X   M   W
    F   D   X1 X2  X3  X4  X5  X6  X7    M    W
        F   D   S   S   S   S   S   S   X1   X2   X3    X4   M    W
            F   S   S   S   S   S   S    D    X    S     S   M    W
                        F   D   S   S    X    M    W
                            F   S   S    D    X    M     W

    CPI = 19/6    (plus one for branch penalty)

3.1.2 
LOOP:   LD F0, 0(R1)
        LD F4, -8(R1)
        LD F6, -16(R1)
        LD F8, -24(R1)
        LD F10, -32(R1)
        LD F12, -40(R1)
        LD F14, -48(R1)
                
        MULTD F0, F0, F2
        MULTD F4, F4, F2
        MULTD F6, F6, F2
        MULTD F8, F8, F2
        MULTD F10, F10, F2
        MULTD F12, F12, F2
        MULTD F14, F14, F2

        SUBD F0, F0, F2
        SUBD F4, F4, F2
        SUBD F6, F6, F2
        SUBD F8, F8, F2
        SUBD F10, F10, F2
        SUBD F12, F12, F2
        SUBD F14, F14, F2

        SD 0(R1), F0
        SD -8(R1), F4
        SD -16(R1), F6
        SD -24(R1), F8
        SD -32(R1), F10
        SD -40(R1), F12
        SD -48(R1), F14

        SUBI R1, R1, #56
        BNEZ R1, LOOP
        SD 8(R1), F14

    CPI = 1

Problem 4

4.1 
    Overall speedup = 1 / ( 1 ^Ö FE + FE/speedup of FE ) 
    Overall speedup is less than 1 / ( 1 ^Ö FE). When the answer of
speedup is negative, it means the proposed overall speedup is not
possible, namely, greater than its limit. Or the negative speedup
indicates the proposal will result a slow down in performance. 

4.2
    Having separate instruction and data memory eliminate structural
hazards when IF and MEM stages happen at the same time. Potential
drawbacks may be we need more memory ports, one read for IF, one
read/write for MEM and higher bandwidth for data transfers. 

4.3
    Branches make up about 15% of IC. Not being able to handle control
hazard efficiently will significantly increase CPI. To predict branch
behavior more accurately will reduce stalls. Misprediction will result in
refetch, sometimes flushing pipeline and often stalls. There are many ways
to reduce control hazards: 1) sacrifice three stalls for each branch. 2)
predict branch taken or not taken 3) fill branch delay slot with
appropriate instructions. 4) scheduling. The best strategy is to use 3 and
4 together. 

4.4
    PowerPC 620 has four stages: fetch, issue, execute and commit,
with instruction buffer between fetch and issue, reservations stations
between issue and execute, reorder buffer between issue and commit, and
rename registers between execute and commit. The buffers allow one stage
to slip with respect to another. The buffers also limit slippage. If the
buffer is full, the instruction filling it will be stalled; if the buffer
is empty, the instruction emptying it will be stalled. It^Òs importation to
keep the instruction buffer to take advantage of the multiple function
units. 

4.5
    Multiprocessors may modify memory at different times and result in
stale data used and incorrect computations. While one processor has not
yet written to the memory, another processor reads the stale data or
writes to it first. Or while one processor is currently writing and the
other processor grabs the data, which could be either the old or the new
value.

Back