CMSC 411 Spring 1997 Midterm Solutions

Problem 1

1.1
  Amdahl's Law shows that overall speedup is achieved by speeding up some
part (lets call that part y) by some amount (lets call that k).  The whole
program is x+y, so speedup is 1/(x+y/k).  Note that the more frequent a bit
of code is, it's corresponding y (or x) is bigger, therefore, assuming the
"common case" is y, making the common case fast really improves the program.

1.2
  YES!  R0 is not a "write to" register, so the compiler should be smart
enough to never use R0 as a destination, so there will not be any physical
errors involving R0.  RAR is not a hazard.  RAW is not a hazard bacause
there is no writing to R0, WAR is not a hazard for the same reason, and WAW
is also not a hazard for the same reason.  There can be a control hazard if
BNEZ loop, R0 is executed and we're assuming taken or BEQZ loop, R0 assuming
not taken. 

1.3
Advantage:
  Using offset based addressing allows a rudimentary stack interface (if you
want one) or array indexing. Of course, you are simulating these features
through code, but you get them.  And, you don't need to store an
address in the instruction--merely a smaller offset to the PC address.

Disadvantage:
  It's possible to have an alignment error. And, you are limited in
the range of addresses that can be referenced with offsets by the
length of the offset.

1.4
  NO!  I can write a microcoded turing machine that will implement any
instruction set.  More to the point, the pipeline need not be used for the
DLX ISA.  Therefore we know two hardware interfaces that implement DLX, so
DLX doesn't uniquely determine it's hardware, and DLX is a valid ISA. 

1.5
  Accumulator machines use one register to do all of their work, so it must
be at least (1,2) to get any useful work done (GPR machines can do useful
work in (0,2). (NOTE: the (m,n) notation requires more than 1
register. The accumulator only has one register; so, it's not a GPR!!)
  Make the common case fast!  If there is one register, and registers are
fast, and all work must be done in registers, and memory is slower than the
register (virtually guaranteed), the common case isn't fast for an
accumulator.  GPR machines do work in the fast registers. 

1.6
  Adding forwarding and bypassing allows some stalls to be removed.  On
average, at least 1 stall will be removed, so the CPI should be lower
(Sp(overall) = pipe depth / (1 + stalls/instruction)). 

1.7
  Both program A and B run faster.  Assuming overall speedup, A runs 2.5
times faster, and B runs 3.14 times faster.  Therefore the execution times
are SHORTER.  However, no information about the relationship between
the execution times of A and B can be inferred from this result.

Problem 2

2.1
  It doesn't matter what K is, speedup can NOT be 4.  If you're only
enhancing half of the instructions, the upper bound of the speedup is 2. 

2.2
  time new == (x + y/K)
Since x is the time spent in the unenhancable part and y is the part that
can be improved, and K is the enhancement on y.

2.3
  Speedup overall = T/Tho = x+y / (x + y/K)
This differs from the usual Amdahl's Law equation since x+y is not yet
normalized to be 1.  let w = x/(x+y) and z = y/x+y), then 
SPoverall = 1/(w+z/K).

2.4
  let x be unenhanceable time
  let 10y be old floating time
  let y be new time

  x + 10y = 50
  x +   y = 10
  
  subtracting
       9y = 40
  so y = 40/9       <--- answer

  Picture:
           __________________________ 
  before :| x  |       10y          | == 50
          ________
   after :| x  |y | == 10 

 by  the way, x = 50/9, and 50/9 + 40/9 = 90/9 = 10
                  50/9  + 400/9 = 450/9 = 50 so it checks too!

Problem 3

I don't think problem 1 is actually solvable since there's no way to build 0.5 without divide or load float immediate. The shifting won't work unless it's a really wierd FP format, it wont work on IEEE 754 --- tonyj

Unless you realize that you can actually hard code "0.5" into a memory location and then load it into an FP register, then tony is right. you can't divide by two, or, rather, multiply by one-half. --- mmh

3.1
  Assume location of B is at R1;
  Assume location of C is at R2;
  Assume location of future X1 is at R3;
  Assume location of future X2 is at R4;
  Assume single precision Floating Point;
  Assume sqrt takes F0, sqrts it,and returns to Floating Point Register F0
         as single precision;
  Assume LFI exists because we can't do a shift on a float and expect
         meaningful results, and division just plain doesn't work, and I
         can't come up with a way to generate 0.5.  We need 0.5 to do
         division by 2.

	LF	F1, 0(R1)	// Get B, save to F1
	LF	F2, 0(R2)	// Get C, save to F2
	ADDI	R10, R0, #-4	// Load -4 to R10
	MOVI2FP	F0, R10		//   then put it in the FP regs
	CVTI2F	F0, F0		//   and finally turn it into a float 
	MULTF	F2, F0, F2	// C = -4C by using -4 calculated before
	MULTF	F3, F1, F1	// Calculate B squared, save to F3
	ADDF	F0, F3, F2	// Add F3 and F2 aka B^2 + (-4C), save to F0
	MOVI2FP	F10, R0		// Fetch (int) 0 from the integer regs
	CVTI2F	F10, F10	//   and make it (float) 0
	LTF	F0, F10		// Compare F0 to F10, save result in
				//   magical status register (aka is F0
				//   negative)
	BFPT	noroots		// if compare is true, then there's no roots
	JAL	sqrt		// Procedure call sqrt (see assumption)
				// Note that PC + 4 is saves in R31 for return
	ADDI	R10, R0, #-1	// Load (int) -1 into R10
	MOVI2FP F10, R10	//   then put it in a FP reg.
	CVTI2F	F10, F10	//   and make it (float) -1
	MULTF	F1, F1, F10	// make B = (-1)*B
	ADDF	F11, F1, F0	// make + numerator in F11, aka 2X1
	SUBF	F12, F1, F0	// make - numerator in F12, aka 2X2
	LFI	F10, #0.5	// See comment above
	MULTF	F11, F11, F10	// F11 = F11 * 1/2, so F11 = X1
	MULTF	F12, F12, F10	// F12 = F12 * 1/2, so F12 = X2
noroots	SF	0(R3), F11	// write back X1
	SF	0(R4), F12	// write back X2
				// Note that garbage was written back when
				// there were no roots.  User's fault, he
				// should've checked
				// DONE!

3.2
	LB	R1, 64(R0)	// load 1st 8 bits from 64
	LB	R2, 65(R0)	// load 2nd 8 bits from 65
	LB	R3, 66(R0)	// load 3rd 8 bits from 66
	LB	R4, 67(R0)	// load 4th 8 bits from 67
	LB	R5, 68(R0)	// load 5th 8 bits from 68
	LB	R6, 69(R0)	// load 6th 8 bits from 69
	LB	R7, 70(R0)	// load 7th 8 bits from 70
	LB	R8, 71(R0)	// load 8th 8 bits from 71
	SB	R8, 64(R0)	// store in reverse order
	SB	R7, 65(R0)	//          " 
	SB	R6, 66(R0)	//          "
	SB	R5, 67(R0)	//          "
	SB	R4, 68(R0)	//          "
	SB	R3, 69(R0)	//          "
	SB	R2, 70(R0)	//          "
	SB	R1, 71(R0)	//          "
				// DONE!

3.3
1	LW	R5, 8(R0)	// Can't use 10 due to byte alignment error

2	ADD	R4, R0, R0	// Can't use Double Precision FP Add here

3	ADD	R3, R4, R5	// Doesn't make sense to put displacement
				// here, ALU can only do one thing at a time

4	ADDI	R5, R5, #1	// R0 can't be dest.,  ADDI requires an 
				// immediate value

5	SW	104(R4), R5	// we don't have indexed mode.

6	correct

7	SW	100(R4), R5	// can't have 103 due to byte alignment

8	ADDI	R5, R5, #-32760	// Immediate too big for 16 bits

9	SW	108(R4), R5	// Dest. comes before source

Problem 4

Here's my attempt at problem 4. I'd like a few comments on it... --- tonyj

NOTE: This is a badly specified problem. Any one you see will have more information, and will require less speculation than this.

4.1
  variables used:  CPI = Average Clock Cycle Count Per instruction
                   IC  = Instruction count
                   CC  = Time for 1 clock cycle

  a) MFLOPS = Floating IC
              ---------------------
              Execution time * 10^6		

     assume a pure floating point instruction mix

     T = Execution Time = Floating IC
                          -------------
                          MFLOPS * 10^6

  b)  MIPS = Total IC
             ---------------------
             Execution time * 10^6

      T = Execution Time = Total IC
                           -----------
                           MIPS * 10^6


4.2
  Assume CPI of ALU, Load/Store, Branches == 1

a)
CPU time = IC x CPI x 1 / CR 

CR = Clock rate, measurec in cycles per second (Hz)

IC for compiler O  = 5+1+1 = 7 * 10^6
IC for compiler HO = 10+1+1 = 12 * 10^6

CPI is unchanged, assume = 1

CR is unchanged, given as 100 MHz = 100 * 10^6 cycles / sec


b)  MIPS = IC / (CPU time * 10^6)
         = IC / (IC x CPI x 1/CR * 10^6)
         = 1 / (CPI x 1/CR * 10^6)            // CPI and CR are constant
       	 = 1 / (1 x 1/(100 x 10^6) x 10^6)
         = 1 / (1/100)
         = 100

c) MFLOPS assumes that there's a floating point insruction, which there
   isn't, so MFLOPS is undefined
-----------------------------------------------
		Compiler O	Compiler HO
a) CPUtime:	  .007s		   .012 s
b) MIPS:	   100 	 	    100
c) MFLOPS:	undefined 	 undefined


4.3
  The new compiler is bad, it produces longer code.  The branches are not
reduced, the loads are unchanged, but the ALU instructions doubled.
That's bad.

4.4
  If MIPS of O is higher than HO, I can assume that CPI != 1 for all
instuctions, and that the code produced by O have more instructions with
lower CPI (for example, software based floating point instead of hardware
FP).  The reversal of MFLOPS supports the theory above, implying that
there are much fewer FP instruction in the O code.

Based on these observations, MFLOPS and MIPS are horrible for use as
predictors of execution time, since the FP mix is not constant.

Problem 5

Here Problem 5 (whew, I'm done) (you need to read this at more than 80 columns)

5.1
  Assume read and write on registers during 1 clock is allowed, write
     first, read second.
  Assume 1 read/write port on memory (so resource contention does exist)

  Control   : None (there are no branches)
  Data      : RAW hazard on line 3:

LW R5, 20(R0)  IF  D   EX  MEM WB
ADD R4, R0, R0 --  IF  D   EX  MEM WB
SW 100(R4), R5 --  --  IF  D   st  st  EX   <-- RAW hazard

       solve by forwarding from LW R5, 20(R0)'s MEM to SW's EX

 
  Structural : Hazard on line 4, can't read and write at the same time. 
               -  Solve by separate memories or multiple ports
               contention for instruction and data memory on line 6.
               Error on line 7

1 LW R5, 20(R0)  IF  D   EX  MEM WB
2 ADD R4, R0, R0 --  IF  D   EX  MEM WB
3 SW 100(R4), R5 --  --  IF  D   st  st  EX  MEM WB
4 ADD R5, R5, R5 --  --  --  IF  st  st  ID  EX  MEM WB 
5 SW 104(R4), R5 --  --  --  --  --  --  IF  ID  st  st  EX  MEM WB
6 ADD R5, R5, R5 --  --  --  --  --  --  --  IF  st  st  ID
7 SW 106(R4), R5  <-------------------------- can't load word from location 106, structural hazard
8 ADD R5, R5, R5
9 SW 108(R4), R5


5.2
    1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21
1  IF  D   EX  MEM WB
2  --  IF  D   EX  mem WB
3  --  --  IF  D   st  st  EX  MEM wb
4  --  --  --  IF  st  st  ID  EX  mem WB 
5  --  --  --  --  --  --  IF  ID  st  st  EX  MEM wb
6  --  --  --  --  --  --  --  IF  st  st  ID  EX  mem WB
7  --  --  --  --  --  --  --  --  --  --  IF  ID  st  st  EX  MEM wb
8  --  --  --  --  --  --  --  --  --  --  --  IF  st  st  ID  EX  mem WB
9  --  --  --  --  --  --  --  --  --  --  --  --  --  --  IF  ID  st  st  EX  MEM wb

NOTE: I modified the code in lines 7 and 9 to make it legal

CPI = clocks / instrutions == 21/9  == 2.333


5.3
                  1   2   3   4   5   6   7   8   9  10  11  12  13
1 LW R5, 20(R0)  IF  D   EX  MEM WB
2 ADD R4, R0, R0 --  IF  D   EX  mem WB
3 SW 100(R4), R5 --  --  IF  D   EX  MEM wb                            (forward from 2.EX to 3.EX)
4 ADD R5, R5, R5 --  --  --  IF  ID  EX  mem WB                        (forward from 1.MEM to 4.EX)
5 SW 104(R4), R5 --  --  --  --  IF  ID  EX  MEM wb                    (forward from 4.EX to 5.MEM)
6 ADD R5, R5, R5 --  --  --  --  --  IF  ID  EX  mem WB                (forward from 4.EX to 5.EX)
7 SW 108(R4), R5 --  --  --  --  --  --  IF  ID  EX  MEM wb            (forward from 6.EX to 7.MEM)
8 ADD R5, R5, R5 --  --  --  --  --  --  --  IF  ID  EX  mem WB        (forward from 6.EX to 8.EX)
9 SW 112(R4), R5 --  --  --  --  --  --  --  --  IF  ID  EX  MEM wb    (forward from 8.EX to 9.MEM)

CPI = 13/9 == 1.44

No data hazards.

Back