Midterm Solution
________________________________________________________________________
 SOLUTION TO PROBLEM 1
 

A)False. The PowerPC use (0,3) RISC ISA.

B) False. The term MIPS means million instructions per second. The term
MFLOPS means million float points instructions per second. The average
throughput means average tasks completed per unit time.

C) False. Not all GPR machines are RISC architectures.

D) False. Make the common case fast means that we should improve the frequent
case rather than the rare case.

E) False. When we talked about (1,1) machine, we talked about GPR
machine. The accumulator machine has only 1 register. So, it isn't a
GPR machine.

F) False. According to Amdahl's law,
                                          1
Speedup_overall    =    ---------------
                                (1-FE)+ FE/SE   , FE=Fraction of Enhancement,

SE=Speedup of the Enhancement.  More computation needs to be done to
compute the amount of time saved, and it is not proportional to 1/FE
or 1/SE, and there is no possible CONSTANT k.

G) False. The instruction set architecture is the key factor in determining the
average number of clocks per instruction.

H) False. Cache is small fast memory close to CPU which stored the recently used
code or data, it is introduced because the principle of locality of reference
which is the data used recently is more likely used in the near future, also because
of " Make the common case fast.

I) False. Stack machine require no operand in their ALU instruction.

J) False. The principle difference between RISC and CISC architectures is the ISA.
From the name, RISC, reduced instruction set computer, CISC, complex instruction
set computer.

K) False. Because of the pipeline hazard.

L) False. Pipeline is designed to make the CPU fast.
 

________________________________________________________________________
 SOLUTION TO PROBLEM 2.1
 

We know,
                        Ins Count          Clock Rate
        MIPS =  -------------  =  ------------
                     XEQ * 10^6       CPI * 10^6
MIPS is easy to understand, however it does not always mesaure the
computer performance accurately. MIPS is dependent on ins. set, making it
difficult to compare MIPS of computers with diff ins. set. More
importantly, MIPS can vary inversely to performance. On the other hand,
MFLOPS is not of much use when there is no floating point operation.
Execution time of CPU is the real measure of performance.

________________________________________________________________________
 SOLUTION TO PROBLEM 2.2
 

Structural Hazard--     Since we have only one memory port, we will have
        the problem of our limited resource when both "IF" and "MEM" stage
        tries to access the memory at the same time.

Date Hazard--           When some instruction code line is trying to
        access the value of a specific register, which is supposed to be
        written after some previous instruction is executed, BUT not yet
        written at that stage, Data Hazard will occur. Cause the current
        instruction will either read some wrong value or may be
        stalled.
 

Control Hazard--        If there is a jump or branch instruction, but
        the previous instruction is yet to set the value of PC register
        or other appropriate registers, execution may stall and Control
        Hazard occurs.

In general, if hazards are ignored, you get the wrong answers,
rendering your machine useless and untrustworthy.
________________________________________________________________________
 SOLUTION TO PROBLEM 2.3
 

An object of byte S at address A is byte aligned if

        A mod S = 0.

In DLX words should be byte aligned. Memory is usually aligned on a word
or double-word boundary. A misaligned memory access will take more
aligned memory references. So, misalignment will slow the program
execution. And misalignment causes hardware complications.  It may
also cause the system to crash, or for you to get wrong information.
 

________________________________________________________________________
 SOLUTION TO PROBLEM 2.4
We know,
        CPU = CPI * IC * CCT

When in design, clock cycle time is the most difficult to get, since it is very hard to  get the clock cycle time when the design is not completed. Once the computer  is built, it is very easy to get the clock cycle time.  The average CPI is also hard to get because it depends on the detailed processor, it is even harder for modern processor using pipelining and memory hierarchies. The instruction count is the easiest, because at this stage, it is usuallysufficient to focus on static IC. The static IC can be computed using code analysis or simulations.

In an existing system, CCT is easiest, as the clock is there already. CPI can be estimated from running benchmarks, or simply code that exercises each instruction.   The problem in an existing system is to predict the average dynamic code size to get the run-time IC.

________________________________________________________________________
 SOLUTION TO PROBLEM 2.5
 

Lets assume, CPI, IC, and CCT stands for the original (initial) values of CPI, Ins. Count and Clock Cycle Time.

We know,
        CPU TIME (old) = CPI * IC * CCT
 

Now, CPI increased 5%, IC decreased by 20% and CLOCK rate is twice means Clock Cycle time is half of old. Remember CCT is inverse of Clock rate.

So we have,

        CPU TIME (new) = (CPI + .5*CPI) * (IC- .2*IC) * (.5* CCT)

and we know,
                   XEQ (old)         CPI * IC * CCT
        SpeedUp = ----------- = ---------------------------
                   XEQ (new)    (1.05*CPI) * (.8*IC) * (.5*CCT)

Solving the equation yields,
        SpeedUp = 1/.42 = 2.580
 
 

________________________________________________________________________
 SOLUTION TO PROBLEM 3.1
 

Clock Cycle Time (CCT) increased 10%. This also means Clock Rate decreased. Remember Clock Cycle Time is inverse of Clock Rate. When clock rate decreased, Computer speed will decrease. The computer will seem slower than before.

Again, CPU = CPI * IC * CCT  and  CPI = (Clock Cycles/Ins Count) An increase in CCT will increase CPU Time. Increase in Clock Cycle Time will take more time per Clock Cycle now. So, it should take less cycle to run a program, which will decrease CPI.

But after all, the performance may deteriorate because CPU time is increasing.

_______________________________________________________________________
SOLUTION TO PROBLEM 3.2

Still under debate--but there will be no such problem on the final--nothing as bizarre as this.

_______________________________________________________________________

Problem 4

4.1.1 Initiate stack
Assume: 1)4000 is the bottom of the stack address.
        2)5000 is the start of the stack address.
        3)R31 point to the top of

               ADDI R31,R0,#5000       ;R31 <- #5000
LOOP:    ADDI R10,R0,#5000       ;R10 <- #5000
                LW 0(R10), R0           ;PUT 0 INTO STACK
                SUBI R10,R10,#4         ;CHANGE POINTER TO THE NEXT STACK
                SUBI R11,R10,#4000      ;IS IT THE BOTTOM OF THE STACK?
                BNEZ R11, LOOP          ;

4.1.2

ADDI R1,R31,R0          ;Get the pointer to the stack
SUBI R2,R1,#4000        ;If the stack is full?
BENZ R2,STOP            ;

LW R2,0(R30)            ;Get the value to be pushed
SW 0(R1),R2             ;push
SUBI R1,R1,#4           ;changed the pointer
ADD  R31,R1,R0          ;store the pointer value in R31
STOP

4.1.3

Stack is used to store machine state during interrupts and procedure calls--where the machine state includes the contents of registers, and perhaps, for exception handling, portions of the pipeline stage
registers, and other internal data. Procedure calls in DLX jump to the correct subroutine address (using JAL for jump and link). Register R31 holds the return address of the subroutine, which is typically dumped on a stack so that the called routine can have access to all the registers.

4.3

4.3.1

                               1  2  3  4  5    6    7  8  910 11 12 13 14 15 16 17 18 19 20 21 22 23
loop:
LW R2,100(R1)   |  F D X M W
LW R3,500(R1)   |      F D X  M  W
SUBI R5,R3,R2   |          F  S   S   D  X  M  W
ADD R5,R5,R4    |                         F  S   S    D   X   M  W
SW 1000(R1),R5  |                                        F    S    S   D  X   M   W
ADDI R1,R1,#4   |                                                           F   D  X   M  W
SUBI R5,R1,#400 |                                                               F  S   S  D  X  M  W
BNEZ R5,Loop    |                                                                              F  D  X  M  W
-----------------------------------------------------------------------------
LW R2,100(R1)   |                                                                                    F  D  X  M  W
 
1) For one iteration, if ignoring BENZ command, one iteration takes 19 clock
cycles, and execute 7 instructions, so
 
CPI=19/7=2.71
 
2) The total CPI. if ignoring the contribution of the branch, it takes 17
cycle for one iteration if the branch is untaken, all branches are untaken
except the last one. The last iteration will take 20 clock cycle.

Also for first iteration

ADDI R1,R0,R0     F D X M W
LW R2,100(R1)       F S S D X M W

There need an extra 3 clock cycle to start the loop.
So, the total clock cycle= 17*99+20+3=1706
The total instruction number is 1+100*8=801

Thus we got the total average CPI=1706/801=2.13

4.3.2

1) Rewite the code operate on five pairs of array operands during loop body.
 
ADD R1,R0,R0            ;init counter to 0
Loop:
LW R2,100(R1)           ;R1 has offset of first array element
LW R3,500(R1)           ;R3 holds another array element

LW R4,104(R1)           ;get the second element
LW R5,504(R1)

LW R6,108(R1)           ;get the third element
LW R7,508(R1)

LW R8,112(R1)           ;get the fouth element
LW R9,512(R1)
 
LW R10,116(R1)          ;get the fifth element
LW R11,516(R1)

SUBI R12,R3,R2          ;perform substraction
SUBI R13,R5,R4
SUBI R14,R7,R6
SUBI R15,R9,R8
SUBI R16,R11,R10

ADD R12,R12,R4          ;perform addition of constant
ADD R13,R13,R4
ADD R14,R14,R4
ADD R15,R15,R4
ADD R16,R16,R4

SW 1000(R1),R12         ;store the result
SW 1004(R1),R13
SW 1008(R1),R14
SW 1012(R1),R15
SW 1016(R1),R16

ADDI R1,R1,#20          ;increment pointer
SUBI R20,R1,#400        ;check pointer
BENZ R20,LOOP           ;branch while not done
STOP

2) Because in this case there is no data hazard, inside the loop, if not considering
control hard, the average CPI would be 1.
 
The overall execution time is 20*28+1=561 clock time.
ADD R1,R0,R0    |
Loop:           |
LW R2,100(R1)   |F D X M W
LW R3,500(R1)   |  F D X M W
LW R4,104(R1)   |    F D X M W
LW R5,504(R1)   |      F D X M W
LW R6,108(R1)   |        F D X M W
LW R7,508(R1)   |          F D X M W
LW R8,112(R1)   |            F D X M W
LW R9,512(R1)   |              F D X M W
LW R10,116(R1)  |                F D X M W
LW R11,516(R1)  |                  F D X M W
ADDI R1,R1,#20  |                    F D X M W
SUBI R12,R3,R2  |                      F D X M W
SUBI R13,R5,R4  |                        F D X M W
SUBI R14,R7,R6  |                          F D X M W
SUBI R15,R9,R8  |                            F D X M W
SUBI R16,R11,R10|                              F D X M W
SUBI R20,R1,#400|                                 F D X M W
ADD R12,R12,R4  |                                   F D X M W
ADD R13,R13,R4  |                                     F D X M W
ADD R14,R14,R4  |                                       F D X M W
ADD R15,R15,R4  |                                         F D X M W
ADD R16,R16,R4  |                                           F D X M W
SW 980(R1),R12  |                                             F D X M W
SW 984(R1),R13  |                                               F D X M W
SW 988(R1),R14  |                                                 F D X M W
SW 992(R1),R15  |                                                   F D X M W
BNEZ R20,Loop   |                                                     F D X M W
SW 994(R1),R16  |                                                       F D X M W
STOP            |