\newpage
\subsection*{Problem 3: Hands ON the Pipe-Like Pipe}


\begin{description}

\item[3.1] What do I mean by ``pipe-like pipe''? That is, why do I
      constantly refer to it that way, instead of merely ``the pipeline''?

\vspace{2in}

\item [3.2]  Compute the speedup due to bypassing and forwarding for the
      extremely strange code fragment below, assuming that the
{\em raw\/} 5-stage pipe-like pipe is used. By raw, I mean the naive version
that we discussed in class on Tuesday, before we'd ever heard of {\em structural
hazards\/}. There is no split instruction and data memory, and register
reads and writes conflict before and after bypassing and forwarding are added.

{\em Don't panic, see the steps below the code fragment.}


\begin{centering}
\begin{verbatim}
(1) LW R1, 100(R1);
(2) LW R2, 200(R1);
(3) DADDI R1, R2, #3000 ;
(4) DSUB R2, R4, R1;
(5) AND R2, R0, R2;
(6) SW R2,400(R2); 
(7) LW R3,100(R0);
(8) XOR R3, R3, R3;
(9) SW R3, 0(R3);
(10) LW R3, 0(R2);
(11) SW  R3,100(R3);
\end{verbatim}
\end{centering}


{\bf Execution Profile: Struct Hazards w/o Bypassing, Forwarding or Load Interlocks}
\begin{verbatim}
    
     1       5       9       13      17      21     25 26    30     34  35     39
(1) F D X M W
(2)   F S S S D X M W
(3)           F S S S D X M W
(4)                   F S S S D X M W
(5)                           F S S S D X M W
(6)                                   F S S S D X M W
(7)                                           F D X M W
(8)                                             F S S S D X M W
(9)                                                     F S S S D X M W
(10)                                                            F D X M W
(11)                                                              F S S S D X M W
\end{verbatim}


{\bf Execution Profile: Struct Hazards with  Bypassing, Forwarding and Load Interlocks}

Remember, cannot read instruction and data from memory during same clock cycle,
nor read and write to registers during the same clock.

\begin{verbatim}
    1       5   7     10 11 12   16 17 19 21 23 25

(1) F D X M W
(2)   F D S X M W
(3)     F S S D S X M W
(4)             F D X M W
(5)               F D X M W
(6)                 F S S S D X M W
(7)                         F D X M W
(8)                           F D S X M W
(9)                               F S D X M W
(10)                                  F S D X M W
(11)                                        F D X M W
\end{verbatim}


{\bf Interpretation:}
\begin{enumerate}
 \item  Compute  ${\rm CPI}_{\rm old}$  by tracking execution of the
       fragment without bypassing and forwarding.  {\bf Answer: 39/11}
 \item  Compute  ${\rm CPI}_{\rm new}$  by tracking execution of the
       fragment assuming bypassing and forwarding. {\bf Answer: 25/11}
 \item  Speedup is  ${\rm CPI}_{\rm old}$ divided by ${\rm CPI}_{\rm new}$. Why?
       Make sure you know! {\bf Answer: 39/25}
\end{enumerate}
\vspace{.5in}


\vspace{.5in}

\end{description}
\newpage
\subsection*{Problem 4: Pass ON That Pipe-Like Pipe}
\begin{description}

\item [4.1] Repeat the last part of problem 3, except this time, assume that
split instruction and data memory are used, and that
registers are written in the first half of the clock cycle, and read during
the second  throughout the entire problem. That is, what is the speedup due to
bypassing and forwarding when all the structural hazards have been removed?


\begin{centering}
\begin{verbatim}
(1) LW R1, 100(R1);
(2) LW R2, 200(R1);
(3) DADDI R1, R2, #3000 ;
(4) DSUB R2, R4, R1;
(5) AND R2, R0, R2;
(6) SW R2,400(R2); 
(7) LW R3,100(R0);
(8) XOR R3, R3, R3;
(9) SW R3, 0(R3);
(10) LW R3, 0(R2);
(11) SW  R3,100(R3);
\end{verbatim}
\end{centering}


{\bf Execution Profile: No Struct Hazards and No Forwarding, Bypassing or Load Interlocks}

\begin{verbatim}
    1       5     8    11    14   17   20 21   23  26 27  30
(1) F D X M W
(2)   F S S D X M W
(3)         F S S D X M W
(4)               F S S D X M W
(5)                     F S S D X M W
(6)                           F S S D X M W
(7)                                 F D X M W
(8)                                   F S S D X M W
(9)                                         F S S D X M W
(10)                                              F D X M W
(11)                                                F S S D X M W
\end{verbatim}

{\bf CPI = 30/11}


{\bf Execution Profile: No Struct Hazards with Forwarding, Bypassing and Load Interlocks}

\begin{verbatim}
            5   6   7 8  9             18
(1) F D X M W
(2)   F D S X M W
(3)     F S D S X M W
(4)         F S D X M W
(5)             F D X M W
(6)               F D X M W
(7)                 F D X M W
(8)                   F D S X M W
(9)                     F S D X M W
(10)                        F D X M W
(11)                          F D S X M W
\end{verbatim}

{\bf CPI = 18/11}
{\bf Speedup = 30/18}

\item[4.2] Okay, now, just for grins and giggles, see if you can play
      optimizing compiler and accomplish whatever it is our friendly code
      fragment with fewer instructions, if possible, and try to
     minimize the   CPI, essentially repeat 3.2 and
      4.1 on your modified code fragment. The point is for  you to become
      comfortable with the way code executes on the pipeline. Why? Because
      we'll be doing a lot of that in the next two weeks.
   
      And, while we are at it, the moral of the story is that no one cares
      what the actual machine code looks like as long as it has the
      identical effect as your source would have. That is, it's got to be
      transparent to the user.


\vspace{.3in}

\begin{centering}
\begin{verbatim}
(1) LW R1, 100(R1);
(2) LW R2, 200(R1);
(3) DADDI R1, R2, #3000 ;
(4) DSUB R2, R4, R1;
(5) AND R2, R0, R2;
(6) SW R2,400(R2); 
(7) LW R3,100(R0);
(8) XOR R3, R3, R3;
(9) SW R3, 0(R3);
(10) LW R3, 0(R2);
(11) SW  R3,100(R3);
\end{verbatim}
\end{centering}

{\bf Analysis:} Let's look at this code fragment carefully and see what
is really altered after the code has executed.

Lines (1) and (2) read some values from memory, that are manipulated in
(3), and (4). However, in (5), we zero out R2, and store the result in
memory in (6). That's the first modification to memory--effective address
400 gets a zero stored in it.

Now, line (7) reads some other value from memory  into R3. But, R3 becomes
zero in line (8), and that zero is stored in address 0 in line (9). Note that
R2 still has a zero in it from instruction (5). So, instruction (10) reads
that zero stored at effective address 0, and stores yet another zero at
effective address 100.

The bottom line? Here's the equivalent code, reorganized to use memory in
increasing address order.
And, could a smart compiler do this? I have no idea. But, it should
certainly give you an idea of how complex-looking code can be simplified.

\begin{centering}
\begin{verbatim}
(1) SW R0, 0(R0);
(2) SW R0, 100(R0);
(3) SW R0, 400(R0);
\end{verbatim}
\end{centering}


\item[4.3] Bored yet? Well, here's my favorite question of all. There's
      this formula on page A-13, on your cheat sheet, and in Bill's
      Pipelining Handout.

It says that the speedup from pipelining is


\[ \frac{1}{1+ {\rm Pipeline\;\;stall\;\;cycles\;\; per\; instruction}} \times 
{\rm Pipeline\;\; depth}.\]


Why do I {\em just adore\/}  this formula? NOT!
Go back and look at your problem 3.1 and 4.1 execution profiles (my
term for mapping out what happens while executing on the pipe-like
pipe) for the unoptimized code fragment.  Then, apply this formula. And,
then, tell me. I have a real problem applying this one. Why? Can you
compute the number of stalls per instruction, independent of instruction order?

{\bf Answer:} Essentially, this is one of those formulas that make more
sense in the general case, where we assume that some poor soul has actually
done some measurements and computations over large code sets, and come up
with, say ALU instructions having a certain percentage of stalls on
average, and Load instructions havin another percentage, and so on.
 It's not meant for use with a code fragment--at least, I don't think so.


\end{description}


\end{document}