\newpage \subsection*{Problem 3: Hands ON the Pipe-Like Pipe} \begin{description} \item[3.1] What do I mean by ``pipe-like pipe''? That is, why do I constantly refer to it that way, instead of merely ``the pipeline''? \vspace{2in} \item [3.2] Compute the speedup due to bypassing and forwarding for the extremely strange code fragment below, assuming that the {\em raw\/} 5-stage pipe-like pipe is used. By raw, I mean the naive version that we discussed in class on Tuesday, before we'd ever heard of {\em structural hazards\/}. There is no split instruction and data memory, and register reads and writes conflict before and after bypassing and forwarding are added. {\em Don't panic, see the steps below the code fragment.} \begin{centering} \begin{verbatim} (1) LW R1, 100(R1); (2) LW R2, 200(R1); (3) DADDI R1, R2, #3000 ; (4) DSUB R2, R4, R1; (5) AND R2, R0, R2; (6) SW R2,400(R2); (7) LW R3,100(R0); (8) XOR R3, R3, R3; (9) SW R3, 0(R3); (10) LW R3, 0(R2); (11) SW R3,100(R3); \end{verbatim} \end{centering} {\bf Execution Profile: Struct Hazards w/o Bypassing, Forwarding or Load Interlocks} \begin{verbatim} 1 5 9 13 17 21 25 26 30 34 35 39 (1) F D X M W (2) F S S S D X M W (3) F S S S D X M W (4) F S S S D X M W (5) F S S S D X M W (6) F S S S D X M W (7) F D X M W (8) F S S S D X M W (9) F S S S D X M W (10) F D X M W (11) F S S S D X M W \end{verbatim} {\bf Execution Profile: Struct Hazards with Bypassing, Forwarding and Load Interlocks} Remember, cannot read instruction and data from memory during same clock cycle, nor read and write to registers during the same clock. \begin{verbatim} 1 5 7 10 11 12 16 17 19 21 23 25 (1) F D X M W (2) F D S X M W (3) F S S D S X M W (4) F D X M W (5) F D X M W (6) F S S S D X M W (7) F D X M W (8) F D S X M W (9) F S D X M W (10) F S D X M W (11) F D X M W \end{verbatim} {\bf Interpretation:} \begin{enumerate} \item Compute ${\rm CPI}_{\rm old}$ by tracking execution of the fragment without bypassing and forwarding. {\bf Answer: 39/11} \item Compute ${\rm CPI}_{\rm new}$ by tracking execution of the fragment assuming bypassing and forwarding. {\bf Answer: 25/11} \item Speedup is ${\rm CPI}_{\rm old}$ divided by ${\rm CPI}_{\rm new}$. Why? Make sure you know! {\bf Answer: 39/25} \end{enumerate} \vspace{.5in} \vspace{.5in} \end{description} \newpage \subsection*{Problem 4: Pass ON That Pipe-Like Pipe} \begin{description} \item [4.1] Repeat the last part of problem 3, except this time, assume that split instruction and data memory are used, and that registers are written in the first half of the clock cycle, and read during the second throughout the entire problem. That is, what is the speedup due to bypassing and forwarding when all the structural hazards have been removed? \begin{centering} \begin{verbatim} (1) LW R1, 100(R1); (2) LW R2, 200(R1); (3) DADDI R1, R2, #3000 ; (4) DSUB R2, R4, R1; (5) AND R2, R0, R2; (6) SW R2,400(R2); (7) LW R3,100(R0); (8) XOR R3, R3, R3; (9) SW R3, 0(R3); (10) LW R3, 0(R2); (11) SW R3,100(R3); \end{verbatim} \end{centering} {\bf Execution Profile: No Struct Hazards and No Forwarding, Bypassing or Load Interlocks} \begin{verbatim} 1 5 8 11 14 17 20 21 23 26 27 30 (1) F D X M W (2) F S S D X M W (3) F S S D X M W (4) F S S D X M W (5) F S S D X M W (6) F S S D X M W (7) F D X M W (8) F S S D X M W (9) F S S D X M W (10) F D X M W (11) F S S D X M W \end{verbatim} {\bf CPI = 30/11} {\bf Execution Profile: No Struct Hazards with Forwarding, Bypassing and Load Interlocks} \begin{verbatim} 5 6 7 8 9 18 (1) F D X M W (2) F D S X M W (3) F S D S X M W (4) F S D X M W (5) F D X M W (6) F D X M W (7) F D X M W (8) F D S X M W (9) F S D X M W (10) F D X M W (11) F D S X M W \end{verbatim} {\bf CPI = 18/11} {\bf Speedup = 30/18} \item[4.2] Okay, now, just for grins and giggles, see if you can play optimizing compiler and accomplish whatever it is our friendly code fragment with fewer instructions, if possible, and try to minimize the CPI, essentially repeat 3.2 and 4.1 on your modified code fragment. The point is for you to become comfortable with the way code executes on the pipeline. Why? Because we'll be doing a lot of that in the next two weeks. And, while we are at it, the moral of the story is that no one cares what the actual machine code looks like as long as it has the identical effect as your source would have. That is, it's got to be transparent to the user. \vspace{.3in} \begin{centering} \begin{verbatim} (1) LW R1, 100(R1); (2) LW R2, 200(R1); (3) DADDI R1, R2, #3000 ; (4) DSUB R2, R4, R1; (5) AND R2, R0, R2; (6) SW R2,400(R2); (7) LW R3,100(R0); (8) XOR R3, R3, R3; (9) SW R3, 0(R3); (10) LW R3, 0(R2); (11) SW R3,100(R3); \end{verbatim} \end{centering} {\bf Analysis:} Let's look at this code fragment carefully and see what is really altered after the code has executed. Lines (1) and (2) read some values from memory, that are manipulated in (3), and (4). However, in (5), we zero out R2, and store the result in memory in (6). That's the first modification to memory--effective address 400 gets a zero stored in it. Now, line (7) reads some other value from memory into R3. But, R3 becomes zero in line (8), and that zero is stored in address 0 in line (9). Note that R2 still has a zero in it from instruction (5). So, instruction (10) reads that zero stored at effective address 0, and stores yet another zero at effective address 100. The bottom line? Here's the equivalent code, reorganized to use memory in increasing address order. And, could a smart compiler do this? I have no idea. But, it should certainly give you an idea of how complex-looking code can be simplified. \begin{centering} \begin{verbatim} (1) SW R0, 0(R0); (2) SW R0, 100(R0); (3) SW R0, 400(R0); \end{verbatim} \end{centering} \item[4.3] Bored yet? Well, here's my favorite question of all. There's this formula on page A-13, on your cheat sheet, and in Bill's Pipelining Handout. It says that the speedup from pipelining is \[ \frac{1}{1+ {\rm Pipeline\;\;stall\;\;cycles\;\; per\; instruction}} \times {\rm Pipeline\;\; depth}.\] Why do I {\em just adore\/} this formula? NOT! Go back and look at your problem 3.1 and 4.1 execution profiles (my term for mapping out what happens while executing on the pipe-like pipe) for the unoptimized code fragment. Then, apply this formula. And, then, tell me. I have a real problem applying this one. Why? Can you compute the number of stalls per instruction, independent of instruction order? {\bf Answer:} Essentially, this is one of those formulas that make more sense in the general case, where we assume that some poor soul has actually done some measurements and computations over large code sets, and come up with, say ALU instructions having a certain percentage of stalls on average, and Load instructions havin another percentage, and so on. It's not meant for use with a code fragment--at least, I don't think so. \end{description} \end{document}