What we already know about pipelining

- We need to avoid structural hazards, data hazards and control hazards in order to get optimal performance from the pipeline.
  - Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls
- Accomplish this by techniques such as:
  - instruction reordering
  - hardware modifications to detect branches earlier
  - compiler approaches to reduce branch delays
- Each technique reduces one or more of the stall components of the Pipeline CPI

What's new in Units 4 & 5

- Some instructions can be executed independently of others. In fact, they could be executed in parallel if there was the hardware to do it.
- Idea is to take advantage of this instruction level parallelism to do major rearrangements to how compilers generate code (Unit 5) and how the MIPS (and other modern processors) pipeline executes code (Unit 4)

New techniques

- Hardware:
  - dynamic pipeline scheduling
    - with scoreboarding
    - with register renaming (Tomasulo)
  - dynamic branch prediction
  - issuing multiple instructions per cycle
  - speculation
  - dynamic memory disambiguation
- Software (compiler):
  - loop unrolling
  - compiler dependence analysis
  - software pipelining and trace scheduling
  - compiler speculation

Data dependences (background)

- If 2 instructions are parallel, can execute simultaneously in pipeline without stalls – assuming no structural hazards
- If 2 instructions are dependent, must be executed in order, but may sometimes be partially overlapped

Data dependences (cont.)

- Instruction $j$ is data dependent on instruction $i$ if either:
  - instruction $i$ produces a result that may be used by instruction $j$, or
  - instruction $j$ is dependent on instruction $k$, and instruction $k$ is data dependent on instruction $l$ (transitivity)
- Dependence implies a chain of one or more data hazards between the 2 instructions – causing a pipelined processor to stall
- Dependences are properties of programs – and whether one causes a stall depends on the properties of the pipeline organization
Data dependences (cont.)

- A dependence
  - indicates the possibility of a hazard
  - determines the order results must be produced
  - sets an upper bound on available parallelism
- Ways to overcome dependences
  - maintain the dependence, but avoid the hazard
    (schedule the code – hardware or software)
  - eliminate the dependence – by transforming the code (software)

Name dependences

- When 2 instructions use the same register or memory location (harder to detect), called a *name*, but no flow of data between them
- 2 types
  - antidependence between instruction *i* and instruction *j* when *j* writes a register or memory location that *i* reads
  - output dependence occurs when *i* and *j* write the same register or memory location
- In both cases, the instructions can execute at the same time, or be reordered, if the *name* is changed so the instructions don’t conflict
  - statically by compiler or dynamically by hardware (usually only for registers – why?)

Control dependences

- To determine the ordering of an instruction, *j*, with respect to a branch instruction so that it is executed in correct program order, and only when it should be
  - e.g., the statements in *then* part of an *if* statement are control dependent on the branch

Control dependences (cont.)

- 2 constraints from control dependences
  - an instruction control dependent on a branch cannot be moved *before* the branch so that its execution is no longer controlled by the branch
  - an instruction not control dependent on a branch cannot be moved *after* the branch so that its execution is controlled by the branch
- But, can violate control dependences if don’t affect the correctness of the program
  - by preserving exception behavior and data flow
  - implemented by control hazard detection that causes control stalls, which can be reduced by various techniques (e.g., delayed branch)
Administrivia

• Quiz returned today
  – answers posted on web page
  – Average: 56
  – Median: 55  25%: 43  75%: 73
• HW #3 due today – posted tomorrow
• HW #4 posted tomorrow – on Ch. 3
• No quiz Thursday – moved back to March 13

Last time

• MIPS R4000 pipeline
  – 8 stages (3 extra from data/instruction memory access)
  – 2 cycle load delay, 3 cycle branch delay (1 delay slot)
  – lots of forwarding possibilities
• What limits instruction-level parallelism?
  – true data dependences:
    DADD R3, R1, R2
    SD R3, 0(R4)
  – antidependences (name dependences)
    DADD R3, R1, R2
    LD R1, 0(R4)

Last time (cont.)

• What limits instruction-level parallelism?
  – output dependences (name dependences too)
    SD R1, 0(R4)
    SD R2, 0(R4)
  – control dependences
    if (x) then y else x
    • need to preserve exception behavior and data flow

Dynamic pipeline scheduling

• A hardware technique that can do a good job of scheduling, since it has perfect information about hazards
• Basic idea is to start each instruction as early as possible
  – Analogy: waiting for a table at a restaurant
• Means have to deal with instructions that complete out-of-order
• There is a lot to keep track of, and two main schemes to do the bookkeeping
  – scoreboard method
  – Tomasulo’s method

Scoreboard method

• Origins: CDC 6600, late 1960s
• Scoreboard controls the progress of each instruction to ensure that hazards such as RAW, etc. are avoided
• ID basic pipeline stage split into 2 stages
  – Issue – decode instructions, check for structural hazards
  – Read operands – wait until no data hazards, then read operands
• Distinguish when instruction begins execution and when complete execution – in between it is in execution
• All instructions pass through issue stage in order, but can stall or bypass each other in second stage
  – so can get WAR hazards when instructions execute out of order
  – to have multiple instructions in EX stage simultaneously, use multiple functional units, or pipelined units, or both – equivalent for purposes of pipeline control

Functional units

• Processor has various functional units, perhaps
  – 1 ALU,
  – 2 multiply units,
  – 1 floating point adder,
  – and 1 divide unit
• Idea is to try to keep all of them busy
Scoreboard (cont.)

• One segment of the scoreboard keeps track of which pipe stage each instruction is in – Fig. A.52.

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Issue</th>
<th>Read operands</th>
<th>Execution complete</th>
<th>Write result</th>
</tr>
</thead>
<tbody>
<tr>
<td>L.D F6,34(R2)</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>L.D F2, 45(R3)</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>MUL.D F0,F2,F4</td>
<td>x</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>SUB.D F8,F6,F2</td>
<td>x</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>DIV.D F10,F0,F6</td>
<td>x</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>ADD.D F6,F8,F2</td>
<td>x</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Functional units

• One segment of the scoreboard keeps track of, for each functional unit:
  – Busy (yes or no)
  – Op: operation to perform
  – Fi: name of destination register
  – Fj, Fk: names of source registers
  – Qj, Qk: functional units producing the sources
  – Rj, Rk: ready flags for Fj and Fk:
    • "yes" if the source is ready and still need it
    • "no" if the source is not ready, or if no longer need it

Scoreboard for functional units

<table>
<thead>
<tr>
<th>Name</th>
<th>Busy</th>
<th>Op</th>
<th>Fi</th>
<th>Fj</th>
<th>Fk</th>
<th>Qj</th>
<th>Qk</th>
<th>Rj</th>
<th>Rk</th>
</tr>
</thead>
<tbody>
<tr>
<td>Integer</td>
<td>Yes</td>
<td>Load</td>
<td>F2</td>
<td>R3</td>
<td></td>
<td></td>
<td></td>
<td>No</td>
<td></td>
</tr>
<tr>
<td>Mult1</td>
<td>Yes</td>
<td>Mult</td>
<td>F0</td>
<td>F2</td>
<td>F4</td>
<td>Integer</td>
<td>No</td>
<td>Yes</td>
<td></td>
</tr>
<tr>
<td>Mult2</td>
<td>No</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Add</td>
<td>Yes</td>
<td>Sub</td>
<td>F8</td>
<td>F6</td>
<td>F2</td>
<td>Integer</td>
<td>Yes</td>
<td>No</td>
<td></td>
</tr>
<tr>
<td>Divide</td>
<td>Yes</td>
<td>Div</td>
<td>F10</td>
<td>F0</td>
<td>F6</td>
<td>Mult1</td>
<td>No</td>
<td>Yes</td>
<td></td>
</tr>
</tbody>
</table>

Register status

• One segment of the scoreboard keeps track of which registers are expecting outputs from which functional units

<table>
<thead>
<tr>
<th>FU</th>
<th>Mult1</th>
<th>Integer</th>
<th>Add</th>
<th>Divide</th>
</tr>
</thead>
<tbody>
<tr>
<td>F0</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>F1</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>F2</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>F3</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
</tbody>
</table>

• Next, an example of the scoreboard method

The bookkeeping

Each functional unit keeps track of:

<table>
<thead>
<tr>
<th>Unit</th>
<th>Opm</th>
<th>Fj</th>
<th>Fk</th>
<th>Qj</th>
<th>Qk</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Op</td>
<td>F1</td>
<td>F2</td>
<td>F3</td>
<td>F4</td>
</tr>
<tr>
<td></td>
<td></td>
<td>F5</td>
<td>F6</td>
<td>F7</td>
<td>F8</td>
</tr>
<tr>
<td></td>
<td></td>
<td>F9</td>
<td>F10</td>
<td>F11</td>
<td>F12</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>F13</td>
</tr>
</tbody>
</table>

And we’ll assume these initial register contents and flags:

<table>
<thead>
<tr>
<th>F0</th>
<th>F2</th>
<th>F4</th>
<th>F6</th>
<th>F8</th>
<th>F10</th>
<th>F12</th>
<th>F14</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>3</td>
<td>4</td>
<td>6</td>
<td>8</td>
<td>10</td>
<td>12</td>
<td>14</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>

The bookkeeping

<table>
<thead>
<tr>
<th>Mult1</th>
<th>x</th>
<th>x</th>
<th>x</th>
<th>x</th>
<th>x</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>x</td>
<td>x</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Add1</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td></td>
<td>x</td>
<td>x</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Add2</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>Mult2</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
</tbody>
</table>

And we’ll assume these initial register contents and flags:

<table>
<thead>
<tr>
<th>F0</th>
<th>F2</th>
<th>F4</th>
<th>F6</th>
<th>F8</th>
<th>F10</th>
<th>F12</th>
<th>F14</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>3</td>
<td>4</td>
<td>6</td>
<td>8</td>
<td>10</td>
<td>12</td>
<td>14</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>
Time 0

S1: ADD.D F0, F2, F4  Assume 2 cycle add

Time 1

S2: MULT.D F2, F6, F8  Assume 4 cycle mult

Time 2

S3: MULT.D F10, F0, F2

Time 3

S4: ADD.D F0, F8, F6

Time 4

S5: ADD.D F4, F6, F6

Time 5

S6: ADD.D F2, F2, F8  stalled

NOT STALLED! F0's contents have already been passed to all units that need it!

To m2, but F0 not really nec.
### Computer Systems Architecture

**CMSC 411**  
**Unit 4 – Instruction-level parallelism**

Alan Sussman  
March 6, 2003

### Last time

- **Dynamic pipeline scheduling**  
  - start each instruction as early as possible  
  - so have to deal with out-of-order completion  
  - main problem is bookkeeping  
  - scoreboard  
  - Tomasulo algorithm

- **Scoreboard**  
  - a centralized mechanism to avoid hazards  
  - split ID pipe stage into issue and read operands  
  - instructions issue in order, can stall/bypass in read operands stage  
  - use multiple functional units for examples

### Administrivia

- HW #3 solutions posted  
- HW #4 posted – on Ch. 3 and scoreboard  
  - 6 problems, including 1 from Appendix A on scoreboard (this is a long chapter)  
- Quiz next Thursday, March 13  
  - on Unit 3 (basic pipelining), and a little on scoreboard

### Last time (cont.)

- **Scoreboard (cont.)**  
  - 1 part for each issued instruction  
    - what pipe stages it has passed through  
  - 1 part for each functional unit (FU)  
    - what it’s doing, and where inputs and outputs come from (reg. or functional unit) and go to (reg.), plus some flags  
  - 1 part for register status  
    - which ones are expecting outputs from which FU
**Limits to the Scoreboard method**

- Number and types of functional units
  - for structural hazards
- Number of entries in the scoreboard table (how many instructions can be issued simultaneously)
  - determines how far ahead pipeline can look for independent instructions (the window)
  - in our discussions, never goes beyond a branch
- Output dependences and antidependences
  - that lead to WAW and WAR stalls
- Parallelism available in the program
  - to find independent instructions to execute

**Tomasulo’s dynamic scheduling method**

- Origins: IBM 360/91, 3 years after CDC 6600
- Scoreboard controls the progress of each instruction and makes sure that hazards such as RAW, etc. are avoided
- In Tomasulo's method, these functions are decentralized in reservation stations

**Tomasulo’s method**

- Main differences with scoreboard method:
  - decentralized execution control
  - decentralized hazard detection
  - register renaming by having operands passed to reservation stations
    - eliminates WAW and WAR hazards

**Instruction status**

- Only 3 kinds of pipeline stages, not 4:
  - Issue: send instruction to an empty reservation station, in FIFO order, along with any register contents it needs
    - also renames registers, removing WAR and WAW hazards
  - Execute: wait for all operands, check for RAW hazards, then execute the instruction
    - if multiple instructions ready in same cycle for same functional unit, for f.p. units choose one arbitrarily
    - for loads/stores, compute effective address first, execute loads from load buffer as soon as memory unit available, for stores wait for value to be stored before send to memory unit
  - Write result: put result on the common data bus (CDB) to get it back to a register and to any other reservation stations that need it
    - stores write to memory here too

**Instruction status (cont.)**

- Note differences from scoreboard:
  - no check for WAW and WAR hazards, since renaming prevents them
  - CDB broadcasts results, so don't have to wait for registers to be filled
  - Load and Store are also treated as functional units

**Advantages over scoreboard**

- Distribution of hazard detection logic
  - from distributed reservation stations and CDB
  - can release multiple instructions at once from single result (if other operands available)
- Eliminates stalls for WAW and WAR hazards
  - from renaming registers using reservation stations
  - and from storing operands into reservation station as soon as they are available
Bookkeeping for Tomasulo method

- Each reservation station keeps track of
  - Busy flag
  - Op: the operation to be performed
  - Qj, Qk: sources of operands
    - already present
    - or to come from another reservation station
  - Vj, Vk: values of the operands
  - A: the memory address info for a load or store (eventually the effective address)

Bookkeeping (cont.)

- Register hardware keeps track of which reservation station will fill each register
  - Qi – number of reservation station containing the operations whose result will be stored into the register
- The load and store buffers have
  - busy flag
  - value to be stored
  - A: effective address

MIPS FP with Tomasulo’s algorithm

![MIPS FP with Tomasulo’s algorithm](image)

Instructions (Fig. 3.3)

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Issue</th>
<th>Execute</th>
<th>Write Result</th>
</tr>
</thead>
<tbody>
<tr>
<td>LD F6,34(R2)</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
<tr>
<td>LD F2, 45(R3)</td>
<td>x</td>
<td>x</td>
<td></td>
</tr>
<tr>
<td>MUL D F0,F2,F4</td>
<td>x</td>
<td></td>
<td></td>
</tr>
<tr>
<td>SUB D F8,F2,F6</td>
<td>x</td>
<td></td>
<td></td>
</tr>
<tr>
<td>DIV D F10,F0,F6</td>
<td>x</td>
<td></td>
<td></td>
</tr>
<tr>
<td>ADD D F6,F8,F2</td>
<td>x</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Reservation Stations (Fig. 3.3)

<table>
<thead>
<tr>
<th>Name</th>
<th>Busy</th>
<th>Op</th>
<th>Vj</th>
<th>Vk</th>
<th>Qj</th>
<th>Qk</th>
<th>A</th>
</tr>
</thead>
<tbody>
<tr>
<td>Load1</td>
<td>no</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Load2</td>
<td>yes</td>
<td>Load</td>
<td></td>
<td></td>
<td>45+R[R3]</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Add1</td>
<td>yes</td>
<td>SUB</td>
<td>Mem[34+R[R2]]</td>
<td>Load2</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Add2</td>
<td>yes</td>
<td>ADD</td>
<td>Add1</td>
<td>Load2</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Add3</td>
<td>no</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Mult1</td>
<td>yes</td>
<td>MUL</td>
<td>R[F4]</td>
<td>Load2</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Mult2</td>
<td>yes</td>
<td>DIV</td>
<td>Mem[34+R[R2]]</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Register Status (Fig. 3.3)

<table>
<thead>
<tr>
<th>Field</th>
<th>F0</th>
<th>F2</th>
<th>F4</th>
<th>F6</th>
<th>F8</th>
<th>F10</th>
<th>F12</th>
<th>…</th>
<th>F30</th>
</tr>
</thead>
<tbody>
<tr>
<td>Qi</td>
<td>Mult1</td>
<td>Load2</td>
<td>Add2</td>
<td>Add1</td>
<td>Mult2</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Figure 3.2

Figure 3.3
An example of the Tomasulo method

The bookkeeping
Each functional unit keeps track of:
- Unit
- Op
- Vj
- Qj
- Qk

For our convenience, we’ll also indicate:
- Inst
- Stage

And we’ll assume these initial register contents and flags:

<table>
<thead>
<tr>
<th></th>
<th>F0</th>
<th>F2</th>
<th>F4</th>
<th>F6</th>
<th>F8</th>
<th>F10</th>
<th>F12</th>
<th>F14</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>0</td>
<td>3</td>
<td>4</td>
<td>6</td>
<td>8</td>
<td>10</td>
<td>12</td>
<td>14</td>
</tr>
<tr>
<td></td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>

The bookkeeping

Time 0

S1: ADD D F0, F2, F4

Assume 2 cycle add

Time 1

S2: MULT D F2, F6, F8

Assume 4 cycle multi.

Time 2

S3: MULT D F10, F0, F2
NOT STALLED - F0’s contents already passed to all units that need it!

To m2, but F0 not really nec.
Example (Fig. 3.6)

- Assume all instructions issued in two successive iterations of loop, but none of f.p. loads/stores or operations has completed
  - don’t show integer ALU op, assume branch predicted as taken
- Once system reaches this state, 2 copies of loop could be sustained, with a CPI close to 1.0
  - if multiplies take less than 4 cycles to complete

Instruction status (Fig. 3.6)

<table>
<thead>
<tr>
<th>Instruction</th>
<th>From iteration</th>
<th>Issue</th>
<th>Execute</th>
<th>Write Result</th>
</tr>
</thead>
<tbody>
<tr>
<td>L.D F0,(R1)</td>
<td>1</td>
<td>x</td>
<td>x</td>
<td></td>
</tr>
<tr>
<td>MUL D F4,F0,F2</td>
<td>1</td>
<td></td>
<td>x</td>
<td></td>
</tr>
<tr>
<td>S.D F4,(R1)</td>
<td>1</td>
<td>x</td>
<td></td>
<td></td>
</tr>
<tr>
<td>L.D F0,(R1)</td>
<td>2</td>
<td>x</td>
<td>x</td>
<td></td>
</tr>
<tr>
<td>MUL D F4,F0,F2</td>
<td>2</td>
<td></td>
<td>x</td>
<td></td>
</tr>
<tr>
<td>S.D F4,(R1)</td>
<td>2</td>
<td></td>
<td>x</td>
<td></td>
</tr>
</tbody>
</table>

Reservation stations (Fig. 3.6)

<table>
<thead>
<tr>
<th>Name</th>
<th>Busy</th>
<th>Op</th>
<th>Vj</th>
<th>Vk</th>
<th>Qj</th>
<th>Qk</th>
<th>A</th>
</tr>
</thead>
<tbody>
<tr>
<td>Load1</td>
<td>yes</td>
<td>Load</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>R(R1)=0</td>
</tr>
<tr>
<td>Load2</td>
<td>yes</td>
<td>Load</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>R(R1)=8</td>
</tr>
<tr>
<td>Add1</td>
<td>no</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Add2</td>
<td>no</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Add3</td>
<td>no</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Mul1</td>
<td>yes</td>
<td>Mul</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>R[F2] Load1</td>
</tr>
<tr>
<td>Mul2</td>
<td>yes</td>
<td>Mul</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>R[F2] Load2</td>
</tr>
<tr>
<td>Store1</td>
<td>yes</td>
<td>Store</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>R[R1] Mult1</td>
</tr>
<tr>
<td>Store2</td>
<td>yes</td>
<td>Store</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>R[R1]=8 Mult2</td>
</tr>
</tbody>
</table>

Register status (Fig. 3.6)

<table>
<thead>
<tr>
<th>Field</th>
<th>F0</th>
<th>F2</th>
<th>F4</th>
<th>F6</th>
<th>F8</th>
<th>F10</th>
<th>F12</th>
<th>...</th>
<th>F30</th>
</tr>
</thead>
<tbody>
<tr>
<td>Qi</td>
<td>Load2</td>
<td>Mult2</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

More on Loads/Stores

- Can be done in different order, if access different addresses
- If load and store access same address, either
  - load is before store in program order, so after interchange get WAR hazard
  - store is before load in program order, so after interchange get RAW hazard
- Interchanging 2 stores to same address results in WAW hazard

Loads / Stores (cont.)

- To decide whether a load can be executed at a given time, check if any uncompleted store (that precedes the load in program order) shares same memory address
  - similarly, for stores, wait until no earlier loads or stores share same address
- To find the address conflicts
  - for a load, after effective address computed, check against A field of all active store buffers
    - if a match, send to load buffer after conflicting store completes
  - for a store, check for conflicts in both load and store buffers
Computer Systems Architecture
CMSC 411
Unit 4 – Instruction-level parallelism

Alan Sussman
March 11, 2003

Administrivia

• Quiz Thursday
  – on Unit 3 (basic pipelining), and maybe a short question on scoreboard
• HW #4 due next Thursday, March 20
• Office hours?

Last time

• Tomasulo dynamic scheduling
  – reservation stations for decentralized control, hazard detection
  – register renaming (use CPU internal registers) to eliminate WAW and WAR hazards
  – 3 pipe stages – issue, execute, write result to CDB
  – load and store buffers
  – a loop-based example with loop unrolling showed power of the algorithm
    • can keep functional units busy with operations from multiple loop iterations

Last time (cont.)

• Loads/Stores
  – can reorder with Tomasulo’s algorithm, subject to constraints
  – must maintain program order if accesses are to same memory location, to avoid hazards
  – use load/store buffers to either decide whether to wait or to forward values
    • check the buffer to see if there is a conflict with an outstanding load or store

Dynamic branch prediction

• Dynamic scheduling techniques (scoreboard and Tomasulo algorithm) reduce data hazards

• Now: ways to reduce branch costs from control hazards

Branch-prediction buffers

• Simplest dynamic branch-prediction scheme
  – also called branch history table
• Records how frequently a branch has been taken in the past
• What is the buffer?
  – a small memory/cache area
• If there are more branches than locations in the buffer:
  – then some locations are used for more than one branch instruction, which can reduce the accuracy of predictions
• How to find the right place in the buffer:
  – from the low order bits of the address of the branch instruction
Branch-prediction buffers (cont.)

• When to access the buffer:
  – fetch the contents during the ID pipeline cycle
  – update the contents after the branch is resolved

• What is stored in the buffer:
  – perhaps one bit per branch, indicating whether the branch was last taken or not
    • predict taken if it was taken last time.
    • otherwise predict not taken

  Example: If the branch is for an inner loop, the prediction will be wrong on the 1st and last iterations, each time the loop is executed.

Branch-prediction buffers (cont.)

• What is stored in the buffer (cont.):
  – perhaps a counter for each branch. If have 3 bits, for example, start the count at 4
    • add one every time the branch is taken
    • subtract one every time the branch is not taken
    • predict taken if the count is 4 or more
    • predict not taken if the count is 3 or less.
  – perhaps a correlating predictor that counts for this branch and for 1 or more earlier ones

How good are they? – Fig. 3.8

Correlating branch predictors

• To improve prediction accuracy, look at behavior of recent other branches than the one trying to predict
• Take advantage of correlation between behavior of different branches
  – also called two-level predictor

Simple example:

```mips
if (d == 0)
  d = 1;
if (d == 1)
  ...
```

MIPS code, with d in R1:

```mips
BNEZ R1,L1  ; branch b1
DADDIU R1,R0,#1
L1: DADDIU R3,R1,#-1
BNEZ R3,L2  ; branch b2
...
L2:
```

Example (cont.) – Fig. 3.10

<table>
<thead>
<tr>
<th>Initial value of d</th>
<th>d==0?</th>
<th>b1 Value of d before b2</th>
<th>d==1?</th>
<th>b2</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>yes</td>
<td>not taken</td>
<td>1</td>
<td>yes not taken</td>
</tr>
<tr>
<td>1</td>
<td>no</td>
<td>taken</td>
<td>1</td>
<td>yes not taken</td>
</tr>
<tr>
<td>2</td>
<td>no</td>
<td>taken</td>
<td>2</td>
<td>no not taken</td>
</tr>
</tbody>
</table>

b1 not taken → b2 not taken

Example – 1 bit predictor

Predictor initialized to not taken – Fig. 3.11

<table>
<thead>
<tr>
<th>d==?</th>
<th>b1 prediction</th>
<th>b1 action</th>
<th>New b1 prediction</th>
<th>b2 prediction</th>
<th>b2 action</th>
<th>New b2 prediction</th>
</tr>
</thead>
<tbody>
<tr>
<td>2</td>
<td>NT</td>
<td>T</td>
<td>T</td>
<td>NT</td>
<td>T</td>
<td>T</td>
</tr>
<tr>
<td>0</td>
<td>T</td>
<td>NT</td>
<td>NT</td>
<td>T</td>
<td>NT</td>
<td>NT</td>
</tr>
<tr>
<td>2</td>
<td>NT</td>
<td>T</td>
<td>NT</td>
<td>T</td>
<td>T</td>
<td>T</td>
</tr>
<tr>
<td>0</td>
<td>T</td>
<td>NT</td>
<td>NT</td>
<td>T</td>
<td>NT</td>
<td>NT</td>
</tr>
</tbody>
</table>

All branches mispredicted!
1-bit prediction with 1 bit correlation

Initialized to not taken/not taken – Fig. 3.13

<table>
<thead>
<tr>
<th>d=?</th>
<th>b1 prediction</th>
<th>b1 action</th>
<th>New b1 prediction</th>
<th>b2 prediction</th>
<th>b2 action</th>
<th>New b2 prediction</th>
</tr>
</thead>
<tbody>
<tr>
<td>2</td>
<td>NT/NT</td>
<td>T</td>
<td>T/NT</td>
<td>NT/NT</td>
<td>T</td>
<td>NT/T</td>
</tr>
<tr>
<td>0</td>
<td>NT/NT</td>
<td>NT</td>
<td>NT/NT</td>
<td>NT/T</td>
<td>NT</td>
<td>NT/T</td>
</tr>
<tr>
<td>2</td>
<td>T/NT</td>
<td>NT</td>
<td>NT/NT</td>
<td>NT/T</td>
<td>T</td>
<td>NT/T</td>
</tr>
<tr>
<td>0</td>
<td>T/NT</td>
<td>NT</td>
<td>NT/NT</td>
<td>NT/T</td>
<td>NT</td>
<td>NT/T</td>
</tr>
</tbody>
</table>

Only misprediction is on first iteration/row

Correlating predictors (cont.)

- Predictor on last slide is a (1,1) predictor
  - uses behavior of last branch to choose from among a pair of 1-bit branch predictors
- In general, an (m,n) predictor uses behavior of last m branches to choose from 2^m branch predictors
  - each is an n-bit predictor for a single branch
- simple hardware to do this – can keep global history of most recent m branches in an m-bit shift register, and each bit records whether branch taken/not taken
  - and index branch prediction buffer using low-order bits of branch address with m-bit global history

Comparing predictors

- Correlating vs. standard 2-bit scheme
  - using same number of state bits
- Number of state bits for (m,n) predictor is:
  - 2^m x n x # prediction entries to select from with the branch address
- 2-bit predictor with no global history is a (0,2) predictor
  - (0,2) predictor from Fig. 3.8 had 4K entries selected by branch address
    - total number of bits is 2^2 x 4K = 8K bits
- (2,2) predictor from Fig. 3.14 has 64 entries, with 4 entries per branch address
  - total number of bits is 2^2 x 16 = 128 bits

Comparing 2-bit predictors

Fig. 3.15

Combining local and global predictors

- Tournament predictors use multiple predictors, usually one global and one local, combined with a selector
  - a form of multilevel branch prediction
  - selector is often just a 2-bit saturating counter per branch to choose from 2 predictors that is used with a finite state machine (see Fig. 3.16)
    - increment counter when “predicted” predictor is correct and other predictor is incorrect, decrement in opposite situation

Misprediction rates

Fig. 3.17

on SPEC89 benchmarks
Branch target buffers

- Another branch penalty reduction method
- BTB stores:
  - branch prediction information
  - predicted location of the instruction following the branch
- What is the buffer:
  - a small memory area, accessed with the address of the PC of the instruction fetched
- If there are more branches than locations in the buffer:
  - then some locations are used for more than one branch instruction, limiting the amount of information stored

Branch target buffers (cont.)

- When to access the buffer:
  - fetch the contents during the IF pipeline cycle
  - update the contents after the branch is resolved
- How to find the right entry in the buffer:
  - look for an entry that matches the PC
- What is stored in the buffer:
  - addresses of branch instructions
  - predicted target of each branch
  - branch prediction information
- Allows branch folding for unconditional branches (jumps)
  - zero cycle instruction!
  - by having BTB replace jump instruction in pipeline with branch target instruction returned from BTB

Branch target buffer – Fig. 3.19

BTB algorithm – Fig. 3.20

BTB penalties – Fig. 3.21

<table>
<thead>
<tr>
<th>Instruction in buffer</th>
<th>Prediction</th>
<th>Actual branch</th>
<th>Penalty cycles</th>
</tr>
</thead>
<tbody>
<tr>
<td>yes</td>
<td>taken</td>
<td>taken</td>
<td>0</td>
</tr>
<tr>
<td>yes</td>
<td>taken</td>
<td>not taken</td>
<td>2</td>
</tr>
<tr>
<td>no</td>
<td>taken</td>
<td></td>
<td>2</td>
</tr>
<tr>
<td>no</td>
<td>not taken</td>
<td></td>
<td>0</td>
</tr>
</tbody>
</table>

Computer Systems Architecture
CMSC 411
Unit 4 – Instruction-level parallelism

Alan Sussman
March 13, 2003
Administrivia

• Quiz today
  – questions?
• HW #4 due next Thursday, March 20

Last time

• Branch prediction buffer – record how frequently each branch taken in the past
  – if one bit, predict taken if taken last time
  – if m-bit counter, use counter value (records last 2m branches) for prediction
  – correlating predictor counts not just this branch, but earlier executed ones, whatever they were – two-level predictors
  – (m,n) correlating predictor uses behavior of last m branches to choose from 2m predictors, each an n-bit predictor for 1 branch

Last time

• Branch target buffer
  – store prediction information
  – and predicted location of next instruction
  – match on address of branch instruction
  • if not there, either instruction is predicted not a branch or a not-taken branch, so fetch next instruction in sequence

Issuing multiple instructions at one time

• Scoreboard and Tomasulo’s method reduce stalls from data hazards
• Branch buffers reduce stalls from control hazards
• Now talk about getting more parallelism by issuing several instructions at once
• Two currently used methods:
  – superscalar processors
  – VLIW (very long instruction word) processors

5 approaches – Fig. 3.23

<table>
<thead>
<tr>
<th>Common name</th>
<th>Issue structure</th>
<th>Hazard detection</th>
<th>Scheduling</th>
<th>Interesting feature</th>
<th>Examples</th>
</tr>
</thead>
<tbody>
<tr>
<td>Superscalar (static)</td>
<td>dynamic</td>
<td>hardware</td>
<td>static</td>
<td>in-order execution</td>
<td>Sun UltraSparc II/III</td>
</tr>
<tr>
<td>Superscalar (dynamic)</td>
<td>dynamic</td>
<td>hardware</td>
<td>dynamic</td>
<td>some o-o-o execution</td>
<td>IBM Power2</td>
</tr>
<tr>
<td>Superscalar (speculative)</td>
<td>dynamic</td>
<td>hardware</td>
<td>dynamic, with spec.</td>
<td>o-o-o, with speculation</td>
<td>PIII/4, R10K, Alpha 21264</td>
</tr>
<tr>
<td>VLIW/LIW</td>
<td>static</td>
<td>software</td>
<td>static</td>
<td>no hazards bet. packets</td>
<td>Trimedia, i860</td>
</tr>
<tr>
<td>EPIC</td>
<td>static</td>
<td>mostly software</td>
<td>mostly static</td>
<td>explicit dep. by compiler</td>
<td>Itanium</td>
</tr>
</tbody>
</table>

Superscalar MIPS

• Idea: issue several non-interfering instructions at each clock cycle
• The hardware has the burden of detecting interfering instructions
• Can be statically scheduled (using compiler techniques), or dynamically scheduled (using scoreboard or Tomasulo’s algorithm)
• For example, look at a superscalar MIPS processor that can issue two instructions at once:
  – one floating point instruction
  – one non-floating point (integer or branch) instruction
Superscalar MIPS (cont.)

- 3 steps in instruction fetch and issue:
  - fetch 2 instructions from cache
  - determine whether 0, 1, or 2 can issue
    - look at hazards from earlier instructions issued, and from within the 2 fetched instructions
    - since 1 integer and 1 FP instruction can issue, mostly just look at opcodes for hazards within fetched pair, unless integer instruction is FP load/store/move – possible RAW hazard
    - also need another read/write port on FP register file, to allow loads/stores to issue with FP operations, and (lots) more bypass paths
  - issue the instructions to the correct functional unit

Example

\( x \) is an array of 1000 double precision numbers, indexed from 1 to 1000
for \( i = 1000, 999, \ldots, 1 \)
\( x[i] = x[i] + s; \)

Unroll the loop – execute 4 iterations before branch (for now don’t worry about loop test)

Loop:

- LD \( F0, 0(R1) \)
- ADD.D \( F4, F0, F2 \)
- S.D \( F4, 0(R1) \)
- LD \( F6, -8(R1) \)
- ADD.D \( F8, F6, F2 \)
- S.D \( F8, -8(R1) \)
- LD \( F10, -16(R1) \)
- ADD.D \( F12, F10, F2 \)
- S.D \( F12, -16(R1) \)
- LD \( F14, -24(R1) \)
- ADD.D \( F16, F14, F2 \)
- S.D \( F16, -24(R1) \)
- DSUBI \( R1, R1, #32 \)
- BNEZ \( R1, Loop \)

Example (cont.)

On single issue MIPS pipeline, takes 14 cycles

<table>
<thead>
<tr>
<th>Cycle</th>
<th>FP instruction</th>
<th>non-FP instruction</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>LD ( F0, 0(R1) )</td>
<td></td>
</tr>
<tr>
<td>2</td>
<td>LD ( F6, -8(R1) )</td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>ADD.D ( F4, F0, F2 )</td>
<td>( F10, -16(R1) )</td>
</tr>
<tr>
<td>4</td>
<td>ADD.D ( F8, F6, F2 )</td>
<td>( F14, -24(R1) )</td>
</tr>
<tr>
<td>5</td>
<td>ADD.D ( F12, F10, F2 )</td>
<td>DSUBI ( R1, R1, 32 )</td>
</tr>
<tr>
<td>6</td>
<td>ADD.D ( F16, F14, F2 )</td>
<td>S.D ( F4, 32(R1) )</td>
</tr>
<tr>
<td>7</td>
<td>S.D ( F8, 24(R1) )</td>
<td></td>
</tr>
<tr>
<td>8</td>
<td>S.D ( F12, 16(R1) )</td>
<td></td>
</tr>
<tr>
<td>9</td>
<td>BNEZ ( R1, Loop )</td>
<td>(delayed branch)</td>
</tr>
<tr>
<td>10</td>
<td>S.D ( F16, 8(R1) )</td>
<td></td>
</tr>
</tbody>
</table>

Example (cont.)

- Need to use care not to exceed register traffic bounds
- It may be impossible to load an F register at the same time that the floating point unit is accessing/reading another F register – a structural hazard

VLIW MIPS

- Idea: have several non-interfering instructions stored in each instruction fetched
- Then the software (the compiler) has the burden of detecting interfering instructions
- Must be statically scheduled by compiler

Example

- Assume a VLIW instruction can contain 2 memory references, 2 floating point operations, and 1 other operation each clock cycle
- Then the example loop takes 6 cycles, because latencies still limit performance
Computer Systems Architecture
CMSC 411
Unit 4 – Instruction-level parallelism

Alan Sussman
March 18, 2003

Example (cont.)

<table>
<thead>
<tr>
<th>Cycle</th>
<th>Memory</th>
<th>Memory</th>
<th>FP</th>
<th>FP</th>
<th>Other</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>L.D F0</td>
<td>L.D F6</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>2</td>
<td>L.D F10</td>
<td>L.D F14</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>ADD.D F4</td>
<td>ADD.D F8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4</td>
<td>ADD.D F12</td>
<td>ADD.D F16</td>
<td>DSUBI</td>
<td></td>
<td></td>
</tr>
<tr>
<td>6</td>
<td>S.D F4</td>
<td>S.D F8</td>
<td></td>
<td></td>
<td>BNEZ</td>
</tr>
<tr>
<td>6</td>
<td>S.D F12</td>
<td>S.D F16</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Limitations on multiple issue

- Programs have limited parallelism and cannot keep all functional units busy – still have control hazards
- CPU hardware gets very complicated for issuing and executing
- Memory bandwidth requirements increase
- Superscalar:
  - many decisions to be made very fast
- VLIW:
  - code length increases, because of loop unrolling and many NOP instructions/operations
- Vector processors (Appendix G online) generally cheaper and faster.
- ... but recently, VLIW becoming more popular again (Intel/HP EPIC)

Administrivia

- HW #4 due Thursday
- I’m out of town Thursday
  - TA will return quizzes and collect homework
  - And go over quiz answers
  - no office hours today or tomorrow either
- Midterm Thursday, 4/3
  - on everything up to then (including hardware ILP)
- On Tuesday, 4/1, finish up hardware ILP, then review/questions for midterm

Last time

- Multiple instruction issue
  - Superscalar (static or dynamic) or VLIW (static)
  - main overall problems are control hazards, complex CPU hardware, memory bandwidth
- Superscalar MIPS
  - static (compiler) or dynamic (Tomasulo) scheduling
  - 3 fetch/issue steps – fetch insts, decide whether to issue 0, 1, or 2 (hazards), issue to functional units
  - main problem is very fast, complex decision-making
- VLIW MIPS
  - static scheduling, by compiler
  - main problem is code length from loop unrolling and NOPS

Hardware speculation

- A method for overcoming control dependences/hazards
- Branch prediction reduces stalls, but still want to execute instructions past a branch to get more instruction level parallelism, esp. with multiple issue
- Speculate on the outcome of branches, and execute the program as if the guesses are correct
  - more than just dynamic scheduling plus branch prediction
  - instructions are fetched, issued, and executed
  - then fix things up if the speculation was incorrect
Hardware speculation (cont.)

- 3 key ideas
  - **dynamic branch prediction**, to choose which instructions to execute
  - **speculation**, to allow executing instructions before branches resolved (and effects undone if speculation was wrong)
  - **dynamic scheduling**, to deal with scheduling combinations of basic blocks

- Essentially a *data flow execution*
  - operations execute as soon as operands available

- We’ll use Tomasulo’s algorithm for dynamic scheduling, and just for FP unit

Extensions to Tomasulo’s algorithm

- Separate bypassing of results between instructions from actual completion of an instruction
  - so that an instruction can execute and produce results for other instructions, without performing any updates that can’t be undone – until know that the instruction should have been executed (control dependences resolved)
  - when instruction is no longer speculative:
    - update the register file, or memory
    - call this **instruction commit**

Implementing speculation

- Allow instructions to execute out-of-order, but make them commit *in order*
  - and prevent any action that can’t be undone (update state, take an exception, etc.) until instruction commits

- Requires some changes to standard pipeline sequence, and hardware buffers to hold results of instructions that have completed execution, but have not committed yet
  - a reorder buffer
  - also used to pass results between speculated instructions

Reorder buffer (ROB)

- Provides registers, like reservation stations
- Holds instruction result from time operation completes until instruction commits
- But no writeback to register file until instruction commits
  - so ROB supplies operands in this time interval
- Similar to store buffer in Tomasulo’s algorithm
  - in design shown, store buffer integrated into ROB

Reorder buffer (cont.)

- 4 fields in an ROB entry
  - **instruction type** – branch, store, register op (ALU or load)
  - **destination** – register number or memory address
  - **value** – result of instruction
  - **ready** – set when instruction completes execution, and value is ready

MIPS with Tomasulo + speculation

- Fig. 3.29
Instruction execution – 4 steps

• Issue
  – get instruction from instruction queue
  – issue if both reservation station and empty ROB slot available
  – send operands to reservation station if available in either registers or ROB
  – send ROB entry number to reservation station for writing result
  – if reservations stations or ROB full, instruction stalls

Instruction execution (cont.)

• Execute
  – If an operand is not yet available, monitor CDB until it appears
  – When all operands available at reservation station, execute the operation
  – Instructions can take multiple cycles in this stage, loads require 2 steps (compute effective address and memory access), stores only need to compute effective address

Instruction execution (cont.)

• Write result
  – When result available, write on CDB (with ROB tag set in issue step), and from CDB to ROB, plus any reservation stations waiting
  – Mark reservation station available
  – For stores, if value to be stored is available, write into Value field of ROB entry for store
    • otherwise, wait until it appears on CDB, then update Value file of ROB entry

Instruction execution (cont.)

• Commit
  – For “normal” case (instruction reaches head of ROB and its result is in the buffer), update the register with the result and remove instruction from ROB
  – For a store, update memory instead of result register
  – For an incorrectly predicted branch, the speculation was wrong – flush the ROB, and restart execution at correct branch target
    • correctly predicted branch is just finished

Example – Fig. 3.30

L.D  F6,34(R2)
L.D  F2,45(R3)
MUL.D F0,F2,F4
SUB.D F8,F6,F2
DIV.D F10,F0,F6
ADD.D F6,F8,F2

• Assume 2 cycle add, 10 cycle multiply, 40 cycle divide

Example (cont.)

<table>
<thead>
<tr>
<th>Name</th>
<th>Busy</th>
<th>Op</th>
<th>Vj</th>
<th>Vk</th>
<th>Qj</th>
<th>Qk</th>
<th>Dest</th>
<th>A</th>
</tr>
</thead>
<tbody>
<tr>
<td>Load1</td>
<td>no</td>
<td></td>
<td>Vj</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Load2</td>
<td>no</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Add1</td>
<td>no</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Add2</td>
<td>no</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Add3</td>
<td>no</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Mul1</td>
<td>no</td>
<td>MUL</td>
<td>Vj</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Mul2</td>
<td>yes</td>
<td>DIV</td>
<td>Vj</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
### Example (cont.)

#### Reorder buffer

<table>
<thead>
<tr>
<th>Entry</th>
<th>Busy</th>
<th>Instruction</th>
<th>State</th>
<th>Dest.</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>no</td>
<td>L.D</td>
<td>Commit</td>
<td>F6</td>
<td>Mem[34+R[R2]]</td>
</tr>
<tr>
<td>2</td>
<td>no</td>
<td>L.D</td>
<td>Commit</td>
<td>F2</td>
<td>Mem[45+R[R3]]</td>
</tr>
<tr>
<td>3</td>
<td>yes</td>
<td>MUL.D</td>
<td>Write result</td>
<td>F0</td>
<td>F2 × R[F4]</td>
</tr>
<tr>
<td>4</td>
<td>yes</td>
<td>SUB.D</td>
<td>Write result</td>
<td>F8</td>
<td>#1 - #2</td>
</tr>
<tr>
<td>5</td>
<td>yes</td>
<td>DIV.D</td>
<td>Execute</td>
<td>F10</td>
<td></td>
</tr>
<tr>
<td>6</td>
<td>yes</td>
<td>ADD.D</td>
<td>Write result</td>
<td>F6</td>
<td>#4 + #2</td>
</tr>
</tbody>
</table>

### Example (cont.)

#### FP register status

<table>
<thead>
<tr>
<th>Field</th>
<th>F0</th>
<th>F1</th>
<th>F2</th>
<th>F3</th>
<th>F4</th>
<th>F5</th>
<th>F6</th>
<th>F7</th>
<th>F8</th>
<th>F10</th>
</tr>
</thead>
<tbody>
<tr>
<td>ROB #</td>
<td>6</td>
<td>4</td>
<td>5</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Busy</td>
<td>yes</td>
<td>no</td>
<td>no</td>
<td>no</td>
<td>no</td>
<td>yes</td>
<td>no</td>
<td>yes</td>
<td>yes</td>
<td>yes</td>
</tr>
</tbody>
</table>

### Exceptions

- Processor with ROB can provide precise exceptions/interrupts
  - don’t take exception caused by program until commit
  - at that point, flush any pending instructions
  - works because commit happens in program order

- Much harder to do precise exceptions with basic Tomasulo algorithm
  - instructions can complete before earlier issued instructions, with register or memory writes that can’t be undone – imprecise exceptions
  - makes some kinds of exceptions (e.g., page faults) hard to handle

### Loop example – Fig. 3.31

#### Loop example (cont.)

- Assume all instructions in loop issued twice
- L.D and MUL.D from iteration 1 committed, all other instructions finished executing
- Effective address for stores already computed, and have left ROB

#### Reorder buffer

<table>
<thead>
<tr>
<th>Entry</th>
<th>Busy</th>
<th>Inst.</th>
<th>State</th>
<th>Dest.</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>no</td>
<td>L.D</td>
<td>Commit</td>
<td>F0</td>
<td>Mem[0+R[R1]]</td>
</tr>
<tr>
<td>2</td>
<td>no</td>
<td>MUL.D</td>
<td>Commit</td>
<td>F4</td>
<td>#1 × R[F2]</td>
</tr>
<tr>
<td>3</td>
<td>yes</td>
<td>S.D</td>
<td>Write result</td>
<td>F0</td>
<td>0+R[R1]</td>
</tr>
<tr>
<td>4</td>
<td>yes</td>
<td>DADDU</td>
<td>Write result</td>
<td>R1</td>
<td>R[R1] - 8</td>
</tr>
<tr>
<td>5</td>
<td>yes</td>
<td>BNE</td>
<td>Write result</td>
<td>F4</td>
<td>0+R[R1]</td>
</tr>
<tr>
<td>6</td>
<td>yes</td>
<td>L.D</td>
<td>Write result</td>
<td>F0</td>
<td>Mem[0+R1]</td>
</tr>
<tr>
<td>7</td>
<td>yes</td>
<td>MUL.D</td>
<td>Write result</td>
<td>F4</td>
<td>0+R[F2]</td>
</tr>
<tr>
<td>8</td>
<td>yes</td>
<td>S.D</td>
<td>Write result</td>
<td>F4</td>
<td>#7</td>
</tr>
<tr>
<td>9</td>
<td>yes</td>
<td>DADDU</td>
<td>Write result</td>
<td>R1</td>
<td>#4 - 8</td>
</tr>
<tr>
<td>10</td>
<td>yes</td>
<td>BNE</td>
<td>Write result</td>
<td>F4</td>
<td>0+R[R1]</td>
</tr>
</tbody>
</table>

### FP register status

<table>
<thead>
<tr>
<th>Field</th>
<th>F0</th>
<th>F1</th>
<th>F2</th>
<th>F3</th>
<th>F4</th>
<th>F5</th>
<th>F6</th>
<th>F7</th>
<th>F8</th>
</tr>
</thead>
<tbody>
<tr>
<td>ROB #</td>
<td>6</td>
<td>7</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Busy</td>
<td>yes</td>
<td>no</td>
<td>no</td>
<td>no</td>
<td>yes</td>
<td>no</td>
<td>no</td>
<td>no</td>
<td>no</td>
</tr>
</tbody>
</table>

- If branch mispredicted, all instructions prior to branch commit, and when branch commits, ROB cleared and processor starts fetching from other path
- this is expensive, but shouldn’t happen very often
Speculation with multiple issue

- Similar to extending Tomasulo algorithm for multiple issue
  - process multiple instructions per clock, assign reservation stations and reorder buffer entries
- Major challenges are instruction issue and monitoring CDBs for instruction completion
  - and must be able to handle multiple instruction commits per cycle
- As example in book shows (Figs. 3.33 and 3.34), speculation helps when there are data dependent branches that limit performance
  - requires good branch prediction
  - incorrect speculation could hurt performance

Design considerations

- Register renaming
  - an alternative to ROB
  - employ a larger set of physical registers than the ones visible in the architecture (R and F)
  - replace the ones in the reservations stations and ROB
  - renaming happens at instruction issue
  - renaming dest register removes WAW and WAR hazards
  - speculation recovery easy since physical register holding instruction dest doesn’t become architectural register until commit
  - hard part is knowing when deallocation is OK
    - when no longer corresponds to a physical register, and nothing will use it

Design considerations (cont.)

- How much to speculate
  - cost comes from incorrect speculation of an expensive event (e.g., a 2nd level cache miss)
  - solution is to only allow low-cost exceptional events in speculative mode (e.g., 1st level cache miss)
  - processor waits until instruction is not speculative before handling expensive exceptional event

Design considerations (cont.)

- Speculating through multiple branches
  - when is this beneficial?
    - very high branch frequencies
    - clustering of branches
    - long delays in functional units
  - really just makes recovery more complicated
  - IBM Power2 can even predict and speculate on more than 1 branch per cycle!

Limitations of ILP

- Question is how much ILP is available in real programs
  - limits the amount of parallelism, no matter how much hardware is used
  - tells you how much can enhance performance from architectural decisions, on top of increases in circuit speeds
- Results that follow are mostly from a study by Wall in 1993 on SPEC92 benchmarks

Hardware model used

- Start from an ideal processor
  - all artificial constraints on ILP removed
  - only limits on ILP are from actual data flows through registers or memory
- Assumptions
  1. register renaming – infinite number of virtual registers available (no WAW or WAR hazards) and unbounded number of insts issued simultaneously
  2. branch prediction – all perfectly predicted
  3. jump prediction – all perfectly predicted – combined with 2, same as perfect speculation and unbounded window
  4. memory address alias analysis – all addresses know exactly, so load can be moved before store if addresses not identical
- 2 and 3 eliminate all control dependences
- 1 and 4 eliminate all but true data dependences
Hardware model (cont.)

- Unlimited instruction issue, with no restrictions on lookahead or on types of instructions issued
- All functional units have 1 cycle latency
- Perfect caches, so all loads/stores complete in 1 cycle
- This processor is pretty much unrealizable, but is where we start

Administrivia

- Midterm Thursday
  - on everything up to then (including hardware ILP)
- Today, finish up hardware ILP, then review/questions for midterm

Limitations of ILP

- Question is how much ILP is available in real programs
  - limits the amount of parallelism, no matter how much hardware is used
  - tells you how much can enhance performance from architectural decisions, on top of increases in circuit speeds
- Results that follow are mostly from a study by Wall in 1993 on SPEC92 benchmarks

Computer Systems Architecture
CMSC 411
Unit 4 – Instruction-level parallelism

Alan Sussman
April 1, 2003

Last time

- Hardware speculation
  - to overcome control dependences, and execute instructions past a branch
  - separate bypassing of results from insts from instruction commit/completion
  - insts execute out-of-order, commit in order
  - use a reorder buffer (ROB)
  - no writeback to registers or memory until instruction commits
  - good for precise exceptions
  - if branch mispredicted, clear ROB and start on other path
  - an alternative to ROB is register renaming

Hardware model used

- Start from an ideal processor
  - all artificial constraints on ILP removed
  - only limits on ILP are from actual data flows through registers or memory
- Assumptions
  1. register renaming – infinite number of virtual registers available (no WAW or WAR hazards) and unbounded number of insts issued simultaneously
  2. branch prediction – all perfectly predicted
  3. jump prediction – all perfectly predicted – combined with 2, same as perfect speculation and unbounded window
  4. memory address alias analysis – all addresses know exactly, so load can be moved before store if addresses not identical
- 2 and 3 eliminate all control dependences
- 1 and 4 eliminate all but true data dependences
Hardware model (cont.)

- Unlimited instruction issue, with no restrictions on lookahead or on types of instructions issued
- All functional units have 1 cycle latency
- Perfect caches, so all loads/stores complete in 1 cycle
- This processor is pretty much unrealizable, but is where we start

Available ILP – Fig. 3.35

First restriction study

- Limit window size and maximum issue count
  - window size is the number of instructions looked at for simultaneous execution (the number in-flight)
  - current window sizes max out around 64-128
    - but real functional units pipelined, not single cycle
    - and real processors need window to hold outstanding memory references (cache misses)

Restricting window size – Fig. 3.36

Second restriction study

- Realistic branch and jump prediction
  - fix window size at 2K and issue at most 64 instructions per clock
  - 5 levels of branch prediction
    - perfect – for branches and jumps
    - tournament-based – correlating 2-bit, noncorrelating 2-bit, and selector to choose best one for each branch, 48k bits total, 97% accurate on SPEC92, plus essentially perfect jump predictor
    - standard 2-bit with 512 2-bit entries, plus 16 entry procedure return predictor
    - static – based on profile history of program
    - none – only predict jumps – parallelism only within basic block

Effect of branch prediction

Again, f.p. programs show more available parallelism
Branch prediction accuracy

- Fig. 3.40
- Accuracy better for f.p. programs, and when prediction not very accurate, misprediction penalty is very high.

Third restriction study

- Effects of finite registers
  - huge tournament predictor (150K bits) plus large jump and return predictors
  - now reduce number of registers for renaming from infinite in ideal processor – start from a base of 32 GP and 32 FP

Finite registers – Fig. 3.41

- For f.p. programs, many registers needed to evaluate independent threads.

Fourth restriction study

- Imperfect alias analysis
  - 3 other memory models, instead of perfect
    - global/stack perfect – and assumes all heap references conflict (idealized version of best a compiler could do currently)
    - inspection – compile time interference checks, just looking for obvious memory conflicts
    - none – all memory references conflict
  - first 2 same for Fortran programs

Imperfect alias analysis

- Fig. 3.43
- Shows compiler or dynamic memory disambiguation needed.

Example: P6 Microarchitecture
P6 microarchitecture

- Dynamically scheduled
- Translates each IA-32 instruction into series of micro-operations (uops) to be executed in pipeline
  - each uop is like a MIPS instruction
- Up to 3 IA-32 instructions fetched, decoded, and translated per cycle, max of 6 uops generated per cycle
- Pipeline is out-of-order and speculative, with register renaming and a reorder buffer (ROB)

P6 uops – Fig. 3.48

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Pipeline stages</th>
<th>Repeat rate</th>
</tr>
</thead>
<tbody>
<tr>
<td>Integer ALU</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>Integer load</td>
<td>3</td>
<td>1</td>
</tr>
<tr>
<td>Integer multiply</td>
<td>4</td>
<td>1</td>
</tr>
<tr>
<td>FP add</td>
<td>3</td>
<td>1</td>
</tr>
<tr>
<td>FP multiply</td>
<td>5</td>
<td>2</td>
</tr>
<tr>
<td>FP divide (64-bit)</td>
<td>32</td>
<td>32</td>
</tr>
</tbody>
</table>

Pipeline stalls

- Lead to processor not committing as many instructions as possible in a cycle
  - fewer than 3 insts fetched – inst cache miss
  - fewer than 3 insts issued – too many uops
  - not all uops generated can issue – lack of reservation stations or ROB entries
  - data dependences everywhere – reservation stations and ROBs
  - data cache miss – all res. stations and ROBs waiting
  - branch mispredicts – pipeline flushes and refills

Decode stalls – Fig. 3.51

Data cache misses – Fig. 3.53

hard to read, but L2 misses are the problem

uop completions per cycle

Fig. 3.56

shows benefit of a dyn. scheduled pipeline
Fallacies

- Processors with lower CPIs will always be faster
- Processors with faster clock rates will always be faster
- Sophisticated pipelines with slower clocks sometimes win from using more ILP

Pitfalls

- Emphasizing improved CPI by increasing issue rate, but sacrificing clock rate, can lead to lower performance
  - Complexity introduced to improve CPI can slow clock rate
- Improving only one part of a multiple-issue processor may not improve overall performance
  - Need to find the bottleneck (i.e. Amdahl's law)
- Sometimes bigger and dumber is better
  - Different applications can produce different behaviors