Notes on Chapter 4.3 -- Reducing Branch Penalties with Dynamic Hardware Prediction

 

Branch prediction buffers

Correlating predictors

Branch prediction cache

Storing target instructions

 


Reducing Branch Penalties

Branch prediction buffer: small memory indexed by lower portion of address of branch instruction

memory contains bit which provides branch prediction

fetching begins in the predicted direction

How to predict branch direction?

single bit scheme -- bit specifies whether branch was recently taken

two-bit scheme -- branch must miss twice before prediction is changed. Note finite state diagram for two bit scheme in Figure 4.13

n-bit counter -- counter can take on values from 0 to 2**n-1, when counter is greater or equal to half maximum value, branch is predicted as taken

Why more than one bit?

for(i=0;i<n;i++) {

for(j=0;j<m;j++){

...statements.... }}

For i>0, the branch associated with the j loop will be mispredicted twice per i loop iteration. (when j=0 and j=m)

Two bit scheme - will always predict branch taken. When j=m, prediction will be wrong, but when the next iteration of the i loop rolls around, prediction will be correct.

Implement as a small cache accessed with instruction address during IF pipe stage, or as a pair of bits attached to each block in the instruction cache and fetched with the instruction

Accuracy of branch prediction buffer with two bits per entry - 82-99% on SPEC89

Scheme (used on its own) helps if we know target PC before we know whether branch is taken.

 


Correlating predictors - use behavior of other branches to make a prediction. For instance:

if(d==0) {d=1;}

if(d==1)

 

DLX Version:

bnez r1,L1 ; branch b1 (r1!=0)

addi r1,r0,#1 ; set r1 to 1 as r1==0

L1: subi r1,r1,#1; subtract 1 from r1

bnez r1,L2; branch b2 (r1!=0)

.....

L2:

Assume that this code sequence is encountered repeatedly with r1 equal to 2 then 0 then 2 then

0 etc.

r1=2 -- branch b1 taken, branch b2 taken

r1=0 -- branch b1 not taken, branch b2 not taken

Assume that we have one bit predictor initialized to not taken. One bit predictor will always

be wrong. Two bit predictor will also always be wrong.

For each conditional, predict based on the outcome of the previously executed conditional

Assume each branch has a 1 bit preditor that uses one bit of correlation:

Notation - consider loop x: NT/NT means that if the previous loop was not taken, we predict loop x as not taken and if the previous loop was taken, we also predict loop x as not taken

Initial prediction for branch b1 and branch b2 is NT/NT. We assume that the last branch was not taken.

r1 = 2 -- branch b1 taken

Prediction NT:: assume last branch not taken - (state b1: NT/NT )

revised b1 state T/NT

branch b2 taken

Prediction NT:: last branch taken - [state b2: NT/NT}

revised b2 state NT/T

r1 = 0 - branch b2 not taken

Prediction NT :: last branch taken - [state b1:T/NT]

revised b1 state T/NT

branch b2 not taken

Prediction NT:: last branch not taken - [state b2: NT/T]

revised b2 state NT/T

r1 = 2 - branch b1 taken

Prediction T:: last branch not taken [state b1:T/NT]

revised b1 state T/NT

branch b2 taken

Prediction T:: last branch taken [state b2:NT/T]

revised b2 state NT/T

 

A (1,1) predictor uses the behavior of the last branch to choose from among a pair of one-bit branch predictors. An (m,n) predictor uses the

behavior of the last m branches to choose from 2**m branch predictors, each of which is an n-bit predictor for a single branch.

Global history of last m branches can be recorded in m bit shift register

Branch prediction buffer can be indexed by concatenating low-order bits from the branch address with the m-bit global history.

 


 

Branch prediction cache that stores predicted address for the next instruction after a branch is called a branch-target buffer or branch target cache

We are predicting the next instruction address before we find out whether a given instruction is a branch. Thus, we must also find out early whether the instruction is actually a branch. If we have seen a given branch before, the instruction will live at a specific address. (Assume we do not allow self-modifying code). If PC of fetched instruction matches PC in branch-target buffer, predicted PC is used as next PC (see Figure 4.22).

For one bit predictor, just store predicted-taken branches.

For two (or more) bit predictors, use both target buffer and prediction buffer.

Study Figure 4.23


Store target instructions instead of, or in addition to, target address.

branch folding --- substitute the instruction from the branch target buffer in place of the instruction returned from the cache - in other words, switch instructions during IF cycle