Problems

The Problems for the Problem of Branching...


1) The CPI of Loops

Loop:
ADDI R1, R1, #8
ADD R2, R2, R1
SW R4, 0(R2)
SUBI R8, R8, #1
ADD R3,R1,R8
BNEZ R8, Loop

Given the above code fragment, and the following knowledge: R8 is 6 upon entering the loop, R8 is always a multiple of 3 before entering the loop, the computer architecture is DLX with forwarding, branch target addresses are calculated in the IF stage, and branch outcomes are decided in the ID stage, what CPI will be produced when the program is executed using each of the following methods to deal with control hazards, freezing, predict branch-taken, prediction using a 1-bit predictor which enters the loop predicting taken, loop unrolling with predict branch not-taken (do not reschedule the code), and branch delay moving SW R4, 0(R2) into the branch delay slot?


2) The Joy of Unrolling

Unroll the following code fragment so that all stalls are removed from inside the loop (except any possible stalls caused by the choice taken at the branch statement), and any unnecssary statements are removed. There are 64 available registers. You can assume that any registers that need to be a certaint multiple works out fine.

Loop:
LW R1, 0(R4)
ADD R2, R2, R1
ADD R4, R4, #16
SUBI R8, R8, #8
BNEZ R8, Loop


3) The Amazing Predict-o-bit

Given that R1 is 4 and R2 is 2 when entering the following code fragment, show how a 1-bit predictor (which is 0 upon entering the loop, predicting not-taken) would cause the following code fragement to be executed. Assume that branch instructions are completed in the ID stage, and branch targets are computed in the IF stage.

Loop1:
BEQZ R2, Loop2
SUBI R2, R2, #1
Loop2:
SUBI R1, R1, #1
BNEZ R1, Loop1
ADD R2, R2, R1


 

Answers:


 

1.1) Freezing

For each iteration through the loop, we get a stall following the ADDI as we wait for R1 to be available. We get 1 stall waiting for R2 follwing the ADD, and 1 stall waiting for R8 following the SUBI. So, with three stalls per iteration through data dependencies, we are left with 2 stalls for freezing (since we are assuming we wil know the outcome of the branch by the ID stage and have forwarding). So, the instructions start at time, 1, 3, 5, 6, 8, 9, respectively. With the last instruction (being the branch) finishing at time 13, 1 cycle after the branch taget starts to execute. So, going through the entire loop 6 times takes us to time 67. For a CPI of 67/36 = 1.86

1.2) Predict branch taken

We still have the 3 stalls for each iteration because of data dependencies. So, the instructions still start at the same times. The only change is that the branch target will begin executing at time 11. So we are simply cutting out 1 stall for each cycle through the loop, except for the last cycle, so we have removed 5 stalls. So, the new CPI will be 62/36 = 1.72

1.3) Prediction using a 1-bit predictor

When we enter the loop, our 1-bit predictor is predicting taken, so we will act like predict-taken. In fact, we will get the exact same outcome as predict-taken above! A CPI of 62/36. If we had instead entered the loop with out 1-bit predictor predicting not-taken, we would have gotten 1 more stall, from our incorrect first prediction. For a CPI of 63/36 = 1.75

1.4) Loop unrolling with predict branch not-taken

With loop unrolling, we will be able to cut out 2 BNEZ instructions. So for one run through the new super-loop (3 iterations of the normal loop) we have our instructions starting at times: 1, 3, 5, 6, 8, (1 sub-loop) 9, 11, 13, 14, 16, (2 sub-loop) 17, 19, 21, 22, 24, 25 (the first branch instruction). So, since we are predicting branch not-taken. We will have 2 stalls the first time through, and none the last time. So, we come out with our last branch instruction finishing at time 56, and our next instruction starting at time 53. So, our CPI will be about 56/32= 1.75, However, we are actually doing more work than this would indicate, since we are cutting out an instruction from each sub-loop.

1.5) Branch delay

By moving the SW instruction into the delay-slot, we are also getting rid of 1 data dependency, so we have our commands starting at times, 1, 3, 4, 6, 7 (branch), 8 (delay-slot). So, with 6 iterations through the loop, our final instruction ends at time 48. So we end up with a CPI of 48/36 = 1.33 and have the wonderful feeling that comes from knowing that our pipeline is pretty darn full.


2)

One possible solution is given by:

Loop:
SUBI R8, R8, #32
LW R1, 0(R4)
LW R9, 16(R4)
LW R17, 32(R4)
LW R25, 48(R4)
ADD R2, R2, R1
ADD R10, R10, R9
ADD R18, R18, R17
ADD R26, R26, R25
ADD R4
, R4, #64
BNEZ R8, Loop


3)

 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
R1
R2
bit
BEQZ R2,Loop2
IF
ID
EX
Mem
WB
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
2
0
SUBI R2,R2,#1
 
IF
ID
EX
Mem
WB
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
1
0
SUBI R1,R1,#1
 
 
IF
ID
EX
Mem
WB
 
 
 
 
 
 
 
 
 
 
 
 
 
3
1
0
BNEZ R1,Loop1
 
 
 
IF
ID
EX
Mem
WB
                        3
1
0
ADD R2,R2,R1
 
 
 
 
IF
n/a
n/a
n/a
n/a
 
 
 
 
 
 
 
 
 
 
 
3
1
1
BEQZ R2, Loop2
 
 
 
 
 
IF
ID
EX
Mem
WB
 
 
 
 
 
 
 
 
 
 
3
1
1
SUBI R1,R1,#1
 
 
 
 
 
 
IF
n/a
n/a
n/a
n/a
 
 
 
 
 
 
 
 
 
3
1
0
SUBI R2,R2,#1
 
 
 
 
 
 
 
IF
ID
EX
Mem
WB
 
 
 
 
 
 
 
 
3
0
0
SUBI R1,R1,#1
 
 
 
 
 
 
 
 
IF
ID
EX
Mem
WB
 
 
 
 
 
 
 
2
0
0
BNEZ R1,Loop1
 
 
 
 
 
 
 
 
 
IF
ID
EX
Mem
WB
 
 
 
 
 
 
2
0
0
ADD R2,R2,R1
 
 
 
 
 
 
 
 
 
 
IF
n/a
n/a
n/a
n/a
 
 
 
 
 
2
0
1
BEQZ R2,Loop2
 
 
 
 
 
 
 
 
 
 
 
IF
ID
EX
Mem
WB
 
 
 
 
2
0
1
SUBI R1,R1,#1
 
 
 
 
 
 
 
 
 
 
 
 
IF
ID
EX
Mem
WB
 
 
 
1
0
1
BNEZ R1,Loop1
 
 
 
 
 
 
 
 
 
 
 
 
 
IF
ID
EX
Mem
WB
 
 
1
0
1
BEQZ R2,Loop2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
IF
ID
EX
Mem
WB
 
1
0
1
SUBI R1,R1,#1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
IF
ID
EX
Mem
WB
0
0
1
BNEZ R1,Loop1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
IF
ID
EX
Mem
0
0
1
BEQZ R2,Loop2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
IF
n/a
n/a
0
0
0
ADD R2,R2,R1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
IF
ID
0
0
0

 

BACK: Unrolling
Back to Start