Delayed Branching

Just give him something to execute...

One way to maximize the use of the pipeline, is to find an instruction that can be safely exeucted whether the branch is taken or not, and execute that instruction. So, when a branch instruction is encountered, the hardware puts the instruction following the branch into the pipe and begins executing it, just as in predict-not-taken. However, unlike in predict-not-taken, we do not need to worry about whether the branch is taken or not, we do not need to clear the pipe because no matter whether the branch is taken or not, we know the instruction is safe to execute. How do we know? The compiler promised it would be safe.

When the program was compiled, the compiler looked at each branch instruction, and tried to find something that could be safely executed, whether we take the branch or not. In the example code we have been using up to now:
LOOP:
LW R1, 0(R2)
SUB R3, R3, R1
BNEZ R3 LOOP
SW R4, 0(R3)
we couldn't execute the SW R4, 0(R3) each time through the loop without causing any problems, because the value of R3 is changing in our loop. But, suppose instead we had the instruction SW R3, 0(R4) following the BNEZ instruction. Now, we could execute this instruction each time through the loop with no problems. We would simply be writing repeatedly to the same memory location, and once we finally do exit the loop, we will just continue on normally, no harm done. However, we are also still wasting a cycle for each time we DO go through the loop.

A better strategy is for the compiler to find an instruction either from inside the loop, or from before the loop, that can safely be rescheduled. For example, given the following, slightly modified code fragment:
LOOP:
SUB R3, R3, R1
SW R3, 0(R5)
BNEZ R3 LOOP
SW R4, 0(R6)
We could still safely execute the SW R4, 0(R3) each time through the loop, but that would provide marginal improvement. A better solution is to move SW R3, 0(R5) right after BNEZ. We will still execute it each time through the loop, because it is in the branch-delay slot, the final result will not be altered, and we get much better overall performance, whether we take the loop or not. See below for an example run of this code fragment.

So, if this strategy offers improvements irregardless of whether we take or do not take the branch, what is the problem? The problem is trying to find an instruction that can both be safely executed whether the branch is taken or not, and will still improve performance. This is the compiler's job, and so using a branch delay slot makes compilers more complex to program. Also, Hennesy and Patterson mention that using this option does cause one shortcoming, if the hardware is changed so that a delay-branch slot is no longer used, all the old programs will no longer work. C programs would have to be recompiled, and assembly language programs and routines would have to be re-written. So, this method does put a lot more work into the hands of the system programmers.

Branch-Delay Example:	LOOP:	SUB R3, R3, R1 BNEZ R3 LOOP SW R3, 0(R5) SW R4, 0(R6)
The code fragment above is the final output of the compiler, in actuality, we want our program to execute the first SW instruction each iteration through the loop. So the compiler put the instruction in the branch-delay slot. Below is the run for the program when we go through the loop twice.
		1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17
SUB R3,R3,R1		IF	ID	EX	Mem	WB
BNEZ R3,Loop			IF	ID	EX	Mem	WB
SW R3, 0(R5)				IF	ID	EX	Mem	WB
SUB R3,R3,R1					IF	ID	EX	Mem	WB
BNEZ R3,Loop						IF	ID	EX	Mem	WB
SW R3, 0(R5)							IF	ID	EX	Mem	WB
SW R4, 0(R6)								IF	ID	EX	Mem	WB

You can see from the above, that using this strategy we were able to eliminate all of the delays. Of course, this is assuming that the compiler is able to find a safe, relevant instruction. If the compiler cannot find anything relevant, then we may only get the same improvement as predict-taken or predict-not-taken. Or, if the compiler is unable to find any instruction that it is safe to execute, then it may fill the branch-delay slot with a no-op, and we are back to wasting one cycle through each iteration of the loop, just as in freezing. So, this method does provide for good use of the pipeline, but only in some cases. And it is dependent on the skill of the compiler to find those cases.

PREVIOUS: Prediction

NEXT: Unrolling

Delayed Branching

Just give him something to execute...

LOOP:

SUB R3, R3, R1 BNEZ R3 LOOP SW R3, 0(R5) SW R4, 0(R6)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

SUB R3,R3,R1

IF

ID

EX

Mem

WB

BNEZ R3,Loop

IF

ID

EX

Mem

WB

SW R3, 0(R5)

IF

ID

EX

Mem

WB

SUB R3,R3,R1

IF

ID

EX

Mem

WB

BNEZ R3,Loop

IF

ID

EX

Mem

WB

SW R3, 0(R5)

IF

ID

EX

Mem

WB

SW R4, 0(R6)

IF

ID

EX

Mem

WB

SUB R3, R3, R1
BNEZ R3 LOOP
SW R3, 0(R5)
SW R4, 0(R6)