Branch Delay slot explanation Code for DLX, not scheduled for the pipeline: (from Hennessey and Patterson) Loop: LD F0, 0(R1) ; F0=array element ADDD F4, F0, F2 ; add scalar in F2 SD 0(R1), F4 ; store result SUBI R1, R1, #8 ; decrement pointer 8 bytes per (DW) BNEZ R1, Loop ; iterate if R1 not zero. end. Now the unrolled loop after it has been rescheduled: (Assume we will unroll it 4 times) Loop: LD F0, 0(R1) LD F6, -8(R1) LD F10, -16(R1) LD F14, -24(R1) ADDD F4, F0, F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD 0(R1), F4 SD -8(R1), F8 SD -16(R1), F12 ; You might be tempted to do the next ; SD at this point, but it is necessary ; for the branch delay slot to avoid stalls SUBI R1, R1, 32 ; Here you update the counter BNEZ R1, Loop SD 8(R1), F16 ; 8-32 = -24 Branch Delay Slot You save the "SD 8(R1), F16" instruction for after the branch because it legally fills the requirements. It would have been executed whether the branch was taken or not, so it is independent of the branch instruction.