Branch Delay slot explanation

Code for DLX, not scheduled for the pipeline:
(from Hennessey and Patterson)
Loop:
	LD	F0, 0(R1)	; F0=array element
	ADDD	F4, F0, F2	; add scalar in F2
	SD	0(R1), F4	; store result
	SUBI	R1, R1, #8	; decrement pointer 8 bytes per (DW)
	BNEZ	R1, Loop	; iterate if R1 not zero.

end.

Now the unrolled loop after it has been rescheduled:
(Assume we will unroll it 4 times)

Loop:
	LD	F0, 0(R1)
	LD	F6, -8(R1)
	LD	F10, -16(R1)
	LD 	F14, -24(R1)
	ADDD	F4, F0, F2
	ADDD 	F8, F6, F2
	ADDD	F12, F10, F2
	ADDD 	F16, F14, F2
	SD	0(R1), F4
	SD 	-8(R1), F8
	SD	-16(R1), F12	; You might be tempted to do the next
				; SD at this point, but it is necessary
				; for the branch delay slot to avoid stalls
	SUBI	R1, R1, 32	; Here you update the counter
	BNEZ	R1, Loop
	SD	8(R1), F16	; 8-32 = -24  Branch Delay Slot

You save the "SD 8(R1), F16" instruction for after the branch because it 
legally fills the requirements.  It would have been executed whether the branch
was taken or not, so it is independent of the branch instruction.