Loop Unrolling

It just keeps going... and going... and going...


Loop unrolling is used to minimize stalls that may be encountered inside loops, and also to get rid of the overhead of running unnecessary branch conditionals. Assume for a second that we knew we were going to run our loop a number of times that was a multiple of 4. What we could do, is instead of having one loop, we could have four copies of each line inside our loop, and just one big branch statement surrounding the whole thing. This requires more free registers, makes longer programs, but reduces overhead, and can sometimes allows statements inside loops to get around possible data stalls.

This is the idea behind loop unrolling. Now for an example, take the following code fragement:

Loop:
LD F0, 0(R1)
ADDD F2, F2, F0
MULTD F4, F2, F0
ADDD F6, F4, F2
SD 0(R1), F2
SUBI R1, R1, #8
BNEZ R1, Loop

Now let us see what this code fragment would look like when unrolled, remember we are assuming that we can be certaint that R1 will be a multiple of 32 upon entering our loop (32, since we are subtracting 8 each iteration, so 32 will allow for four iterations through the loop) So, unrolled, we have:

Loop:
LD F0, 0(R1)
ADDD F2, F2, F0
MULTD F4, F2, F0
ADDD F6, F4, F2
SD 0(R1), F2

LD F8, -8(R1)
ADDD F10, F10, F8
MULTD F12, F10, F8
ADDD F14, F12, F10
SD -8(R1), F10

LD F16, -16(R1)
ADDD F18, F18, F16
MULTD F20, F18, F16
ADDD F22, F20, F18
SD -16(R1), F18

LD F24, -24(R1)
ADDD F26, F26, F24
MULTD F28, F26, F24
ADDD F30, F28, F26
SD -24(R1), F26

SUBI R1, R1, #32
BNEZ R1, Loop

Which is quite a bit longer, and more difficult to understand from just looking at it, but does get the same job done with less overhead. After all, we have cut out 3 SUBI instructions, 3 BNEZ instructions, not to mention any possible stalls resulting from the BNEZ instructions. However, we can also make another improvement within the unrolled loop. If you look closely, you will notice that there are going to be stalls following each LD, ADDD, and MULTD instruction, because of RAW hazards. We can use the unrolled loop to get around this data hazard also, as shown below:

Loop:
LD F0, 0(R1)
LD F8, -8(R1)
LD F16, -16(R1)
LD F24, -24(R1)
ADDD F2, F2, F0
ADDD F10, F10, F8
ADDD F18, F18, F16
ADDD F26, F26, F24
MULTD F4, F2, F0
MULTD F12, F10, F8
MULTD F20, F18, F16
MULTD F28, F26, F24
ADDD F6, F4, F2
ADDD F14, F12, F10
ADDD F22, F20, F18
ADDD F30, F28, F26
SD 0(R1), F2
SD -8(R1), F10
SD -16(R1), F18
SD -24(R1), F26

SUBI R1, R1, #32
BNEZ R1, Loop

With this rescheduled version of the unrolled loop, we have not only eliminated six unecessary instructions, we have also eliminated the stalls caused by data dependencies. Of course, to do all this we had to use a lot of registers, and we have to be certaint that R1 will be a multiple of 32 coming into the loop. But, if these situations are satisfied, then we have much faster code through the use of loop unrolling.

That is the last way to get around control hazards that we will be covering in this document. There are other methods, many of them trying to improve on the prediction method's ability to correctly predict the branch outcome. But, now you should have a good undestanding of the major methods to avoid control hazards. There are a number of questions and execises that you may use to test your knowledge, you may reach these by clicking on the link below.

PREVIOUS: Delayed Branch
NEXT: The Problems