The objective of this module is to go through a clock level simulation for a sequence of instructions that undergo dynamic scheduling. We will look at how the data structures maintain the relevant information and how the hazards are handled.
We will first of all do a recap of dynamic scheduling. Dynamic Scheduling is a technique in which the hardware rearranges the instruction execution to reduce the stalls, while maintaining data flow and exception behavior. The advantages of dynamic scheduling are:
• It handles cases when dependences are unknown at compile time
– (e.g., because they may involve a memory reference)
• It simplifies the compiler
• It allows code compiled for one pipeline to run efficiently on a different pipeline
• Hardware speculation, a technique with significant performance advantages, builds on dynamic scheduling
In a dynamically scheduled pipeline, all instructions pass through the issue stage in order; however they can be stalled or bypass each other in the second stage and thus enter execution out of order. Instructions will also finish out-of-order.
In order to allow instructions to execute out-of-order, we need to do some changes to the decode stage. So far, we have assumed that during the decode stage, the MIPS pipeline decodes the instruction and also reads the operands from the register file. Now, with dynamic scheduling, we bring in a change to enable out-of-order execution. The decode stage or the issue stage only decodes the instructions and checks for structural hazards. After that, the instructions will wait until there are no data hazards, and then read operands. This will enable us to do an in order issue, out of order execution and out of order completion.
The entire book keeping is done in hardware. The hardware considers a set of instructions called the instruction window and tries to reschedule the execution of these instructions according to the availability of operands. The hardware maintains the status of each instruction and decides when each of the instructions will move from one stage to another. The dynamic scheduler introduces register renaming in hardware and eliminates WAW and WAR hazards.
In the Tomasulo’s approach, the register renaming is provided by reservation stations (RSs). Associated with every functional unit, we have a few reservation stations. When an instruction is issued, a reservation station is allocated to it. The reservation station stores information about the instruction and buffers the operand values (when available). So, the reservation station fetches and buffers an operand as soon as it becomes available (not necessarily involving register file). This helps in avoiding WAR hazards. If an operand is not available, it stores information about the instruction that supplies the operand. The renaming is done through the mapping between the registers and the reservation stations. When a functional unit finishes its operation, the result is broadcast on a result bus, called the common data bus (CDB). This value is written to the appropriate register and also the reservation station waiting for that data. When two instructions are to modify the same register, only the last output updates the register file, thus handling WAW hazards. Thus, the register specifiers are renamed with the reservation stations, which may be more than the registers. For load and store operations, we use load and store buffers, which contain data and addresses, and act like reservation stations.
The three steps in a dynamic scheduler are listed below
• Get next instruction from FIFO queue
• If available RS, issue the instruction to the RS with operand values if available
• If a RS is not available, it becomes a structural hazard and the instruction stalls
• If an earlier instruction is not issued, then subsequent instructions cannot be issued
• If operand values are not available, the instructions will wait in the RSs looking at CDBs for operands
• When operand becomes available on the CDB, store it in any reservation station waiting for it
• When all operands are ready, the instruction is executed by the respective functional unit
• Loads and store are maintained in program order through the effective address
• No instruction allowed to initiate execution until all branches that proceed it in program order have completed
• Write result
• Write result on CDB into reservation stations and store buffers
• Stores must wait until address and value are received
The dynamic scheduler maintains three data structures – the reservation station, a register result data structure that keeps of the instruction that will modify a register and an instruction status data structure. The third one is more for understanding purposes. The reservation station components are as shown below:
Name —Identifying the reservation station
Op—Operation to perform in the unit (e.g., + or –)
Vj, Vk—Value of Source operands
– Store buffers have V field, result to be stored
Qj, Qk—Reservation stations producing source registers (value to be written)
– Store buffers only have Qi for RS producing result
Busy—Indicates reservation station or FU is busy
Register result status—Indicates which functional unit will write each register, if one exists. It is blank when there are no pending instructions that will write that register. The instruction status gives the status of each instruction in the instruction window.
Figure 17.1 shows the organization of the Tomasulo’s dynamic scheduler. Instructions are taken from the instruction queue and issued. During the issue stage, the instruction is decoded and allocated an RS entry. The RS station also buffers the operands if available. Otherwise, the RS entry marks the pending RS value in the Q field. The results are passed through the CDB and go to the appropriate register, as dictated by the register result data structure, as well as the pending RSs.
Having looked at the basics of the Tomasulo’s dynamic scheduler, we shall now discuss how dynamic scheduling happens for a sequence of instructions. We are only considering a basic block without branches.
Figure 17.2 shows the sequence of instructions considered along with the three data structures maintained. We shall assume three Add reservation stations and two Mul reservation stations. There are also three Load buffers. Assume that Add has a latency of 2, Mul has a latency of 10 and Div has a latency of 40.
During the first clock cycle, Load1 is issued. A load buffer entry is allocated, its status is now busy and the contents of R2 needed for the effective address calculation and the value 34 is stored in the buffer. The register result status shows that register F6 is to be written by the instruction identified by Load buffer1.
During the second clock cycle, Load1 moves from the issue stage to the execution stage. The effective address (34+R2) is calculated and stored in the Load buffer1. Meanwhile, instruction 2, i.e. Load2 is issued. It is allocated Load buffer entry2, and the details of the effective address (45, R3) are stored there. Load buffer2 is now busy. The register result status shows that register F2 is to be written by the instruction identified by Load buffer2.
During the third clock cycle, Load1 does the memory access and completes execution. For Load2, the effective address (45+R3) is calculated and stored in the Load buffer2. Meanwhile, instruction 3, i.e. Mul is issued. It is allocated Mult1 entry, the contents of register F4 are fetched and stored in the place of Vk. The other operand F2 cannot be fetched now and has to wait for Load2 to finish. This is indicated in the Qj field as Load2. Mult1 reservation station is marked busy. The register result status shows that register F0 is to be written by the instruction identified by Mult1. This scenario is indicated in Figure 17.3.
During the fourth clock cycle, Load1 writes the result on the CDB and the corresponding Load1 entry is made free. This data on the CDB writes into F6. Load2 does the memory access and completes execution. Meanwhile, instruction 4, i.e. Sub is issued. It is allocated Add1 entry, the contents of register F6 are fetched and stored in Vj. The other operand F2 cannot be fetched now and has to wait for Load2 to finish. This is indicated in the Qk field as Load2. Add1 reservation station is marked busy. The register result status shows that register F8 is to be written by the instruction identified by Add1. Note that Mul is still waiting for the operand and is stalled.
During the fifth clock cycle, Load2 writes the result on the CDB and the corresponding Load2 entry is made free. This data on the CDB writes into F2. This data is taken up by the Mult1 reservation station and the Add1 reservation station that have been waiting. These two instructions, Sub and Mul are now ready for execution. Meanwhile, instruction 5, i.e. Div is issued. It is allocated Mult2 entry, the contents of register F6 are fetched and stored in Vk. The other operand F0 cannot be fetched now and has to wait for Mul to finish. This is indicated in the Qj field as Mult1. Mult2 reservation station is marked busy. The register result status shows that register F10 is to be written by the instruction identified by Mult2 (Div instruction).
During the sixth clock cycle, both the Mul instruction and the Sub instruction have gone through one execution cycle. Meanwhile, instruction 6, i.e. Add is issued. It is allocated Add2 entry, the contents of register F2 are fetched and stored in Vk. The other operand F8 cannot be fetched now and has to wait for Sub to finish. This is indicated in the Qj field as Add1. Add2 reservation station is marked busy. The register result status shows that register F6 is to be written by the instruction identified by Add2 (Add instruction). This scenario is depicted in Figure 17.4.
Note that there is a name dependency for F6 for the Add instruction with both the Div as well as the Sub instruction, which might lead to WAR hazards. However, since the operand F6 for both the Sub and Div instructions have already been read and buffered in the respective reservation stations, the name dependency has been resolved by register renaming and thus avoids the possible WAR hazards.
The Sub instruction finishes execution in the seventh clock cycle, since it has a latency of 2. Nothing else happens in this clock cycle. It writes into the CDB in the eighth clock cycle. The result of Sub is written into the register F8 and it also writes into the reservation station Add2 of the Add subtraction. The Add1 entry is now freed up.
Clock cycle nine – Add and Mul are still executing. During clock cycle ten, Add finishes execution and it writes into the CDB in clock cycle eleven. This data goes into register F6 and the Add2 reservation station entry is freed up. Note that even though there is a name dependency on register F6 with the Div instruction, this does not cause a problem. The Div instruction‘s reservation station has already got the original data of F6 through the Load1 instruction.
During clock cycle fifteen, Mul finishes execution (latency of 10 – clock cycle six to fifteen) and it writes the result in clock cycle sixteen. The result goes to register F0 and the reservation station Mult2 waiting for it (Div instruction). The Mult1 entry is freed up. Div starts executing in clock cycle seventeen and finishes execution in fifty six (latency of 40). It writes the result in fifty seven and frees up the Mult2 entry. The complete schedule is shown in Figure 17.5. Observe the instructions are issued in-order, execute out-of-order and complete execution out-of-order.
The advantages of the Tomasulo’s dynamic scheduling approach are as follows:
(1) The distribution of the hazard detection logic:
– Tomasulo’s approach uses distributed reservation stations and the CDB
– If multiple instructions are waiting on a single result, and each instruction has other operand, then instructions can be released simultaneously by broadcast on CDB
– If a centralized register file were used, the units would have to read their results from the registers when register buses are available.
(2) The elimination of stalls for WAW and WAR hazards
– The buffering of the operands in the reservation stations, as soon as they become available, does the mapping between the registers and the reservation stations. This renaming process avoids WAR hazards. Even if a subsequent instruction modifies the register file, there is no problem because the operand has already been read and buffered.
– When two or more instructions write into the same register, leading to WAW hazards, only the latest register information is maintained in the register result status. This renaming avoids WAW hazards.
However, this approach is not without drawbacks. The following are the drawbacks:
o The hardware becomes complicated with the book keeping done and the CDB and the associative compares
· Many associative stores (CDB) at high speed
· Performance limited by Common Data Bus
o Each CDB must go to multiple functional units. This leads to high capacitance and high wiring density
o With only one CDB, the number of functional units that can complete per cycle is limited to one!
§ Multiple CDBs will lead to more FU logic for parallel associative stores
· Non-precise interrupts!
o The out-of-order execution will lead to non-precise interrupts. We will address this later when we discuss speculation.
We have not discussed dependences with respect to memory locations. They will also have to be handled properly. A load and a store can safely be done out of order, provided they access different addresses. If a load and a store access the same address, then either
· the load is before the store in program order and interchanging them results in a WAR hazard, or
· the store is before the load in program order and interchanging them results in a RAW hazard.
Similarly, interchanging two stores to the same address results in a WAW hazard. Hence, to determine if a load can be executed at a given time, the processor can check whether any uncompleted store that precedes the load in program order shares the same data memory address as the load. Similarly, a store must wait until there are no unexecuted loads or stores that are earlier in program order and share the same data memory address.
To summarize, we have discussed an example of dynamic scheduling, wherein the hardware does a dynamic reorganization of code at run time. For a given sequence of instructions, we have looked at a clock level simulation. The book keeping done and the various steps used have been elaborated. We have not handled branches in this example.
|Web Links / Supporting Materials|
|Computer Organization and Design – The Hardware / Software Interface, David A. Patterson and John L. Hennessy, 4th Edition, Morgan Kaufmann, Elsevier, 2009.|
|Computer Architecture – A Quantitative Approach , John L. Hennessy and David A. Patterson, 5th Edition, Morgan Kaufmann, Elsevier, 2011.|