CMSC 411 Project - Summer '97 - Toy Benchmark Analysis

[ Home ]
Please Note:

The page numbers refer to the page in the text book used in CMSC411 at University of Maryland at College Park.

"Computer Architecture a Quantitative Approach", Hennessy & Patterson, 2^nd Ed.

1996. Morgan Kaufmann Publishers, INC. San Francisco, California.

For the True/False and short answer/questions credits should be given to the above book and the course instructor: Dr. Michelle Hugue.

A closer look to the project and its objectives:

The project relates mostly with Toy Benchmarks. A C implementation of a game called "peg". The program (provided on this page) executes a certain piece of code numerous times such as the loops. The idea is to run the code and calculate its runtime on different machines, then make modifications to it and test it again and finally compare the results.

But since consider only the code and different machines in comparing the run times is a naïve approach, we will also take other aspects into consideration such as: the architecture of the machine the code was tested on, the memory versus cache issue, the compiler version and its characteristics, the presence of other users on the network, etc…

We hope that this experiment gives the reader insight into the run time differences of different architectures. It should also show how benchmarks could be biased and therefore trick you into buying a machine that really isn’t performing as well as advertised.

Another important thing to know is that the clock speed is not the only factor to be considered when you want to buy your new computer. Why? Simply because there are many other important factors too. If someone tells you a certain machine is better than another one only because it has a higher clock rate, you should doubt their claim and ask them more specific questions about the actual architecture of that machine. There is no standard definition for efficiency in the land of computers, each machine could be efficient for one task and still do poorly on another. So pay attention to the details and if you still haven’t read the beautiful "Computer Architecture a Quantitative Approach" by Hennessy & Patterson (2^nd Edition), this might be a good time to look at it, and notice how they introduce the CPU_Time equation as a more accurate measurement of the performance of a machine.

CPU_{Time =}Instruction count * Clock cycles per instruction * Clock cycle time

(page30)

Questions & Answers:

Q: What is a benchmark?

A: Benchmarks are programs that run on a system, evaluate the performance of that system and/or machine and predict its performance.

Q: Are there different types of benchmarks?

A: Yes. Toy benchmarks and Synthetic benchmarks.

Toy benchmarks are usually 10 to 100 lines of code which run on almost any

computer. For example: Quicksort

Synthetic benchmarks which match "the average frequency of operations and

operands of a large set of programs" (page 20). Whetstone and Dhrystone are

among the most popular benchmarks of this type.

Q: Are there other types of programs used in evaluating systems’ performances ?

A: Yes. Real programs and Kernels. According to Hennessy and Patterson Real programs and Kernels are even more accurate in their measurements. Examples of Real programs include: C compilers, text-processing software and CAD tools. Kernels extract key informations from Real programs and use them in their evaluations. Unlike Real programs, users won’t run Kernels simply because it’s only an evaluation tool.

Q: What are benchmark suites?

A: Putting together Benchmark Suites is a trend in today’s industry. They are a collection of benchmarks which try to measure the performance of processors with different applications. An example of these is the popular SPEC92 benchmark suite used in workstation and server markets. (For more information see figures 1.9 on page 22 and1.18 on page 49 of the H&P)

Q: How useful are these benchmarks?

A: Benchmarks are useful if used within a correct work frame. In other words, if complete description of the machines, the compiler flags, etc… are provided their evaluations could be unbiased and therefore useful, however they could still be tricked. For example, SPEC89 had a small kernel (matrix300), which consisted of heavy matrix complications, and 99% of the execution time was spent in a single line! When IBM Powerstation 550 used a technique called "blocking", its performance improved by a factor of 9.

Summary: The benchmark (SPEC89) tested compiler performance which is not "a good indication of overall performance, nor of this particular optimization." (page48)

Test Your Knowledge

True/False

&

short Questions/Answers

Whenever a pipelined machine is mentioned a 5 stage pipeline with forwarding is assumed where writing to memory is done in the first half of a clock cycle and reading occurs in the second half. (Unless stated otherwise)

. Speed up is limited by the fraction of time the faster mode can be used - True

. Addressing modes related to Porgram Counter are PC-relative addressing modes (e.g. control instructions) - True

. If a lower code size is desired, using variable instruction length is better than fixed - True

. High level optimizations are done in loops - False

** They are done on the source and their ouptut is fed to later optimization passes

. Register Allocation is a graph coloring problem with 16 nodes - True

. Adding to ALU instructions affects the total CPI - True

. The pipe’s depth is equal to Clock Cycle _{not pipelined}/ Clock Cycle _Pipelined - True

. The instruction count and the total execution time of a program are independent - False

. The clock cycle in pipelined machines depend on the order of instructions - true

. Loop unrolling eliminates stalls - False

** Loop unrolling alone doesn’t necessarily eliminates stalls, it is rather performing multiple iterations of the content of single loop which is itself embedded in a loop with less iterations.

. Instruction rescheduling reduces the number of stalls - True

. Structural hazards are resource contention problems - True

. A structural hazard can only happen when in a given clock cycle an instruction is being fetched and another instruction is accessing the memory for a read or write. - False

** We could have other structural hazards as well, for example two instructions being in the memory stage at the same time. This situation specially arises when the pipeline has floating point operational units.

. If the clock cycle of an unpipelined machine is equal to that of a pipelined one, then the following holds:

Throughput_pipelined = Throughput_unpipelined

- True

. Pipelining increases the throughput but doesn’t improve the time it takes to execute one instruction. - True

. If there were no stalls, pipelining can improve the CPI by depth of the pipeline. - True

. When dealing with branches, the deeper the depth of a pipeline the worse the branch penalty in clock cycles. - True

. A delayed branch instruction is executed regardless of the outcome of the branch. - True

. Pipeline interlocks detect hazards and stall the pipeline. - True

. Data hazards occur when the order of instructions being executed in the pipeline is in such an order that it messes up the order of read/write accesses to operands, therefore resulting in wrong answers. - True

. Bypassing and short circuiting are two different ways of handling data hazards. - False

** They are the same techniques and not two different ones

. In our 5 stage DLX pipeline if an ALU instruction follows a Load or Store and depends on them, even forwarding can not eliminate the stall. - True

. Does the longer latency of floating point operations increase the number of RAW hazards? Yes. (page 190)

. What are the advantages of having two separate register files for integers and floating points? It doubles their number and we can have more writes during the write back stage without adding to the number of ports.

. Scheduling can be done dynamically by the compiler. - False

** Compiler does so but statically

. Scheduling can be done dynamically by the hardware. - True

. Scheduling depends on the amount of ILP (Instruction Level Parallelism) and latencies of the functional units. - True

. Flushing the pipe is deleting instructions in the pipeline. - False

** Flushing or freezing the pipe happens when we hold or delete any instruction after a branch until the destination is known.

. When dependent instructions can not be reordered they can cause stalls. - True

. Increasing the pipeline stages does improve the performance. - False

** The clock cycle rates should be considered too. (Page 210)

. Data dependences are aproperty of pipelines. - False

** According to H&P they are properties of programs (page 230)

. Scheduling prevents hazards caused by data dependences. - True

. Register renaming is a technique used in dealing with WAR and WAW hazards.

. Scoreboard is a dynamic scheduling technique. - True

. Scoreboard has a complex hardware implementation and has to deal with out of order execution of instructions. - True

. WAR hazards in scoreboard are detected in the ID (issue) stage. - False

** They are detected at the Write Result stage

. A scoreboard is limited by the followings in eliminating stalls:

. The amount of parallelism available among instruction

. The window size, the # of instructions it can look at

. The numbers and types of functional units

. Presence of anti and output dependences

. In Tomasulo’s algorithm the hazard detection & execution control are centralized. - False

** They are centralized in Scoreboard algorithm and distributed in Tomasulo’s.

. Reservation Stations deal with register renaming in Tomasulo’s. - True

. Scoreboard uses the CDB (Common Data Bus) to transfer data between functional units and the reservation stations. - False

** The CDB is used in Tomasulo’s

. An independent instruction can bypass the stalled instructions in which algorithm? Tomasulo’s or Scoreboard? - Tomasulo’s

. If there are a limited number of registers and scheduling is hard which one of the algorithms is ideal to be used? - Tomasulo’s

. Superscalar processors have dynamic issuing capabilities. - True

. VLIW processors have static issuing capabilities. - False

. HP7100 is a superscalar processor. - True

. Throughput is inversely proportional to time. - True

. Interleaved memory takes advantage of potential parallelism. - True

. Compare the followings for split memory (instruction & data) and unified memory:

Miss Rate - Higher for split memory

Mem Access Time - Lower for split memory

. Power PC620 is the first dynamically scheduled machine.

. In the memory hierarchy introduced in chapter 5 (with L0 and L1 caches), bottom up - each level maps addresses from a smaller, faster memory to a bigger, slower memory. - False

** The reverse is true. From a larger, slower memory to a smaller, faster one.

. Increasing associativity has the following impacts:

increases the number of blocks per set. - True

increases the size of the tag. - True

decreases the size of the index. - False

. Since we want to make the common case fast, therefore caches should be made fast for reads. - True

. Why do writes to memory take more time than reads? Because checking tags and matching addresses can not be done in parallel. (page 379)

. With the write through technique, misses don’t result in writing to lower levels and are easier to implement. - True

. With the write back technique, we have higher speed and less bandwidth. - True

. Direct mapping and 1 way set associative are equal. - True

. The miss rate of instruction caches is higher than that of data caches. - False

** The reverse holds. The miss rate of instruction caches is lower than that of data caches.

. Hit time affects the clock rate of the processor. - True

. A DRAM is unaccessable when it is refreshing itself. - True

The interleaving factor in interleaved memories is equal to the following: address modulo 4 , where 4 is the number of banks. - True

. An instruction set architecture determines the hardware. - False

. DLX can only be used with a 5 stage pipeline. - False

. MIPS and MFLOPS indicate the average execution time of the systems they describe. - False

** They are not accurate measurements of the average execution time

. RISC machines use variable length instructions. - False

** They use fixed size isntructions

. The instruction set architecture has no role in dtermining the average number of clocks per instruction. - False

** They do play a role

. ALU operations have no memory operands in Stack machines. - False

** ALU operations have no operands at all in Stack machines

. If program p1 has a lower execution time on machine B than on machine B, this implies that machine A has a faster clock than machine B. - False

** There are other factors to consider as well

. Main memory organization has no influence on the cache organization, the width of the data bus, or the handling of virtual memory. - False

. Multiple issue machines are faster than single-issue machines regardless of the code being executed. - Flase

** They do depend on the code, for example clock cycles might be longer!

. In VLIW systems, the programmer is required to write his code in chunks that can be processed independently but simultaneously by the CPU. - True

. The ideal CPI of a pipeline is the quotion of pipeline stages over the average length of time an instruction spends in the pipe. - False

. All data hazards can be handled without causing stalls. - False