CMSC 411 Project - Summer '97 - Toy Benchmark Analysis

System Run Time (seconds) Compiler Users OS

Pentium 133, 32M RAM 133 gcc 1 BSDI

Sun Sparc 20, 32M RAM 234 gcc 1 SunOS

Sun Sparc 5, 32M RAM 287 gcc 1 SunOS

HP 712/100, 64M RAM 185 c89 1 HPUX

Pentium 75, 24M RAM 282 VC++ 1 Win95

Intel 486/66, 16M RAM 618 gcc 1 Linux

Pentium 166, 32M RAM 132 VC++ 1 BSDI

Pentium 133, 32M RAM 198 VC++ 1 Win95

Pentium Pro 200, 32M RAM 82 VC++ 1 Win 95

Cyrix P166+, 32M RAM 181 VC++ 1 Win95

Cyrix P166+, 32M RAM 158 gcc 1 Linux

Sun Sparc Ultra-1 309 gcc 13 SunOS

Pentium 133, 32M RAM 208 TC++ 1 Win95

Cyrix P166+, 32M RAM 167 TC++ 1 Win95

DEC ALPHA, (cluster) 490 gcc - SunOS(?)

System	Run Time (seconds)	Compiler	Users	OS
Pentium 133, 32M RAM	133	gcc	1	BSDI
Sun Sparc 20, 32M RAM	234	gcc	1	SunOS
Sun Sparc 5, 32M RAM	287	gcc	1	SunOS
HP 712/100, 64M RAM	185	c89	1	HPUX
Pentium 75, 24M RAM	282	VC++	1	Win95
Intel 486/66, 16M RAM	618	gcc	1	Linux
Pentium 166, 32M RAM	132	VC++	1	BSDI
Pentium 133, 32M RAM	198	VC++	1	Win95
Pentium Pro 200, 32M RAM	82	VC++	1	Win 95
Cyrix P166+, 32M RAM	181	VC++	1	Win95
Cyrix P166+, 32M RAM	158	gcc	1	Linux
Sun Sparc Ultra-1	309	gcc	13	SunOS
Pentium 133, 32M RAM	208	TC++	1	Win95
Cyrix P166+, 32M RAM	167	TC++	1	Win95
DEC ALPHA, (cluster)	490	gcc	-	SunOS(?)

Discussion

Surprisingly, the benchmark reported slower run times on Sun SparcStations than on Intel Pentium systems. Does this mean that one should replace a network consisting of Suns with Pentiums? Hardly. An analysis of the source code reveals that the benchmark deals only with integers. This says nothing of floating point operations, device output, disk access, or memory access time. Since the integer operations are mainly incrementing and array indexing, it is likely only the add and subtract operations of the system are being used. While the Pentium system had faster run times, it should be noted that the operating systems are different, which can also have a profound impact on run time. This is shown with the Cyrix run times. The last two entries are both on the same computer, with a dual operating system. Under Windows 95, the Cyrix P166+ had a run time of 184 seconds, while under Linux, it was 158 seconds. Since this is being run on the same system, one may like to conclude that Windows (aka Windoze) is 15% slower than UNIX-based operating systems. This would be an unsafe statement. Even to conclude that Windows runs integer-intensive benchmarks 15% slower than UNIX based operating systems would also be incorrect. Why? Once again, these conclusions are leaving out crucial variables. What about TSR's (terminate-and-stay-resident) that may be running? Windows may have been running an anti-virus program that scanned the benchmark at the initiation of run time or even during run time, which would have added additional overhead to the final results. This is also the case with the Sun systems. Despite the fact that the user load was equal, the Sun systems are were run on a network, with many daemons in the background, undoubtedly using CPU time. Network traffic is also being broadcast and routed through some of these systems. Perhaps the most critical issue that has not yet been considered is the compiler used. Note that on the Cyrix system, the benchmark was compiled using gcc for Linux, while on Windows it was compiled with Visual C++. This crucial relationship between compilers and program run time will be discussed later on.

A fascinating point on the Cyrix system should be considered. Despite the name "P 166+", the Cyrix system does not have a 166 MHz clock rate. According to the technical documents, the P166+ actually has a clock rate of 133 MHz. A comparison of the 133 MHz Cyrix with the 133 MHz Pentium (same OS, same compiler, same memory) shows the Cyrix run time to be 9% faster. Considering the benchmark was compiled with the same compiler, under the same OS, with the same exact memory, on a system with the same clock rate, we would assume that the run times should be equal. But recall the CPU performance equation:

CPU Time = IC X CPI X Clock Cycle Time

We know the clock cycle times are equal, since the rates are equal and rate = 1/cc time. Since the same compiled program was run on both systems, with the same input, we know that the program instruction count is also equal. Therefore, assuming equal background work loads, we may conclude that the clock cycles per instruction are different between the two systems, leading to the difference in run time.

Cross-Platform Compiler Issues

A point worth noting is that run times of a identical programs compiled with different compilers are unequal. Recall the above data:

Pentium 133, 32M RAM	208	TC++	1	Win95
Cyrix P166+, 32M RAM	167	TC++	1	Win95
Pentium 133, 32M RAM	198	VC++	1	Win95
Cyrix P166+, 32M RAM	181	VC++	1	Win95

We established earlier that the Cyrix system runs the benchmark faster due to a lower CPI. Dividing the run time of the Pentium by the run time of the Cyrix (208/167), one may conclude that the Cyrix is 1.246 times faster . Or more specifically, one may conclude that the Cyrix runs the benchmark 1.246 times faster than the Pentium. Both of these conclusions are wrong. Consider the last two run times, everything identical, except compiled on Microsoft Visual C++. Since both programs were compiled with the same settings (standard compilation with no optimizations), we would expect the Cyrix to once again run 1.246 times faster, yet dividing 198 by 181 yields 1.094! We see that there is more than a 10% difference in program run time between the two compilers across systems. Already establishing a CPI difference between the two systems, this suggests that the compilers generate entirely different code. This small example shows the importance of compilers in run time. When writing a program where run time is critical, a careful analysis of compilers should be made. It is also evident from the comparisons that benchmarks are vulnerable to bias. A company selling Cyrix processors may claim that the Cyrix is 20% faster than the Pentium, which is what the benchmark run time suggests. Yet using Visual C++ to compile the benchmark, it could also be argued by competitors that the Cyrix is only 9% faster.

Compiler Optimizations

Perhaps the most extreme example of compilers and their impact on program run time is the use of optimizations. Consider the following data:

System	Compiler	Unoptimized Run Time	Optimized Run Time	OS
Pentium 133, 32M RAM	VC++	198	61	Win 95
Cyrix P166+, 32M RAM	VC++	181	84	Win 95

For the above data, the benchmark was compiled using Microsoft Visual C++, first unoptimized, then optimized for maximum speed. Remarkably, the optimized version is 3.25 times faster on the Pentium. On the Cyrix processor the program runs 2.15 times faster. Interestingly, while the Cyrix consistently ran the benchmark faster on all previous tests, here it runs the optimized code slower! Once again, a company selling Pentium systems could compile the benchmark using speed optimization, and make the claim that the Pentium is 18% faster than the Cyrix. Or even worse, the company could compare the optimized Pentium run time to the unoptimized Cyrix run time (after all, they're still using the same benchmark), and make the claim that the Pentium is 300% faster. Once again, we see a great disparity of system run times, that can easily be exploited by computer vendors.

Code and Hardware Issues

Aside from algorithm efficiency, coding can have another large impact on run times. For example, in the benchmark, certain variables are manipulated hundreds of millions of times. In a computer, aside from a cache, the fastest storage location are the registers. By preceding a variable declaration with the word "register", the compiler will attempt to place that variable in a general purpose register. So if an integer is declared as "register int", and the compiler is indeed able to place that variable into a register, access time will be minimal. Knowing the difference of speed between registers and memory, we can effectively force a variable into memory by declaring it to be greater than 32 bits, hence too large for a register. This can be accomplished by declaring a variable as a 64 bit integer.* Note the run time differences when the primary counter and index variables are changed:

System	Compiler	Run Time - int64	Run Time - register	OS
Pentium 133, 32M RAM	VC++	381	198	Win 95

Using register declarations, the benchmark runs nearly twice as fast than 64 bit integers. Another method that can be used is inline functions. This will reduce jumps, eliminating overhead. Examine the data:

System	Compiler	Run Time - function call	Run Time - inline	OS
Pentium 133, 32M RAM	VC++	248	198	Win 95

The functional/inline comparisons also show the importance of coding and how it affects run time. Algorithm efficiency is often stressed, and rightly so - in the case of sorts, run times can differ from n^2 to logn depending on the algorithm used. Considering the above data however, a good programmer will not limit his considerations to algorithm design, considering code and hardware aspects.

*Assuming that 64 bit integers are being placed in memory. It's possible they could be split between two registers, but regardless, increasing the size of the integer will cause significant overhead, as shown by the results.

Concluding Statements

From the data presented, one should be aware of the fallacy of using benchmarks as a determinant of system speed. It is clear how benchmarks can be tailored without even changing the code or hardware. In the search of the "unbiased benchmark", it is necessary to include as much information as possible - the amount of memory, the type of memory, type of cache, processor, operating system, background programs, compiler, version of the compiler used, and compiler settings. All of these factors and even more contribute to run time, therefore the less information given, the more subject the results may be to bias. This is especially proven true as with the DEC Alpha run times - the DEC Alpha is probably the fastest processor of all tested - yet according to the results, it is slower than the Pentium 75! This is undoubtedly due to very high user loads and other uses which drain CPU time. In purchasing a computer, one should carefully consider the type of work he or she will be doing. Will it be computationally heavy? Will it be graphics intensive? If possible, the buyer should try to examine how each system runs the applications that will be most frequently used. For example, a computer that is rated highly for graphics speed may be a very poor choice for a file server, or a waste of money if it will be used only for word processing. In summary, it should be clear from this report that there is no perfect way of measuring system performance. A buyer should be skeptical of benchmark results, and should not hesitate to inquire about any unsaid details in the results.

Discussion

Cross-Platform Compiler Issues

Compiler Optimizations

Code and Hardware Issues

Concluding Statements

"There's lies, damn lies, statistics, and benchmarks."