How Intel FUNKED up the Pentium: MMX

The introduction of the MMX class of Pentiums marks the most interesting architectural change to the x86 since its introduction (and gives us the chance to watch commercials with lab technicians dancing to "Play that Funky Music" - if you have the right plugin, you're listening to it now). In light of the changing nature of the computations being done by current software, Intel developed the MMX technology primarily "make the common case fast", optimizing the performance of repetitive integer operations. It also didn't hurt for Intel to jump on the multimedia bandwagon, proclaiming that the new MMX chips would substantially improve performance of multimedia applications (MMX is officially not an acronym, but it is generally accepted that it stands for MultiMedia eXtensions). The usefulness of the MMX technology has been a subject of much debate, and as shall be explored, the implementation was hampered by the need for backwards compatibility. Ultimately the technology is beneficial (if used properly), and the architectural issues show that unlike most of the CS classes you'll take here, class material does appear in the real world.

Among the changes in the MMX architecture are

the addition of 57 new SIMD instructions
the addition of 8 "new" MMX registers
saturation arithmetic
improved branch prediction
doubling of the internal cache
an additional stage in the pipeline
limited superscalar capabilities
elements formerly in microcode being implemented in silicon

The 57 new instructions

The 57 new instructions added in the MMX chips are a response to the 90/10 rule - 10% of the code doing 90% of the work. In boring business applications, the only programs which processor manufacturers have traditionally taken seriously, all parts of the processor are used, with none being especially taxed. Typical applications have floating point and integer operations mixed, neither occurring in large groups. In order to improve performance of such applications, chips only needed to undergo evolutionary changes and improvement - nothing too fancy was needed to improve performance. In the past few years multimedia applications have proliferated, as the graphics capabilities of the PC finally caught up to the Amiga and Macintosh's performance of 7 years ago. With this rush on multimedia applications the typical structure of programs changed. No longer were instructions mixed for the majority of the code. Now programs are spending most of their time working on small integers in highly repetitive loops, using many multiplies and accumulates in numerically intensive calculations (things like DCT transforms in JPEG decompression, bilinear filtering in games, pixel prediction in MPEG decoding, and many many more). Programs now are spending more and more time doing repetitive integer operations.

Processor performance could be improved via normal methods, but the benefits would not justify the cost - it would take a very large improvement to make any kind of noticeable difference in these types of operations. What Intel, like several chip manufacturers before them, has done is introduce an array of SIMD instructions (57 to be exact) which specialize in the types of operations most prevalent in modern programs, capitalizing on their inherent parallelism.

What is a SIMD instruction?

It is nearly self-explanatory - Single Instruction Multiple Data. A SIMD instruction is a bit like parallel processing in that it works on several pieces of data at once. Consider a situation where you have to add 64 1 byte integers. On a normal sequential execution architecture machine you will have to perform 63 additions. If like in the MMX Pentiums you can pack 8 1 byte words into one SIMD instruction, the same 64 integers could be summed with only 9 additions (8 partial sums of 8 integers, 1 sum of the partial sums). With this type of SIMD instruction, the number of additions can be reduced by up to roughly 85%, a tremendous savings

The SIMD instructions in the MMX chips work on 4 new datatypes, all 64 bits:

Packed Byte (8 x 8-bit words)
Packed Word (4 x 16-bit words)
Packed Doubleword (2 x 32-bit words)
Quadword (1 64-bit word)

In the Packed Byte case, for example, 8 8-bit numbers are packed into one of the "new" MMX registers, where they are worked on in parallel by the MMX instruction. As mentioned, there are 57 new MMX instructions, falling into 7 basic categories:

data transfer
arithmetic
comparison
conversion
logical
shift
empty MMX state (to be addressed later)

For more information on these instructions feel free to visit this place.

The MMX registers

Several times the "new" MMX registers have been mentioned. There are 8 new MMX registers, but no registers have been added (a matter of compatibility with existing operating systems). The 8 64-bit MMX registers are aliased on the 8 80-bit floating point registers (the MMX number stored as the 64 bits of the mantissa, the exponent being filled with 1's so it is viewed as NotANumber when looked at in floating point mode if an MMX number is mistakenly not removed), an action with quite serious ramifications.

Before an MMX number can be loaded (the first stage in an MMX calculation), the current state of the FP registers must be saved. FP and MMX numbers cannot occupy the registers simultaneously, i.e. FP number in F0 and MMX number in M1 (F1) is not allowed. Some overhead is incurred in this state saving. Everything proceeds efficiently, in both MMX and non-MMX integer computations, until the code needs to switch back to FP operations. To switch back to FP mode safely, the EMMS (Empty MMX State) instruction must be issued. If the EMMS instruction is not issued, any of the following are likely to happen:

a floating point exception may be generated
a "soft exception" may occur, where the FP code continues to execute, but yields incorrect results
if used in a multitasking environment, if the OS does not manage FP context across task switches, an error can occur

Obviously, none of these can be allowed to happen. The EMMS instruction marks the FP registers as empty, preparing for the FP state to be restored.

The Cycle Sinkhole

Unfortunately, there is a major problem with the EMMS instruction. Execution of this instruction causes a mandatory 53 cycle stall. In boring applications, a 53 cycle stall would be meaningless - if it occurs in a spreadsheet, your calculation takes a tiny bit longer, but you don't even notice. In multimedia and games, the main reason for the MMX, this 53 cycle stall is enormous. In the games world, one might have a graphic engine pushing the system to its limits, drawing a screen 60 times a second. To get this kind of performance, games are painstakingly optimized - even the 41 cycles of a Pentium divide are too much, so lookup tables and quadratic approximations are used to save cycles. Sinking 53 cycles into the EMMS instruction may be enough to cause slowdown in the game, a totally unacceptable situation to a graphic engine programmer who has spent hundred of hours writing tens of thousands of lines of optimized assembly language code by hand.

Naturally, Intel has a solution to this problem - "Do not mix MMX code and floating point code at the instruction level". The programmer must group MMX instructions as much as possible, thereby increasing the number of MMX instructions executed per EMMS instruction stall. Intel claims that this is an acceptable solution, as no applications require closely interleaved FP and MMX instructions. Unfortunately, that is simply not the case, especially with games. One of the most attractive uses of MMX instructions in games is in computing texture coordinates with perspective correction. Without MMX instructions, the best way to perform this requires among other things two divides per pixel. With MMX instructions, quadratic interpolation is quick and accurate enough that it can be done in place of the divides, saving quite a few cycles. As the triangle decreases in size or becomes narrow, something which happens commonly in the dynamic game environment, the speed advantage of the MMX enabled technique drops due to edge effects. Eventually the MMX enabled technique is hampered so much that it becomes significantly more efficient to go back to the 2 divides per pixel method. If it were not for the 53 cycle penalty incurred when switching modes, the program could use the most efficient method in each situation. Alas, the stall is present, so the programmer must choose one method and stick with it - the penalty for switching is too high, and there is no way to group MMX instructions together due to the diverse size of the triangles on screen. Ultimately it is best to go ahead and use MMX when possible, but its benefits are much less than they could and should be.

Saturation Arithmetic

Saturation arithmetic is something which would seem not to be of much use, but can be a great service to the programmer. Essentially it eliminates the classic problem of overflow and wraparound. You might have seen this problem in arcade games of the early '80s. These games were 8-bit machines, so most quantities were represented with 8-bit numbers. One such quantity was usually the level number. All would be fine through level 255. But if you finished level 255, what was next? Level 256? Not if an 8-bit number was used. The level count would increment to the next number, which was 0. But since there was no such thing as a Level 0, the game would crash. While interesting, it was undesirable (And by now you're asking "Why didn't they see this was going to happen and prevent it?" Basically, the programmers figured no one would ever get that far, so they didn't bother to do anything about it). Click HERE to see what happened when Pac-Man fell victim to wraparound.

The same type of thing can happen today with digital video and graphics effects. Let's say you have a dark red object on the screen, with an RGB value of (200,0,0 - on a 255,255,255 scale, in this case where 255 is darkest). Now for some reason you decide to render a glass bottle and lay that over top the red object. Let's say this bottle has color (100,0,100 - a light purple). You would expect the place where you can see the dark red object through the bottle to now be very dark red, with a little purple. Since you were overlapping the objects, you just add the color components, and end up with that spot being (44,0,100) - light blue-purple. What happened was that 200+100=300, which caused wraparound to 44. Normally you have to do some extra work to make sure this sort of thing does not happen. Saturation arithmetic takes away the need to do this extra work. If overflow occurs, instead of wraparound, you have saturation. 200+100 will equal 255, the highest valid value. 255+255 would be 255. Saturation arithmetic limits values to the highest possible instead of having wraparound. While perhaps not an intuitively obvious beneficial change, it can save several cycles of work per pixel in graphical applications.

Branch prediction

As you should know by now, the control hazards faced when a branch is encountered in code can detract significantly from performance. Techniques such as loop unrolling can reduce the number of branches in the code, and thus reduce the amount of overhead, but not all control hazards can be resolved so easily. When a branch is encountered, the standard approach is to predict where the program counter should go, go there for the next instruction, and keep the pipeline full - in loops you most likely will branch back to the beginning of the loop, so go ahead and begin another loop iteration instead of waiting. If the prediction is wrong, a cost must be incurred in flushing the pipe, but the amortized performance is significantly better than always waiting to ensure the correct branch is taken.

With a simple pipe, a simple branch prediction scheme is acceptable. With DLX's simple 5 stage pipe, there is little to be flushed for the pipe to be empty. In a chip with a more complex pipeline, like the Pentium Pro's 12 stage pipe, an incorrectly predicted branch can be much more costly. Realizing this, Intel utilized a branch target buffer in the Pentium Pros, and now have brought that technique to the MMX Pentiums. As alluded to in class, a branch target buffer essentially allows for multiple predictions to be made by storing several previously taken branches. In the MMX Pentium case, the branch target buffer supports up to 4 predicted branches. When the branch destination computation is finished, the appropriate thread from the BTB is called upon.

Doubled cache

Previous Pentiums have 16K total of L1 cache. MMX Pentiums increased this to 32K, 16K for instruction and 16K for data. This split should look very familiar...A write back strategy for the updating of memory is standard, but write through can be done on a line-by-line basis if desired. The data cache is dual ported to allow two simultaneous references (setting the stage for superscalar execution), and utilizes a deeper write back buffer than previous Pentiums to prevent some annoying stalls.

Pipelining and Superscalar capabilities

The pipeline of the x86 was extended with the addition of one more stage in the MMX chips. Some degree of superscalar execution can be done, executing either 2 integer instructions, 1 integer and 1 MMX instruction, or 2 MMX instructions simultaneously.