x86: Evolution of an Architecture

Here is a (sometimes) brief description and history of the Intel 80x86 architecture.

Q: Why do we care about Intel?
Q: Anything else I should know?
Q: Can you tell me what the 8088 architecture was like?
Q: What about floating point computations?
Q: Well, my Pentium seems to run better than all of this suggests, how does it do it?
Q: Any moral to the story thus far?
Q: Has Intel ever FUNKED anything up?
Q: Hey, can I quiz myself with some questions and answers on this stuff?
Q: I'm just so interested in this - where can I learn more?
Q: Who should I hold responsible for these pages?

Q: Why do we care about Intel?

A: Three reasons, really.

IBM choose Intel as the manufacturer for the CPU of the original IBM PC.
IBM allowed many different manufacturers to make PC clones.
Everybody bought them. (The clones, that is)

Q: Anything else I should know?

A: If you can remember that far back, you may recall that the early 80's saw quite a few personal computers introduced to the marketplace. While most of them were roughly similar in terms of capabilities, the only thing that the average consumer knew about any of them was that the software for one would not run on another. This was a scary thought to someone who probably just finished selling their old 8-tracks to afford the down payment on a Beta VCR.

Compatibility is also a major concept to keep in the back of your mind as you attempt to unravel the 80x86 architecture. In order to avoid having to tell their customers that their old software was no longer useful, Intel expended a major amount of effort to ensure that code written for the 8088 would run on a Pentium Pro. (For those with really good memories, you may recall that Apple can't make a similar claim.) For the remainder of this history, we will be tracing this effort.

Q: Can you tell me what the 8088 architecture was like?

A: Sort of.

First, the 8088 was the CPU for the original IBM PC. The 8088 was a scaled down version of the 8086 (the flagship of the 80x86 line). Problem was that the 8086 was a bit too expensive for the market at the time, so Intel cheapened things up (without effectively cutting performance for the single user).

The 8088 was internally a 16bit chip. It had 16bit registers, but it had an 8bit external bus. On the high-end of things, the 8086 was a fully 16bit chip. 20bits were used for memory addressing (allowing for 1M of addressable memory, for those following at home).

The 8088 introduced Intel segmentation to the memory organization of the x86 family. The method was usable in what was/is called 'Real' mode addressing, and allowed for the addressing of memory in terms of 64KB segments (and an offset). This addressing scheme allowed for memory to be logically managed in terms of active and inactive segments.

While the details of memory segmentation are beyond the scope of this page, the basic idea is that memory is addressable as a general location and a specific address. However, you need to keep in mind that there is a distinction between the physical address of the item in memory (the machine address of the memory slot) and the logical or segmentation address (the address used by the compiler and the CPU).

While the memory addressing of the 8088 allowed for segmentation, the instructions themselves allowed for seven data addressing modes: absolute, register indirect, based, indexed, based indexed with displacement, based with scaled index, and based with scaled indexed and displacement. While a couple of these modes are listed on page 75 of Hennessey and Patterson, several aren't.

Most notable are the based modes. These are the Real Mode addressings referred to above, and are the source of much of the complexity associated with the architecture as a whole. The most important point with the choice of all of these addressing modes (as was hinted at in H&P) is that there is no practical way to force the 8088 instruction set into instructions of a set length. The shortest instruction is a single byte, but the longest possible is seventeen bytes!

At this point it may be illustrative to consider this range in instruction lengths as compared to the RISC model. What we are used to dealing with from class is instruction sets with nice, uniform length instructions that lend themselves well to pipelining and superscalar executions. Unfortunately, the x86 architecture is not as pretty. While there is a good deal of pipelining and some superscalar stuff (especially as the family produces the 486 and the Pentium), there is a good deal of fetch/decode/store overhead before the actual issue of the instruction can occur. There isn't even a consistent model of an x86 instruction fetch!

More on some of these issues later.

Q: What about floating point computations?

A: This is going to sound primitive.

The real interesting issues for Intel floating point functions begin with the 8087 coprocessor. For those of us who can't remember back that far (or at least to the late 80's), a coprocessor is a separate unit from the CPU which performs numerical calculations.

The 8087 was organized as a stack, with the two operands of the computation implicitly in the two registers corresponding to the top of the stack. However, the stack of the 8087 included several additional registers as swap locations to avoid the cost of sending operands to and from memory.

Q: Well, my Pentium seems to run better than all of this suggests, how does it do it?

A: There is a long line of development from the 8088 to the Pentium. Here are some of the interesting points:

80286

Introduced in 1982, the 80296 (or 286) added additional memory functionality as well as expended the external data bus to 16bits.

While the 8088 had segmented memory accessed from the Real Mode, the 286 added the Protected Mode to the architecture's memory organization. Briefly, the new mode allowed for the segment registers from the 8088 to be used as pointers to descriptor tables (think hash tables) which in turn allowed access to memory via 24 bit base addresses (recall the bit on the 8088's memory from above). This scheme allowed for addressing 16MB of memory (both physical as well as virtual), but added the difficulty of increasing the ISA's variability in memory usage.

As noted before, while there are advantages to allowing memory to be referenced in a number of ways (consider the difficulty of DLX's single offset addressing mode), the more ways in which the memory may be addressed, the more information that will be needed to activate the control path in the correct way. The difficulties may be slight for a processor like the 286 (primarily used as a singly-user, single task chip), but as the commercial operating systems and applications began to get more and more complicated (as evidenced by Windows NT running on a Pentium), the problems have become greater.

80386

This chip was perhaps the single greatest leap foreword in the history of the x86 family. The 386 introduced both a new logical memory organization and 32bit processing to the personal computing world, yet did so in such a way as to retain compatibility with the 286 and 8088. This slight of hand was accomplished by including a Virtual 8086 Mode in the architecture that allowed the lower half of the 32bit registers to be used as the older model registers. More interesting to this course, however, was the 386's introduction of parallel execution to the family.

Six units were added to the chip organization, and under ideal circumstances the units allowed for one instruction to pass a stage in a single clock cycle. The six units were:

Bus Interface Unit (Basically a queue on top of the external bus)
Code Prefetch Unit (Managed the queue and unpacked instructions)
Instruction Decode Unit (Decoded instruction)
Execution Unit (Not yet!)
Segmentation Unit (Begins translation of logical memory address to physical address)
Paging Unit (finishes translation of address and holds memory access information)

If you seem to notice a similarity between the units and the DLX pipeline, you are both not alone and not really on the money. Because of the difficulties mentioned above about the instruction lengths and memory modes, the x86 can really only be pipelined to a minimal degree. Parallel execution does occur, but not nearly as cleanly as with some of the architectures discussed in class. More on this topic in the discussion of the MMX technologies.

80486

The introduction of the 486 in 1989 saw the extension of the 386's parallel execution as well as the first inclusion of on-chip floating-point functions.

The instruction decode and execution units from the 386 was expanded into a five-stage pipeline structure which passed from stage to stage at speeds which could approach one stage per cycle. This feat was accomplished by reducing the overhead from the memory references introduced by the 8088. By separating the decoding of an instruction into several stages, the ideal situation will find all of the necessary values ready by the time that the execution of the instruction needs them.

In addition to these improvements, new cache support was added by the 486. A 8KB L1 cache was added on chip, and the 486 allowed for off-chip L2 cache support. Thus, the improvements in the parallel execution were due to the improved memory hierarchy as well as by the addition of the stages just discussed.

80586 (The Pentium... turns out you can't copyright a number)

For the purposes of this page, the development of the x86 family is essentially complete with the introduction of the Pentium in 1993. The rule of thumb for the Pentium is that it is basically two 486's in one.

The pipeline mentioned earlier is the general format for the Pentium execution path, but the 586 added a second pipeline to allow for superscalar execution. Both pipes are managed by what appears to be a scoreboard. That is, the instructions are decoded b y the Pentium into a set of smaller instructions (do I smell RISC?) which are placed in an instruction buffer to await issue to the proper functional unit. Upon completion of the smaller instruction, the results are the held in a retirement unit which returns the result of the original instruction in the order in which it fell in the program. In this way, Intel hopes to (and frequently does) achieve execution of a single instruction at better than one clock. Perhaps the lesson here is that the speed of the Pentium is a function of how well it undoes much of the earlier architecture choices of the Intel development teams.

On less interesting notes, the size of the on-chip cache was doubled to 16KB (with 8KB devoted to instructions and 8KB to data), and additional command pins were added to allow for multi-processor systems.

Q: Any moral to the story thus far?

A: As we have seen, the x86 family has grown over the space of 15 years into a fairly snappy pipelined processor with the heart of a 1970's stack machine. The improvements of the final generations of the family are a result of the chip performing many control tasks to "strip away" the CISC-ness of the architecture and allow the instructions to execute as if they were several small, general purpose commands.

While I would in no way attempt to belittle the achievements of Intel in maintaining the viability of the x86 family, it would seem that the achievement serves well as a cautionary illustration: a chip can be built to perform many tasks, but as the heart of every task is identical, the greatest gains are from improvements to this heart

I'm just so interested in this - where can I learn more?

For more information on this material, try some of these links:

x86.org
Intel documentation for developers
mmx.com - Intel's MMX site
Intel's latest developer's literature

Who should I hold responsible for these web pages?

Eugene Madlangbayan - Q&A section, printing out tons of Intel documents
Shane Shaffer - The MMX material, doing the web pages, wasting time making a dancing Pentium graphic
Randall Ward - The history/development stuff up to MMX