RISCy Business: Intel's New IA-64 Architecture

RISCy Business: Intel’s New IA-64 Architecture

Over the past twenty years, Intel has enjoyed tremendous success with its x86 family of microprocessors. Though RISC-based processors often employ better designs and achieve higher clock rates, the CISC-based x86 architecture succeeded mainly due to its ability to run common personal computer software faster. Within the past two years, Intel began efforts to design a completely new architecture to replace the aging x86 line. They have linked up with Hewlett Packard to jointly create what they hope will be the first post-RISC processor to enter the personal computer mass market. The first such processor, Merced, is tentatively scheduled for release in the fourth quarter of 1999 or the first quarter of 2000.

Major Design Points

The IA-64 architecture will be a fully 64-bit architecture, though it will retain backward compatibility with older 32-bit architectures (such as the P6 architecture) and x86 architectures. The proposed improvements will come from three major design elements: EPIC, branch predication and control speculation. Branch predication and control speculation will be illustrated later, but what does EPIC mean to me?

EPIC: Explicitly Parallel Instruction Computing. Design approach that stresses compiler based optimizations of code. Compiler parallizes code and forms instruction bundles.

ILP Instruction Level Parallelism. The overlapping of the execution of instructions. IA-64 will attempt to enhance and increase ILP using techniques known as branch predication and control speculation.

The IA-64 will focus on parallel execution of instructions and will provide many more resources than in previous designs. IA64 will include 128 integer registers, 128 FP registers, 64 predication registers as well as multiple functional units.

Conventional RISC Designs and Their Limitations

The major limitations on current RISC designs stem from memory latency and control hazards. In a typical RISC processor memory latency can cause miss penalties of up to 30 or 40 cycles. Caches and cache optimizations serve to reduce the impact of memory access, but they cannot eliminate the penalties associated with fetching from main memory. Control dependencies in any architecture also significantly degrade performance. In most processors, techniques including branch delay slots and branch prediction (both static and dynamic) help to decrease the number of control dependencies that cause hazards. Unfortunately, no combination of methods can fully eliminate control hazards and their associated penalties.

Intel’s Solutions for the IA-64

Intel will introduce two new concepts for addressing these inherent limitations to parallel execution, branch predication and control speculation. Though they will not solve these problems outright, they will help to reduce related penalties. These are not difficult topics to comprehend, but they do require a look at some examples.

Branch Predication: In essence, the IA-64 processors will approach a typical conditional block in a unique way. Rather than deciding which path to execute, the processor will simultaneously execute both paths and throw away results obtained from the path not needed. This is accomplished using predicates, tags assigned to each path which evaluate to true or false depending on the outcome of the branch. Instructions with a false predicate will be killed in favor of those with a true predicate.

Control Speculation: This attempts to "hide" memory latency by loading operands before they are needed. In simple basic block this is quite straightforward and not a new optimization. When branches are involved however, the situation becomes more complicated. When the compiler encounters a branch, the necessary load is "hoisted" or moved before the branch condition statement. This can lead to undesired effects, especially exceptions if the address isn't valid. To prevent undesired behavior, the suspect load command is replaced with an instruction new to the architecture, ld.s. This command will carry out the memory fetch and detect any exceptions raised by this procedure. Rather than call the operating system if an exception is detected, it will write a flag to the target register and allow execution to continue. Following the branch statement, another command, check.s examines the target register for the presence of an exception flag. If present, the check.s makes an unconditional jump to fix-up code, which explicitly loads the correct value from memory. If no exception was detected, the check.s allows execution to continue as normal. This effectively hides the latency inherent in a memory fetch when no exception is thrown. When ld.s causes an exception, the processor must branch to the fix-up code and thus incurs a penalty analogous to that of a cache miss.

Using these two approaches, Intel engineers hope to effectively eliminate control hazards and memory stalls from code optimized for the IA-64.

Impact of the IA-64

The key to using the IA-64 architecture to its full potential lies in running code optimized for the processor. This means a good deal of emphasis will be placed on improving compiler technology and matching it to the specifications of IA-64. In the past, Intel has not accentuated the importance of running optimized code on its new processors. When the Pentium Pro debuted, many were disappointed at its inferiority at running common applications such as MS-Windows 9x. This arose because the P6 architecture was designed for fast execution of 32-bit code. Thus, it did not perform as well on older 16-bit code.

The same situation will arise in the Merced, where 64-bit code will probably outperform 32 and 16-bit legacy code from the x86 line. Thus, the compiler will play a pivotal role in this new architecture, and the compiler technology has been significantly enhanced. The IA-64 compiler uses inter procedural compilation that allows the compiler to parallelize code over larger regions than traditional compilers. The use of this new compiler explicitly parallelizes the code and results in improved processor utilization.

Furthermore, this reduces the hardware necessary as it eliminates the need for a reorder engine and dynamic scheduling capabilities common to most RISC processors. This compiler attempts to initiate loads ahead of schedule to allow for more time for the data to arrive without stalling the processor. Intel stresses the fact that legacy code will successfully run on the IA-64 architecture without modification, but the performance of such code is not expected to improve significantly. They will likely make a reference compiler available to industry before the release of Merced.

Comparisons with Other Processors

The IA-64 will have several basic similarities with other RISC processors discussed in Hennessy and Patterson. ALU instructions will be organized the same as in other RISC systems, with a (0,3) format. Addressing modes will be simple as well, allowing for only those analogous to DLX. The hardware will consist of many functional units and higher bandwidth than in previous processors. At this time, no announcement has been made regarding the memory hierarchy. There will be no dynamic scheduling employed in IA-64 because all parallelism will be generated by the compiler. This simplification will reduce the die size and create more room for the additional functional units.

It is difficult to classify IA-64 as either a VLIW or superscalar machine at this point, as it appears that this architecture will borrow concepts from both schools. It will execute native code in 128-bit instruction bundles while running x86 code in a manner similar to a superscalar design.

A significant portion of the hardware will be dedicated to converting x86 CISC code into efficient IA-64 code. This process will be carried out by at least three parallelized decoders that will break down the x86 code into RISC-like micro-operations. The processor will then issue as many of these micro-ops as possible, based on the amount of ILP present in this code. Many of the capabilities of the decode portion of Merced will mirror those present in the Pentium Pro family. However, rather than simply port these functions over to IA-64, designers have designed them from the ground up, arriving at a finished product resembling the P6 but with significantly different "guts". The overall effect will allow for backward compatibility of IA-32 software.

Exercises

1. Show how a compiler would break up the following code in a traditional architecture and for IA-64:

i = j++;
if ( i > inputval )
i = sqrt(i);
else
i = i++;
j = inputval + i;

2. Show how a compiler would break up the following code in a traditional architecture and for IA-64:

cin>>datavalue;
if ( datavalue > defaultvalue )
b[i] = a[datavalue + i];
a[i]--;

Check your answers with ours.

Summary and References

The next generation of Intel processors will signify the departure from the classic CISC-based x86 architecture we have all become familiar with (for better or worse). The IA-64 architecture will employ many more RISC design techniques than in many previous Intel processors. It will be a VLIW and superscaler hybrid, which focuses on fast execution of native code while retaining compatibility for legacy code. The two major design methods introduced will be branch predication and control speculation. Branch predication attempts to handle branches by executing both outcomes and throwing away unneeded results. Control speculation capitalizes on the hoisting of instructions to allow memory references to be executed before their results are needed. This effectively hides the inherent memory latency that significantly degrades performance. This new architecture will rely heavily on compatible compilers capable of producing the level of ILP needed to achieve fast execution. Only time will tell whether the compiler link will serve to be IA-64’s major drawback.

"The IA-64 Architecture at Work", Carole Dulong. Computer Magazine, July 1998.
Computer Architecture a Quantitative Approach. Hennessy & Patterson. (My favorite book).
Logic and Computer Design Fundamentals. Mano & Kime
"IA-64 Architecture: Delivers New Levels of Performance and Scalability",

http://developer.intel.com/solutions/archive/issue6/stories/IA64.htm

"HP’s Trimaran Compiler for IA-64",

http://www.techweb.com/wire/story/TWB19980902S0008

"PC Webopaedia Definitions and Links",

http://webopedia.internet.com/TERM/I/IA_64.html

"IA-64 Overview", http://www.hp.com/esy/technology/ia_64/overview/

Copyright 1998.
Pentium, Pentium Pro, Merced Copyrighted by Intel, Inc.
MS-Windows 9x Copyrighted by Microsoft, Corp.
Much props for Dr. Mount for his green squiggly graphic.
Created by Ken Bishop and Matt House.