[Arrow keys or mouse to lower right corner to navigate; press t for printable version]
as detailed in P. H. Worley, and T. H. Dunigan, Jr., Early Performance Evaluation of the Cray X1 at Oak Ridge National Laboratory, Cray Users Group Conference, May 2003.
With conventional scalar processing, the processor fetches and executes several instructions for every element in the vector. It must increment the loop counter, test the loop bound, jump back to the start of the loop, and perform the next operation.
In vector processing, operands are operated on in batches. On the Cray X1, a single vector load instruction will load 64 elements into a 64-element-long vector and a single vector add instruction will add 64 pairs of elements, one pair per cycle. This reduces overhead and, more importantly, allows the functional units to be deeply pipelined for maximum cycle speed and maximum throughput.
So why doesn't everybody use vector processing? The answer is memory bandwidth: if the processor is to sustain one three-operand operation per clock (c[i] = a[i] + b[i]), the memory must be able to transfer three operands (24 bytes) per clock. Nowadays processors are an order of magnitude faster than memory, so to keep a vector processor fed, you'd need an order of magnitude more memory banks operating in tandem. More memory banks + more wires to connect them = more expensive. Another reason is that your code has to be expressible in terms of vector operations to benefit from vector processing, and most "desktop" software isn't.
There is one exception: "multimedia" applications. Modern PC architectures have adopted short vector processing, operating on two or four operands at a time, to tackle repetetive operations in video, graphics, and audio. This takes the form of Intel's SSE (Streaming SIMD Extensions) and Apple's Altivec. They rely on data reuse and fast caches to keep the processor fed with data.
Single MSP performance is 2.5x better than nearest competitor
Pleasantly parallel, ~20 GB/s per node out of 26 GB/s peak
Pairs of processes exchanging data; 4 MSPs share ~25 GB/s network BW
MG solves dense linear system using multigrid method
Single SMP node results, component of atmospheric modeling code
X1 keeps up with Earth Simulator