The Cray X1

John Kleint

UMD CMSC 714


[Arrow keys or mouse to lower right corner to navigate; press t for printable version]

Introduction

Cray X1

What's all this about vectors?

With conventional scalar processing, the processor fetches and executes several instructions for every element in the vector. It must increment the loop counter, test the loop bound, jump back to the start of the loop, and perform the next operation.

In vector processing, operands are operated on in batches. On the Cray X1, a single vector load instruction will load 64 elements into a 64-element-long vector and a single vector add instruction will add 64 pairs of elements, one pair per cycle. This reduces overhead and, more importantly, allows the functional units to be deeply pipelined for maximum cycle speed and maximum throughput.

So why doesn't everybody use vector processing? The answer is memory bandwidth: if the processor is to sustain one three-operand operation per clock (c[i] = a[i] + b[i]), the memory must be able to transfer three operands (24 bytes) per clock. Nowadays processors are an order of magnitude faster than memory, so to keep a vector processor fed, you'd need an order of magnitude more memory banks operating in tandem. More memory banks + more wires to connect them = more expensive. Another reason is that your code has to be expressible in terms of vector operations to benefit from vector processing, and most "desktop" software isn't.

There is one exception: "multimedia" applications. Modern PC architectures have adopted short vector processing, operating on two or four operands at a time, to tackle repetetive operations in video, graphics, and audio. This takes the form of Intel's SSE (Streaming SIMD Extensions) and Apple's Altivec. They rely on data reuse and fast caches to keep the processor fed with data.

Cray X1 Node

Diagram adapted from John M. Levesque, http://www.csm.ornl.gov/workshops/SOS8/Levesque-SOS8.ppt

X1 Physical

Photograph taken from John M. Levesque, http://www.csm.ornl.gov/workshops/SOS8/Levesque-SOS8.ppt

X1 Single-Streaming Processor

Info taken from NASA's X1 Docs

Single Processor Performance (DGEMM)

Single MSP performance is 2.5x better than nearest competitor

X1 Node Summary

Diagram taken from Christian Bell, Wei Chen, Dan Bonachea, and Katherine Yelickn, "Evaluating Support for Global Address Space Languages on the Cray X1," ICS 2004.

X1 Memory

Aggregate Memory Bandwidth

Pleasantly parallel, ~20 GB/s per node out of 26 GB/s peak

X1 Interconnect

MPI Swap Bandwidth

Pairs of processes exchanging data; 4 MSPs share ~25 GB/s network BW

NAS Parallel Benchmarks (MG)

MG solves dense linear system using multigrid method

ORNL Spectral Shallow Water Model

Single SMP node results, component of atmospheric modeling code

LANL Parallel Ocean Program

X1 keeps up with Earth Simulator

Conclusions