next up previous
Next: Operation of PetaSIM Up: PetaSIM - A Performance Previous: PetaSIM - A Performance

Introduction and Motivation

Central to this proposal is a performance simulator, PetaSIM, which is aimed at supporting the (conceptual and detailed) design phases of parallel algorithms, systems software and hardware architecture. Originally this was designed as a result of two week-long workshops - PAWS and PetaSoft - aimed at understanding hardware and software architectures ten years from now when Petaflop ( tex2html_wrap_inline608 ) scale computing can be expected. It was clear from these meetings that the community needed better aids for performance estimation, as the 8 groups (and 8 different machine designs) present found it difficult to compare designs and in particular differed by a factor of one million in estimating the performance of a set of extremely simple algorithms - the PetaKernels - on their new designs. The most sophisticated PetaKernel was a regular finite difference problem solved by simple Jacobi iteration. These workshops emphasized the need for tools to allow the description of complex memory hierarchies (present in all future and most current machine designs) and the mapping of problems onto them in a way that allows reliable (if high level) performance estimates in the initial design and rapid prototyping stages.

PetaSIM is aimed at a middle ground - half way between detailed instruction level machine simulation and simple ``back of the envelope'' performance estimates. It takes care of the complexity - memory hierarchy, latencies, adaptivity and multiple program components which make even high level performance estimates hard. It uses a crucial simplification - dealing with data in the natural blocks (called aggregates in HLAM) suggested by memory systems - which both speeds up the performance simulation and in many cases will lead to greater insight as to the essential issues governing performance. We motivate and illustrate the design of PetaSIM by the well known formulas for parallel performance of simple regular applications on nodes without significant memory hierarchy. Then (Chapter 3 of [24]) one finds,

SpeedUp = Number of Nodes/(1 + Overhead)

where the Overhead is proportional to tex2html_wrap_inline610 ,

where in this case the natural data block size is the tex2html_wrap_inline612 , or number of data points located in each node. The power g measures edge over area effects and is 1/d for a system of geometric dimension d. tex2html_wrap_inline618 represents a ratio of communication to compute performance of the hardware. Such a formula shows the importance of identifying natural data blocks, and how such high level analysis allows one to understand the relation of performance to the memory size, I/O and computation capabilities of the architecture. PetaSIM generalizes this ``back of the envelope'' estimate to more complex problems and machines. It also includes not just primitive messaging performance (node to node as used in the above estimate), but also collective (such as multi-cast) mechanisms that are present in most applications but ignored in many simple analyses. Note that the simple performance estimate above is valid (with straightforward generalizations) on machines with a simple two level distributed memory hierarchy - namely memory is either on or off processor - which is essentially the model built into the current generation of parallel programming systems as typified by HPF or MPI. As we described in Section 2.1, it is essential to generalize this machine model whether we want to provide input to either parallel program generation tools or to performance estimation systems. Thus we believe our experience with HLAM and PetaSIM will be very valuable in helping to design the new generation of parallel programming environments needed for the increasingly complex high performance systems coming online.

Fortunately we have good evidence that we can generalize this naïve analysis to more complex problems and machines, since the Rutgers/UCSB group has studied granularity issues [31, 64] to identify natural data block sizes and computation clusters based on computation/communication ratios in more general hierarchical memories. They have developed preliminary performance prediction and optimization tools (PYRROS [63], D-PYRROS [41], RAPID [25]) based on task graphs in which the impact of a single-processor memory hierarchy is addressed at the intra-task level and the impact of interprocessor communication delay is identified at the inter-task level. These techniques have been shown to be effective for a number of adaptive and static applications [29, 28, 21, 26] and will be integrated into PetaSIM, as this technology is the basis of HLAM, as described in Section 2.2.

As well as a machine description, PetaSIM requires a problem description, and will use the HLAM described in Section 2.2. Application emulators (further described in Section 3 also use the same ``aggregate'' description of applications needed by PetaSIM and will also be used as input. It is not clear yet if we will convert the application emulators directly to the HLAM or design a separate API for this style of program description.


next up previous
Next: Operation of PetaSIM Up: PetaSIM - A Performance Previous: PetaSIM - A Performance

Wes Stevens
Fri Jul 11 15:07:44 EDT 1997