Our goal is to develop methodologies that will make it possible to provide approximate predictions of the performance that could be achieved by sophisticated new applications on new high performance architectures. While many of the techniques we develop will apply to any type of application, we will focus on two broad classes of applications. Loosely synchronous adaptive applications include adaptive structured or unstructured multigrid codes, particle methods and multipole codes. Data exploration and data fusion applications are codes that carry out processing, analysis and exploration of one or several very large data sets. This class includes codes that carry out analysis, exploration and fusion of sensor data from sensors located on different platforms (e.g. satellites, aircraft and ships), and codes that carry out sensor data analysis and fusion of data from conventional high power microscopy, confocal microscopy, electron microscopy, computerized tomography, and magnetic resonance imaging.
The groups involved in this proposal will leverage their extensive experience with high-end applications through a multi-level process. We list the following salient features of this process:
One of the components in the project is a hierarchical high level application modeling framework (HLAM), which is based on existing research in the group, but will be substantially generalized in this project, as well as being integrated into a powerful simulation system. We believe there are several advantages in using simplified application abstractions rather than the full application. First, it is expensive to simulate full applications - especially large problems on future high performance (PetaFlop) systems. Second, use of a well chosen abstraction can lead to better understanding of the key features that determine performance. Finally, abstractions can be generic and easily represent a broader class of applications than any one full application. In HLAM, we first hierarchically divide an application (sometimes called in this context a meta-problem) into modules. A sophisticated parallel application may be composed of several coupled parallel programs. Parallel programs can themselves be viewed as being composed of a collection of parallel modules. These modules may be explicitly defined by a user, or the modules may be generated by a semi-automatic process, such as an HPF compiler. Modules that represent distinct programs may execute on separate nodes of a networked meta-computer. An individual module may be sequential or data parallel. We might use a data parallel module to represent a multi-threaded task that runs on a multiprocessor node. HLAM will include a wide range of applications, including data-intensive applications (including I/O from remote sites) and migratory Web programs.
Another component of the proposal is a performance simulator, PetaSIM, which is aimed at supporting the (conceptual and detailed) design phases of parallel algorithms, systems software and hardware architecture. PetaSIM is aimed at a middle ground - half way between detailed instruction level machine simulation and simple ``back of the envelope'' performance estimates. It takes care of the complexity - memory hierarchy, latencies, adaptivity and multiple program components which make even high level performance estimates hard. It uses a crucial simplification - dealing with data in the natural blocks (called aggregates in HLAM) suggested by memory systems - which both speeds up the performance simulation and in many cases will lead to greater insight as to the essential issues governing performance.
PetaSIM defines a general framework in which the user specifies the computer and problem architectures and the primitive costs of I/O, communication and computation. The computer and problem can in principle be expressed at any level of granularity - typically the problem is divided into aggregates which fit into the lowest interesting level of the memory hierarchy which is exposed in the user specified computer model. Note the user is responsible for deciding on the ``lowest interesting level'' and the same problem/machine mapping can be studied at different levels depending on what part of the memory is exposed for user(PetaSIM) control and which (lower) parts are assumed under automatic (cache) machine control. The computer and problem can both be described hierarchically, and PetaSIM will support both numeric and data intensive applications. Further, both distributed and shared memory architectures, and various combinations of those architectures, can be modeled.
We will provide both a C and Java version of the simulator, whereas the user interface will be developed as a Java applet. The visualization of the results will use a set of Java applets based on extensions to NPAC's current Java interface to Pablo performance monitoring system from the University of Illinois.
The final component in the proposal is a set of application emulators. An application emulator is a suite of programs that, when run, exhibits computational and data access patterns that resemble the patterns observed in a particular type of application. We will construct two application emulators motivated by loosely synchronous adaptive applications and by data exploration and data fusion applications. As described earlier, application emulators will be used to validate the HLAM/PetaSIM modeling process. We will use the application emulators to produce application and machine specifications at varying levels of granularity and then use PetaSIM to estimate performance obtained on selected current and future architectures. We believe that our application emulators address some of the key applications targeted at future very-high end architectures. The application emulators will be shared with the performance modeling community. We believe that general availability of application emulators will help the community focus attention on crucial application classes.
We will develop an application emulator to model the performance characteristics of three classes of irregular adaptive scientific computations, along with coupled versions of multiple instances of any of these classes. The targeted computation classes are: (1) adaptive unstructured codes (e.g. unstructured multigrid solvers, integro-differential equation solvers and molecular dynamics codes), (2) structured adaptive codes, (e.g. adaptive multigrid algorithms), and (3) particle codes (e.g. Direct Simulation Monte Carlo methods, Rokhlin-Greengard or Barnes-Hut Fast Multipole codes, particle-in-cell codes).
We will also develop an application emulator that will reproduce application characteristics found in many defense and high-end civilian applications that involve sensor data analysis, sensor data fusion and real time sensor data processing. We are focusing on emulating application scenarios that will be of practical relevance in a 5 to 15 year time frame. The application emulator will be coded as a stripped-down application suite that runs on distributed collections of multiprocessors and networked workstations.
The application emulator will emulate this ambitious sensor data fusion application suite in a parameterized fashion. Parameter adjustment will make it possible to use the emulator for various application scenarios. The behavior we emulate will include computation, secondary storage accesses, tertiary storage accesses, remote object invocations, and program migration between processing nodes.