 
  
  
   
Researchers working on predicting performance of large-scale applications have taken one of two approaches: detailed simulation on specific architectures or high-level modeling. The SimOS simulation environment is the most ambitious of the detailed simulation efforts and is capable of modeling a complete computer system including a full operating system [54]. It is, however, focused exclusively on SGI multiprocessors and the Irix operating system and it is not suitable, in its current form, for simulating tera/peta-flop level applications. Furthermore, while SimOS does provide multiple levels of simulation detail, these levels are far too detailed for the scale of applications considered in this proposal. Other detailed simulation efforts have limited themselves to modeling application performance (e.g. [32, 52]). However, detailed simulation of future generation computing systems is difficult to perform. It is difficult to predict the exact sequence of technological developments across orders of magnitude improvements in performance. Even if one were able to, do so the computational resource requirements would be prohibitive.
High-level modeling is more suitable for the scale of applications and architectures envisaged in this proposal. Research on high-level modeling of large-scale applications has taken one of two approaches: algorithmic analysis to isolate performance-critical program segments or modeling of families of stylized applications.
An exemplar of the first approach is the iso-efficiency metric proposed by Grama et al [35] for evaluating the scalability of algorithm-architecture combinations. This metric relates the problem size with the number of processors based on information about processor speed, network speed and type of interconnection networks. This metric, and others like it [58, 61], is primarily suitable for the analysis of regular problems, meaning problems whose behavior is independent of the data and can be easily parameterized with respect to problem size. The metrics are not suitable for predicting the performance of irregular adaptive scientific applications. Furthermore, as they are currently formulated, they do not take I/O into consideration.
An exemplar of the second approach is the Modeling Kernel toolkit [56]. This toolkit focuses on computational fluid dynamics applications and analyzes an annotated parse tree representation of these programs. The annotations are attributes related to the program's execution time and include measured quantities like iteration counts, branch frequencies, message sizes and basic block execution times. It does not model I/O costs, nor does it deal with irregular problems. Another limitation is the lack of a model for the memory hierarchy. Performance of the memory hierarchy is an important determinant of overall performance in modern and future machines. The multi-level modeling framework we are proposing provides a hierarchical and symbolic representation of a class of application codes and can help identify performance-critical components and analyze their performance at an appropriate level of detail, including memory hierarchy performance.
A similar effort is the Templates compilation by Barrett et al [9]. That work focuses on iterative methods for solving large sparse systems of linear equations. Thus, besides providing templates, that work suggests how to choose and implement an effective method, and how to specialize a method to specific matrix types. The work focuses on selecting solution methods for particular problem types and quick turn-around time for developing parallel versions of the programs. The templates provided by Barrett et al are an important resource for scientists but do not provide prediction for future computing systems (nor were they intended to).
Several research efforts have focused on measurement of large, long-running programs, a leading example being Paradyn [44]. Paradyn attempts to provide detailed, flexible performance information without incurring the space (and time) overhead typically associated with trace-based tools by dynamically instrumenting the application and automatically controlling this instrumentation in search of performance problems. In addition to gathering data, it also includes a ``Performance Consultant'' which provides automated isolation of performance problems. Paradyn is suitable for isolating performance problems and for obtaining information needed for constructing accurate cost functions. Prediction of performance on future computing systems is outside the scope of Paradyn.
Given the scale and the multi-level nature of the simulation needed for performance prediction on tera/peta-flop systems, the ability to control the level of detail based on computational requirements and accuracy is essential. Several research groups have investigated computational ``steering'', which allows either users or monitoring programs to adapt an application during execution. An exemplar of such systems is Falcon [37], an online performance measurement tool that includes a ``steering'' interface. The available controls in Falcon, and similar systems, are restricted to ``steering'' options exported by the application programmer.
As we expect PetaSIM to be an irregular adaptive program, an important component of this research is parallelization of such programs. There is a large body of research in this area, including previous research by several of the PIs on this proposal [4, 3, 2, 12, 16, 17, 29, 45]. Important techniques in this regard are runtime preprocessing of access patterns to optimize communication and other operations (the inspector-executor scheme), graph transformations and scheduling suitable for unstructured task graphs and adaptive data partitioning. We expect to build on this expertise and further develop these techniques as needed.
 
  
 