Robert Bennet, Kelvin Bryant, Alan Sussman, Raja Das, Joel Saltz.
There has been a great deal of recent interest in paralell I/O. This paper discussed issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architechtures that include multiple disks or disk arrays. The major emphasis of the paper is on optimizations that are made possible by the use of collective I/O, so that I/O requests for multiple processors can be combined to improve performance. Performance measurements from benchmarking our implementation of an I/O library that currently performs collective local optimizations , called Jovian, on three application templates are also presented.
Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, Milan Hodoscek.
CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architechtures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.
R. Das, J. Saltz.
No Abstract Available...
R. Das, Y. Hwang, M. Uysal, J. Saltz, A. Sussman.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).
Raja Das, Mustafa Uysal, Joel Saltz, Yuan-Shin Hwang.
This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload genera tor, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARM M).
A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy.
This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.
Raja Das, Joel Saltz, Reinhard v. Hanxleden.
An increasing fraction of the applications targeted by paralle l computers makes heavy use of indirection arrays for indexing data arrays. Suc h irregular access patterns make it difficult for a compiler to generat e efficient parallel code. A limitation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. How ever, many codes using sparse data structures across their data through mult iple levels of indirection.
This paper presents a method for transforming progr ams using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can p arallelize efficiently. a central concept of our algorithm is to perform p rogram slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph elim inates redundant preprocessing and gives an ordering in which to compute the sli ces. We present our work in the context of High Performance Fortran, an impleme ntation in Fortran D prototype compiler is in progress.
R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, J. Saltz.
We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for trac king and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situatio ns which allow runtime preprocessing overheads to be amortized. This dataflow aanalysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.
R. Ponnusamy, R. Das, J. Saltz, D. Mavriplis, Alok Choudhary.
No abstract available...
R. Das, D. J. Mavriplis, J. Saltz, S. Gupta, R. Ponnusamy.
No abstract available...
Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, Joel Saltz.
In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.