These publications may be viewed using either Postscript Files or PDF Files

Click here to remove the abstracts 

Chialin Chang, Anurag Acharya, Alan Sussman, and Joel Saltz

Proceedings of the ACM SIGMOD Record, pages 58-66, Volume 27, Number 1, March 1998

University of Maryland Technical Report CS-TR-3867 and UMIACS-TR-98-04

As computational power and storage capacity increase, processing and analyzing large volumes of multi-dimensional datasets play an increasingly important part in many domains of scientific research. Several database research groups and vendors have developed object-relational database systems to provide some support for managing and/or visualizing multi-dimensional datasets. These systems, however, provide little or no support for analyzing or processing these datasets -- the assumption is that this is too application-specific to warrant common support. As a result, applications that process these datasets are usually decoupled from data storage and management, resulting in inefficiency due to copying and loss of locality. Furthermore, every application developer has to implement complex support for managing and scheduling the processing.

Our study of a large set of scientific applications over the past three years indicates that the processing for such datasets is often highly stylized and shares several important characteristics. Usually, both the input dataset as well as the result being computed have underlying multi-dimensional grids. The basic processing step usually consists of transforming individual input items, mapping the transformed items to the output grid and computing output items by aggregating, in some way, all the transformed input items mapped to the corresponding grid point. In this paper, we present the design of T2, a customizable parallel database that integrates storage, retrieval and processing of multi-dimensional datasets. T2 provides support for common operations including index generation, data retrieval, memory management, scheduling of processing across a parallel machine and user interaction. It achieves its primary advantage from the ability to seamlessly integrate data retrieval and processing for a wide variety of applications and from the ability to maintain and jointly process multiple datasets with different underlying grids.

 

Renato Ferreira, Bongki Moon, Jim Humphries, Alan Sussman, Joel Saltz, Robert Miller, and Angelo Demarzo 

Proceedings of the 1997 AMIA Annual Fall Symposium, Nashville, Tennessee, October 1997 

University of Maryland Technical Report: CS-TR-3777 and UMIACS-TR-97-35 
 

We present the design of the Virtual Microscope, a soft-ware system employing a client/server architecture to provide a realistic emulation of a high power light microscope.  We discuss several technical challenges related to providing the performance necessary to achieve rapid response time, mainly in dealing with the enormous amounts of data (tens to hundreds of gigabytes per slide) that must be retrieved from secondary storage and processed.  To effectively implement the data server, the system design relies on the computational power and high I/O through-put available from an appropriately configured parallel computer. 

 

Yuan-Shin Hwang and Joel Saltz 

Proceedings of the 10th International Workshop on Languages and Compilers for Parallel Computing, August 1997 
 

Pointer analysis is essential for optimizing and parallelizing compilers.  It examines pointer assignment statements and estimates pointer-induced aliases among pointer variables or possible shapes of dynamic recursive data structures.  However, previously proposed techniques perform pointer analysis without the knowledge of traversal patterns of dynamic recursive data structures to be constructed.  This paper presents an algorithm to identify the traversal patterns of recursive data structures and propagate this information back to those statements that define the data structures.  This approach recognizes the DEF/USE relationships between the statements that define and traverse dynamic recursive data structures.  The outcome of this technique will be useful for pointer analysis and parallelization.  Algorithms to perform pointer analysis and dependence test using the knowledge of traversal patterns will also be presented.

 

M. Uysal, A. Acharya and J. Saltz 

University of Maryland Technical Report:   CS-TR-3802 and UMIACS-TR-97-49 

May 1997 
 

I/O intensive parallel programs have emerged as one of the leading consumers of cycles on parallel machines.  This change has been driven by two trends.  First, parallel scientific applications are being used to process larger datasets that do not fit in memory.  Second, a large number of parallel machines are being used for non-scientific applications.  Efficient execution of these applications requires high-performance I/O systems which have been designed to meet their I/O requirements.  In this paper, we examine the I/O requirements for data-intensive parallel applications and the implications of these requirements for the design of I/O systems for parallel machines.  We attempt to answer the following questions.  First, what is the steady-state as well peak I/O rate required?  Second, what spatial patterns, if any, occur in the sequence of I/O requests for individual applications?  Third, what is the degree of intra-processor and inter-processor locality in I/O accesses?  Fourth, does the application structure allow programmers to disclose future I/O requests to the I/O system?  Fifth, what patterns, if any, exist in the sequence of inter-arrival times of I/O requests?  To address these questions, we have analyzed I/O request traces for a diverse set of I/O-intensive parallel applications.  This set includes seven scientific applications and four non-scientific applications. 

 

M. Uysal, A. Acharya, R. Bennett, and J. Saltz 

International Parellel Processing Symposium '97, April 1997 
 
University of Maryland Technical Report CS-TR-3690 and UMIACS-TR-96-68 (extended version) 
 

We present a customizable simulator called netsim for high-performance point-to-point workstation networks that is accurate enough to be used for application-level performance analysis yet is easy enough to customize for multiple architectures and software configurations.  Customization is accomplished without using any proprietary information, using only publicly available hardware specifications and information that can be readily determined using a suite of test programs.  We customized netsim for two platforms:  a 16-node IBM SP-2 and less than 10% error on the Alpha Farm for most test cases.  It achieves this accuracy at the cost of a 7-36 fold simulation slowdown with respect to the SP-2 and a 3-8 fold slowdown with respect to the Alpha Farm. 

 

Chialin Chang, Bongki Moon, Anurag Acharya, Carter Shock, Alan Sussman, and Joel Saltz 

Proceedings of the 13th International Conference on Data Engineering, Birmingham, United Kingdom, April 1997

University of Maryland Technical Report CS-TR-3689 and UMIACS-TR-96-67 

1997 International Conference on Data Engineering 
 

There are two major challenges for a high-performance remote-sensing database. First, it must provide low-latency retrieval of very large volumes of spatio-temporal data. Second, the order of magnitude reduction in data-size due to post-processing must make it imperative, from a performance perspective, that the postprocessing be done on the machine that holds the data. This paper describes the design, implementation and evaluations of Titan, a parallel shared-nothing database designed for remote sensing data. The computational platform for Titan is 16-processor IBM SP-2 with four fast disks attached to each processor. Titan is currently operational and contains about 24 GB of data from the Advanced Very High Resolution Radiometer, (AVHRR) on the NOAA-7 satellite. The experiemtnal results show that Titan provides good performance for global queries, and the interactive response times for local queries.

 

Bongki Moon and Joel Saltz 

To appear in:  IEEE Transactions on Knowledge and Data Engineering   

1997

Efficient storage and retrieval of multi-attribute datasets have become one of the essential requirements for many data-intensive applications.  The Cartesian product file has been known as an effective multi-attribute file structure for partial-match and best-match queries.  Several heuristic methods have been developed to decluster Cartesian product files across multiple disks to obtain high performance for disk accesses.  Though the scalability of the declustering methods becomes increasingly important for systems equipped with a large number of disks, no analytic studies have been done so far.  In this paper we derive formulas describing the scalability of two popular declustering methods Disk Modulo and Fieldwise Xor for range queries, which are the most common type of queries.  These formulas disclose the limited scalability of the declustering methods and are corroborated by extensive simulation experiments.  From the practical point of view, the formulas given in this paper provide a simple measure which can be used to predict the response time of a given range query and to guide the selection of a declustering method under various conditions.

 

 Guy Edjlali, Alan Sussman, and Joel Saltz 

International Parallel Processing Symposium 1997, April 1997 
 

This paper describes a framework for providing the ability to use multiple specialized data parallel libraries and/or languages within a single applications. The ability to use multiple libraries is required in many application areas, such as multidisciplinary complex physical simulations and remote sensing image database applications. An application can consist of one program or multiple programming that use different libraries- to parallelize operations on distributed data structures. The framework is embodied in a runtime library called Meta--Chaos that has been used to exchange data between data parallel programs written using High Performance Fortran, the Chaos and Multiblock Parti libraries developed at Maryland for handling various types of unstructured problems, and the runtime library for pc++, a data parallel version of C++ from Indiana University. Experimental results show that Meta-Chaos is able to move data between libraries efficiently, and that Meta-Chaos provides effective support, for complex applications.

 

Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, and Joel Saltz 

Scientific Programming, January 1997 

University of Maryland Technical Report CS-TR-3510 and UMIACS-TR-95-83 
 

For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programming in such an adaptive environment. Executing programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers of HPF-like languages to generate code for an adaptive environment. We present performance results for Navier-Stokes solver on a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that if the number of processors is not varies frequently, the cost of data distribution is not significant when compared to the time required for the actual computation. Overall, our work establishes the feasibility of compiling HPF for a network on non-dedicated workstations, which are likely to be an important resource for parallel programming in the future.

 

M. Ranganathan, Anurag Acharya, Shamik Sharma, and Joel Saltz 

Proceedings of the Usenix 1997 Technical Conference Anaheim, Californa, June 1996 

University of Maryland Technical Report: CS-TR-3659 and UMIACS-TR-96-46 
 

In this paper, we investigate network-aware mobile programs, programs that can use mobility as a tool to adapt to variations in network characteristics. We present infrastructural support for mobility and network monitoring and show how adaptalk, a Java-based mobile Internet chat application can take advantage of this support to dynamically place the chat server so as to minimize response time. Our conclusion was that on-line network monitoring and adaptive placement of shared data-structures can significantly improve performance of distributed applications on the Internet.

 

Guy Edjlali, Serge Petition, and Nahid Emad 

Proceedings of the International Conference on Information Systems, Analysis and Synthesis, June 1996 
 

The data-parallel paradigm which is effective for many numerical algebra computations is no more sufficient when programming heterogeneous machines. To efficiently utilize these new resources we need to combine the two new paradigms: data-parallelism and task parallelism. This leads us to define new algorithims.
In this paper, throughout all examples we will illustrate this statement. The chosen example is the computation of some eigenvalues of large non-structured sparse non-Hermitian matrices. We present a new algorithm, Interleaved Arnoldi, and compare its performances to the parallel Arnoldi algorithm. The different environments we consider are: a CM5, a CM5 and a Sparc 20, a CM5 and 2 Sparc 20. We show that significant results can be obtained even whell the het- erogeneous machine consists of a CM5 and 2 Sparc 20 workstations.

 

Guy Edjlali, Alan Sussman, and Joel Saltz 

University of Maryland Technical Report:  CS-TR-3633 and UMIACS-TR-96-30 

May 1996 
 

This paper describes a framework for providing the ability to use multiple specialized data parallel libraries and/or languages within a single applications. The ability to use multiple libraries is required in many application areas, such as multidisciplinary complex physical simulations and remote sensing image database applications. An application can consist of one program or multiple programming that use different libraries- to parallelize operations on distributed data structures. The framework is embodied in a runtime library called Meta--Chaos that has been used to exchange data between data parallel programs written using High Performance Fortran, the Chaos and Multiblock Parti libraries developed at Maryland for handling various types of unstructured problems, and the runtime library for pc++, a data parallel version of C++ from Indiana University. Experimental results show that Meta-Chaos is able to move data between libraries efficiently, and that Meta-Chaos provides effective support, for complex applications.

 

M. Ujaldon Martinez, S. Sharma, J. Saltz, and E. Zapata
 

Proceedings of the ACM International Conference of Supercomputing, pages 78-85, Philadelphia, Pennsylvania, May 25-27, 1996 
 

Sparse matrix problems are difficult to parallelize efficiently on distributed memory machines since non-zero elements are unevenly scattered and and are accessed via multiple levels of indirection. Irregular distributions that acheive good load balance and locality are hard to compute, have high memory overheads and also lead to further indirection in locating distributed data. This paper evaluates alternative semi-regular distribution strategies which rade off the quality of the load balance and locality for power distribution overheads and efficient lookup. The proposed techniques are compared to an irregular sparse matrix partitioner and the relative merits of each distribution method are outlined.

 

Gagan Agrawal, Anurag Acharya, and Joel Saltz 

Proceedings of the ACM International Conference of Supercomputing, pages 358-365, Philadelphia, Pennsylvania, May 25-27, 1996 
 

Overlapping memory accesses with computations is a standard technique for improving performance of modern architectures, which have deep memory hierarchies. In this paper, we present a compiler technique for overlapping accesses to secondary memory (disks) with computation. We have developed an Interprocedural Balanced Code Placement (IBCP) framework, which performs analysis on arbitrary recursive procedures and arbitrary control flow and replaces synchronous I/O operations with a balanced pair of asynchronous operations. We demonstrate how this analysis is useful for the applications which perform frequent and large accesses to secondary memory, analysis is useful for the applications which perform frequent and large accesses to secondary memory, including the applications which snapshot or checkpoint their computations or the out-of-core applications.

 

Anurag Acharya, Mustafa Uysal, Robert Bennett, Assaf Mendelson, Michael Beynon, Jeff Hollingsworth, Joel Saltz, and Alan Sussman 

Fourth Annual Workshop on I/O in Parallel and Distributed Systems, Philadelphia, Pennsylvania, May 27, 1996 
 

Getting good I/O performance from parallel programs is a critical problem for many application domains. In this paper, we report our experience tuning the I/O performance of four application programs from the areas of sensor data processing and linear algebra. After tuning, three of the four applications achieve effective I/O rates of over 100MB/s, on 16 processors. The total volume of I/O required by the programs ranged from about 75MB to over 200GB. We report the lessons learned in achieving high I/O performance from these applications, including the need for code restructuring, local disks on every node and overlapping I/O with computation. We also report our experience on achieving high performance on peer-to-peer configurations. Finally, we comment on the necessity of complex I/O interfaces like collective I/O and strided requests to achieve high performance.

 

Anurag Acharya 

Proceedings of the ACM International Conference of Supercomputing, pages 325-332, Philadelphia, Pennsylvania, May 1996 
 

A rule-based program consists of a set of if-then rules and a tuple-space. The rules are the code for the program and the tuple-space contains the data being processed by the program. Previous efforts to parallelize rule-based programs have achieved limited speedups. The main reason for these disappointing results is a high frequency of barrier synchronizations. Since little work is done between successive barrier synchronizations, the number of processors that can be effectively utilized is bounded. Even though required by language semantics, a large fraction of the barrier synchronizations are not necessary for most programs. This paper proposes a pair of simple language extensions that allow an implementation to efficiently detect and eliminate redundant barrier synchronizations. Simulation results based on a real implementation show that for a set of five benchmarks, this scheme is able to eliminate between 95.6% and 99.9% of the barrier synchronizations. This results in a multiplicative speedup of between 2.2 and 52.3 fold over and above the speedup achieved by a parallelizing compiler. For the programs studied, simulations indicate that speedups up to 115 fold relative to an optimized sequential version are possible.

 

Mudumbai Ranganathan, Anurag Acharya, Guy Edjlali, Alan Sussman, and Joel Saltz 

Proceedings of the ACM International Conference of Supercomputing, pages 229-236, Philadelphia, Pennsylvania, May 25-27, 1996 

University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3565 and UMIACS TR-95-116 

May 1996   

We consider the problem of efficiently computing multiple data-parallel programs at runtime. We propose an approach that establishes a mapping between data structures in different data-parallel programs and implements a user-specified consistency model. Mappings are established at runtime and can be added and deleted while the programs being coupled are in execution. Mappings, or the identity of the processors involved, do not have to be known at compile-time or even link-time. Programs can be made to interact with different granularities of interaction without requiring any re-coding. A-priori knowledge of consistency requirements aflows buffering of data as well as concurrent execution of the coupled applications. Efficient data movement is achieved by pre-computing an optimized schedule. We describe our prototype implementation and evaluate its performance using a set of synthetic benchmarks. We examine the variation of performance with variation in the consistency requirement. We demonstrate that the cost of the flexibility provided by our coupling scheme is not prohibitive when compared with a monolithic program that performs the same computation.

 

Bongki Moon, Anurag Acharya, and Joel Saltz 

Proceedings of the 1996 Parallel Processing Symposium, April 1996 

University of Maryland Technical Report: CS-TR-3589 and UMIACS-TR-96-4 Lengthened version 
 

Efficient storage and retrieval of large multidimensional datasets has been one of the key issues in large-scale scientific computations such as long running time-dependent simulations periodically generating snapshots of the state. The main challenge for efficiently handling such datasets is to minimize response time for multidimensional range queries. The grid file is one of the well known access methods for multidimensional and spatial data. We investigate effective and scalable declustering techniques for grid files with the primary goal of minimizing response time and the secondary goal of maximizing the fairness of data distribution. The main contributions of this paper is (1) the analytic and experimental evaluation of existing index-based declustering techniques and their extensions for grid files, and (2) the development of an improved proximity-based declustering algorithm called minimax which has been experimentally shown to scale well and consistently achieve better response time compared to all the other algorithms while maintaining perfect disk distribution.

 

Bongki Moon and Joel H. Saltz 

IEEE Transactions on Software Engineering, April 1996 

University of Maryland Technical Report: CS-TR-3590 and UMIACS-TR-96-5 
 

Efficient storage and retrieval of multi-attribute datasets has become one of the essential requirements for many data-intensive applications. The Cartesian product file has been known as an effective multi-attribute file structure for partial-match and best-match queries. Several heuristic methods have been developed to decluster Cartesian product files over multiple disks to obtain high performance for disk accesses. Though the scalability of the declustering methods becomes increasingly important for systems equipped with a large number of disks, no analytic studies have been done so far In this paper we derive formulas describing the, scalability of two popular declustering methods Disk Modulo and Fieldwise Xor for range queries, which are the most common type of queries. These formulas disclose the limited scalability of the declustering methods and are corroborated by extensive simulation experiments. From the practical point of view, the formulas given in this paper provide a simple measure which can be used to predict the response time of a given range query and to guide the selection of a declustering method under various conditions.

 

Bongki Moon, H. V. Jagadish, Christos Faloutsos, and Joel H. Saltz 

 IEEE Transactions on Knowledge and Data Engineering, March 1996 

University of Maryland Department of Computer Science Technical Report, CS-TR-3611 and UMIACS-TR-96-20 
 

Several schemes for linear mapping of multidimensional space have been proposed for many applications such as access methods for spatio-temporal databases, image compression and so on. In all these applications, one of the most desired properties from such linear mappings is clustering, which means the locality between objects in the multidimensional space is preserved in the linear space. It is widely believed that Hilbert space-filling curve achieves the best clustering. In this paper we provide closed-fortran formulas of the number of clusters required by a given query region of an arbitrary shape (e.g., polygons and polyhedra) for Hilbert space-filling curve. Both the asymptotic solution for a general case and the exact solution for a special case generalize the previous work, and they agree with the empirical results that the number of clusters depends on the hyper-surface area of the query region and not on its hyper-volume. We have also shown that Hilbert curve achieves better clustering than z-curve. From the practical point of view, the formulas given in this paper provide a simple measure which can be used to predict the required disk access behaviors and hence the total access time.

Index Terms: locality-preserving linear mapping, range queries, multi-attribute access methods, data clustering, Hilbert curve, space-filling curves, fractals. 

 

G. Agrawal, J. Saltz 

Proceedings of Supercomputing 95, San Diego, California, December 3-8, 1995. 

University of Maryland: Department of Computer Science Technical Reports CR-TR- 3447 
 

Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications that have irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (PRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present three new interprocedural optimizations: placement of scatter routines, deletion of data structures and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D implementation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.

 

Manuel Ujaldon, Shamik D. Sharma, Joel Saltz, and Emilio Zapata 

Proceedings of the 1995 Workshop on Irregular Problems, pages 78-85, September 1995 
 

Sparse matrix problems are difficult to parallelize efficiently on message-passing machines, since they access data through multiple levels of indirection. Inspector/executor strategies, which are typically used to parallelize such problems impose insignificant preprocessing overheads. This paper describes the runtime support required by new compilation techniques for sparse matrices and evaluates their performance, highlighting optimizations and improvements over previous techniques.

 

Ravi Ponnusamy, Joel Saltz, Alok Choudhary, Yuan-Shin Hwang, and Geoffrey Fox 

IEEE Transactions on Parallel and Distributed Memory Systems, pages 815-831,Volume 6, Number 8, August 1995 
 

University of Maryland: Department of Computer Science Technical Report CS-TR-3194 and UMIACS-TR-93-135 
 

This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

 

 R. P. Nance, R. G. Wilmouth, B. Moon, H. A. Hassan, and J. Saltz 

AIAA Journal of Thermophysics and Heat Transfer, pages 471-477, Volume 9, Number 3, July-September 1995 
 
University of Maryland Technical Report: CR-TR-3425 and UMIACS-TR-95-25 
 

This paper describes a parallel implementation of the direct simulation Monte Carlo method. Runtime library support is usedfor scheduling and execution of communication between nodes, and domain decomposition is perfonned dynamically to maintain a favorable load balance. Perfonnance tests are conducted using the code to evaluate various remapping and remapping- interval policies, and it is shown that a one-dimensional chain-partitioning method works bestfor the problems considered. The parallel code is then used to simulate the Math 20 nitrogen flow over a finite-thickness flat plate. It will be shown that the parallel algorithm produces results which are very similar to previous DSMC results, despite the increased resolution available. However, it yields significantlyfaster execution times than the scalar code, as well as very good load-balance and scalability characteristics.

 

S. Mukherjee, S. Sharma, M. Hill, J. Larus, A. Rogers, and J. Saltz 

Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming '95 pages 68-79, Santa Barbara, California, July 19-21, 1995 

ACM SIGPLAN Notices: Volume 30, Number 8 
 

Irregular computation problems underlie many important scientific applications . Although these problems are computationally expensive, and so it would seem appropriate for parallel machines, their irregular and unpredictable run-time behavior makes this type of parallel program difficult to write and adversely affects run-time performance.
This paper explores three issues--partitioning,mutual exclusion, and data transfer, --crucial to the efficient execution of irregular problems on distributed-memory machines. Unlike previous work, we studied the same programs running in three software alternative systems running on the same hardware base (A Thinking Machines CM-5): the CHAOS irregular application library, transparent shared memory (TSM), and the eXtensible shared memory (XSM). CHAOS and XSM performed equivalently for all three applications. Both systems are somewhat (13%) to significantly faster (991%) than TSM.

 

Gagan Agrawal, Alan Sussman, and Joel Saltz. 

IEEE Parallel and Distributed Systems, pages 747-754, Volume. 6, Number 7, July 1995 

University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3143, UMIACS-TR-93-94 
 

Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid codes) and/or irregularly coupled (called multiblock or irregularly coupled regular mesh problems). In this paper, we present a combined runtime and compile-time approach for parallelizing these applications on distributed memory parallel machines in an efficient and machine-independent fashion. We have designed and implemented a runtime library which can be used to port these applications on distributed memory machines. The library is currently implemented on several different systems. Since the design of the library is machine independent, it can be easily ported on other distributed memory machines and environments which support message passing. To further ease the task of application programmers, we have developed methods for integrating this runtime library with compilers for HPF-like parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University. We present experimental results to demonstrate the efficacy of our approach. We have experimented with a multiblock Navier-Stokes solver template and a multigrid code. Our experimental results show that our primitives have low runtime communication overheads. Further, the compiler parallelized codes perform within 20% of the code parallelized by manually inserting calls to the runtime library.

 

Yuan-Shin Hwang, Raja Das, Joel Saltz, Bernard Brooks, and Milan Hodoscek 

IEEE Computational Science and Engineering,  pages 18-29, Volume 2, Number 2, Summer 1995 

University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3374, UMIACS-TR-94-125 
 

CHARMM (Chemistry at Harvard Macromolecular Mechanics) is a program that is widely used to model and simulate macromolecular systems. CHARMM has been parallelized by using the CHAOS runtime support library on distributed memory architectures. This implementation distributes both data and computations over processors. This data-parallel strategy should make it possible to simulate very large molecules on large numbers of processors.
In order to minimize communication among processors and to balance computational load, a variety of partitioning approaches are employed to distribute the atoms and computations over processors. In this implementation, atoms are partitioned based on geometrical positions and computational load by using unweighted or weighted recursive coordinate bisection. The experimental results reveal that taking computational load into account is essential. The performance of two iteration partitioning algorithms, atom decompositions and force decomposition, is also compared. A new irregular force decompositional algorithm is introduced and implemented.
The CHAOS library is designed to facilitate parallelization of irregular applications. This library (1) couples partitioners to the application programs, (2) remaps data and partitions work among processors, and (3) optimizes interprocessor communications. This paper presents and application of CHAOS that can be used to support efficient execution of irregular problems on distributed memory machines.

 

Y.-S. Hwang, B. Moon, S. Sharma, R. Ponnusamy, R. Das, and J. Saltz 

Software Practice & Experience, pages 597-621, Volume 25, Number 6, June 1995 
 

In many scientific applications, arrays containing data are indirectly indexed through indirection arrays. Such scientific applications are called irregular programs and are a distinct class of applications that require special techniques for parallelization. 

This paper presents a library called CHAOS, which helps users implement irregular programs on distributed memory-passing machines, such as Paragon, Delta, CM-5, and SP-1. The CHAOS library provides efficient runtime primitives for distributing data and computation over processors; it supports efficient translation mechanisms and provides users high-level mechanisms for optimizing communication. CHAOS subsume the previous PARTI library and supports a larger class of applications. In particular, it provides efficient support for parallelization of adaptive irregular programs where indirectional arrays are modified during the course of computation. To demonstrate the efficacy of CHAOS, two challenging real-life applications were parallelized using CHAOS primitives: a molecular dynamics code, CHARMM, and a particle-in-cell code, DSMC.
Besides providing runtime support to users, CHAOS can also be used by compilers to automatically parallelize irregular applications. This paper shows how CHAOS can be used in such a framework. By embedding CHAOS primitive in Syracuse Fortran 90D/HPF compiler, kernels taken from CHARMM and DSMC codes have been parallelized automatically.
 

G. Agrawal, J. Saltz, and R. Das 

Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation, pages 258-269, La Jolla, California,  June 18-21, 1995 

University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3446 and UMIACS-TR-95-42 
 

Partial Redundancy Elimination (PRE) is a general scheme for suppressing partial redundancies which encompasses traditional optimizations like loop invariant code motion and redundant code elimination. In this paper we address the problem of performing this optimization interprocedurally. We use interprocedural partial redundancy elimination for placement of communication and communication preprocessing statements while compiling for distributed memory parallel machines.

 

J. Wu, R. Das, J. Saltz, H. Berryman, and S. Hiranandani 

IEEE Transactions on Computers, pages 737-753, Volume 44, Number 6, June 1995 
 

This paper addresses the issue of compiling concurrent loop nests in the presence of complicated array references and irregularly distributed arrays. Arrays accessed within loops may contain accessed that make it impossible to precisely determine the reference pattern at compile time. This paper proposes a run time support mechanism that is used effectively by a compiler to generate efficient codes in these situations. The compiler accepts as input a Fortran 77 program enhanced with specifications for distributing data, and outputs a message passing program that runs on the nodes of a distributed memory machine. The runtime support for the compiler consists of a library of primitives designed to support irregular patterns of distributed array accesses and irregularly distributed array partitions. A variety of performance results on the Intel iPSC/860 are presented.

 

G. Edjlali, G. Agrawal, A. Sussman, and J. Saltz 

Proceedings of the Ninth International Parallel Processing Symposium, pages 827-832, Santa Barbara, California, April 24-28 1995 

University of Maryland Technical Report: CS-TR-3350 and UMIACS-TR-94-109 Lengthened version 
 

For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at runtime. In this paper, we discuss runtime support for data parallel programs in such an adaptive environment. Executing data parallel programs in an adaptive environment requires determining new loop bounds and communication patterns for the new set of processors. We have developed a runtime library to provide this support. We discuss how the runtime library can be used by compilers to generate code for an adaptive environment. We also present performance results for a multiblock Navier-Stokes solver run on a network of workstations using PVM for message passing. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computations.

 

B. Moon, M. Uysal, and J. Saltz 

Proceedings of the Ninth International Parallel Processing Symposium,  pages 812-819, Santa Barbara, California, April 24-28, 1995 
 

Current research in parallel programming is focused on closing the gap between globally indexed algorithms and the separate address spaces of processors on distributed memory multicomputers. A set of index translation schemes have been implemented as a part of CHAOS runtime support library, so that the library functions can be used for implementing a global index space across a collection of separate local index spaces. These schemes include two software-cached translation schemes aimed at adaptive irregular problems as well as a distributed translation table technique for statically irregular problems. To evaluate and demonstrate the efficiency of the software-cached translation schemes, experiments have been performed with as adaptively irregular loop kernel and a full-fledged 3D DSMC code from NASA Langely on the Intel Paragon and Cray T3D. This paper also discusses and analyzes the operational conditions under which each scheme can produce optimal performance.

 

C. Chang, A. Sussman, and J. Saltz 

University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3438 and UMIACS-TR-95-35 

March, 1995 
 

Object-oriented applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex data types. Traditionally, however, applications executed on distributed memory parallel architectures in single-program multiple-data (SPMD) mode use on distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques that to such applications executing in a loosely synchronous manner. Existing runtime systems that rely solely on global indices are not always applicable to object-oriented applications, since no global names or indices are imposed upon dynamic complex data structures linked by pointers.
 
We describe a portable object-oriented runtime library that has been designed to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures, by providing two abstractions, mobile objects and globally addressable objects. CHAOS++ uses preprocessing techniques to analyze communication patterns, and provides data exchange routines to perform efficient transfers between processors. Results for applications taken from three distinct classes are also presented to demonstrate the wide applicability and good performance characteristics of the runtime library.
 

J. Saltz, R. Ponnusammy, S. Sharma, B. Moon, Y.-S. Hwang, M. Uysal, and R. Das 

University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3437 and UMIACS-TR-95-34 

March 1995 
 

Procedures are presented that are designed to help users efficiently program irregular problems (e. g. unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These procedures are also designed for use in compilers for distributed memory multiprocessors. The portable CHAOS procedures are designed to support dynamic data distributions and to automatically generate send and receive message by capturing communications patterns at runtime.

 

Bongki Moon, Gopal Patnaik, Robert Bennett, David Fyfe, Alan Sussman, Craig Douglas, K. Kailasanath, and Joel Saltz
 
Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing,  pages 575-580, February 1995.
 

One class of scientific and engineering applications involves structured meshes. One example of a code in this class is a flame modelling code developed at the Naval Research Laboratory (NRL). The numerical model used in the NRL flame code is predominantly based on structured finite volume methods. The chemistry process of the reactive flow is modeled by a system of ordinary differential equations which is solved independantly at each grid point. Thus, though the model uses a mesh structure, the workload at each grid point can vary considerably. It is this feature that requires the use of both structured and unstructured methods in the same code. We have applied the Multiblock PARTI and CHAOS runtime support libraries to parallelize the NRL flame code with minimal changes to the sequential code. We have also developed parallel algorithms to carry out dynamic load balancing. It has been observed that the overall performance scales reasonably up to 256 Paragon processors and that the total runtime on a 256-node PAragon is about half that of a single processor Cray C90. 

 

Chialin Chang, Alan Sussman, and Joel Saltz 

University of Maryland: Department of Computer Science Technical Report CS-TR-3416 and UMIACS-TR-95-19 

January 1995 
 

Traditionally, applications executed on distributed memory architectures in single-program multiple-data (SPMD) mode use distributed (multi-dimensional) data arrays. Good performance has been achieved by applying runtime techniques to such applications executing in a loosely synchronous manner. However, many applications utilize language constructs such as pointers to synthesize dynamic complex data structures, such as linked lists, trees and graphs, with elements consisting of complex composite data types. Existing runtime systems that rely on global indices cannot be used for these applications, as no global names or indicies are imposed upon the elements of these data structures.
A portable object-oriented runtime library is presented to support applications that use dynamic distributed data structures, including both arrays and pointer-based data structures. In particular, CHAOS++ deals with complex data types and pointer-based data structures by providing mobile objects and globally addressable objects . Preprocessing techniques are used to analyze communication patterns, and data exchange primitives are provided to carry out efficient data transfer. Performance results for applications taken from three distinct classes are also included to demonstrate the wide applicability of the runtime library.

 

R. Bennett, K. Byrant, A. Sussman, R. Das, J. Saltz 

University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3417 and UMIACS-TR-95-20 

January 1995 
 

There has been a great deal of recent interest in parallel I/O. This paper discusses issues in the design and implementation of a portable I/O library designed to optimize the performance of multiprocessor architecture that include multiple disks or disk arrays. The major emphasis of the paper is on optimizations that are made possible by the use of collective I/O, so that the I/O requests for multiple processors can be combined to improve performance. Performance measurements from benchmarking our implementation of an I/O library that currently performs collective local optimizations, called Jovian, on three application templates are also presented.

 

F. Chong, S. Sharma, E. Brewer, and J. Saltz 

International Conference on Supercomputing 1995, December 1994 
 

This paper examines parallelization of computations that can be characterized as a fine-grained, irregular, directed acyclic graph of tasks. Such computations typically arise in sparse matrix applications where loops may have loop carried data dependencies. The fine grain of the computation and the irregular nature of the dependencies make such loops extremely hard to parallelize efficiently. We use runtime preprocessing to parallelize such computations. A preprocessing step analyzes the DAG and determines an optimized mapping of tasks to processors, an appropriate order of executing these tasks on each processor and a set of synchronization constraints. This information is used to distribute data across processors, and to schedule computation, communication and synchronization. We use sparse-matrix triangular solution as our example application and compare a range of runtime preprocessed techniques for its parallelization. Whereas previous studies have achieved maximum speedups of less that 4 on even simple banded matrices, we achieve a speedup of almost 35 on 128 processors of the CM-5 on an irregular matrix with 5300 rows. We find that and receive overheads dominate the time of out application. Efficient parallelization of such loops requires even more support for low latency communication.

 

Shamik D. Sharma, Ravi Ponnusamy, Bongki Moon, Yuan-Shin Hwang, Raja Das, and Joel Saltz 

Supercomputing 1994, pages: 97-106, Washington, District of Columbia, November 14-18, 1994 
 

In adaptive irregular problems the data arrays are accessed via indirection arrays, and data access patterns change during computation. Implementing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration. This research presents efficient runtime primitives for such problems. This new set of primitives is part of the CHAOS library. It subsumes the previous PARTI library which targeted only static irregular problems. To demonstrate the efficacy of the runtime support, two real adaptive irregular applications have been parallelized using CHAOS primitives: a molecular dynamics code (CHARMM) and a particle-in-cell code (DSMC). The paper also proposes extensions to Fortran D which can allow compilers to generate more efficient code for adaptive problems. These language extensions have been implemented in the Syracuse Fortran 90D/HPF prototype compiler. The performance of the compiler parallelized codes is compared with the hand parallelized versions.

 

R. Bennett, K. Bryant, A. Sussman, and R. Das 

Proceedings of the 1994 Scalable Parallel Libraries Conference, pages 10-20, Mississippi State, Mississippi, October 12-14, 1994 
 

There has been a great deal of recent interest in parallel I/O. In this paper, we discuss the design and implementation of the Jovian library, which is intended to optimize the I/O performance of multiprocessor architectures that include multiple disks or disk arrays. We also present preliminary performance measurements from benchmarking the Jovian I/O library on the IBM SP1 distributed memory parallel machine for two application templates.
 

Rahul Parulekar, Larry Davis, Rama Chellappa, Joel Saltz, Alan Sussman, and John Townshend 

Proceedings of the International Joint Conference on Pattern Recognition, IEEE Computer Society Press, pages 234-238, Jerusalem, Israel, October 9-13, 1994 
 

We present the overall goals of our research program on the application of high performance computing to remote sensing applications, specifically, applications in land cover dynamics. This involves developing scalable and portable programs for a variety of image and map data processing applications, eventually integrated with new models for parallel I/O of large scale images and maps. After an overview of the multiblock PARTI run time support system, we explain extensions made to that system to support image processing applications, and then present an example involving multiresolution image processing. Results of running the parallel code on both a TMC CM5 and an Intel Paragon are discussed.

 

R. Das, M. Uysal, J. Saltz, and Y.-S. Hwang 

Journal of Parallel and Distributed Computing, pages: 462-479, Volume 22, Number 3, September 1994 

University of Maryland Technical Report CS-TR-3163 and UMIACS-TR-93-109 
 

This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. We present a detailed performance and scalability analysis of the communication primitives. This performance and scalability analysis is carried out using a workload generator, kernels from real applications and a large unstructured adaptive application (the molecular dynamics code CHARMM).

 

R. Nance, R. Wilmoth, B. Moon, H. Hassan, and J. Saltz. 

Proceedings of 6th {AIAA/ASME} Joint Thermophysics and Heat Transfer Conference, Colorado Springs, Colorado, June 20-23, 1994 
 

This paper describes a parallel implementation of the direct simulation Monte Carlo (DSMC) method. Runtime library support is used for scheduling and execution of communication between nodes, and domain decomposition is performed dynamically to maintain a good load balance. Performance tests are conducted using the code to evaluate various remapping and remapping-interval policies, and it is shown that one-dimensional chain partitioning method works best for the dimensional problems considered. The parallel code is then used to simulation Mach 20 nitrogen flow over a finite thickness flat plate. It is shown that the parallel algorithm produces results which compare well with experimental data. Moreover, it yields significantly faster execution times that the scalar code, as well as very good load-balance characteristics.

 

 

Ravi Ponnusamy, Yuan-Shin Hwang, Joel Saltz, Alok Choudhary, and Geoffrey Fox 

IEEE Parallel & Distributed Technology, Spring 1995 (revised version) 

University of Maryland: Department of Computer Science and UMIACS Technical Reports CR-TR-3268 and UMIACS-TR-94-57

May 1994 
 

We present methods that make it possible to efficiently support an important subclass of irregular problems using data parallel languages. The approach we describe involves the use of a portable, compiler-independent, runtime support library called CHAOS. The CHAOS runtime support library contains procedures that support static and dynamic distributed array partitioning, partition loop iterations and indirection arrays, remap arrays from one distribution to another, and carry out index translation, buffer allocation and communication schedule generation.
The CHAOS runtime procedures are used by a prototype Fortran 90D compiler as runtime support for irregular problems. We present performance results of compiler-generated and hand-parallelized versions of two stripped down applications codes. The first code is derived from an unstructured mesh computational fluid dynamics flow solver and the second is derived from the molecular dynamics code CHARMM.
A method is described that makes it possible to emulate irregular distributions in HPF by reordering elements of data arrays and renumbering indirection arrays. We present results that suggest that an HPF compiler could use reordering and renumbering extrinsic functions to obtain performance comparable to that achieved by a compiler for a language (such as Fortran 90D) that directly supports irregular distributions.

 

S. Sharma, R. Ponnusamy, B. Moon, Y. Hwang, R. Das, and J. Saltz 

University of Maryland and UMIACS Technical Report: CS-TR-3266 and UMIACS-TR-94-55 

May 1994 

A revised version appears in Proceedings of Supercomputing '94 
 

In adaptive irregular problems, data arrays are accessed via indirection arrays, and data access patterns change during computation.  parallelizing such problems on distributed memory machines requires support for dynamic data partitioning, efficient preprocessing and fast data migration.  This paper describes CHAOS, a library of efficient runtime primitives that provides such support.  To demonstrate the effectiveness of the runtime support, two adaptive irregular applications have been parallilized using CHAOS primitives:  a molecular dynamics code (CHARMM) and a code for simulating gas flows ( DSMC).  We have also proposed minor extensions to Fortran D which would enable compilers to parallelize irregular forall loops in such adaptive applications by embedding calls to primitives provided  by a runtime library.  We have implemented our proposed extensions in the Syracuse Fortran 90D/HPF prototype compiler, and have used the compiler to parallelize kernels from two adaptive applications.

 

Bongki Moon and Joel Saltz. 

Proceedings of the Scalable High Performance Computing Conference 1994, pages 176-183, Knoxville, Tennessee, May 23-25, 1994 
 

In highly adaptive irregular problems such as many Particle-In-Cell (PIC) codes and Direct Simulation Monte Carlo (DSMC) codes, data access patterns may vary from time step to time step. This fluctuation may hinder efficient utilization of distributed memory parallel computers because of the resulting overhead for data redistribution and dynamic load balancing. This may hinder efficient utilization of runtime pre-processing because the pre-processing requirements are sensitive to perturbations in the data access patterns. To efficiently parallelize such adaptive irregular problems on distributed memory parallel computers, several issues such as effective methods for domain partitioning, efficient index dereferencing and fast data transportation must be addressed. This paper presents efficient runtime support methods for such problems. These new runtime support primitives have recently been implemented and added to the CHAOS library. A new domain partitioning algorithm is introduced A simple one-dimensional domain partitioning method is implemented and compared with unstructured mesh partitioners such as recursive coordinate bisection and recursive inertial bisection. A remapping decision policy has been investigated for dynamic load balancing on 3-dimensional DSMC codes. Performance results are presented.

 

Y.-S. Hwang, B. Moon, S. Sharma, R. Das, and J. Saltz 

Proceeding of the Workshop on Environments and Tools for Parallel Scientific Computing, pages 19-32, May 1994 
 

This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector/executor preprocessing. CHAOS also supplies special purpose routines to support specific types of irregular reduction and runtime support for partitioning data and work between processors. A number of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

 

Frederick T. Chong, Shamik D. Sharma, Eric Brewer, and Joel Saltz 

Parallel Processing Letters, pages 671-683, Volume 5, Number 4, 1995 

University of Maryland: Department of Computer Science Technical Report, CS-TR-3266 

 March, 1994 
 

Multiprocessor runtime support for fine-grained, irregular directed acyclic graphs (DAGs) such as those that arise from sparse-matrix triangular solves. We conduct our experiments on the CM-5, whose lower latencies and active-message support allow us to achieve unprecedented speedups for a general multiprocessor. Where as previous implementations have maximum speedups of less than 4 on even simple banded matrices, we are able to obtain scalable performance on extremely small and irregular problems. On a matrix with only 5300 rows, we are able to achieve scalable performance with a speedup of almost 35 for 128 processors.
We achieve these speedups with non-matrix-specific methods which are applicable to any DAG. We compare a range of run-time preprocessed and dynamic approaches on matrices from the Harwell-Boeing benchmark set. Although precomputed data distributions and execution schedules produce the best performance, we find that it is challenging to keep their cost low enough to make them worthwhile on small, fine-grained problems.
Additionally, we find that a policy of frequent network polling can reduce communication overhead by a factor of three over the standard CM-5 policies. We present a detailed study of runtime overheads and demonstrate that send and receive processor overhead still dominate these applications on the CM-5. We conclude that these applications would highly benefit from architectural support for low-overhead communication.

 

Raja Das, D. J. Mavriplis, J. Saltz, S. Gupta, and R. Ponnusamy 

Journal of the American Institute of Aeronautics and Astronautics (AIAA), pages 489-496, Volume 32, Number 3, March 1994 
 

This paper is concerned primarily with the implementation of a three-dimensional unstructured-grid Euler-solver on massively parallel distributed memory computer architectures. The goal is to minimize solution time by achieving high computational rates with a numerically efficient algorithm. A unstructured multigrid algorithm with an edge-based data-structure has been adopted, and a number of implementations have been devised and implemented in order to accelerate the parallel computational rates. The implementation is carried out by creating a set of software tools, which provide an interface between the parallelization issues and the sequential tools, while providing a basis for future automatic run-time support. Large practical unstructured grid problems are solved on the Intel iPSC/860 hypercube and the Intel Touchstone Delta machine. The quantitative effect of the various optimizations are demonstrated, and we show that the combined effects of these optimizations leads to roughly a factor of three performance improvement. The overall solution effeciency is compared with that obtained on the CRAY-YMP vector supercomputer.
 

Alan Sussman, Gagan Agrawal, and Joel Saltz 

University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3070.1, UMIACS-TR-93-36.1 

December 1993 
 

There exists a large class of scientific applications that are composed of irregularly coupled regular mesh (ICRM) computations. These problems are often referred to as block structured or multiblock, problems and include the Block Structured Navier-Stokes solver developed as NASA Langley called TLNS3D.
Primitives are presented that are designed to help users to efficiently program such problems on distributed memory machines. These primitives are also designed for use by compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and received messages are automatically generated. The primitives are also useful for parallelizing regular computations, since block structured computations also require all the runtime support necessary for regular computations.

 

Gagan Agrawal, Alan Sussman, and Joel Saltz. 

Proceedings of Supercomputing '93,  pages 578-587, Portland Oregon, November 15-19, 1993 

University of Maryland Technical Report: CS-TR-3052 and UMIACS-TR-93-29 
 

Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. In this paper we present how this runtime library can be integrated with compilers for High Performance Fortran (HPF) style parallel programming languages. We discuss how we have integrated this runtime library with the Fortran 90D compiler being developed at Syracuse University and provide experimental data on a block structured Navier-Stokes solver template and a small multigrid example parallelized using this compiler and run on an Intel iPSC/860. We show that the compiler parallelized code performs within 20% of the code parallelized by inserting calls to the runtime library manually.

 

Ravi Ponnusamy, Joel Saltz, and Alok Choudhary 

Proceedings Supercomputing '93,  pages 361-370, Portland, Oregon, November 15-19, 1993 

University of Maryland and UMIACS Technical Report: CS-TR-3055 and UMIACS-TR-93-32 
 

In this paper, we describe two new ideas by which HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specifies mapping procedure via a set of compiler directives. The directives allow the user to use program arrays to describe graph connectivity, spatial location of array elements and computational load. The second is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed results from inspectors(e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.

 

Raja Das, Y. Hwang, M. Uysal, J. Saltz, and A. Sussman 

Proceedings of the Scalable Parallel Libraries Conference 1993, pages 45-56, Starkville, Mississippi, October 6-8, 1993 
 

This paper describes a number of optimizations that can be used to support the efficient execution of irregular problems on distributed memory parallel machines. We describe software primitives that (1) coordinate interprocessor data movement, (2) manage the storage of, and access to, copies of off-processor data, (3) minimize interprocessor communication requirements and (4) support a shared name space. The performance of the primitives is characterized by examination of kernels from real applications and from a full implementation of a large unstructured adaptive application (the molecular dynamics code CHARMM).

 

Gagan Agrawal, Alan Sussman, and Joel Saltz 

University of Maryland: Department of Computer Science and UMIACS Technical Reports CS-TR-3140, UMIACS-TR-93-92 

October 1993 
 

Scientific and engineering applications often involve structured meshes. These meshes may be nested (for multigrid or adaptive codes) and/or irregularly coupled (called Irregularly Coupled Regular Meshes). We have designed and implemented a runtime library for parallelizing this general class of applications on distributed memory parallel machines in an efficient and machine independent manner. One important communication primitive supported by this library is the regular section move, which copies a regular section from a distributed array to another distributed array, potentially involving changes in offset, stride and rotation of dimensions. In this paper we discuss the regular section analysis which is required for efficiently generating schedules for this kind of communication. We discuss the details of the analysis required when the distributions of arrays may be block or cyclic.

 

Raja Das, Joel Saltz, and Reinhard v. Hanxleden. 

Proceedings of the 6th Workshop on Languages and Compilers for Parallel Computing, pages 152-168, Portland Oregon, August 12-14, 1993 

University of Maryland Technical Report: CS-TR-3076 and UMIACS-TR-93-42 
 

An increasing fraction of the applications targeted by parallel computers makes heavy use of indirection arrays for indexing data arrays. Such irregular access patterns make it difficult for a compiler to generate efficient parallel code. A lim itation of existing techniques addressing this problem is that they are only applicable for a single level of indirection. However, many codes using sparse data structures across their data through multiple levels of indirection.

This paper presents a method for transforming programs using multiple levels of indirection into programs with at most one level of indirection, thereby broadening the range of applications that a compiler can parallelize efficiently. A central concept of our algorithm is to perform program slicing on the subscript expressions of the indirect array accesses. Such slices peel off the levels of indirection, one by one, and create opportunities for aggregated data prefetching on between. A slice graph eliminates redundant preprocessing and gives an ordering in which to compute the slices. We present our work in the context of High Performance Fortran, an implementation in Fortran D prototype compiler is in progress. 

 

Raja Das and Joel Saltz 

Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing, pages 187-192, Norfolk, Virginia, March, 1993 
 

This paper is concerned with the implementation of a molecular dynamics code, CHARMM, on massively parallel distributed memory computer architectures using a data-parallel approach. The implementation is carried out by creating a set of software tools, which will provide an interface between the parallelization issues and the sequential code. Large practical MD problems is solved on the Intel iPSC/860 hypercube. The overall solution efficiency is compared with that obtained when implementation is done using data-replication.

 

R. Ponnusamy, R. Das, J. Saltz, and D. Mavriplis 

Proceedings of COMPCON 1993, pages 205-212, San Francisco, California, February 23-26, 1993 

Over the past few years, we have developed methods that make it possible for a compiler to efficiently map many sparse and unstructured scientific problems to scalable multiprocessor architectures. These methods are implemented using compiler embeddable runtime support procedures. These procedures can be though of as supporting a type of weakly coherent distributed shared memory that can be closely coupled to distributed shared memory compilers. These procedures (1) coordinate inter-processor data movement, (2) manage the storage of and access to copies of off-processor data, (3) support a shared name space, and (4) couple runtime data and workload partitioners to compilers. We are in the process of implementing prototype High Performance Fortran compilers which use this runtime support to handle unstructured computations.

 

R. von Hanxleden, K. Kennedy, and J. Saltz 

Journal of Programming Languages, Special Issue on Compiling and Run-Time Issues for Distributed Access Machines, pages 259-282, Volume 2, Number 3, December 1993 

CRPC 1993, no. CRPC-TR93365-S 

December 1993 
 

Compiling irregular applications written in a data-parallel, High Performance Fortran-like language presents a challenging problem of growing importance. One principal difficulty with irregular problems is the general lack of access locality when distributing data and computation naively, based on simple functions of array and iteration indices. To address this problem, we extend classical, index-based data distributions to value-based distributions. This type of distribution allows the programmer to conveniently express for example the locality characteristics of an underlying physical problem. This gives the compiler an opportunity to improve both inter- and intra- processor locality, resulting in better performance and scalability, even when this locality is not inherent in the original data structures and access patterns.
This paper reports on the experience gained from implementing value-based distributions in a Fortran 77D compiler prototype. We propose a natural extension of index-based distributions as already present in Fortran D and HPF. This paper addresses the compilation issues involved and makes a quantitative comparison of index and value-based distributions for a Molecular Dynamics application.
 

A. Sussman, J. Saltz, R. Das, S. Gupta, D. Mavriplis, R. Ponnusamy, and K. Crowley 

Computing Systems in Engineering, pages 73-86, Volume 3,  Number 1-4, 1992 

This paper describes a set of primitives (PARTI) developed to efficiently execute unstructured and block structured problems on distributed memory parallel machines. We present experimental data from a 3-D unstructured Euler solver run on the Intel Touchstone Delta to demonstrate the usefulness of out methods.

 

R. v. Hanxleden, K. Kennedy, C. Koelbel, R. Das, and J. Saltz 

Proceedings of 5th Workshop on Languages and Compilers for Parallel Computing, pages 97-111, New Haven, Connecticut, August 3-5, 1992 
 

We developed a dataflow framework which provides a basis for a rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors.
In many programs, several loops access the same of f-processor memory locations. Our runtime support gives as a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-process or data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.

[Applications | High Performance I/O | Compilers | Tools]