J: Previous Accomplishments

Next: K: Facilities Up: Project Integration Previous: I: Key Personnel

J: Previous Accomplishments

Previous work in Saltz's group at Maryland has focused on developing tools, compiler run time support and compilation techniques to help scientists and engineers develop high-speed parallel implementations of codes for irregular scientific problems (i.e. problems that are unstructured, sparse, adaptive or block structured). We have developed a series of runtime support libraries (PARTI, CHAOS, CHAOS++) that carry out preprocessing and data movement needed to efficiently implement irregular and block structured scientific algorithms on distributed memory machines and networks of workstations [50, 40]. Our compilation research has played a major role in demonstrating that it is possible to develop data parallel compilers able to make effective use of a wide variety of runtime optimizations. Over the past few years, much of my emphasis in this area has been to: (1) develop techniques that make it possible for data parallel compilers to optimize complex coding constructs found in many irregular application programs, and (2) develop interprocedural optimizations that optimize placement of irregular runtime support.

Maryland's approach to high performance I/O has been to work experimentally with applications to document the optimizations needed to obtain a given level of performance, and then to design compiler transformations and runtime support libraries in response to this experimental work. We have carried out a detailed study involving I/O-intensive applications from two areas: satellite-data processing (earth science) and out-of-core sparse-matrix factorization (scientific computation) [1]. Our primary experimental platform consisted of a 16-processor IBM SP-2 with six fast disks attached to every processor. For each program the objective was simple: make it run as fast as possible and keep track of what was required to achieve this. The results of this exercise are encouraging. Foremost, we were able to obtain application-level I/O rates of over 100 MB/s for three out of four applications.

Maryland has also been exploring ways to support interoperability between sequential and parallel programs written using different languages and programming paradigms. Successful techniques would facilitate the design of complex scientific applications that are composed of separately developed components, and provide the infrastructure required to make use of highly distributed data and computational resources. We have demonstrated an ability to compose parallel programs that have been written using different programming paradigms (e.g. High Performance Fortran, HPC++ and MPI) [19, 53]. We have developed a prototype ``meta-library'' called Meta-Chaos that makes it possible to integrate multiple data parallel programs (written using different parallel programming paradigms) within a single application. Meta-chaos also supports the integration of multiple data parallel libraries within a single program.

Maryland has two sets of projects in the area of detailed simulation of hardware and systems software. First, we have developed Howsim [59] a coarse-grain simulator for I/O-intensive tasks on workstation clusters. Howsim was developed for evaluation of architectural and OS policy alternatives for I/O-intensive tasks. Accordingly, Howsim simulates I/O devices (storage and network) and the corresponding OS software at a fairly detailed level and the processor at a fairly coarse level. Howsim has been applied to micro-applications on both an IBM SP-2 and a cluster of Digital Alpha SMPs, with very encouraging early results.

Second, Jeff Hollingsworth's prior research is also relevant to detailed performance modeling and simulation, and has been focused in the area of performance measurement tools, attacking two sets of problems. First, he is developing techniques for efficient performance monitoring of large parallel applications. Second, he is trying to provide assistance to users to help them manage the collected data to reduce information overload. To date, dynamic performance monitoring has demonstrated the feasibility of efficiently monitoring the performance of large, long running applications. Measurements indicate that the approach to measurement reduces the volume of data gathered by two to three orders of magnitude compare to traditional event logging. His work on Dynamic Instrumentation provides an efficient way to collect performance data for parallel computations [38]. Dynamic Instrumentation Data collection is a critical problem for any parallel program performance measurement system. To understand the performance of parallel programs, it is necessary to collect data for full-sized data sets running on large numbers of processors. However, collecting large amounts of data can excessively slow down a program's execution, and distort the collected data. Dynamic Instrumentation takes a new approach to data collection that defers instrumenting the program until it is in execution. This approach permits dynamic insertion and alteration of the instrumentation during program execution. He has also developed a new data collection model that permits efficient, yet detailed measurements of a program's performance. The search model and dynamic instrumentation have been incorporated into the Paradyn Parallel Performance Measurement Tool [44].

At Rutgers, Professor Gerasoulis' previous accomplishments are in many related areas that range from engineering to computer science and numerical mathematics. We present a sample of significant accomplishments: (1) His piecewise polynomial approximation method for singular integrals [27] has now become the standard in materials science research and crack analysis in engineering [8]]. (2) His fast algorithms for the multiplication of a singular matrix with a vector, similar to N-body computation, was the first algorithm to demonstrate the existence of faster than algorithms for the N-body problem. His work has generated significant research interest in this area and has become a standard reference in numerical computing and complexity theory [48, 34, 36]. (3) His work in scheduling and software systems has been extensive. Two of his papers with Tao Yang have been selected as significant research contribution in this area [30, 31]. His work in this area has been widely referenced. (4) He has been directing a major effort in the development of tools for mapping and scheduling applications on parallel machines. This work has been supported by ARPA under HPCD and has produced several PH.D. graduates, including Tao Yang, a collaborator on this proposal. The PYRROS, PLUSPYR and D-Pyrros systems are the outcome of this project. Clustering algorithms developed for these tools have been used by several leading institutions and other ARPA supported projects(MIT, Berkeley, NASA, UMD, CMU, RIACS, etc.). (5) Under the ARPA supported HPCD project he has collaborated with Norm Zabusky, Sandeep Bhatt and others in the application of 3D fast algorithms in vortex dynamics [21]. This collaboration has resulted in orders of magnitude performance improvements, making it possible to compute high resolutions that were not possible with the previous technology (going from 40000 particle to million particle simulations). (6) He has collaborated with the ship division of SAIC for the parallelization and improvement in robustness of the LAMP ship design codes [28]. The LAMP system was recently used in the design of the revolutionary Arsenal Ship configuration developed by the General Dynamics/Raytheon/SAIC team and is one of the applications cited in this proposal.

Tao Yang has conducted algorithm research and system development in the areas of performance prediction, scheduling, runtime systems and sparse matrix computation, high performance WWW/digital library servers, parallel image processing tools and parallel radiosity. With Gerasoulis at Rutgers, he conducted research on granularity analysis and scheduling algorithms [31, 64], designed and implemented the PYRROS system [63], which models an application with task graphs, schedules and predicts performance and generates executable parallel code on nCUBE/Intel machines. The performance of the PYRROS scheduler is competitive with other well-designed algorithms but its complexity is one-order lower, which is critical for dealing with large-scale applications. Other research groups have used this software. For example, MIT and Maryland used it in a sparse triangular solver. PYRROS is also incorporated in an automatic task graph generation and cost abstraction system by French scientists [14] for Fortran. At UCSB, he also extended the task graph model and scheduling algorithms for dealing with iterative computations involving loop and task parallelism [62]. With his students, he has developed the RAPID runtime system [25] for parallelizing irregular scientific code. Using this system, an effective solution is provided for the semi-automatic parallelization of sparse triangular solvers, Cholesky and LU decomposition with partial pivoting. Note that sparse LU is an open problem for parallelization in the literature on distributed-memory machines with memory hierarchy. Using a new cache optimization and partitioning technique [26], the RAPID sparse LU code has achieved good speedups compared with a highly optimized sequential code recently developed at UC Berkeley. He has also developed a performance prediction and scheduling tool for modeling parallel image processing applications [43]. He is currently developing a nonlinear equation solver based on the parallel sparse LU algorithm. In collaborative work with the Alexandria project, he has developed prediction-based adaptive scheduling algorithms [7, 7] and a multiprocessor WWW server for distributed data-intensive digital library applications [5]. Those results are being used by Navy NRAD.

Geoffrey Fox has worked in this general area for 15 years at both Caltech and Syracuse. Recently his relevant related work includes the DARPA sponsored Common Runtime Support for High Performance Parallel Languages (PCRC) which is a collaboration including Syracuse, Maryland, Cooperating Systems, Florida, Indiana, Rochester and Texas. This is building common runtime to support Fortran C++ and Java (originally we had intended ADA as a third language) parallel compilers. This includes support for both regular applications (Syracuse) and irregular cases (Maryland, where the interoperable Meta-Chaos system was developed). This PCRC activity has also continued the development of the Syracuse HPF compiler [10, 13, 51], which was supported on a previous DARPA grant. This was probably the first HPF compiler to demonstrate the viability of the language and key technology ideas needed. This research prototype was licensed by the Portland Group, whose commercial product is highly regarded. We intend to build some of application emulator activity directly on the PCRC runtime support. Other relevant Syracuse activity is in applications which has always been Fox's focus. The book Parallel Computing Works, which essentially described Fox's work at Caltech, highlighted 50 separate significant parallel applications. The most relevant current Syracuse application project is an NSF funded Grand Challenge studying the collision of two black holes where they are playing a major role in both physics and computer science parts of the activity. We will use this as an adaptive mesh application emulator and expect Syracuse to be involved in other important applications during the period of this proposal.

Next: K: Facilities Up: Project Integration Previous: I: Key Personnel

Wes Stevens
Fri Jul 11 15:07:44 EDT 1997