Data Intensive Computing

Active Data Repository Project - High Performance Multidimensional Databases

Titan is a prototype distributed database engine designed for efficient retrieval and processing of large multi-dimensional datasets. One of its major goals is to be able to integrate and overlap a wide range of user-defined operations, in particular, order-independent operations with the basic retrieval functions. Titan targets parallel and distributed architectures that have been configured to support high I/O rates. Currently, we are exploring the use of Titan for three tasks: generating data composites from low level satellite sensor data; processing and browsing digitized slides from high power light or electron microscopy; and exploration and analysis of data generated by large scientific simulations.

Ultimately, this database engine will be accessible as:

a stand-alone tool (Titan) designed to carry out optimized operations on data repositories and to send the results to parallel or sequential clients,

as a Database extender module associated with a relational database, or

as compiler or user-level runtime support. When implemented as compiler or user-level runtime support, the software will be able to carry out generalized reduction operations associated with out-of-core scientific codes.

The techniques employed involve the generation and use of optimized schedules in which preprocessing of data access patterns is used to schedule processing, communication, local I/O and non-local I/O. The engine makes use of clustering and declustering methods we have developed for rapid retrieval of large multi-dimensional datasets from large disk farms disks.

Active Data Repository Research Project (Presentation - in Adobe pdf)

High Performance I/O

Parallelization and characterization of data-intensive applications

the goal of this project is to answer the question: what does it takes to achieve good I/O performance for real-life I/O-intensive tasks on parallel machines suitable for I/O?

Interprocedural compiler analysis for overlapping large I/O operations with computation

the goal of this project is to compiler scheduling of large I/O operations which are beyond the capacity of the standard operating system prefetch and write-behind capabilities.

Scheduling and resource management for I/O-intensive tasks on peer-to-peer systems

the goal of this project is to develop techniques for achieving high performance for I/O-intensive tasks on peer-to-peer systems.

Data placement techniques for multidimensional datasets on large disk farms

the goal of this project is to develop data placement techniques for multidimensional datasets. This includes declustering techniques to assign data blocks to individual disks as well as clustering techniques to determine the placement of blocks on individual disks.

Our Recent Publications in this Area

More Information on High Performance I/O Research at Maryland

The University of Maryland is a member of the Scalable I/O Consortium

ESDIS Final Report on the EOSDIS project on High-Performance I/O Techniques

Our High Performance Multidimensional Database project will be incorporated into the NPACI data intensive computing infrastructure

Tools for Metacomputing

Parallel Program Interoperability

We are currently developing techniques that will allow parallel compute and data objects to offer their services to near and remote clients. Our goal is to develop techniques that will allow users to compose programs running on any combination of distributed memory, shared memory or networked microcomputers or workstations. This research is motivated by two important classes of applications -- sensor data processing and integration, and complex physical simulations. An important early result of this work is the development of runtime support (Meta-Chaos ) which allows communication between diverse runtime libraries. The ability to use multiple libraries is needed in several application areas, such as multidisciplinary complex physical simulations and remote-sensing applications. In these scenarios, an application can consist of a single program or multiple programs that use a variety of runtime libraries for partitioning the data-sets and/or inter-process communication.

Another important part is the control language which specifies the range and frequency of information exchange between programs being coupled. It uses a set of mappings between data regions belonging to different programs -- each mapping identifies the regions of interest and the frequency at which they are to be kept consistent. Our implementation allows the mappings to specified either at initialization time or later during actual execution of the program. We used early prototypes of these systems to demonstrate the ability to couple High Performance Fortran applications with other HPF applications as well as with applications developed using the Maryland CHAOS and Multiblock Parti libraries.

Our Recent Publications in this Area

Meta-Chaos will be employed in the NPACI consortium in Adaptable Scalable Tools/Environments projects

Resource Aware Computing

One of the biggest challenges for efficient data access over the Internet and other higher end wide area networks is effective adaptation to resource fluctuations. There are several emerging high end applications that are likely to benefit from an ability to adapt to changes in availability of network and computational resources. These applications fall into two roughly defined categories, metacomputing applications and customized multicast (customcast) applications.

Metacomputing applications often involve synthesis of information obtained from multiple sources. These applications may generate output products by combining data obtained from a variety of sources. Data sources can consist of archived information, information generated by physical simulations, information from scientific instruments such as light or electron microscopes or data from sensors based on land, aircraft, or satellites.

We anticipate that the process of data product generation will require moving large amounts of data through variable quality network links. A common metacomputing scenario organizes the computation as a pipeline consisting of a sequence of programs. For instance in defense and disaster response scenarios, there is a need to generate data products that combine information from combinations of satellite sensors, aircraft sensors and ground based sensors. There are often significant computational costs associated with data combining. Resource aware scheduling may be needed to deal with changes in network characteristics, changes in the computational demands associated with data combination, or changes in availability of computational resources.

Customized multicast (customcast) applications involve customized propagation of data obtained from a single source. An example of customcast arises in an application under development by our group in the Department of Pathology at Johns Hopkins Medical School. This application, the Virtual Microscope, tries to achieve a realistic digital emulation of a high power light microscope. An important design goal is to achieve interactive response while the dataset is simultaneously explored by multiple users. The utility of customcast, in this scenario, is indicated by the spatial and temporal locality in the request patterns. While individual dataset are quite large (10-100 Gbytes per specimen), most users usually access only those portions of the dataset that contain features of medical interest. Moreover, due to psychophysical limitations of human data processing, users frequently dwell on portions of the dataset that are of particular interest.

Mobile programs can move an active thread of control from one site to another during execution. This flexibility has many potential advantages. For example, a program that searches distributed data repositories can improve its performance by migrating to the repositories and performing the search on-site instead of fetching all the data to its current location. Similarly, an Internet video-conferencing application can minimize overall response time by positioning its server based on the location of its users. Applications running on mobile platforms can react to a drop in network bandwidth by moving network-intensive computations to a proxy host on the static network. The primary advantage of mobility in these scenarios is that it can be used as a tool to adapt to variations in the operating environment. Applications can use online information about their operating environment and knowledge of their own resource requirements to make judicious decisions about placement of computation and data.

In order to investigate resource aware computing, we have designed and implemented Sumatra, an extension of Java that supports resource-aware mobile programs. We also describe the design and implementation of a distributed resource monitor that provides the information required by Sumatra programs.

Our Recent Publications in this Area

Ontology Based Indexing Schemes

Infectious disease researchers and clinicians have difficulty in formulating queries that capture the kinds of questions they wish to pose. The taxonomy used to describe micro-biological data is complex; this terminology also changes with time. As a means for overcoming these difficulties, we have developed methods for efficiently integrating semantic knowledge, stored in the form of thesauri (or more precisely, ontologies) in ways that support efficient indexing of large databases. The goal of this effort is to provide clinicians with the ability to express complex queries without becoming experts on the underlying data model. Thus far, we have employed this technique to index relational databases. We will soon adapt this method to support efficient indices for sensor and scientific datasets.

Our Recent Publications in this Area

This work is being carried out in collaboration with the Parallel Understanding Systems (PLUS) research group

Irregular Problems and High Performance Architectures (The CHAOS Project)

The Chaos project has developed methods to map a broad range of challenging applications onto high performance computer architectures. A major focus of this work is development of parallelization techniques for irregular scientific problems -- problems that are unstructured, sparse, adaptive or block-structured. This project works extensively with application developers in many disciplines and with parallel compiler vendors.

Based upon our experience in developing runtime libraries and in parallelizing applications, we have developed several compilation techniques. Our goal is to be able to automate our parallelization/optimization techniques through the use of compilers. We have employed the Fortran~D compilation system (developed primarily at Rice University) as the infrastructure for implementation of our techniques.

For efficiently parallelizing applications that use multiple levels of indirection, we have developed program-slicing technique which transforms a loop with multiple levels of indirection into a series of loops with at most one level of indirection. In another effort, we have developed aggressive interprocedural optimizations for communication and I/O placement in irregular data-parallel programs. We have developed an Interprocedural Partial Redundancy Elimination technique for performing interprocedural placement of communication preprocessing and collective communication statements. We are currently working on Interprocedural Balanced Code Placement which will allow us to overlap computation and communication across procedure boundaries. We are also working on generating distributed memory code from Fortran~90 codes that use pointers and recursive data structures.

Additional Information and Recent Publications from CHAOS project

High-end Medical Applications -- Johns Hopkins Department of Pathology

The Center for Computer Science in Medicine is founded on the principle that close collaboration between researchers in computer science and medicine can lead to dramatic advances in both areas.

Areas of interest in medical informatics include:

Development of high performance spatial database engines for real time analysis of high-power microscopic data

Tools for medical database interoperability and program migration

Generation of medical knowledge bases using techniques for extracting and synthesizing information from clinical data

Development of algorithms for medical datamining

Tools for characterizing performance of distributed systems

Tools and compilation methods for computationally intensive biomedical problems

Our Recent Publications in this Area

Department of Pathology, Johns Hopkins Medical Institutions