DataCutter

Principle Investigators

Joel Saltz, M.D., Ph.D.
Alan Sussman, Ph.D.
Mike Beynon, Ph.D.
Tahsin Kurc, Ph.D.
Umit Catalyurek, Ph.D.

Software Distribution

Current version

Related Information

Publication List

Presentation Slides

User's Manual (pdf)

Class Reference (pdf)

Middleware for Filtering Large Archival Scientific Datasets in a Grid Environment

Increasingly powerful computers have made it possible for computational scientists and engineers to model physical phenomena in great detail. As a result, overwhelming amounts of data are being generated by scientific and engineering simulations. In addition, large amounts of data are being gathered by sensors of various sorts, attached to devices such as satellites and microscopes. The primary goal of generating data through large scale simulations or sensors is to better understand the causes and effects of physical phenomena. Thus, the exploration and analysis of large datasets plays an increasingly important role in many domains of scientific research. The continuing increase in the capabilities of high performance computers and sensor devices implies that datasets with sizes up to petabytes will be common in the near future. Such vast amounts of data require the use of archival storage systems distributed across a wide-area network. Simulation or sensor datasets generated or acquired by one group may need to be accessed over a wide-area network by other groups. Efficient storage, retrieval and processing of multiple large scientific datasets on remote archival storage systems is therefore one of the major challenges that needs to be addressed for efficient exploration and analysis of these datasets. Software support is needed to allow users to obtain needed subsets of very large, remotely stored datasets.

DataCutter is a middleware infrastructure that enables processing of scientific datasets stored in archival storage systems across a wide-area network. DataCutter provides support for subsetting of datasets through multi-dimensional range queries, and application specific aggregation on scientific datasets stored in an archival storage system.

DataCutter provides a core set of services, on top of which application developers can implement more application-specific services or combine with existing Grid services such as metadata management, resource management, and authentication services. The main design objective in DataCutter is to extend and apply features of the Active Data Repository (ADR), namely support for accessing subsets of datasets via range queries and user-defined filtering operations, for very large datasets in a shared distributed computing environment. In ADR, data processing is performed where the data is stored (i.e. at the data server). In a Grid environment, however, it may not always be feasible to perform data processing at the server, for several reasons. First, resources at a server (e.g., memory, disk space, processors) may be shared by many other competing users, thus it may not be efficient and cost-effective to perform all processing at the server. Second, datasets may be stored on distributed collections of storage systems, so that accessing data from a centralized server may be very expensive. Moreover, distributed collections of shared computational and storage systems can provide a more powerful and cost-effective environment than a centralized server, if they can be used effectively. Therefore, to make efficient use of distributed shared resources within the DataCutter framework, the application processing structure is decomposed into a set of processes, called filters. DataCutter uses these distributed processes to carry out a rich set of queries and application specific data transformations. Filters can execute anywhere (e.g., on computational farms), but are intended to run on a machine close (in terms of network connectivity) to the archival storage server or within a proxy server.

Another goal of DataCutter is to provide common support for subsetting very large datasets through multi-dimensional range queries. Very large datasets may result in a large set of large data files, and thus a large space to index. A single index for such a dataset could be very large and expensive to query and manipulate. To ensure scalability, DataCutter uses a multi-level hierarchical indexing scheme.

DataCutter is also being integrated with the Storage Resource Broker (SRB), under development at the San Diego Supercomputing Center through the NPACI consortium. The SRB provides transparent access to distributed storage resources in a Grid environment, and DataCutter will enhance the SRB services to allow for subsetting and filtering of large archival datasets stored through the SRB.