Increasingly powerful computers have made it possible for computational
scientists and engineers to model physical phenomena in great detail. As a result,
overwhelming amounts of data are being generated by scientific and engineering
simulations. In addition, large amounts of data are being gathered by sensors of various
sorts, attached to devices such as satellites and microscopes. The primary goal of
generating data through large scale simulations or sensors is to better understand the
causes and effects of physical phenomena. Thus, the exploration and analysis of large
datasets plays an increasingly important role in many domains of scientific research. The
continuing increase in the capabilities of high performance computers and sensor devices
implies that datasets with sizes up to petabytes will be common in the near future. Such
vast amounts of data require the use of archival storage systems distributed across a
wide-area network. Simulation or sensor datasets generated or acquired by one group may
need to be accessed over a wide-area network by other groups. Efficient storage, retrieval
and processing of multiple large scientific datasets on remote archival storage systems is
therefore one of the major challenges that needs to be addressed for efficient exploration
and analysis of these datasets. Software support is needed to allow users to obtain needed
subsets of very large, remotely stored datasets.
DataCutter is a middleware infrastructure that enables processing of scientific
datasets stored in archival storage systems across a wide-area network. DataCutter
provides support for subsetting of datasets through multi-dimensional range queries, and
application specific aggregation on scientific datasets stored in an archival storage
system.
DataCutter provides a core set of services, on top of which application developers can
implement more application-specific services or combine with existing Grid services such
as metadata management, resource management, and authentication services. The main design
objective in DataCutter is to extend and apply features of the Active
Data Repository (ADR), namely support for accessing subsets of datasets via range
queries and user-defined filtering operations, for very large datasets in a shared
distributed computing environment. In ADR, data processing is performed where the data is
stored (i.e. at the data server). In a Grid environment, however, it may not always be
feasible to perform data processing at the server, for several reasons. First, resources
at a server (e.g., memory, disk space, processors) may be shared by many other competing
users, thus it may not be efficient and cost-effective to perform all processing at the
server. Second, datasets may be stored on distributed collections of storage systems, so
that accessing data from a centralized server may be very expensive. Moreover, distributed
collections of shared computational and storage systems can provide a more powerful and
cost-effective environment than a centralized server, if they can be used effectively.
Therefore, to make efficient use of distributed shared resources within the DataCutter
framework, the application processing structure is decomposed into a set of processes,
called filters. DataCutter uses these distributed processes to carry out a rich
set of queries and application specific data transformations. Filters can execute anywhere
(e.g., on computational farms), but are intended to run on a machine close (in terms of
network connectivity) to the archival storage server or within a proxy server.
Another goal of DataCutter is to provide common support for subsetting very large
datasets through multi-dimensional range queries. Very large datasets may result in a
large set of large data files, and thus a large space to index. A single index for such a
dataset could be very large and expensive to query and manipulate. To ensure scalability,
DataCutter uses a multi-level hierarchical indexing scheme.
DataCutter is also being integrated with the Storage Resource Broker (SRB), under
development at the San Diego Supercomputing Center through the NPACI consortium. The SRB provides transparent
access to distributed storage resources in a Grid environment, and DataCutter will enhance
the SRB services to allow for subsetting and filtering of large archival datasets stored
through the SRB.