Final report on the ESDIS project on High-Performance I/O Techniques

Department of Computer Science

University of Maryland, College Park MD 20742

1. Overview

The goals of this project were two-fold: to understand the I/O requirements of algorithms for data product generation and to develop techniques that help meet these requirements on suitably configured machines. Towards the first goal, we have analyzed a variety of existing data product generation programs and have successfully parallelized two of them, pathfinder and climate which generate the Level 2 and Level 3 data products for the AVHRR Land Pathfinder effort. We have also developed a parallel I/O library suitable for parallel data generation programs. We will describe our experiences in Section 2. We will also present our suggestions regarding the structure of EOSDIS data product generation programs, the organization of the files used to store the data products and the runtime support needed for effective parallelization of data product generation applications.

Based on our understanding of the I/O and processing requirements of these applications, we have developed several techniques to help meet them on suitably configured machines. These techniques deal with (1) declustering multi-dimensional datasets on large disk farms, (2) partitioning satellite data products for efficient retrieval, (3) overlapping I/O, computation and communication to perform data retrieval and processing on the same processors and (4) interprocedural analysis to automate the placement of asynchronous I/O operations. We describe these techniques in Sections 4 and 5. Based on these techniques, we have developed Titan, a high-performance database for remote-sensing data. The computational platform for Titan is a 16-processor IBM SP-2 with four fast disks attached to each processor. Titan is currently operational and contains about 24~GB of AVHRR data on the NOAA-7 satellite. Titan supports interactive queries over its data and supports full-globe queries as well localized queries. Experimental results show that Titan provides good performance for global queries, and interactive response times for localized queries. We describe the design and evaluation ofTitan in Section 3.

Based on our experience with Titan, we are currently in the process of developing an extensible framework for managing extremely large multi-dimensional datasets. We plan to implement this framework both as a stand-alone system for efficient storage, retrieval and processing of large data repositories and as a database extender which allows multi-dimensional datasets to be integrated with commercial relational databases which store other forms of data, in particular metadata asociated with the datasets.

2. Analysis and parallelization of data product generation

We collected data product generation programs from four groups: (1) pathfinder and climate from the GSFC DAAC; (2) gaps and extract from the GIMMS group (branch 923); (3) the SeaWiFS processing chain from the SeaWiFS project; (4) LAS and ADAPS from the EROS Data Center. From each source, we acquired multiple programs that formed a processing chain which took Level 1 data as input and generated Level 2 and Level 3 data products.

Inspite of significant differences in the science algorithxms and the organization of the code, all these applications have a common structure. Programs that generate Level 2 products process Level 1 files which contain information from a single satellite orbit and generate a single file that contains a composited multi-band image for the area of interest. Before composition, individual values are corrected for various distortions and are navigated to the desired projection and resolution. The composition operation is a complex max-reduction operation - the specific predicate used to determine when one pixel is preferable to another depends on the program and the dataset. The reduction operation is performed by: (1) creating a temporary image; (2) processing all the inputs with a fixed chunk size;(3) processing and navigating the IFOVs in a chunk; (4) performing the max-reduction. Given the similarities, we selected pathfinder and climate as the prototypical programs for the generation of Level 2 and Level 3 data products.

We parallelized pathfinder by partitioning the output image in equal-sized horizontal strips. Each processor is responsible for all processing needed to generate its partition of the output image. We chose to partition the output image (instead of the input data) as this allows all combination operations to be local to individual processors. No inter-processor communication is needed. We chose a horizontal partitioning scheme to take advantage of the row-major storage format used in all files (input, ancillary as well as output files). Horizontal striping allows I/O to be performed in large contiguous blocks.

Each processor computes the map from the input data set to the output image by subsampling (one scan line per chunk) all input files. It then reads the chunks that intersect with its partition of the output image. For each chunk, it maps each input pixel into the output image. Pixels that map into its partition are processed further, others are ignored. The individual partitions of the output image are also too large to be stored in main memory. Therefore, the composition operation is still out-of-core. Once all processing is completed, the final result is produced by concatenating the individual partitions.

In climate, the mapping between the pixels of the input image and those of the output image is data-independent and can be computed a-priori. The amount of computation to be done is proportional to the amount of input data. We parallelized climate by horizontally partitioning the output image. Each processor reads the data that maps to its partition of the output image. Load balance is achieved by ensuring that all processors read approximately equal amounts of data.

The total I/O performed by pathfinder is over 28GB and the total I/O performed by climate is 75.5MB. The original version of pathfinder ran for 18,800 seconds on a single processor of an IBM SP-2. Of this, about 13,600 seconds (76% of the time) were spent waiting for I/O. The final version took 1200 seconds using 12 processors. Of this, 10-15% time was spent waiting for I/O -- pathfinder is now computation-bound. The maximum aggregate application level I/O rate was 644 MB/s. For climate, the execution time was reduced from 200 seconds to 32 seconds (on eight processors) of which 4-5% was spent waiting for I/O. The maximum aggregate application-level I/O bandwidth for climate was 36 MB/s. These experiments were conducted on an IBM SP-2 which has been configured with 16 thin nodes, two Fast/Wide SCSI adaptors per node, and three IBM Starfire 7200 disks (7 MB/s application-level I/O bandwidth) per SCSI adaptor. More details about this tuning and evaluation effort can be found in our paper titled"Tuning the Performance of I/O-intensive Parallel Applications" which appeared in the Fourth Annual Workshop on I/O in Parallel and Distributed Systems (IOPADS'96).

Titan: a high-performance database for remote-sensing data

We have designed, implemented and evaluated Titan, parallel shared-nothing database designed for handling remote-sensing data. Titan consists of two parts: (1) a front-end that interacts with querying clients, performs initial query processing and partitions data retrieval and computation; and (2) a back-end that retrieves the data and performs post-processing and composition operations. The front-end consists of a single host which can be located anywhere on the network. The back-end consists of a set of processing nodes on a dedicated network that store the data and do the computation. The current implementation of Titan uses one node of the 16-processor IBM SP-2 as the front-end and the remaining 15 nodes as the back-end. No data is stored on the disks of the node used as the front-end.

Titan partitions its data set into coarse-grained data blocks and uses a simplified R-tree to index these chunks. This index is stored at the front-end which uses it to build a plan for the retrieval and processing of the required data blocks. The size of this index for 24 GB of AVHRR data is 11.6 MB, which is small enough to be held in primary memory.

Titan queries specify four constraints: (1) temporal bounds (a range in universal coordinated time), (2) spatial bounds (a quadrilateral on the surface of the globe), (3) sensor type and number, and (4) resolution of the output image. The result of a query is a multi-band image. Each pixel in the result image is generated by composition over all the sensor readings for the corresponding area on the earth's surface.

When the front-end receives a query, it searches the index for all data blocks that intersect with the query. It uses the location information for each block (which is stored in the index) to determine the set of data blocks to be retrieved by each back-end node. In addition, the front-end partitions the output image among all the back-end nodes. Currently, the output image is evenly partitioned by blocks of rows and columns, assigning each back-end node approximately the same number of output pixels. Under this partitioning scheme, data blocks residing on the disks of a node may be processed by other nodes; each back-end node processes the data blocks corresponding to its partition of the output image. The front-end distributes the data block requests and output image partitions to all back-end nodes.

Each back-end node computes a schedule for retrieving the blocks from its disks. This schedule tries to balance the needs of all nodes that will process these data blocks. As soon as a data block arrives in primary memory, it is dispatched to all nodes that will process it. Once a data block is available for processing (either retrieved from local disk or forwarded by another node), a simple quadrature scheme is used to search for sensor readings that intersect with the local partition of the output image. After all data blocks have been processed, the output image can either be returned to the front-end for forwarding to the querying client, or it can be stored in a file for later retrieval.

Data layout decisions in Titan were motivated by the format of AVHRR data and the common query patterns identified by NASA researchers and our collaborators in the University of Maryland Geography Department. We distributed the AVHRR data on a large disk farm. We used the declustering algorithms described in Section 3 to compute the data distribution.

Titan is currently operational on a 16-processor IBM SP-2 with four IBM Starfire 7200 disks attached to each processor. It contains about 24 GB of AVHRR data from the NOAA-7 satellite.

We have run a sequence of experiments on Titan to evaluate our techniques for partitioning the images into chunks, declustering the chunks over a large disk farm and placement of the chunks assigned to individual disks. Experimental results show that Titan provides good performance for global queries, and interactive response times for local queries. A global query for a 10-day composite of normalized vegetation index takes less than 100 seconds; similar queries for Australia and the United Kingdom takes 4 seconds and 1.5 seconds respectively. Our data distribution techniques improved the disk parallelism, the number of disks active for individual queries by 48 to 70 percent. The total estimated retrieval time was reduced by between 8 and 33 percent. We also evaluated schemes for placement of data blocks assigned to a single disk. We found that the average length of a read (without an intervening seek) can be improved by about a factor of two. Design, implementation and evaluation of Titan has been described in our paper titled "Titan: A High-Performance Remote-sensing Database" which appeared in the International Conference on Data Engineering, 1997.

3. Declustering algorithms for multi-dimensional range queries

We investigated data declustering techniques for multidimensional datasets with the primary goal of minimizing response time and the secondary goal of maximizing disk space utilization. First, we extended the three best-known index-based schemes (Disk-Modulo, Fieldwise-XOR and Hilbert-Curve) for declustering Cartesian product files to grid files which allow better utilization of disk space. Using simulation experiments, we showed that the scalability of Disk-Modulo and Fieldwise-XOR for multidimensional range queries is limited. That is, as the number of disks is increased beyond a threshold, the response time no longer decreases. This result is corroborated by an analytical study. The response time for Hibert-Curve scales better than Disk-Modulo or Fieldwise-XOR, but the difference between its performance and the best possible performance increases with the degree of skew in the data distribution. As an alternative to the index-based schemes, we developed a declustering algorithm based on a proximity measure. To evaluate this algorithm, we compared its performance with that of the three index-based schemes mentioned above as well as other proximity-based schemes in the literature. Our evaluation was based on two real datasets and three synthetic datasets. The real datasets were: (1) a sequence of snapshots from a Direct Simulation Monte Carlo code (from NASA Langley) and (2) stock price data for a basket of stocks over a period of time. This algorithm has also been used in the Titan database described in Section 2. Results from our simulation experiments indicate that the proposed algorithm achieves better declustering than the algorithms we compared it to, particularly for configurations with large number of disks. This research has been described in our paper titled "Study of Scalable Declustering Algorithms for Parallel Grid Files" which was presented at the Tenth International Parallel Processing Symposium (IPPS'96).

4. Interprocedural analysis for placement of I/O operations

We developed an Interprocedural Balanced Code Placement technique for compiler placement of I/O calls. The goal of this technique is to maximize the overlap between I/O and computation. Each synchronous I/O operation is replaced by a balanced pair of asynchronous operations. The asynchronous operations are placed so as to achieve overlap with the computation, while maintaining correctness and safety. To be able to overlap disk accesses with computation, it is important for the compiler to analyze code across procedure boundaries. We implemented a Fortran source-to-source transformation tool, which performs the IBCP analysis. We used this tool to compile "satellite", a satellite-data processing template based on pathfinder which repeatedly modifies an out-of-core image. Our results show that use of compiler-placed asynchronous write operations can reduce the I/O overhead for this template by 25%-40% and improve the overall performance by 13.3%-14.7%. Performing interprocedural analysis for placement was critical for getting better performance; almost no overlap would have been possible if the analysis was restricted intraprocedurally. These results have been included in our paper on "An Interprocedural Framework for Placement of Asynchronous I/O Operations" which was presented at the 1996 ACM International Conference on Supercomputing (ICS'96).

5. Lessons learnt and suggestions

In this section, we briefly present the lessons learnt from our experience. We believe they would be useful to science algorithm developers as well as to people working on configuring the hardware and software systems.

Satellite data generation programs are relatively easy to parallelize: Given the common structure of different data product generation programs, we believe that the parallelization scheme described in this report should be suitable for most, if not all, data product generation programs. Since communication between peers is needed only for putting together the final output, this scheme should work as well on shared memory as on distributed memory machines.

Proper code restructuring is important:

As far as possible, I/O should be done in the outermost nest of a nested loop. Embedding I/O calls in inner nests of a nested loops usually results in a sequence of small requests interleaved with seeks. It is usually possible to restructure the loop nests so that the I/O is performed in outermost nest and only computation done in the inner nests. This restructuring is illustrated by the following example which is basd on the composition module of pathfinder. For the applications we studied, this was not a difficult operation.

Original code:

for (i = 0; i < num_input_pixels; i++)
for (j = 0; j < num_of_bands; j++)
map input pixel to output image
seek to output pixel in this band
write pixel value for this band

Restructured code:

determine the bounding box output pixels involved
for (j = 0; j < num_bands; j++)

read in the bounding box for this band
for (i = 0; i < num_input_pixels; i++)

map input pixel to output image
update output image pixel for this band

write out the bounding box for this band

Information about future requests is usually available: In the parallelization scheme described above, processors subsample the input files in the partitioning phase. At the end of this phase, every processor has complete information about its future requests for input reads. For the modified version of the out-of-core max-reduction (where modification consisted of a pair of simple loop-splitting and loop-reordering transformations), information about updates to all frequency bands of the output image is known before any updates are performed.

It is possible to partition the intermediate data so that each processor reads and writes to its own local disk(s). Bandwidths for local disk access are substantially higher than the bandwidths for non-local accesses. In addition, local accesses are guaranteed not to interfere with I/O requests from other processors. This increases the utility of the file cache and makes the overall behavior of the application more predictable.

stdio is usually not suitable for satellite data processing:Many programs use the fread/fwrite interface for I/O which introduces an extra level of buffering and requires one extra copy. Since individual requests are usually large enough, the buffering performed by fread/fwrite does not provide a performance advantage and the read/write interface is likely to provide additional benefit.

Geo-location information should be placed at the top of file: in the absence of information that can be used to map the IFOVs contained in a file, our parallelization scheme is forced to have each processor subsample all the files. This is inefficient and limits scalability. Providing geo-location information at the beginning of the file would allow each processor to read data proportional to the number of files.

Diskful machines are important: Diskful machines (machines with local disks) allow problems to be partitioned such that most of the I/O requests are satisfied by local disks. As noted above, local disk accesses have a higher application-level bandwidth with the associated benefit of guaranteed lack of contention for the disk and the file cache. In combination with code restructuring to exploit locality, diskful machines can improve both the I/O performance and the overall execution time for out-of-core applications.

Complex I/O interfaces are not required: After code restructuring, most requests in the studied applications were large. For large requests, the interface is usually less important. Small strided requests were a recurrent pattern in the original versions of pathfinder and climate. However we found that these patterns were caused by the embedding of small I/O requests in the innermost loops. Relatively straightforward loop restructuring, including loop splitting, interchanging the order of nested loops and fusing multiple requests were sufficient to coalesce these requests into large block I/O requests. None of the applications studied required collective I/O. This is not surprising given the size of the requests after code restructuring. All of the applications are parallelized in SPMD fashion. In our earth-science applications all processes are independent (apart from initial and possibly final synchronization). Independent I/O requests were able to utilize the servers when they would have been idle in a collective-I/O model.

Compiler-directed placement of I/O operations can be eliminate I/O waiting time: our experiments showed that complete overlap of the write operations with computation can be achieved through flow-sensitive interprocedural analysis. Note that almost no overlap would have been possible if the analysis was restricted to within single procedures.

The declustering algorithm mentioned above scales well: as the number of disks is increased and consistently achieves a better response time compared to all the other algorithms (with a few exceptions when the number of disks is small). It also achieves perfect data balance and maximizes the disk space utilization. Furthermore, it rarely maps buckets that are close in the data space to the same disk indicating that the distributions it generates are probably quite close to the optimal distribution.