Tuning the I/O Performance of Earth Science Applications Joel Saltz, Anurag Acharya, Alan Sussman, Jeff Hollingsworth, Michael Beynon University of Maryland and Center of Excellence in Space Data & Information Sciences (CESDIS), NASA Goddard Maryland is targeting three different types of I/O intensive applications: applications associated with earth science (sensor data processing), out-of-core sparse matrix factorization, and supporting queries made to datasets generated by programs that carry out scientific simulations. We will describe our experiences with one of our Earth Science applications in detail and outline the lessons learned thus far from Maryland's applications driven research program in Scalable I/O. We are making use of an IBM SP-2 multiprocessor with 16 thin nodes. Each node is identically configured and contains one Power2 processor and 64 megabytes of main memory. Each node also has an aggressive I/O subsystem consisting of two fast-wide SCSI buses each with three 2.2 gigabyte IBM Starfire 7200 SCSI disks. Each disk has a rated minimum bandwidth of 8 megabytes per second, and each SCSI bus has a maximum bandwidth of 20 megabytes/sec. The overall system has 96 disks totaling 192 gigabytes, and a theoretical maximum aggregate disk bandwidth of 640 megabytes/second. For each application, the objective was simply to make the program run as fast as possible and to keep track of what was required to achieve this speed. The results have been encouraging; while the achievable I/O rates have been far smaller than the theoretical maxima, we have been able to tune all of the sensor data and sparse matrix applications to the point that I/O time did not predominate. A query to a dataset produced by a scientific calculation involves very little computation, so we simply do our best to optimize I/O time for this application. Our set of applications includes two earth science programs, Pathfinder and Climate, which constitute a complete processing chain for Advanced Very High Resolution Radiometer (AVHRR) satellite data. Pathfinder AVHRR data sets are the primary data sets publicly available from NASA and are global, multi-channel data from NOAA meteorological satellites. Pathfinder and Climate are currently in production use by the Goddard Distributed Active Archive Center, and together are representative of a large class of NASA earth science applications. Furthermore, the structure of these applications is also similar to the large set of programs currently being developed to process data from the Earth Observation System satellites. Our efforts to optimize the I/O costs associated with Pathfinder and Climate have both been successful. We will focus this discussion on Pathfinder as it is more complex and requires more I/O than does Climate. Pathfinder processes AVHRR global area coverage (4.4km resolution) data. The input to the program consists of a set of files each with 40-45 megabytes of sensor data. Pathfinder is often run on a day's worth of data from fourteen files; each file corresponds to slightly more than a single orbit's data. Pathfinder performs calibration for instrument drift, topographic correction, masking of cloudy pixels, registration of sensor readings with locations on the ground and compositing of multiple sensor data pixels corresponding to the same ground location. In addition to satellite sensor data, it reads several ancillary files, such as a topographic map of the world, a global land-sea mask, global ozone data and ephemeris data specific to the satellite. The output is a twelve band 5004x2168 Interrupted Goode Homolosine image of the world at 8km resolution. The unoptimized sequential Pathfinder code took 18,880 seconds on a single SP-2 processor; 13,600 seconds of that time was due to I/O. The total amount of I/O performed by the program is over 28GBytes. To optimize I/O performance, a variety of program optimizations had to be carried out. In the unoptimized program, sensor data was input to Pathfinder 3584 bytes at a time. It proved to be possible to restructure the code to read 512K bytes of data at a time. A recurrent I/O pattern found elsewhere in Pathfinder (and Climate) was the embedding of small I/O requests in the innermost loops. Each such occurrence resulted in nested sequences of small requests with fixed strides. The request size was almost always two bytes and the subsequent seek distance was 20 megabytes. Relatively straightforward loop restructuring transformations were sufficient to aggregate the I/O and move it to the outermost loop. Pathfinder inputs sensor data readings obtained from satellites traversing polar orbits and outputs a 12 band, 8KM resolution image of the world. Our approach to mapping Pathfinder to a multiprocessor is to partition the output image horizontally (by latitude) and to have each processor sample the input to determine the map between the input data and the output image. This information is used to partition the input among the processors. The output image is repeatedly referenced since Pathfinder accumulates multiple sensor data values into each element of the output image. By partitioning the output image, we can avoid having to communicate when portions of the output image are updated. The output image is too large to be stored in main memory in our SP-2 configuration, so the process of carrying out successive updates to the output image is formulated as an out-of-core computation. Once updates are completed, the final results are produced by rewriting the partitioned image from local disks and local memories into persistent storage. I/O is carried out by Jovian-2, a multi-threaded parallel I/O library developed at the University of Maryland. Jovian-2 currently runs on the IBM SP-2 and is being ported to an ATM-connected network of Digital Alpha Sable multiprocessors, and to a network of Pentiums that run Linux. Jovian-2 consists of two parts: a client proxy that runs in the same thread as the application and a separate server thread. The server thread can serve requests from both local and remote processes; local requests are optimized for fast processing. Jovian-2 allows both peer-to-peer and client-server I/O configurations. It also allows the application process running on each node to control the scheduling of the associated server thread. User files can be striped across disks in an arbitrary way. Jovian-2 assumes that a standard Unix file system with asychronous I/O calls is available on individual nodes. On the SP-2, our implementation uses the high-speed user-space communication primitives provided by IBM's Message Passing Library (MPL). To test the capabilities of the underlying IBM AIX file system we employed a modified version of the IOZONE benchmark to carry out consecutive, contiguous reads and writes of varying request sizes on a single processor. These single processor reads and writes were directed at varying numbers of disks and varying numbers of SCSI strings. The maximum achievable read or write bandwidth for 4Kbyte requests was 10 Mbytes/second. The best rate for 1 Mbyte read requests was 15 Mbytes/second and 17 Mbyte/second for 1 Mbyte write requests. These rates were achieved using a total of four disks, configured by placing two disks on each of two SCSI strings. There was minimal additional performance improvement obtained by employing three disks on each of the two SCSI strings. Our experiments on Pathfinder made use of configurations consisting of varying numbers of processors. On each processor, Jovian-2 was configured to use a total of four disks and two SCSI strings. Once Pathfinder was properly restructured and partitioned, most I/O could be carried out locally. The local I/O requests were large (512Kbytes) and consecutive requests were frequently directed at contiguous file locations. The restructured code also allowed the use of asynchronous I/O. The net effect of this restructuring was that local I/O overheads were reduced to a few percent of total runtime (e.g. local I/O was 4% of total execution time when Pathfinder was run on 10 client and 2 server nodes). The non-local I/O required to distribute the input dataset posed much more of a problem. Our best overall results were obtained when we set aside a small number of processors as input data servers. For instance, when we used a total of 12 processors, we achieved the best overall runtime using two processors as I/O servers. For this configuration, we measured a total 1276 second end-to-end execution time. Non-local I/O constituted 11% of this execution time (computation time accounted for 85% of the execution time and local I/O accounted for 4% of the time). For this configuration, the per-processor bandwidth of the input phase of the non-local I/O was 4.7Mbytes/second, the per-processor bandwidth of the output phase of non-local I/O was 9.0 Mbytes/second. The goal of the studies described here is to provide guidelines for obtaining high I/O performance from I/O intensive applications. The Pathfinder example described above included many features that were typical of our experience. The first conclusion is already well known - it is crucial to restructure applications so that they make large, contiguous I/O requests. We have found that it has been possible (with varying amounts of effort), to reformulate the applications we have encountered to make large I/O requests. The second conclusion is that applications we have worked with can be profitably restructured so that most I/O is carried out in files that are local to a processor. This is a potentially important point as relatively simple software and inexpensive I/O configurations are able to provide respectable single node I/O performance. It is clearly much more difficult to obtain corresponding performance levels from parallel libraries and file systems. The other set of lessons to be learned from our I/O performance tuning studies is that the common wisdom about I/O requests and disk access strategies does not appear to hold for many I/O intensive applications. Strided requests were not required by any of the applications we studied. Appropriate restructuring of applications to optimize I/O requests can transform requests that appear to require strided access into unit stride requests. We found it possible to eliminate strided requests by interchanging the order of nested loops and by fusing multiple requests. We also noted that none of our applications appeared to require collective I/O operations. All the programs are data parallel, but large I/O requests can be generated locally on each processor, so that it does not appear that there would be significant advantage in combining requests across multiple processors into a collective I/O request. In addition to support from the Scalable I/O Consortium, this project has also been supported by a grant from Earth Science Data & Information System Project (ESDIS) at NASA Goddard Space Flight Center and the University of Maryland NSF Grand Challenge project in Land Cover Dynamics. We would like to thank ESDIS managers Tonjua Hines and Steve Kempler for their constant support and encouragement, and thank Peter Smith and Mary James of the Goddard DAAC for providing the Pathfinder and Climate codes.