Tuning the I/O Performance of Earth Science Applications

Joel Saltz, Anurag Acharya, Alan Sussman, Jeff Hollingsworth, Michael Beynon
University of Maryland and Center of Excellence in Space Data & Information
Sciences (CESDIS), NASA Goddard


Maryland is targeting three different types of I/O intensive applications:
applications associated with earth science (sensor data processing),
out-of-core sparse matrix factorization, and supporting queries made to
datasets generated by programs that carry out scientific simulations.  We
will describe our experiences with one of our Earth Science applications in
detail and outline the lessons learned thus far from Maryland's
applications driven research program in Scalable I/O.

We are making use of an IBM SP-2 multiprocessor with 16 thin nodes.  Each
node is identically configured and contains one Power2 processor and 64
megabytes of main memory.  Each node also has an aggressive I/O subsystem
consisting of two fast-wide SCSI buses each with three 2.2 gigabyte IBM
Starfire 7200 SCSI disks.  Each disk has a rated minimum bandwidth of 8
megabytes per second, and each SCSI bus has a maximum bandwidth of 20
megabytes/sec.  The overall system has 96 disks totaling 192 gigabytes, and
a theoretical maximum aggregate disk bandwidth of 640 megabytes/second.

For each application, the objective was simply to make the program run as
fast as possible and to keep track of what was required to achieve this
speed. The results have been encouraging; while the achievable I/O rates
have been far smaller than the theoretical maxima, we have been able to
tune all of the sensor data and sparse matrix applications to the point
that I/O time did not predominate. A query to a dataset produced by a
scientific calculation involves very little computation, so we simply do
our best to optimize I/O time for this application.

Our set of applications includes two earth science programs, Pathfinder and
Climate, which constitute a complete processing chain for Advanced Very
High Resolution Radiometer (AVHRR) satellite data.  Pathfinder AVHRR data
sets are the primary data sets publicly available from NASA and are global,
multi-channel data from NOAA meteorological satellites.  Pathfinder and
Climate are currently in production use by the Goddard Distributed Active
Archive Center, and together are representative of a large class of NASA
earth science applications.  Furthermore, the structure of these
applications is also similar to the large set of programs currently being
developed to process data from the Earth Observation System satellites.

Our efforts to optimize the I/O costs associated with Pathfinder and
Climate have both been successful.  We will focus this discussion on
Pathfinder as it is more complex and requires more I/O than does Climate.
Pathfinder processes AVHRR global area coverage (4.4km resolution)
data. The input to the program consists of a set of files each with 40-45
megabytes of sensor data.  Pathfinder is often run on a day's worth of data
from fourteen files; each file corresponds to slightly more than a single
orbit's data.  Pathfinder performs calibration for instrument drift,
topographic correction, masking of cloudy pixels, registration of sensor
readings with locations on the ground and compositing of multiple sensor
data pixels corresponding to the same ground location.  In addition to
satellite sensor data, it reads several ancillary files, such as a
topographic map of the world, a global land-sea mask, global ozone data and
ephemeris data specific to the satellite. The output is a twelve band
5004x2168 Interrupted Goode Homolosine image of the world at 8km
resolution.

The unoptimized sequential Pathfinder code took 18,880 seconds on a single
SP-2 processor; 13,600 seconds of that time was due to I/O.  The total
amount of I/O performed by the program is over 28GBytes.  To optimize I/O
performance, a variety of program optimizations had to be carried out. In
the unoptimized program, sensor data was input to Pathfinder 3584 bytes at
a time.  It proved to be possible to restructure the code to read 512K
bytes of data at a time.

A recurrent I/O pattern found elsewhere in Pathfinder (and Climate) was the
embedding of small I/O requests in the innermost loops. Each such
occurrence resulted in nested sequences of small requests with fixed
strides.  The request size was almost always two bytes and the subsequent
seek distance was 20 megabytes.  Relatively straightforward loop
restructuring transformations were sufficient to aggregate the I/O and move
it to the outermost loop.

Pathfinder inputs sensor data readings obtained from satellites traversing
polar orbits and outputs a 12 band, 8KM resolution image of the world.  Our
approach to mapping Pathfinder to a multiprocessor is to partition the
output image horizontally (by latitude) and to have each processor sample
the input to determine the map between the input data and the output image.
This information is used to partition the input among the processors.

The output image is repeatedly referenced since Pathfinder accumulates
multiple sensor data values into each element of the output image.  By
partitioning the output image, we can avoid having to communicate when
portions of the output image are updated.  The output image is too large to
be stored in main memory in our SP-2 configuration, so the process of
carrying out successive updates to the output image is formulated as an
out-of-core computation.  Once updates are completed, the final results are
produced by rewriting the partitioned image from local disks and local
memories into persistent storage.

I/O is carried out by Jovian-2, a multi-threaded parallel I/O library
developed at the University of Maryland.  Jovian-2 currently runs on the
IBM SP-2 and is being ported to an ATM-connected network of Digital Alpha
Sable multiprocessors, and to a network of Pentiums that run Linux.
Jovian-2 consists of two parts: a client proxy that runs in the same thread
as the application and a separate server thread. The server thread can
serve requests from both local and remote processes; local requests are
optimized for fast processing. Jovian-2 allows both peer-to-peer and
client-server I/O configurations. It also allows the application process
running on each node to control the scheduling of the associated server
thread.  User files can be striped across disks in an arbitrary way.
Jovian-2 assumes that a standard Unix file system with asychronous I/O
calls is available on individual nodes. On the SP-2, our implementation
uses the high-speed user-space communication primitives provided by IBM's
Message Passing Library (MPL).

To test the capabilities of the underlying IBM AIX file system we employed
a modified version of the IOZONE benchmark to carry out consecutive,
contiguous reads and writes of varying request sizes on a single
processor. These single processor reads and writes were directed at varying
numbers of disks and varying numbers of SCSI strings. The maximum
achievable read or write bandwidth for 4Kbyte requests was 10
Mbytes/second. The best rate for 1 Mbyte read requests was 15 Mbytes/second
and 17 Mbyte/second for 1 Mbyte write requests. These rates were achieved
using a total of four disks, configured by placing two disks on each of two
SCSI strings. There was minimal additional performance improvement obtained
by employing three disks on each of the two SCSI strings.

Our experiments on Pathfinder made use of configurations consisting of
varying numbers of processors. On each processor, Jovian-2 was configured
to use a total of four disks and two SCSI strings. Once Pathfinder was
properly restructured and partitioned, most I/O could be carried out
locally. The local I/O requests were large (512Kbytes) and consecutive
requests were frequently directed at contiguous file locations. The
restructured code also allowed the use of asynchronous I/O. The net effect
of this restructuring was that local I/O overheads were reduced to a few
percent of total runtime (e.g. local I/O was 4% of total execution time
when Pathfinder was run on 10 client and 2 server nodes).  The non-local
I/O required to distribute the input dataset posed much more of a
problem. Our best overall results were obtained when we set aside a small
number of processors as input data servers. For instance, when we used a
total of 12 processors, we achieved the best overall runtime using two
processors as I/O servers. For this configuration, we measured a total 1276
second end-to-end execution time. Non-local I/O constituted 11% of this
execution time (computation time accounted for 85% of the execution time
and local I/O accounted for 4% of the time). For this configuration, the
per-processor bandwidth of the input phase of the non-local I/O was
4.7Mbytes/second, the per-processor bandwidth of the output phase of
non-local I/O was 9.0 Mbytes/second.

The goal of the studies described here is to provide guidelines for
obtaining high I/O performance from I/O intensive applications.  The
Pathfinder example described above included many features that were typical
of our experience.  The first conclusion is already well known - it is
crucial to restructure applications so that they make large, contiguous I/O
requests. We have found that it has been possible (with varying amounts of
effort), to reformulate the applications we have encountered to make large
I/O requests.

The second conclusion is that applications we have worked with can be
profitably restructured so that most I/O is carried out in files that are
local to a processor. This is a potentially important point as relatively
simple software and inexpensive I/O configurations are able to provide
respectable single node I/O performance. It is clearly much more difficult
to obtain corresponding performance levels from parallel libraries and file
systems.

The other set of lessons to be learned from our I/O performance tuning
studies is that the common wisdom about I/O requests and disk access
strategies does not appear to hold for many I/O intensive applications.
Strided requests were not required by any of the applications we studied.
Appropriate restructuring of applications to optimize I/O requests can
transform requests that appear to require strided access into unit stride
requests.  We found it possible to eliminate strided requests by
interchanging the order of nested loops and by fusing multiple requests.
We also noted that none of our applications appeared to require collective
I/O operations. All the programs are data parallel, but large I/O requests
can be generated locally on each processor, so that it does not appear that
there would be significant advantage in combining requests across multiple
processors into a collective I/O request.

In addition to support from the Scalable I/O Consortium, this project has
also been supported by a grant from Earth Science Data & Information System
Project (ESDIS) at NASA Goddard Space Flight Center and the University of
Maryland NSF Grand Challenge project in Land Cover Dynamics.  We would like to
thank ESDIS managers Tonjua Hines and Steve Kempler for their constant
support and encouragement, and thank Peter Smith and Mary James of the
Goddard DAAC for providing the Pathfinder and Climate codes.