Report for Scalable I/O, November 1996

Scalable I/O quarterly report for Q396

Joel Saltz Anurag Acharya


In this quarter, we have made progress on four fronts. First, we have completed the ports of our benchmark satellite data processing programs, Pathfinder and Climate, to a cluster of multiprocessor DEC Alpha workstations at the University of Maryland and to the Beowulf cluster of Pentium PCs running Linux at the NASA Goddard Space Flight Center. Ports to both these platforms use Jovian-2. We had ported Jovian-2 to these platforms in the last quarter (see report for Q296). Our effort has uncovered performance problems in MPI implementations on both these platforms. We are working with MPI (LAM) implementors to fix these problems. As our SP-2 experiments with Jovian-2 and Pathfinder showed, there is a delicate relationship between the client and server threads for peer-to-peer configurations (this was reported in our IOPADS'96 paper). If the application is given priority, the server on the same node can and does fall behind requests from other nodes; if the server is given priority, it can lead to poor application performance (in the worst case, it can result in starvation). Currently we are extending the Jovian-2 library to support several strategies to better understand the scheduling problem. We propose that the application be allowed to control the scheduling of the server on the same node. The first strategy is "Favor I/O Requests", where the server sleeps waiting for requests from other processors for data. When a request arrives, the server wakes up, schedules the disk request, and goes back to sleep. When the disk I/O is completed, the server wakes up again, and sends the data to the requesting processor. The "Favor Local Computations" strategy instead will only handle off-processor requests when I/O is needed for the local client. A variation on this strategy involves insuring the server thread is also scheduled at fixed intervals to handle off-processor requests, even when the local client is not performing I/O.

Second, we have completed our first round of experiments on Titan, our satellite image database. The goal of these experiments was to evaluate our techniques for partitioning the images into chunks, declustering the chunks over a large disk farm and placement of the chunks assigned to individual disks. Experimental results on our 16-processor SP-2 (six disks per processor) show that Titan provides good performance for global queries, and interactive response times for local queries. A global query for a 10-day composite of normalized vegetation index takes less than 100 seconds; similar queries for Australia and the United Kingdom takes 4 seconds and 1.5 seconds respectively. As mentioned in our previous report, Titan contains 30 GB of AVHRR data. Our distribution improved the disk parellelism, the number of disks active for individual queries by 48 to 70 percent. The total estimated retrieval time was reduced by between 8 and 33 percent. We also evaluated schemes for placement of data blocks assigned to a single disk. We found that the average length of a read (without an intervening seek) can be improved by about a factor of two. A detailed description of this effort has been submitted for publication. Currently, it is available as UMD Technical Report CS-TR-3689.

Third, we have identified a suite of I/O-intensive parallel applications that represent a large class of I/O-intensive applications. We are currently in the process of obtaining detailed I/O traces (file-system and disk-level) for each of these applications.

Finally, we are involved in the process of refining the MPI-IO interface to make it easier to use as well as to make it available to both MPI and non-MPI users. The major issues that are being dealt with are interoperability and simplification of the interface. The thrust for interoperability is to allow users to choose between different file formats at open-time. Allowing different file formats allows the user to choose between efficiency and portability.