Next: Lessons Learned
Up: Conclusion
Previous: Hardware Setup
This section describes the major problems encountered during this
work. This only worsened the time constraints, and hindered more
thorough experimentation.
-
Shared libraries under OSF/1 do not work as advertised. Using them,
while linking in the aio library for asynchronous I/O and the
pthreads library, would reproducibly cause core dumps to occur
before main() is ever called. The solution was to pass the
-non_shared and -threads options to the linker to avoid shared
libraries and to link in the thread-safe versions of libraries and
link in pthreads automatically. Node 10 is running Dec Unix 4.0 beta
and did not exhibit this behavior.
-
Jovian-2 executes asynchronous calls such as aio_read() and
aio_write(). In a test case of copying a large file to a small
number of servers resulted in the system not allowing any more calls
to successfully execute. OSF/1's implementation of the aio_* calls
are somewhat different than the SP-2's implementation, even though they
represent the same Posix standard. At first, I thought this was an
os bug, and tried to create a simple test case to illustrate the problem.
It turns out that the OSF/1 aio_* calls must be followed by a
call to aio_return() to get the return value of the call, and as
a side effect, free up internal resources used in the static kernel tables
for asynchronous I/O calls. Without this, all aio_* calls after
the 64th would fail.
-
It was unclear if the ATM switch was really being used for all
messages. The farm is setup to use two sets of node names. The first
being the normal name, and the second being the name suffixed with
-a. This second name is used for meaning the ATM connection.
This is fine, but the only MPI implementation that would work (MPICH)
would always use the current node as the first, and look to the config
file for other nodes to allocate. The problem is the first node would
be discovered by getting the hostname without the suffix designating
the ATM switch. A hypothesis is that the ATM switch is thus only
being used for communication between pair nodes other than the first
one where the job was started. MPICH provides a -nolocal option
to mpirun to not assume the current node is the first node to
run on. Unfortunately, this did not work as it should, and this
behavior could not be avoided.
Next: Lessons Learned
Up: Conclusion
Previous: Hardware Setup
Generated by latex2html-95.1
Tue May 14 18:14:17 EDT 1996