We''re running Lustre 1.6.3 and Linux 2.6.18 on our 972-node
(5832-processor) machines, and we''re seeing some interesting problems
when we run executables from a Lustre filesystem. When we run
5000-processor jobs, we often see some - maybe only a few, maybe a
couple of dozen - fail with illegal-instruction and other traps, where
examining the core file shows that the instructions in question are just
fine (and the same as on jobs that succeeded). Has anybody else seen
similar problems running executables from a Lustre filesystem?
The setup in our lab only has MGS+MDT and one OST on one node, and two
OSTs on another, exported to the rest via socklnd over our Ethernet
emulation. This originally showed up in some Fortran code, but we have
also been able to reproduce it with a generated C program that contains
nothing but 50,000 "x = x + 1" lines. On the theory that this has
something to do with I/O being completed prematurely - i.e. while
buffers are in fact still being filled - we produced a variant of the
program that walks through the entire program text to make sure the
pages all get loaded well before they''re accessed, and the failures do
not occur in this mode. Stranger still, after a few runs (more than
one) with the page-scanning turned on, runs without the page-scanning
also start to succeed. Copy the executable to a new location, though,
and the failures start all over again. This seems to support the theory
that there''s a race in the I/O completion code, but doesn''t
tell us much
more than that.
There''s a significant chance that the problem is architecture-specific
(our CPU architecture is MIPS with weak memory ordering) and/or in Linux
rather than Lustre, but the same test has run fine using Lustre 1.6beta
on Linux 2.6.15 and on other filesystems (e.g. NFS or ext3 over NBD)
using current versions. If anybody has any suggestions about places to
look, parameters to tweak for the sake of experimentation, etc. it would
be most appreciated.