On Jan 20, 2009 11:11 +0000, Herbert Fruchtl wrote:> It is not particularly stable. Every 1-3 weeks the filesystem goes awol and
I
> have to reboot the machine. This morning I did an "ls -lR" on the
front-end
> (which serves as MDS), just to count the files, and it took more than one
hour.
> "top" showed "ls" taking up anything between 5% and 90%
of a CPU during this
> time (most of the time in the 10-30% range). Is this normal?
>
> The crash this weekend was preceded by half a day during which the
front-end
> kept losing and regaining connection to the filesystem. It worked for a
while,
> then "df" gave an "input/output error", or "Cannot
send after transport
> endpoint", then recovered again. It seemed OK all the time from the
compute
> nodes, and from one external Lustre client (until it went away completely).
>
> I have inherited this cluster and I am not an expert in filesystems. The
> timeout is set to its default 100s. How do I find out what''s
wrong?
There was a series of bugs related to "statahead" in the 1.6.5.1
release
that could cause problems with "ls -lR" type workloads. You should
cat
disable this feature with "echo 0 >
/proc/fs/lustre/llite/*/statahead_max",
or upgrading at least the clients to the 1.6.6 release.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.