thr3ads.net - Lustre discuss - [Lustre-discuss] slow ls on flaky system [Jan 2009]

If this information is useful, please help other people find it:
Share via:

Herbert Fruchtl

2009-Jan-20 11:11 UTC

[Lustre-discuss] slow ls on flaky system

We have lustre 1.6.5.1 on our cluster (CentOS 4.7). The /home filesystem
contains 27TB, distributed over three OSS servers, each containing a RAID split
into two filesystems. It''s 90% full and contains 2 million files.

It is not particularly stable. Every 1-3 weeks the filesystem goes awol and I
have to reboot the machine. This morning I did an "ls -lR" on the
front-end
(which serves as MDS), just to count the files, and it took more than one hour.
"top" showed "ls" taking up anything between 5% and 90% of a
CPU during this
time (most of the time in the 10-30% range). Is this normal?

The crash this weekend was preceded by half a day during which the front-end
kept losing and regaining connection to the filesystem. It worked for a while,
then "df" gave an "input/output error", or "Cannot send
after transport
endpoint", then recovered again. It seemed OK all the time from the compute
nodes, and from one external Lustre client (until it went away completely).

I have inherited this cluster and I am not an expert in filesystems. The timeout
is set to its default 100s. How do I find out what''s wrong?

  Herbert
-- 
Herbert Fruchtl
Senior Scientific Computing Officer
School of Chemistry, School of Mathematics and Statistics
University of St Andrews
--
The University of St Andrews is a charity registered in Scotland:
No SC013532

Andreas Dilger

2009-Jan-20 18:02 UTC

head link

[Lustre-discuss] slow ls on flaky system

On Jan 20, 2009  11:11 +0000, Herbert Fruchtl wrote:> It is not particularly stable. Every 1-3 weeks the filesystem goes awol and
I
> have to reboot the machine. This morning I did an "ls -lR" on the
front-end
> (which serves as MDS), just to count the files, and it took more than one
hour.
> "top" showed "ls" taking up anything between 5% and 90%
of a CPU during this
> time (most of the time in the 10-30% range). Is this normal?
> 
> The crash this weekend was preceded by half a day during which the
front-end
> kept losing and regaining connection to the filesystem. It worked for a
while,
> then "df" gave an "input/output error", or "Cannot
send after transport
> endpoint", then recovered again. It seemed OK all the time from the
compute
> nodes, and from one external Lustre client (until it went away completely).
> 
> I have inherited this cluster and I am not an expert in filesystems. The
> timeout is set to its default 100s. How do I find out what''s
wrong?
There was a series of bugs related to "statahead" in the 1.6.5.1
release
that could cause problems with "ls -lR" type workloads.  You should
cat
disable this feature with "echo 0 >
/proc/fs/lustre/llite/*/statahead_max",
or upgrading at least the clients to the 1.6.6 release.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Jan 2009 - slow ls on flaky system

[Lustre-discuss] slow ls on flaky system

[Lustre-discuss] slow ls on flaky system