thr3ads.net - Lustre discuss - [Lustre-discuss] Re: Lustre-discuss Digest, Vol 5, Issue 11 [Jun 2006]

If this information is useful, please help other people find it:
Share via:

Herb Wartens

2006-Jun-18 11:18 UTC

[Lustre-discuss] Re: Lustre-discuss Digest, Vol 5, Issue 11

lustre-discuss-request@clusterfs.com wrote:> Send Lustre-discuss mailing list submissions to
> 	lustre-discuss@clusterfs.com
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
> or, via email, send a message with subject or body ''help''
to
> 	lustre-discuss-request@clusterfs.com
> 
> You can reach the person managing the list at
> 	lustre-discuss-owner@clusterfs.com
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Lustre-discuss digest..."
> 
> 
> Today''s Topics:
> 
>    1. Lustre 1.4.6.1 system hangs (Allen Todd)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Mon, 12 Jun 2006 14:23:23 -0400 (EDT)
> From: allen.todd@sig.com (Allen Todd)
> Subject: [Lustre-discuss] Lustre 1.4.6.1 system hangs
> To: lustre-discuss@clusterfs.com
> Message-ID: <200606121823.k5CINNN21816@sauza.dev.susq.com>
> Content-Type: text/plain; charset=us-ascii
> 
> Sorry if this is a repeat, but I never saw it come across the list last
week
> and the archive does not contain messages for the current month....
> 
> 
>            Summary: after upgrade from 1.2.6 to 1.4.6 system hangs have
made
>                     our lustre file system almost unuseable
>            Version: 1.4.6
>           Platform: i386
>         OS/Version: Novell Suse Linux Enterprise Server 9 w/ sp2
> 
> First question -- anyone out there running lustre on SLES9?  Seems like 
> most people are using a redhat variant.
> 

Here at LLNL we are using Lustre on our BGL front end nodes as clients.  These
are running SLES just fine.

> Second plea -- if you have any advice on figuring out how to diagnose the
> cause of the system hangs w/ lustre and linux I would appreciate that even 
> if you haven''t had problems like I am currently experiencing.
> 

I would recommend at least to help pinpoint your problem to try testing on
a small set of systems where some are just dedicated as servers (MDSs and OSTs)
and some as clients.  This will probably help you pinpoint what subsystem is
causing the issue.

> On to the problem description....
> 
> 
> Our first lustre volumes were setup with Novell SLES9 sp2 and the lustre
1.2.6
> components included on the SLES9 CD
(lustre-lite-1.2.6.suse.2-0.3.i586.rpm).  We
> were running  with clients and servers on the same machines and things were
> working reasonably well, but recovery timeouts from down machines, and a
rarely
> encountered deadlock inspired me to upgrade to the latest release
rpm''s compiled
> for SLES9 - 1.4.6.1.
> 
> So far 1.4.6.1 and  1.4.6 have been very unstable for us.  Out of our
cluster of
> 150 machines, we have been experiencing  5 to 7 system hangs per day.  I
have
> LKCD tools installed, but as none of the machines have keyboards, I have
not
> been able to get any system core dumps.  Before hanging, ganglia often
indicates
> that the machine is experiencing high load averages (6 to 20 or so).
> 
> It is hard for me to pinpoint what may be causing the hang, because we are
using
> a configuration with lustre client and lustre server running on the same
> machines.  We have 5 lustre volumes, each with 30 ibm e326 dual opteron 270
> servers.  29 of the machines are OSTs and the 30th one is the MDS.  They
all use
> a second internal SATA drive for OST/MDS.  When there is no lustre traffic
(but
> filesystems mounted) the systems do not hang.  Once allowing our
computation
> programs to start using the lustre file system it can take minutes or hours
for
> a system hang to occur.  Rebooting gets the filesystem functional again
until a
> different node fails.  Very seldom is it the same node failing twice in a
row.
> 
> Errors in the syslog only seem to be from other clients not able to access
the
> OST that has hung, with no errors from the system hanging until it is
rebooted
> and recovery started.
> 
> Lustre components installed include:
> 
>   # rpm -qa | grep lustre
>   kernel-bigsmp-2.6.5-7.244_lustre.1.4.6
>   lustre-1.4.6-2.6.5_7.244_lustre.1.4.6bigsmp
>   lustre-modules-1.4.6-2.6.5_7.244_lustre.1.4.6bigsmp
>   lustre-debuginfo-1.4.6-2.6.5_7.244_lustre.1.4.6bigsmp
> 
> or
> 
>   # rpm -qa | grep lustre
>   lustre-modules-1.4.6.1-2.6.5_7.252_lustre.1.4.6.1bigsmp
>   lustre-1.4.6.1-2.6.5_7.252_lustre.1.4.6.1bigsmp
>   kernel-bigsmp-2.6.5-7.252_lustre.1.4.6.1
>   lustre-debuginfo-1.4.6.1-2.6.5_7.252_lustre.1.4.6.1bigsmp
> 
> I am also still using the same config.xml that I used for lustre 1.2.6 and
I am
> mounting the file systems with lconf --node client /etc/lustre/config.xml

My best advice here would be to review the migration guide for moving to 1.4.6
It has quite a few examples of what should be done.  Hopefully this will help
solve your issue.

-Herb

Andreas Dilger

2006-Jun-18 11:18 UTC

head link

[Lustre-discuss] Re: Lustre-discuss Digest, Vol 5, Issue 11

On Jun 12, 2006  11:57 -0700, Herb Wartens wrote:> allen.todd@sig.com (Allen Todd) wrote:
> >Second plea -- if you have any advice on figuring out how to diagnose
the
> >cause of the system hangs w/ lustre and linux I would appreciate that
even
> >if you haven''t had problems like I am currently experiencing.
> >
> 
> 
> I would recommend at least to help pinpoint your problem to try testing on
> a small set of systems where some are just dedicated as servers (MDSs and 
> OSTs) and some as clients.  This will probably help you pinpoint what
> subsystem is causing the issue.
Actually, along that vein, if you can split the compute nodes such that
compute nodes in group A only write to a filesystem with OSTs in group
B then you should be able to avoid the deadlock I mentioned previously.
How easy this is depends on how your cluster is set up.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Lustre discuss - Jun 2006 - Re: Lustre-discuss Digest, Vol 5, Issue 11

[Lustre-discuss] Re: Lustre-discuss Digest, Vol 5, Issue 11

[Lustre-discuss] Re: Lustre-discuss Digest, Vol 5, Issue 11