Herb Wartens
2006-Jun-18 11:18 UTC
[Lustre-discuss] Re: Lustre-discuss Digest, Vol 5, Issue 11
lustre-discuss-request@clusterfs.com wrote:> Send Lustre-discuss mailing list submissions to > lustre-discuss@clusterfs.com > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > or, via email, send a message with subject or body ''help'' to > lustre-discuss-request@clusterfs.com > > You can reach the person managing the list at > lustre-discuss-owner@clusterfs.com > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Lustre-discuss digest..." > > > Today''s Topics: > > 1. Lustre 1.4.6.1 system hangs (Allen Todd) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 12 Jun 2006 14:23:23 -0400 (EDT) > From: allen.todd@sig.com (Allen Todd) > Subject: [Lustre-discuss] Lustre 1.4.6.1 system hangs > To: lustre-discuss@clusterfs.com > Message-ID: <200606121823.k5CINNN21816@sauza.dev.susq.com> > Content-Type: text/plain; charset=us-ascii > > Sorry if this is a repeat, but I never saw it come across the list last week > and the archive does not contain messages for the current month.... > > > Summary: after upgrade from 1.2.6 to 1.4.6 system hangs have made > our lustre file system almost unuseable > Version: 1.4.6 > Platform: i386 > OS/Version: Novell Suse Linux Enterprise Server 9 w/ sp2 > > First question -- anyone out there running lustre on SLES9? Seems like > most people are using a redhat variant. >Here at LLNL we are using Lustre on our BGL front end nodes as clients. These are running SLES just fine.> Second plea -- if you have any advice on figuring out how to diagnose the > cause of the system hangs w/ lustre and linux I would appreciate that even > if you haven''t had problems like I am currently experiencing. >I would recommend at least to help pinpoint your problem to try testing on a small set of systems where some are just dedicated as servers (MDSs and OSTs) and some as clients. This will probably help you pinpoint what subsystem is causing the issue.> On to the problem description.... > > > Our first lustre volumes were setup with Novell SLES9 sp2 and the lustre 1.2.6 > components included on the SLES9 CD (lustre-lite-1.2.6.suse.2-0.3.i586.rpm). We > were running with clients and servers on the same machines and things were > working reasonably well, but recovery timeouts from down machines, and a rarely > encountered deadlock inspired me to upgrade to the latest release rpm''s compiled > for SLES9 - 1.4.6.1. > > So far 1.4.6.1 and 1.4.6 have been very unstable for us. Out of our cluster of > 150 machines, we have been experiencing 5 to 7 system hangs per day. I have > LKCD tools installed, but as none of the machines have keyboards, I have not > been able to get any system core dumps. Before hanging, ganglia often indicates > that the machine is experiencing high load averages (6 to 20 or so). > > It is hard for me to pinpoint what may be causing the hang, because we are using > a configuration with lustre client and lustre server running on the same > machines. We have 5 lustre volumes, each with 30 ibm e326 dual opteron 270 > servers. 29 of the machines are OSTs and the 30th one is the MDS. They all use > a second internal SATA drive for OST/MDS. When there is no lustre traffic (but > filesystems mounted) the systems do not hang. Once allowing our computation > programs to start using the lustre file system it can take minutes or hours for > a system hang to occur. Rebooting gets the filesystem functional again until a > different node fails. Very seldom is it the same node failing twice in a row. > > Errors in the syslog only seem to be from other clients not able to access the > OST that has hung, with no errors from the system hanging until it is rebooted > and recovery started. > > Lustre components installed include: > > # rpm -qa | grep lustre > kernel-bigsmp-2.6.5-7.244_lustre.1.4.6 > lustre-1.4.6-2.6.5_7.244_lustre.1.4.6bigsmp > lustre-modules-1.4.6-2.6.5_7.244_lustre.1.4.6bigsmp > lustre-debuginfo-1.4.6-2.6.5_7.244_lustre.1.4.6bigsmp > > or > > # rpm -qa | grep lustre > lustre-modules-1.4.6.1-2.6.5_7.252_lustre.1.4.6.1bigsmp > lustre-1.4.6.1-2.6.5_7.252_lustre.1.4.6.1bigsmp > kernel-bigsmp-2.6.5-7.252_lustre.1.4.6.1 > lustre-debuginfo-1.4.6.1-2.6.5_7.252_lustre.1.4.6.1bigsmp > > I am also still using the same config.xml that I used for lustre 1.2.6 and I am > mounting the file systems with lconf --node client /etc/lustre/config.xmlMy best advice here would be to review the migration guide for moving to 1.4.6 It has quite a few examples of what should be done. Hopefully this will help solve your issue. -Herb
Andreas Dilger
2006-Jun-18 11:18 UTC
[Lustre-discuss] Re: Lustre-discuss Digest, Vol 5, Issue 11
On Jun 12, 2006 11:57 -0700, Herb Wartens wrote:> allen.todd@sig.com (Allen Todd) wrote: > >Second plea -- if you have any advice on figuring out how to diagnose the > >cause of the system hangs w/ lustre and linux I would appreciate that even > >if you haven''t had problems like I am currently experiencing. > > > > > I would recommend at least to help pinpoint your problem to try testing on > a small set of systems where some are just dedicated as servers (MDSs and > OSTs) and some as clients. This will probably help you pinpoint what > subsystem is causing the issue.Actually, along that vein, if you can split the compute nodes such that compute nodes in group A only write to a filesystem with OSTs in group B then you should be able to avoid the deadlock I mentioned previously. How easy this is depends on how your cluster is set up. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.