thr3ads.net - Lustre discuss - [Lustre-discuss] Problems with MDS Crashing [May 2010]

If this information is useful, please help other people find it:
Share via:

Gary Brooks

2010-May-12 21:42 UTC

[Lustre-discuss] Problems with MDS Crashing

Help Needed:

We''re having trouble with our MDS server. Nothing suspicious in logs -
at
some point they are just not being created anymore.

The scenario is as following: we''re having a MDS running on DRBD, 2 OSS
and
ca. 10 clients. The traffic pattern is lots of small file reads and
writes. We provision Joomla! sites. Joomla! site has about 17000 small
files. We are writing 1 new Joomla! site every 30 seconds. This happens
all day long and does not stop.

During operation, load on MDS is around 2 (it''s a 8-core machine raid
10
using a 3ware card, pretty heavily equipped and should handle much more).
iostat says that there is constantly about 5 MB/s read and 100kB-7MB/s
write. There are about 5000 r/w ops per second.

Then, all of the sudden the MDS stops responding, ssh sessions die and only
hard restart helps. After the restart, /var/log/messages contains normal
information (some timeout chit-chat).

While this happens randomly, there is an almost sure way to trigger it:
issue sysctl -w lnet.debug=0 on all clients and servers, after which the
file system becomes super responsive, load on MDS is still low, our gig-e
link is well utilized (unlike when lnet logging is enabled) and after a few
minutes MDS dies as described above.

I know that this is too little information to ask for help, but maybe you
could at least tell me where to look for any information?

Gary
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100512/9a0d49e7/attachment.html

Andreas Dilger

2010-May-12 21:55 UTC

head link

[Lustre-discuss] Problems with MDS Crashing

On 2010-05-12, at 15:42, Gary Brooks wrote:> We''re having trouble with our MDS server. Nothing suspicious in
logs - at some point they are just not being created anymore.
> 
> The scenario is as following: we''re having a MDS running on DRBD,
2 OSS and ca. 10 clients. The traffic pattern is lots of small file reads and
writes.   We provision Joomla! sites.  Joomla! site has about 17000 small files.
We are writing 1 new Joomla! site every 30 seconds.    This happens all day long
and does not stop.
At 17000/30 = 570 files created per second, you should be OK as far as load
goes.  I assume you have enough inodes on both the MDT and OST filesystems.
> During operation, load on MDS is around 2 (it''s a 8-core machine
raid 10 using a 3ware card, pretty heavily equipped and should handle much
more). iostat says that there is constantly about 5 MB/s read and 100kB-7MB/s
write. There are about 5000 r/w ops per second.
> 
> Then, all of the sudden the MDS stops responding, ssh sessions die and only
hard restart helps. After the restart, /var/log/messages contains normal
information (some timeout chit-chat).
> 
> While this happens randomly, there is an almost sure way to trigger it:
issue sysctl -w lnet.debug=0 on all clients and servers, after which the file
system becomes super responsive, load on MDS is still low, our gig-e link is
well utilized (unlike when lnet logging is enabled) and after a few minutes MDS
dies as described above.
> 
> I know that this is too little information to ask for help, but maybe you
could at least tell me where to look for any information?
You need to connect up a serial console and/or something like netdump to get the
actual error messages on the console when it hangs.  Doing "sysrq-p"
or "sysrq-t" to see if it is stuck in some thread, if there are no
error messages on the console.


Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Gregory Matthews

2010-May-18 08:10 UTC

head link

[Lustre-discuss] Problems with MDS Crashing

Gary Brooks wrote:> Then, all of the sudden the MDS stops responding, ssh sessions die and 
> only hard restart helps. After the restart, /var/log/messages contains 
> normal information (some timeout chit-chat).
is your hardware using the bnx2 NIC driver? We''ve just been seeing very
similar issues on Lustre clients on brand new Dell Power Edge R610s. The 
workaround is to turn off MSI-X but there has recently been a fix merged 
into the mainline kernel which has also been backported by Red Hat.
> While this happens randomly, there is an almost sure way to trigger it: 
> issue sysctl -w lnet.debug=0 on all clients and servers, after which the 
> file system becomes super responsive, load on MDS is still low, our 
> gig-e link is well utilized (unlike when lnet logging is enabled) and 
> after a few minutes MDS dies as described above.
we have not been able to trigger it in any predictable fashion either.

GREG
> 
> I know that this is too little information to ask for help, but maybe 
> you could at least tell me where to look for any information?
> 
> Gary
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-- 
Greg Matthews            01235 778658
Senior Computer Systems Administrator
Diamond Light Source, Oxfordshire, UK

Gary Brooks

2010-May-18 18:29 UTC

head link

[Lustre-discuss] Problems with MDS Crashing

Greg,

We are using CENT and 3ware raid adapters.    We had the MDS on Raid 0 (on
accident) and changed it out for a raid 10 machine with 3ware card.  We did
update the firmware on the 3ware cards and changed the stripe size down to
64k.   It was at 256.  After changing out the equipment the problem stopped.
   I heard that some Western Digital drives might cause strange system
locks.  Network cards are Intel Pro 1000''s.

I hope this helps you.


On Tue, May 18, 2010 at 4:10 AM, Gregory Matthews <
greg.matthews at diamond.ac.uk> wrote:
> Gary Brooks wrote:
>
>> Then, all of the sudden the MDS stops responding, ssh sessions die and
>> only hard restart helps. After the restart, /var/log/messages contains
>> normal information (some timeout chit-chat).
>>
>
> is your hardware using the bnx2 NIC driver? We''ve just been seeing
very
> similar issues on Lustre clients on brand new Dell Power Edge R610s. The
> workaround is to turn off MSI-X but there has recently been a fix merged
> into the mainline kernel which has also been backported by Red Hat.
>
>
>  While this happens randomly, there is an almost sure way to trigger it:
>> issue sysctl -w lnet.debug=0 on all clients and servers, after which
the
>> file system becomes super responsive, load on MDS is still low, our
gig-e
>> link is well utilized (unlike when lnet logging is enabled) and after a
few
>> minutes MDS dies as described above.
>>
>
> we have not been able to trigger it in any predictable fashion either.
>
> GREG
>
>
>> I know that this is too little information to ask for help, but maybe
you
>> could at least tell me where to look for any information?
>>
>> Gary
>>
>>
>>
------------------------------------------------------------------------
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>
>
> --
> Greg Matthews            01235 778658
> Senior Computer Systems Administrator
> Diamond Light Source, Oxfordshire, UK
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100518/45303d97/attachment.html

Andrew Godziuk

2010-May-20 15:21 UTC

head link

[Lustre-discuss] Problems with MDS Crashing

We have had another hang, but this time we had KVM access to the
machine (and the screen blanker wasn''t on). I took some screenshots,
the first one is an error I got after reboot, the BMP one is what I
saw when I first logged in to KVM, and the other ones are what I saw
when trying to type ''root'' - it started printing traces.

http://amber.leeware.com/wi/lustre-death/

After reboot there was a command timeout message from RAID card. When
hanged - "too little hardware resources".

-- 
Andrew
http://CloudAccess.net/

Lustre discuss - May 2010 - Problems with MDS Crashing

[Lustre-discuss] Problems with MDS Crashing

[Lustre-discuss] Problems with MDS Crashing

[Lustre-discuss] Problems with MDS Crashing

[Lustre-discuss] Problems with MDS Crashing

[Lustre-discuss] Problems with MDS Crashing