thr3ads.net - Lustre discuss - [Lustre-discuss] OSS crash [Dec 2008]

If this information is useful, please help other people find it:
Share via:

Denise Hummel

2008-Dec-04 18:34 UTC

[Lustre-discuss] OSS crash

Hi Andreas;

I thought there might be filesystem corruption, however when I run
e2fsck there are no issues reported.
The system is now crashing about every hour.  I did get the messages on
the console and am including them in this messages.  I am not great at
deciphering the messages, however it looks like a storage problem.  Let
me know what you think.  I have quite a few scientists impatiently
waiting to get back on the system.  Thanks!

<ffffffff8032105b>{scheduled_timeout
+375}<ffffffffa048f31d>{:ost:ost_brw_write+9885}
Will spare you the hex and give the messages - let me know if you need
it.
{:ost:ost_brw_read+8528} {default_wake_function+0}
{:ptlrpc:lustre+msg_check_version+69} 
{:ost:ost_bulk_timeout+0} {:ost:ost_handle+12187}
{lnet:lnet_match_block_msg+920}
{ptlrpc:ptlrpc_server_handle_request+2830}
{libcfs:lcw_update_time+30}  {__mod_timer+293}
{ptlrpc:ptlrpc_main+2456} {default_wake_function+0}
{ptlrpc:ptlrpc_retry_rqbds+0}  {ptlrpc:ptlrpc_retry_rqbds+0}
{child_rip+8}  {ptlrpc:ptlrpc_main+0}
{child_rip+0}
Code: 0f 0b 04 6b 3d a0 ff ff ff ff 36 05 48 8b 43 20 66 44 29 58
RIP ldisk:ldiskfs_mb_use_best_found+256 RSP
<0> Kernel panic - not syncing Oops


On Dec 03, 2008  19:30 -0700, Hummel, Denise wrote:> We have a lustre filesystem that has been pretty stable since June
2008 on> a 200 node cluster until three weeks ago.  The OSS kernel panic has
> escalated since then to now about every 2 hours.
> The MDT/MGS is on a x86_64 server with 8G memory and 2 dual core AMD
procs> The OSS is on a x86_64 server with 8G memory and 2 dual core AMD procs
> One OST raid 6 ~9TB (I know it is larger than currently tested) - at58%

Running with OSTs > 8TB exposes you to filesystem corruption.

Cheers, Andreas
--
Andreas Dilger

Andreas Dilger

2008-Dec-04 19:30 UTC

head link

[Lustre-discuss] OSS crash

On Dec 04, 2008  11:34 -0700, Denise Hummel wrote:> I thought there might be filesystem corruption, however when I run
> e2fsck there are no issues reported.
> The system is now crashing about every hour.  I did get the messages on
> the console and am including them in this messages.  I am not great at
> deciphering the messages, however it looks like a storage problem.  Let
> me know what you think.  I have quite a few scientists impatiently
> waiting to get back on the system.  Thanks!
> 
> {child_rip+0}
> Code: 0f 0b 04 6b 3d a0 ff ff ff ff 36 05 48 8b 43 20 66 44 29 58
> RIP ldisk:ldiskfs_mb_use_best_found+256 RSP
A quick search for ldiskfs_mb_use_best_found in bugzilla reveals bug 16101.

This is an indication that the kernel is trying to use space beyond 8TB and
hitting a bug.  We fixed that bug in 1.6.6, but there is at least one other
problem that we are aware of when we tested the OSTs with 14TB filesystems
(bug 17530).

One option is to download the latest e2fsprogs (1.41.3 I think) from
sourceforge and use this to shrink the OST filesystem to 8TB to avoid
this problem until such a time that Lustre supports OSTs > 8TB.
> On Dec 03, 2008  19:30 -0700, Hummel, Denise wrote:
> > We have a lustre filesystem that has been pretty stable since June
> 2008 on
> > a 200 node cluster until three weeks ago.  The OSS kernel panic has
> > escalated since then to now about every 2 hours.
> > The MDT/MGS is on a x86_64 server with 8G memory and 2 dual core AMD
> procs
> > The OSS is on a x86_64 server with 8G memory and 2 dual core AMD procs
> > One OST raid 6 ~9TB (I know it is larger than currently tested) - at
> 58%
> 
> Running with OSTs > 8TB exposes you to filesystem corruption.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Denise Hummel

2008-Dec-08 20:32 UTC

head link

[Lustre-discuss] OSS crash

Hi Andreas;

Thanks for the advice to shrink the filesystem.  It took about 21 hours
to complete (from 9TB to 8TB with about 5.5TB of data).  The system has
been up since Friday afternoon and seems to be happy.

Denise

Lustre discuss - Dec 2008 - OSS crash

[Lustre-discuss] OSS crash

[Lustre-discuss] OSS crash

[Lustre-discuss] OSS crash