Klaus Steden
2008-Feb-04 21:54 UTC
[Lustre-discuss] Client hangs when reading from Lustre ...
Hello, I''m trying to figure out something odd ... a node in my cluster hangs when I run ''df'', or ''find -exec file {}'' or other commands like that. No other clients in the cluster exhibit the same behaviour. I''m seeing a lot of messages like this in its syslog: -- cut -- Feb 4 13:51:37 tiger-0-6 kernel: LustreError: 5827:0:(client.c:576:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR, err == -30 req at 000001010d30e400 x9218/t0 o8->ost1_UUID at tiger-oss-0-0.local_UUID:6 lens 240/272 ref 1 fl Rpc:R/0/40000 rc 0/-30 -- cut -- I haven''t had problems with this node before, but I''m hoping someone out there can maybe make a suggestion as to where to look to figure out what''s going on. The cluster is running kernel version 2.6.9-42.0.2.EL_lustre.1.4.7.1 with ROCKS 4.1. thanks, Klaus
Andreas Dilger
2008-Feb-04 23:22 UTC
[Lustre-discuss] Client hangs when reading from Lustre ...
On Feb 04, 2008 13:54 -0800, Klaus Steden wrote:> I''m trying to figure out something odd ... a node in my cluster hangs when I > run ''df'', or ''find -exec file {}'' or other commands like that. > > No other clients in the cluster exhibit the same behaviour. I''m seeing a lot > of messages like this in its syslog: > > -- cut -- > Feb 4 13:51:37 tiger-0-6 kernel: LustreError: > 5827:0:(client.c:576:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR, err > == -30 req at 000001010d30e400 x9218/t0 > o8->ost1_UUID at tiger-oss-0-0.local_UUID:6 lens 240/272 ref 1 fl Rpc:R/0/40000 > rc 0/-30 > -- cut --/usr/include/asm/errno.h says -30 = -EROFS. That means your OST filesystem has likely been remounted read-only because of a detected filesystem error. Check your /var/log/messages for something like "LDISKFS-fs error ...: Remounting filesystem read-only". This will be accompanied by the reason the filesystem is read-only. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Klaus Steden
2008-Feb-04 23:47 UTC
[Lustre-discuss] Client hangs when reading from Lustre ...
Thanks Andreas ... That would make sense, although the only error message (or, message vaguely resembling an error message) that I could find was this one: -- cut -- /var/log/messages.1:Feb 1 09:28:09 tiger-oss-0-0 kernel: LDISKFS-fs error (device sdb): ldiskfs_journal_start_sb: Detected aborted journal -- cut -- I''m assuming that''s causing the problem -- but what''s the next step? Punt all the clients, stop Lustre, and run e2fsck on the affected device? Klaus On 2/4/08 3:22 PM, "Andreas Dilger" <adilger at Sun.COM>did etch on stone tablets:> On Feb 04, 2008 13:54 -0800, Klaus Steden wrote: >> I''m trying to figure out something odd ... a node in my cluster hangs when I >> run ''df'', or ''find -exec file {}'' or other commands like that. >> >> No other clients in the cluster exhibit the same behaviour. I''m seeing a lot >> of messages like this in its syslog: >> >> -- cut -- >> Feb 4 13:51:37 tiger-0-6 kernel: LustreError: >> 5827:0:(client.c:576:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR, err >> == -30 req at 000001010d30e400 x9218/t0 >> o8->ost1_UUID at tiger-oss-0-0.local_UUID:6 lens 240/272 ref 1 fl Rpc:R/0/40000 >> rc 0/-30 >> -- cut -- > > /usr/include/asm/errno.h says -30 = -EROFS. That means your OST filesystem > has likely been remounted read-only because of a detected filesystem error. > Check your /var/log/messages for something like "LDISKFS-fs error ...: > Remounting filesystem read-only". This will be accompanied by the reason > the filesystem is read-only. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. >
Andreas Dilger
2008-Feb-05 01:07 UTC
[Lustre-discuss] Client hangs when reading from Lustre ...
On Feb 04, 2008 15:47 -0800, Klaus Steden wrote:> Thanks Andreas ... That would make sense, although the only error message > (or, message vaguely resembling an error message) that I could find was this > one: > > -- cut -- > /var/log/messages.1:Feb 1 09:28:09 tiger-oss-0-0 kernel: LDISKFS-fs error > (device sdb): ldiskfs_journal_start_sb: Detected aborted journal > -- cut -- > > I''m assuming that''s causing the problem -- but what''s the next step? Punt > all the clients, stop Lustre, and run e2fsck on the affected device?Yes. An aborted journal means an error at the journal layer... Maybe with a "JBD" error message?> On 2/4/08 3:22 PM, "Andreas Dilger" <adilger at Sun.COM>did etch on stone > tablets: > > > On Feb 04, 2008 13:54 -0800, Klaus Steden wrote: > >> I''m trying to figure out something odd ... a node in my cluster hangs when I > >> run ''df'', or ''find -exec file {}'' or other commands like that. > >> > >> No other clients in the cluster exhibit the same behaviour. I''m seeing a lot > >> of messages like this in its syslog: > >> > >> -- cut -- > >> Feb 4 13:51:37 tiger-0-6 kernel: LustreError: > >> 5827:0:(client.c:576:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR, err > >> == -30 req at 000001010d30e400 x9218/t0 > >> o8->ost1_UUID at tiger-oss-0-0.local_UUID:6 lens 240/272 ref 1 fl Rpc:R/0/40000 > >> rc 0/-30 > >> -- cut -- > > > > /usr/include/asm/errno.h says -30 = -EROFS. That means your OST filesystem > > has likely been remounted read-only because of a detected filesystem error. > > Check your /var/log/messages for something like "LDISKFS-fs error ...: > > Remounting filesystem read-only". This will be accompanied by the reason > > the filesystem is read-only. > > > > Cheers, Andreas > > -- > > Andreas Dilger > > Sr. Staff Engineer, Lustre Group > > Sun Microsystems of Canada, Inc. > >Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Klaus Steden
2008-Feb-05 01:37 UTC
[Lustre-discuss] Client hangs when reading from Lustre ...
On 2/4/08 5:07 PM, "Andreas Dilger" <adilger at sun.com>did etch on stone tablets:> On Feb 04, 2008 15:47 -0800, Klaus Steden wrote: >> Thanks Andreas ... That would make sense, although the only error message >> (or, message vaguely resembling an error message) that I could find was this >> one: >> >> -- cut -- >> /var/log/messages.1:Feb 1 09:28:09 tiger-oss-0-0 kernel: LDISKFS-fs error >> (device sdb): ldiskfs_journal_start_sb: Detected aborted journal >> -- cut -- >> >> I''m assuming that''s causing the problem -- but what''s the next step? Punt >> all the clients, stop Lustre, and run e2fsck on the affected device? > > Yes. An aborted journal means an error at the journal layer... Maybe with > a "JBD" error message? >I didn''t see anything like that, but I did see a bundle of journal commit errors, a number of errors from the SCSI layer, and a message about the LUN being remounted read-only. Two questions ... 1. Assuming all the bad blocks can be re-mapped at the device layer, what is the potential for data loss from running e2fsck? 2. Is it possible to get notification from a cluster component when something like this happens, via SNMP, Ganglia, or some other monitoring system? cheers, Klaus