thr3ads.net - Lustre discuss - [Lustre-discuss] Client hangs when reading from Lustre ... [Feb 2008]

If this information is useful, please help other people find it:
Share via:

Klaus Steden

2008-Feb-04 21:54 UTC

[Lustre-discuss] Client hangs when reading from Lustre ...

Hello,

I''m trying to figure out something odd ... a node in my cluster hangs
when I
run ''df'', or ''find -exec file {}'' or other
commands like that.

No other clients in the cluster exhibit the same behaviour. I''m seeing
a lot
of messages like this in its syslog:

-- cut --
Feb  4 13:51:37 tiger-0-6 kernel: LustreError:
5827:0:(client.c:576:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR, err
== -30 req at 000001010d30e400 x9218/t0
o8->ost1_UUID at tiger-oss-0-0.local_UUID:6 lens 240/272 ref 1 fl
Rpc:R/0/40000
rc 0/-30
-- cut --

I haven''t had problems with this node before, but I''m hoping
someone out
there can maybe make a suggestion as to where to look to figure out
what''s
going on.

The cluster is running kernel version 2.6.9-42.0.2.EL_lustre.1.4.7.1 with
ROCKS 4.1.

thanks,
Klaus

Andreas Dilger

2008-Feb-04 23:22 UTC

head link

[Lustre-discuss] Client hangs when reading from Lustre ...

On Feb 04, 2008  13:54 -0800, Klaus Steden wrote:> I''m trying to figure out something odd ... a node in my cluster
hangs when I
> run ''df'', or ''find -exec file {}'' or
other commands like that.
> 
> No other clients in the cluster exhibit the same behaviour. I''m
seeing a lot
> of messages like this in its syslog:
> 
> -- cut --
> Feb  4 13:51:37 tiger-0-6 kernel: LustreError:
> 5827:0:(client.c:576:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR,
err
> == -30 req at 000001010d30e400 x9218/t0
> o8->ost1_UUID at tiger-oss-0-0.local_UUID:6 lens 240/272 ref 1 fl
Rpc:R/0/40000
> rc 0/-30
> -- cut --
/usr/include/asm/errno.h says -30 = -EROFS.  That means your OST filesystem
has likely been remounted read-only because of a detected filesystem error.
Check your /var/log/messages for something like "LDISKFS-fs error ...:
Remounting filesystem read-only".  This will be accompanied by the reason
the filesystem is read-only.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Klaus Steden

2008-Feb-04 23:47 UTC

head link

[Lustre-discuss] Client hangs when reading from Lustre ...

Thanks Andreas ... That would make sense, although the only error message
(or, message vaguely resembling an error message) that I could find was this
one:

-- cut --
/var/log/messages.1:Feb  1 09:28:09 tiger-oss-0-0 kernel: LDISKFS-fs error
(device sdb): ldiskfs_journal_start_sb: Detected aborted journal
-- cut --

I''m assuming that''s causing the problem -- but what''s
the next step? Punt
all the clients, stop Lustre, and run e2fsck on the affected device?

Klaus

On 2/4/08 3:22 PM, "Andreas Dilger" <adilger at Sun.COM>did etch
on stone
tablets:
> On Feb 04, 2008  13:54 -0800, Klaus Steden wrote:
>> I''m trying to figure out something odd ... a node in my
cluster hangs when I
>> run ''df'', or ''find -exec file {}''
or other commands like that.
>> 
>> No other clients in the cluster exhibit the same behaviour.
I''m seeing a lot
>> of messages like this in its syslog:
>> 
>> -- cut --
>> Feb  4 13:51:37 tiger-0-6 kernel: LustreError:
>> 5827:0:(client.c:576:ptlrpc_check_status()) @@@ type ==
PTL_RPC_MSG_ERR, err
>> == -30 req at 000001010d30e400 x9218/t0
>> o8->ost1_UUID at tiger-oss-0-0.local_UUID:6 lens 240/272 ref 1 fl
Rpc:R/0/40000
>> rc 0/-30
>> -- cut --
> 
> /usr/include/asm/errno.h says -30 = -EROFS.  That means your OST filesystem
> has likely been remounted read-only because of a detected filesystem error.
> Check your /var/log/messages for something like "LDISKFS-fs error ...:
> Remounting filesystem read-only".  This will be accompanied by the
reason
> the filesystem is read-only.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>

Andreas Dilger

2008-Feb-05 01:07 UTC

head link

[Lustre-discuss] Client hangs when reading from Lustre ...

On Feb 04, 2008  15:47 -0800, Klaus Steden wrote:> Thanks Andreas ... That would make sense, although the only error message
> (or, message vaguely resembling an error message) that I could find was
this
> one:
> 
> -- cut --
> /var/log/messages.1:Feb  1 09:28:09 tiger-oss-0-0 kernel: LDISKFS-fs error
> (device sdb): ldiskfs_journal_start_sb: Detected aborted journal
> -- cut --
> 
> I''m assuming that''s causing the problem -- but
what''s the next step? Punt
> all the clients, stop Lustre, and run e2fsck on the affected device?
Yes.  An aborted journal means an error at the journal layer...  Maybe with
a "JBD" error message?
> On 2/4/08 3:22 PM, "Andreas Dilger" <adilger at Sun.COM>did
etch on stone
> tablets:
> 
> > On Feb 04, 2008  13:54 -0800, Klaus Steden wrote:
> >> I''m trying to figure out something odd ... a node in my
cluster hangs when I
> >> run ''df'', or ''find -exec file
{}'' or other commands like that.
> >> 
> >> No other clients in the cluster exhibit the same behaviour.
I''m seeing a lot
> >> of messages like this in its syslog:
> >> 
> >> -- cut --
> >> Feb  4 13:51:37 tiger-0-6 kernel: LustreError:
> >> 5827:0:(client.c:576:ptlrpc_check_status()) @@@ type ==
PTL_RPC_MSG_ERR, err
> >> == -30 req at 000001010d30e400 x9218/t0
> >> o8->ost1_UUID at tiger-oss-0-0.local_UUID:6 lens 240/272 ref 1
fl Rpc:R/0/40000
> >> rc 0/-30
> >> -- cut --
> > 
> > /usr/include/asm/errno.h says -30 = -EROFS.  That means your OST
filesystem
> > has likely been remounted read-only because of a detected filesystem
error.
> > Check your /var/log/messages for something like "LDISKFS-fs error
...:
> > Remounting filesystem read-only".  This will be accompanied by
the reason
> > the filesystem is read-only.
> > 
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Sr. Staff Engineer, Lustre Group
> > Sun Microsystems of Canada, Inc.
> > 
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Klaus Steden

2008-Feb-05 01:37 UTC

head link

[Lustre-discuss] Client hangs when reading from Lustre ...

On 2/4/08 5:07 PM, "Andreas Dilger" <adilger at sun.com>did etch
on stone
tablets:
> On Feb 04, 2008  15:47 -0800, Klaus Steden wrote:
>> Thanks Andreas ... That would make sense, although the only error
message
>> (or, message vaguely resembling an error message) that I could find was
this
>> one:
>> 
>> -- cut --
>> /var/log/messages.1:Feb  1 09:28:09 tiger-oss-0-0 kernel: LDISKFS-fs
error
>> (device sdb): ldiskfs_journal_start_sb: Detected aborted journal
>> -- cut --
>> 
>> I''m assuming that''s causing the problem -- but
what''s the next step? Punt
>> all the clients, stop Lustre, and run e2fsck on the affected device?
> 
> Yes.  An aborted journal means an error at the journal layer...  Maybe with
> a "JBD" error message?
> I didn''t see anything like that, but I did see a bundle of journal
commit
errors, a number of errors from the SCSI layer, and a message about the LUN
being remounted read-only.

Two questions ... 

1. Assuming all the bad blocks can be re-mapped at the device layer, what is
the potential for data loss from running e2fsck?

2. Is it possible to get notification from a cluster component when
something like this happens, via SNMP, Ganglia, or some other monitoring
system?

cheers,
Klaus

Lustre discuss - Feb 2008 - Client hangs when reading from Lustre ...

[Lustre-discuss] Client hangs when reading from Lustre ...

[Lustre-discuss] Client hangs when reading from Lustre ...

[Lustre-discuss] Client hangs when reading from Lustre ...

[Lustre-discuss] Client hangs when reading from Lustre ...

[Lustre-discuss] Client hangs when reading from Lustre ...