thr3ads.net - Lustre discuss - [Lustre-discuss] Here''s an odd problem [Jan 2007]

If this information is useful, please help other people find it:
Share via:

John R. Dunning

2007-Jan-10 06:15 UTC

[Lustre-discuss] Here''s an odd problem

I have a test cluster system, using lustre as its rootfs, that I''ve
been using
for several months.  It''s generally been pretty trouble-free, at least
when I
don''t do something dumb to it :-}

Yesterday I was running some test stuff on it, which had nothing in particular
to do with lustre, when for no obvious reason everything on the client
wedged.  I rebooted the client, and it wouldn''t come up.  I debugged
further,
and discovered that it was no longer able to mount the root at boot time.

I''ve dug further into it, and it''s not at all clear to me
what''s going on.
I''m not really able to see what debug info might be available to the
cluster
client which is trying to use this thing as its rootfs, but when I try to
mount the fs from another random client, the mount just hangs.  I looked in
various logs on both the client and the various servers, and there was nothing
obvious pointing to error conditions.  On the client, it would pop out a
message every 5 seconds of the form

Jan 10 07:58:23 localhost LustreError:
3674:0:(client.c:950:ptlrpc_expire_one_request()) Skipped 1 previous similar
message
Jan 10 07:58:23 localhost Lustre: 3674:0:(peer.c:238:lnet_debug_peer())
10.2.2.21@tcp               2    up     8     8     8     8     7 0

Which suggests to me that it really is something on server 21 which is hung
up.  That server appears to be idling happily, and responds to other
requests.  That server has an OST and the MDT on it, when I tried to unmount
the OST, that also hung, and in its log, I saw a bunch of messages like

Jan 10 08:03:09 localhost LustreError:
8544:0:(ldlm_lib.c:560:target_handle_connect()) @@@ UUID
''scx1-OST0000_UUID'' is not available  for connect (stopping)


Usually when I''ve seen other lustre issues kind of like this,
they''re
accompanied by lots of commentary in the logs about stuff that its unhappy
about, but this time the appearance is of an otherwise contented machine on
which a piece of lustre is just "stuck".

Anybody seen anything like this?  CFS folks, if I can reproduce this, anything
in particular you''d like me to look for?

TIA...

David Golden

2007-Jan-10 06:18 UTC

head link

[Lustre-discuss] Here''s an odd problem

[Well, upgrade to 1.4.8 causes hang on boot of our
shared root for us, but I''m pretty confident now it''s 
the known flock bug - yours sounds quite different!]

John R. Dunning

2007-Jan-10 06:42 UTC

head link

[Lustre-discuss] Here''s an odd problem

From: David Golden <dgolden@cp.dias.ie>
    Date: Wed, 10 Jan 2007 13:18:17 +0000
    
    [Well, upgrade to 1.4.8 causes hang on boot of our
    shared root for us, but I''m pretty confident now it''s 
    the known flock bug - yours sounds quite different!]
    
Yes.  I''m running 1.6b5, for one thing :-}

Eric Barton

2007-Jan-10 10:13 UTC

head link

[Lustre-discuss] Here''s an odd problem

John,

Were there any network error messages in the lustre debug log?

Normally these don''t come out on the console any more.  You can turn
this on with...

echo neterror > /proc/sys/lnet/printk

/proc/sys/lnet/debug is the same actually.  The systax is either a number
(decimal or hex: -1 == 0xffffffff) or a list of symbols to set the mask.  
You can also include ''+'' (following symbols are added) or
''-'' (following
symbols are removed).

The full set of symbols is...

trace inode super ext2 malloc cache info ioctl neterror net warning
buffs other dentry page dlmtrace error emerg ha rpctrace vfstrace
reada mmap config console quota sec

                Cheers,
                        Eric

---------------------------------------------------
|Eric Barton        Barton Software               |
|9 York Gardens     Tel:    +44 (117) 330 1575    |
|Clifton            Mobile: +44 (7909) 680 356    |
|Bristol BS8 4LL    Fax:    call first            |
|United Kingdom     E-Mail: eeb@bartonsoftware.com|
---------------------------------------------------
 
> -----Original Message-----
> From: lustre-discuss-bounces@clusterfs.com 
> [mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of 
> John R. Dunning
> Sent: 10 January 2007 1:15 PM
> To: lustre-discuss@clusterfs.com
> Subject: [Lustre-discuss] Here''s an odd problem
> 
> I have a test cluster system, using lustre as its rootfs, 
> that I''ve been using
> for several months.  It''s generally been pretty trouble-free, 
> at least when I
> don''t do something dumb to it :-}
> 
> Yesterday I was running some test stuff on it, which had 
> nothing in particular
> to do with lustre, when for no obvious reason everything on the client
> wedged.  I rebooted the client, and it wouldn''t come up.  I 
> debugged further,
> and discovered that it was no longer able to mount the root 
> at boot time.
> 
> I''ve dug further into it, and it''s not at all clear to me
> what''s going on.
> I''m not really able to see what debug info might be available 
> to the cluster
> client which is trying to use this thing as its rootfs, but 
> when I try to
> mount the fs from another random client, the mount just 
> hangs.  I looked in
> various logs on both the client and the various servers, and 
> there was nothing
> obvious pointing to error conditions.  On the client, it 
> would pop out a
> message every 5 seconds of the form
> 
> Jan 10 07:58:23 localhost LustreError: 
> 3674:0:(client.c:950:ptlrpc_expire_one_request()) Skipped 1 
> previous similar message
> Jan 10 07:58:23 localhost Lustre: 
> 3674:0:(peer.c:238:lnet_debug_peer()) 10.2.2.21@tcp           
>     2    up     8     8     8     8     7 0
> 
> Which suggests to me that it really is something on server 21 
> which is hung
> up.  That server appears to be idling happily, and responds to other
> requests.  That server has an OST and the MDT on it, when I 
> tried to unmount
> the OST, that also hung, and in its log, I saw a bunch of 
> messages like
> 
> Jan 10 08:03:09 localhost LustreError: 
> 8544:0:(ldlm_lib.c:560:target_handle_connect()) @@@ UUID 
> ''scx1-OST0000_UUID'' is not available  for connect
(stopping)
> 
> 
> Usually when I''ve seen other lustre issues kind of like this,
they''re
> accompanied by lots of commentary in the logs about stuff 
> that its unhappy
> about, but this time the appearance is of an otherwise 
> contented machine on
> which a piece of lustre is just "stuck".
> 
> Anybody seen anything like this?  CFS folks, if I can 
> reproduce this, anything
> in particular you''d like me to look for?
> 
> TIA...
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>

Lustre discuss - Jan 2007 - Here's an odd problem

[Lustre-discuss] Here''s an odd problem

[Lustre-discuss] Here''s an odd problem

[Lustre-discuss] Here''s an odd problem

[Lustre-discuss] Here''s an odd problem