John,
Were there any network error messages in the lustre debug log?
Normally these don''t come out on the console any more. You can turn
this on with...
echo neterror > /proc/sys/lnet/printk
/proc/sys/lnet/debug is the same actually. The systax is either a number
(decimal or hex: -1 == 0xffffffff) or a list of symbols to set the mask.
You can also include ''+'' (following symbols are added) or
''-'' (following
symbols are removed).
The full set of symbols is...
trace inode super ext2 malloc cache info ioctl neterror net warning
buffs other dentry page dlmtrace error emerg ha rpctrace vfstrace
reada mmap config console quota sec
Cheers,
Eric
---------------------------------------------------
|Eric Barton Barton Software |
|9 York Gardens Tel: +44 (117) 330 1575 |
|Clifton Mobile: +44 (7909) 680 356 |
|Bristol BS8 4LL Fax: call first |
|United Kingdom E-Mail: eeb@bartonsoftware.com|
---------------------------------------------------
> -----Original Message-----
> From: lustre-discuss-bounces@clusterfs.com
> [mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of
> John R. Dunning
> Sent: 10 January 2007 1:15 PM
> To: lustre-discuss@clusterfs.com
> Subject: [Lustre-discuss] Here''s an odd problem
>
> I have a test cluster system, using lustre as its rootfs,
> that I''ve been using
> for several months. It''s generally been pretty trouble-free,
> at least when I
> don''t do something dumb to it :-}
>
> Yesterday I was running some test stuff on it, which had
> nothing in particular
> to do with lustre, when for no obvious reason everything on the client
> wedged. I rebooted the client, and it wouldn''t come up. I
> debugged further,
> and discovered that it was no longer able to mount the root
> at boot time.
>
> I''ve dug further into it, and it''s not at all clear to me
> what''s going on.
> I''m not really able to see what debug info might be available
> to the cluster
> client which is trying to use this thing as its rootfs, but
> when I try to
> mount the fs from another random client, the mount just
> hangs. I looked in
> various logs on both the client and the various servers, and
> there was nothing
> obvious pointing to error conditions. On the client, it
> would pop out a
> message every 5 seconds of the form
>
> Jan 10 07:58:23 localhost LustreError:
> 3674:0:(client.c:950:ptlrpc_expire_one_request()) Skipped 1
> previous similar message
> Jan 10 07:58:23 localhost Lustre:
> 3674:0:(peer.c:238:lnet_debug_peer()) 10.2.2.21@tcp
> 2 up 8 8 8 8 7 0
>
> Which suggests to me that it really is something on server 21
> which is hung
> up. That server appears to be idling happily, and responds to other
> requests. That server has an OST and the MDT on it, when I
> tried to unmount
> the OST, that also hung, and in its log, I saw a bunch of
> messages like
>
> Jan 10 08:03:09 localhost LustreError:
> 8544:0:(ldlm_lib.c:560:target_handle_connect()) @@@ UUID
> ''scx1-OST0000_UUID'' is not available for connect
(stopping)
>
>
> Usually when I''ve seen other lustre issues kind of like this,
they''re
> accompanied by lots of commentary in the logs about stuff
> that its unhappy
> about, but this time the appearance is of an otherwise
> contented machine on
> which a piece of lustre is just "stuck".
>
> Anybody seen anything like this? CFS folks, if I can
> reproduce this, anything
> in particular you''d like me to look for?
>
> TIA...
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>