thr3ads.net - Lustre discuss - [Lustre-discuss] Error on restarted Lustre disk [Jan 2010]

If this information is useful, please help other people find it:
Share via:

Ms. Megan Larko

2010-Jan-06 21:25 UTC

[Lustre-discuss] Error on restarted Lustre disk

Happy New Year to One and All!

My OSS Operating System CentOS 5 Lustre 1.6.4.3smp had an OS failure
on 28 Dec 2009.   The failure was a software error which had nothing
to do with Lustre or the physical hard disks.   I fixed the failure
(my typo) on the OS and restarted when I returned from my Christmas
break.    The mounted disk was again accessible.   The users,
including myself, are not having any perceptible issues with the
Lustre disk.   I am seeing on the client notes in /var/log/messages
persisting to 24 hours after the re-access reading:

Jan  6 16:09:05 cn1 kernel: Lustre:
6736:0:(import.c:837:ptlrpc_connect_interpret())
MGS at MGC192.168.64.210@o2ib_0 changed server handle from
0x9815e39016e41295 to 0x9815e39016f0b846
Jan  6 16:09:05 cn1 kernel: but is still in recovery
Jan  6 16:09:05 cn1 kernel: Lustre: MGC192.168.64.210 at o2ib: Reactivating
import
Jan  6 16:09:05 cn1 kernel: Lustre: Skipped 2 previous similar messages
Jan  6 16:09:05 cn1 kernel: Lustre: MGC192.168.64.210 at o2ib: Connection
restored to service MGS using nid 192.168.64.210 at o2ib.
Jan  6 16:09:05 cn1 kernel: Lustre: Skipped 2 previous similar messages
Jan  6 16:14:05 cn1 kernel: Lustre:
6736:0:(import.c:837:ptlrpc_connect_interpret())
MGS at MGC192.168.64.210@o2ib_0 changed server handle from
0x9815e39016f0b846 to 0x9815e39016fd8717
Jan  6 16:14:05 cn1 kernel: but is still in recovery

There are absolutely no kernel or Lustre messages of any kind on the
OSS (just the "MARK"s).  The disk is not full.   The OST which
comprise the disk are within 1 percent of one another as far as
filling up.  This is a 76Tb usable volume.

[larkoc at crew ~]$ df -h /crewdat
Filesystem            Size  Used Avail Use% Mounted on
ic-mds1 at o2ib:/crew8    76T   51T   21T  72% /crewdat

Should the recovery take this long?    Will the messages about
"changed server handle from 0x9815e39016e41295 to 0x9815e39016f0b846"
go away after the disk is out of recovery?   Why does the client think
the system server handle changed?   I did have to reboot (several
times) to fix my OS error.  There were no hardware changes at all on
this OSS repair.   The hardware was not even moved.

Thanks for any insight you may be able to provide.

Sincerely,
megan

Lustre discuss - Jan 2010 - Error on restarted Lustre disk

[Lustre-discuss] Error on restarted Lustre disk