thr3ads.net - Lustre discuss - [Lustre-discuss] Odd corrupted fs problem [Dec 2006]

If this information is useful, please help other people find it:
Share via:

John R. Dunning

2006-Dec-12 12:23 UTC

[Lustre-discuss] Odd corrupted fs problem

I''ve just gotten done recovering from a wierd problem on our test
system, and
wondered if anybody else has seen this kind of thing.

Lustre 1.6b5
Kernel vanilla (more or less) 2.6.15
Opteron smp clients and servers

The clients boot diskless, using lustre as their rootfs.  The servers have
over the last few weeks been bounced numerous times due to me playing around
with hardware while evaluating interface cards and stuff.  Some of the time I
didn''t restart the clients when rebooting the servers, either because I
forgot
or because something crashed unexpectedly.

Up until today, it seemed like everything was behaving itself pretty well.
The clients didn''t work, of course, when their rootfs was down, but
when the
servers came back up, the clients seemed to reconnect and carry on, so I
didn''t worry overly much about it.

Today I tripped over evidence that the rootfs was corrupted, and that the
corruption wasn''t limited to in-memory structures, it was on the disks.
I
rebuilt the fs (ie reformatted all the osts and things), installed a new
rootfs, and then discovered that it was *still* corrupted.  Looking at server
logs, there were a number of errors from one oss where one of the old clients
had been trying to reconnect to it and replay transactions.  

So:  Is it an incredibly bad idea to allow an old stale client to try to
reconnect to a freshly-reconstituted server?  I had the impression that lustre
had sufficient protocol in place to avoid that kind of skewage causing
problems, but if that''s not the case, it would certainly account for
the
lossage.  If that is supposed to be safe, I guess this means I''ve
probably
found a bug, and should try to characterize it further.

Lustre discuss - Dec 2006 - Odd corrupted fs problem

[Lustre-discuss] Odd corrupted fs problem