thr3ads.net - Lustre discuss - [Lustre-discuss] files missing after writeconf [Jul 2010]

If this information is useful, please help other people find it:
Share via:

David Gucker

2010-Jul-09 01:34 UTC

[Lustre-discuss] files missing after writeconf

When bringing up the cluster after a full powerdown, the MDS/MGS node 
was reporting the following for for each of the OSTs:

Jul  8 17:16:18 ID6317 kernel: LustreError: 13b-9: Test01-OST0000 claims 
to have registered, but this MGS does not know about it, preventing 
registration.
Jul  8 17:16:18 ID6317 kernel: LustreError: 
26184:0:(mgs_handler.c:660:mgs_handle()) MGS handle cmd=253 rc=-2

I have two OSS''s and checked back to my mkfs commands and it looks like
I forgot to enable failover in the options.  So I found that I could 
update that flag using tunefs.lustre.  Looking into that a bit I found 
that I should run it with --writeconf flag as well.

So, I unmounted the OST''s and ran:
tunefs.lustre --param failover.mode=failout /dev/iscsi/ost-1.target0

on each of them.   After doing this (and maybe remounting the mds/mgs), 
I was able to mount the OSTs, and then mounted the client but all data 
was missing. The filesystem reports 11% full which is about right for 
the data that was on there but no files.

After reading the docs a bit better I found that I should have done 
things more properly (fully shutdown and unloaded the filesystem, then 
done the writeconf beginning with the mgs).  So I tried running through 
the proceedure a little better and filesystem is in the same state 
(appears to be fine, just shows used space and no files).

I was unable to recreate this in another test cluster (no data loss).   
So, I''m wondering if these files are recoverable at all?  Can anyone 
point me in the right direction, if there is one?

Dave

Andreas Dilger

2010-Jul-09 16:51 UTC

head link

[Lustre-discuss] files missing after writeconf

Unmount the MDS and mount it as type ldiskfs and list the ROOT directory. If
there are no files there then it seems that somehow you have deleted or
reformatted the MDS Filesystem.

You could also check lost+found at that point in case your files were moved by
e2fsck for some reason.

Check ''dumpe2fs -h'' on the mds device to see what the format
time is.

If there are no more files on the MDS then the best you can do is to run lfsck
and link all the orphan objects into the Lustre lost+found dir and look at the
file contents to identify them.

If you have a backup it would be easier to just restore from that. Sorry. 

Cheers, Andreas

On 2010-07-08, at 19:34, David Gucker <dgucker at choopa.com> wrote:
> When bringing up the cluster after a full powerdown, the MDS/MGS node 
> was reporting the following for for each of the OSTs:
> 
> Jul  8 17:16:18 ID6317 kernel: LustreError: 13b-9: Test01-OST0000 claims 
> to have registered, but this MGS does not know about it, preventing 
> registration.
> Jul  8 17:16:18 ID6317 kernel: LustreError: 
> 26184:0:(mgs_handler.c:660:mgs_handle()) MGS handle cmd=253 rc=-2
> 
> I have two OSS''s and checked back to my mkfs commands and it looks
like
> I forgot to enable failover in the options.  So I found that I could 
> update that flag using tunefs.lustre.  Looking into that a bit I found 
> that I should run it with --writeconf flag as well.
> 
> So, I unmounted the OST''s and ran:
> tunefs.lustre --param failover.mode=failout /dev/iscsi/ost-1.target0
> 
> on each of them.   After doing this (and maybe remounting the mds/mgs), 
> I was able to mount the OSTs, and then mounted the client but all data 
> was missing. The filesystem reports 11% full which is about right for 
> the data that was on there but no files.
> 
> After reading the docs a bit better I found that I should have done 
> things more properly (fully shutdown and unloaded the filesystem, then 
> done the writeconf beginning with the mgs).  So I tried running through 
> the proceedure a little better and filesystem is in the same state 
> (appears to be fine, just shows used space and no files).
> 
> I was unable to recreate this in another test cluster (no data loss).   
> So, I''m wondering if these files are recoverable at all?  Can
anyone
> point me in the right direction, if there is one?
> 
> Dave
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Bernd Schubert

2010-Jul-09 21:29 UTC

head link

[Lustre-discuss] files missing after writeconf

If the device really has been reformated and the data would be very important, 
there also would be the option to to follow the recovery procedure we have 
done last year, after an MDT was accidentally reformated. It took several 
months, also because I was busy with lots of parallel tasks, in the end it was 
mostly successful. Not perfect, but at least a big part of the directory 
structure could be recovered (also thanks to helpful discussions with 
Andreas). I probably should write a document what needs to be done and publish 
the tools. But even with that it will be time consuming, although using the 
existing tools, that will take much less time than last time... 



Cheers,
Bernd

On Friday, July 09, 2010, Andreas Dilger wrote:> Unmount the MDS and mount it as type ldiskfs and list the ROOT directory.
> If there are no files there then it seems that somehow you have deleted or
> reformatted the MDS Filesystem.
> 
> You could also check lost+found at that point in case your files were moved
> by e2fsck for some reason.
> 
> Check ''dumpe2fs -h'' on the mds device to see what the
format time is.
> 
> If there are no more files on the MDS then the best you can do is to run
> lfsck and link all the orphan objects into the Lustre lost+found dir and
> look at the file contents to identify them.
> 
> If you have a backup it would be easier to just restore from that. Sorry.
> 
> Cheers, Andreas
> 
> On 2010-07-08, at 19:34, David Gucker <dgucker at choopa.com> wrote:
> > When bringing up the cluster after a full powerdown, the MDS/MGS node
> > was reporting the following for for each of the OSTs:
> > 
> > Jul  8 17:16:18 ID6317 kernel: LustreError: 13b-9: Test01-OST0000
claims
> > to have registered, but this MGS does not know about it, preventing
> > registration.
> > Jul  8 17:16:18 ID6317 kernel: LustreError:
> > 26184:0:(mgs_handler.c:660:mgs_handle()) MGS handle cmd=253 rc=-2
> > 
> > I have two OSS''s and checked back to my mkfs commands and it
looks like
> > I forgot to enable failover in the options.  So I found that I could
> > update that flag using tunefs.lustre.  Looking into that a bit I found
> > that I should run it with --writeconf flag as well.
> > 
> > So, I unmounted the OST''s and ran:
> > tunefs.lustre --param failover.mode=failout /dev/iscsi/ost-1.target0
> > 
> > on each of them.   After doing this (and maybe remounting the
mds/mgs),
> > I was able to mount the OSTs, and then mounted the client but all data
> > was missing. The filesystem reports 11% full which is about right for
> > the data that was on there but no files.
> > 
> > After reading the docs a bit better I found that I should have done
> > things more properly (fully shutdown and unloaded the filesystem, then
> > done the writeconf beginning with the mgs).  So I tried running
through
> > the proceedure a little better and filesystem is in the same state
> > (appears to be fine, just shows used space and no files).
> > 
> > I was unable to recreate this in another test cluster (no data loss).
> > So, I''m wondering if these files are recoverable at all?  Can
anyone
> > point me in the right direction, if there is one?
> > 
> > Dave
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-- 
Bernd Schubert
DataDirect Networks

David Gucker

2010-Jul-10 15:33 UTC

head link

[Lustre-discuss] files missing after writeconf

Thanks all.  The data was not mission critical so I decided to give up 
on it.  I believe the source of the problem
was a replication issue with the underlying systems that control the 
MDT.  I''ll be focusing on backup policy and
procedures from here out as the data does become mission critical.  
Nevertheless, it would be nice to see those
recovery procedures and tools published.

tyvm,
Dave

On 7/9/2010 5:29 PM, Bernd Schubert wrote:> If the device really has been reformated and the data would be very
important,
> there also would be the option to to follow the recovery procedure we have
> done last year, after an MDT was accidentally reformated. It took several
> months, also because I was busy with lots of parallel tasks, in the end it
was
> mostly successful. Not perfect, but at least a big part of the directory
> structure could be recovered (also thanks to helpful discussions with
> Andreas). I probably should write a document what needs to be done and
publish
> the tools. But even with that it will be time consuming, although using the
> existing tools, that will take much less time than last time...
>
>
>
> Cheers,
> Bernd
>
> On Friday, July 09, 2010, Andreas Dilger wrote:
>    
>> Unmount the MDS and mount it as type ldiskfs and list the ROOT
directory.
>> If there are no files there then it seems that somehow you have deleted
or
>> reformatted the MDS Filesystem.
>>
>> You could also check lost+found at that point in case your files were
moved
>> by e2fsck for some reason.
>>
>> Check ''dumpe2fs -h'' on the mds device to see what the
format time is.
>>
>> If there are no more files on the MDS then the best you can do is to
run
>> lfsck and link all the orphan objects into the Lustre lost+found dir
and
>> look at the file contents to identify them.
>>
>> If you have a backup it would be easier to just restore from that.
Sorry.
>>
>> Cheers, Andreas
>>
>> On 2010-07-08, at 19:34, David Gucker<dgucker at choopa.com> 
wrote:
>>      
>>> When bringing up the cluster after a full powerdown, the MDS/MGS
node
>>> was reporting the following for for each of the OSTs:
>>>
>>> Jul  8 17:16:18 ID6317 kernel: LustreError: 13b-9: Test01-OST0000
claims
>>> to have registered, but this MGS does not know about it, preventing
>>> registration.
>>> Jul  8 17:16:18 ID6317 kernel: LustreError:
>>> 26184:0:(mgs_handler.c:660:mgs_handle()) MGS handle cmd=253 rc=-2
>>>
>>> I have two OSS''s and checked back to my mkfs commands and
it looks like
>>> I forgot to enable failover in the options.  So I found that I
could
>>> update that flag using tunefs.lustre.  Looking into that a bit I
found
>>> that I should run it with --writeconf flag as well.
>>>
>>> So, I unmounted the OST''s and ran:
>>> tunefs.lustre --param failover.mode=failout
/dev/iscsi/ost-1.target0
>>>
>>> on each of them.   After doing this (and maybe remounting the
mds/mgs),
>>> I was able to mount the OSTs, and then mounted the client but all
data
>>> was missing. The filesystem reports 11% full which is about right
for
>>> the data that was on there but no files.
>>>
>>> After reading the docs a bit better I found that I should have done
>>> things more properly (fully shutdown and unloaded the filesystem,
then
>>> done the writeconf beginning with the mgs).  So I tried running
through
>>> the proceedure a little better and filesystem is in the same state
>>> (appears to be fine, just shows used space and no files).
>>>
>>> I was unable to recreate this in another test cluster (no data
loss).
>>> So, I''m wondering if these files are recoverable at all? 
Can anyone
>>> point me in the right direction, if there is one?
>>>
>>> Dave
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>        
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>      
>

Lustre discuss - Jul 2010 - files missing after writeconf

[Lustre-discuss] files missing after writeconf

[Lustre-discuss] files missing after writeconf

[Lustre-discuss] files missing after writeconf

[Lustre-discuss] files missing after writeconf