We lost an OST several months ago and could not recover it. We decided to deactivate until we bring some new storage online and can just rebuild the entire file system. However, now, the MDT still knows about all the files that were on the lost OST and this results in things like "invalid argument" and "?--------? ? .." in directory listings. The files cannot be removed by standard commands. We end up doing something like.... mv Dir to Tmp cp -r Tmp Dir (this produces lots of ''cp: cannot stat ...'' for the missing files) mv Tmp /lost+found (this moves all the missing file names more or less out of the way). Is there some way to remove these files from the MDT - as though they never existed - without reformatting the entire file system? Thanks, Charlie Taylor UF HPC Center
Miguel Afonso Oliveira
2010-Apr-18 13:35 UTC
[Lustre-discuss] Lost Files - How to remove from MDT
Hi, You are going to have to use "unlink" with something like this: for file in lost_files unlink $file Cheers, Miguel Afonso Oliveira P.S.: To build a list of all your lost files you can do a rsync with the dry-run flag. On Apr 18, 2010, at 2:30 PM, Charles Taylor wrote:> We lost an OST several months ago and could not recover it. We decided to deactivate until we bring some new storage online and can just rebuild the entire file system. However, now, the MDT still knows about all the files that were on the lost OST and this results in things like "invalid argument" and "?--------? ? .." in directory listings. The files cannot be removed by standard commands. We end up doing something like.... > > mv Dir to Tmp > cp -r Tmp Dir (this produces lots of ''cp: cannot stat ...'' for the missing files) > mv Tmp /lost+found (this moves all the missing file names more or less out of the way). > > Is there some way to remove these files from the MDT - as though they never existed - without reformatting the entire file system? > > Thanks, > > Charlie Taylor > UF HPC Center > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1580 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100418/0195a2ab/attachment.bin
Brian J. Murrell
2010-Apr-18 13:38 UTC
[Lustre-discuss] Lost Files - How to remove from MDT
On Sun, 2010-04-18 at 09:30 -0400, Charles Taylor wrote:> > Is there some way to remove these files from the MDT - as though they never existed - without reformatting the entire file system?lfsck is the documented, supported method. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100418/2c791966/attachment.bin
On Apr 18, 2010, at 9:38 AM, Brian J. Murrell wrote:> On Sun, 2010-04-18 at 09:30 -0400, Charles Taylor wrote: >> >> Is there some way to remove these files from the MDT - as though they never existed - without reformatting the entire file system? > > lfsck is the documented, supported method.Yes, but we attempted that at one time with a smaller file system (for a different reason). After letting it run for over a day, we estimated that it would have taken seven to ten days to finish. That just wasn''t practical for us at the time and still isn''t. This file system would probably take a couple of weeks to lfsck. I''m sorry to say we can''t take the file system offline for that long. We may just have to leave it as is until we put some new storage in place and can migrate the good data off it. I just thought I''d ask. Thanks for the reply though, Charlie Taylor UF HPC Center
On Apr 18, 2010, at 9:35 AM, Miguel Afonso Oliveira wrote:> Hi, > > You are going to have to use "unlink" with something like this: > > for file in lost_files > unlink $fileNope. That''s really no different than "rm" and produces the same result... unlink /scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop unlink: cannot unlink `/scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop'': Invalid argument Thanks for the suggestion though, Charlie Taylor UF HPC Center> > Cheers, > > Miguel Afonso Oliveira > > P.S.: To build a list of all your lost files you can do a rsync with the dry-run flag. > > On Apr 18, 2010, at 2:30 PM, Charles Taylor wrote: > >> We lost an OST several months ago and could not recover it. We decided to deactivate until we bring some new storage online and can just rebuild the entire file system. However, now, the MDT still knows about all the files that were on the lost OST and this results in things like "invalid argument" and "?--------? ? .." in directory listings. The files cannot be removed by standard commands. We end up doing something like.... >> >> mv Dir to Tmp >> cp -r Tmp Dir (this produces lots of ''cp: cannot stat ...'' for the missing files) >> mv Tmp /lost+found (this moves all the missing file names more or less out of the way). >> >> Is there some way to remove these files from the MDT - as though they never existed - without reformatting the entire file system? >> >> Thanks, >> >> Charlie Taylor >> UF HPC Center >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Miguel Afonso Oliveira
2010-Apr-18 14:47 UTC
[Lustre-discuss] Lost Files - How to remove from MDT
Hi again, Sorry I forgot to mention this only works if the "offending" OST still exists. If at this time you can no longer re-include the OST where these files existed then you can still create a new one with the same index and then you can unlink. MAO On Apr 18, 2010, at 3:16 PM, Charles Taylor wrote:> > On Apr 18, 2010, at 9:35 AM, Miguel Afonso Oliveira wrote: > >> Hi, >> >> You are going to have to use "unlink" with something like this: >> >> for file in lost_files >> unlink $file > > Nope. That''s really no different than "rm" and produces the same result... > > unlink /scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop > unlink: cannot unlink `/scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop'': Invalid argument > > Thanks for the suggestion though, > > Charlie Taylor > UF HPC Center > >> >> Cheers, >> >> Miguel Afonso Oliveira >> >> P.S.: To build a list of all your lost files you can do a rsync with the dry-run flag. >> >> On Apr 18, 2010, at 2:30 PM, Charles Taylor wrote: >> >>> We lost an OST several months ago and could not recover it. We decided to deactivate until we bring some new storage online and can just rebuild the entire file system. However, now, the MDT still knows about all the files that were on the lost OST and this results in things like "invalid argument" and "?--------? ? .." in directory listings. The files cannot be removed by standard commands. We end up doing something like.... >>> >>> mv Dir to Tmp >>> cp -r Tmp Dir (this produces lots of ''cp: cannot stat ...'' for the missing files) >>> mv Tmp /lost+found (this moves all the missing file names more or less out of the way). >>> >>> Is there some way to remove these files from the MDT - as though they never existed - without reformatting the entire file system? >>> >>> Thanks, >>> >>> Charlie Taylor >>> UF HPC Center >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1580 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100418/3a92d8d0/attachment.bin
On Apr 18, 2010, at 10:47 AM, Miguel Afonso Oliveira wrote:> Hi again, > > Sorry I forgot to mention this only works if the "offending" OST still exists. If at this time you can no longer re-include the OST where these files existed then you can still > create a new one with the same index and then you can unlink.Ok, thanks. We may go ahead and try that. Charlie> > MAO > On Apr 18, 2010, at 3:16 PM, Charles Taylor wrote: > >> >> On Apr 18, 2010, at 9:35 AM, Miguel Afonso Oliveira wrote: >> >>> Hi, >>> >>> You are going to have to use "unlink" with something like this: >>> >>> for file in lost_files >>> unlink $file >> >> Nope. That''s really no different than "rm" and produces the same result... >> >> unlink /scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop >> unlink: cannot unlink `/scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop'': Invalid argument >> >> Thanks for the suggestion though, >> >> Charlie Taylor >> UF HPC Center >> >>> >>> Cheers, >>> >>> Miguel Afonso Oliveira >>> >>> P.S.: To build a list of all your lost files you can do a rsync with the dry-run flag. >>> >>> On Apr 18, 2010, at 2:30 PM, Charles Taylor wrote: >>> >>>> We lost an OST several months ago and could not recover it. We decided to deactivate until we bring some new storage online and can just rebuild the entire file system. However, now, the MDT still knows about all the files that were on the lost OST and this results in things like "invalid argument" and "?--------? ? .." in directory listings. The files cannot be removed by standard commands. We end up doing something like.... >>>> >>>> mv Dir to Tmp >>>> cp -r Tmp Dir (this produces lots of ''cp: cannot stat ...'' for the missing files) >>>> mv Tmp /lost+found (this moves all the missing file names more or less out of the way). >>>> >>>> Is there some way to remove these files from the MDT - as though they never existed - without reformatting the entire file system? >>>> >>>> Thanks, >>>> >>>> Charlie Taylor >>>> UF HPC Center >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
On Sunday 18 April 2010, Charles Taylor wrote:> On Apr 18, 2010, at 9:38 AM, Brian J. Murrell wrote: > > On Sun, 2010-04-18 at 09:30 -0400, Charles Taylor wrote: > >> Is there some way to remove these files from the MDT - as though they > >> never existed - without reformatting the entire file system? > > > > lfsck is the documented, supported method. > > Yes, but we attempted that at one time with a smaller file system (for a > different reason). After letting it run for over a day, we estimated > that it would have taken seven to ten days to finish. That just wasn''t > practical for us at the time and still isn''t. This file system would > probably take a couple of weeks to lfsck. I''m sorry to say we can''t take > the file system offline for that long.You don''t need to take the filesystem offline for lfsck. Also, I have rewritten large parts of lfsck and also fixed the parallelization code. I need to review all patches again and probably also make a hg or git repository out of it. Unfortunately, I always have more tasks to do than I manage to do... But given the fact that I fixed several bugs and added safety checks, I think my version actually is better than upstream. Let me know if you are interested and I can put a tar ball of e2fsprogs-sun-ddn on my home page. Cheers, Bernd -- Bernd Schubert DataDirect Networks
On 2010-04-18, at 07:16, Charles Taylor wrote:> On Apr 18, 2010, at 9:35 AM, Miguel Afonso Oliveira wrote: >> You are going to have to use "unlink" with something like this: >> >> for file in lost_files >> unlink $file > > Nope. That''s really no different than "rm" and produces the same > result... > > unlink /scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop > unlink: cannot unlink `/scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop'': > Invalid argumentThis surprises me that "unlink" doesn''t work, since that is the answer I was going to give also. Did you also verify that after this message is posted, that the file isn''t actually unlinked? I suspect that the file name was unlinked, but an error is returned from destroying the OST object, but that is fine since the OST is dead and gone anyway. What error messages are posted on the console log (dmesg/syslog)? Cheers, Andreas -- Andreas Dilger Principal Engineer, Lustre Group Oracle Corporation Canada Inc.
Miguel Afonso Oliveira
2010-Apr-18 17:26 UTC
[Lustre-discuss] Lost Files - How to remove from MDT
Hi all, I had a similar problem with one of my filesystems and from my experience until the actual OST, or another one with the same original index, is present, unlink will not work. It will give the error message an it will keep on showing the reference to the file. At the time it somehow made sense to me - an erroneous unlink on a missing file would destroy the file metadata. But now I''m not quite sure this ought to be treated as a feature or as a bug. We can always create an OST with the index of the missing one and unlink. Seems to me a safer option but them it would require better documentation. All the best, MAO On Apr 18, 2010, at 6:14 PM, Andreas Dilger wrote:> On 2010-04-18, at 07:16, Charles Taylor wrote: >> On Apr 18, 2010, at 9:35 AM, Miguel Afonso Oliveira wrote: >>> You are going to have to use "unlink" with something like this: >>> >>> for file in lost_files >>> unlink $file >> >> Nope. That''s really no different than "rm" and produces the same >> result... >> >> unlink /scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop >> unlink: cannot unlink `/scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop'': >> Invalid argument > > This surprises me that "unlink" doesn''t work, since that is the answer > I was going to give also. Did you also verify that after this message > is posted, that the file isn''t actually unlinked? I suspect that the > file name was unlinked, but an error is returned from destroying the > OST object, but that is fine since the OST is dead and gone anyway. > > What error messages are posted on the console log (dmesg/syslog)? > > Cheers, Andreas > -- > Andreas Dilger > Principal Engineer, Lustre Group > Oracle Corporation Canada Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1580 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100418/9635f46f/attachment.bin
While I''m thinking about it, that brings up an interesting question. All the OSTs for this file system were originally formatted under 1.6.3. We have since upgraded to 1.8.x. If we reformat the missing OST with the same index under 1.8.2 and add it back into the file system (sans its data) should we expect trouble? We were reluctant to do so since we doubt that this is a tested scenario but perhaps we are being overly paranoid. Should it be OK to mix OSTs formatted under different versions (1.6 vs 1.8) of Lustre? Seems like it should be OK but you can''t test everything and this seems like a bit of an outlier. Regards, Charlie Taylor UF HPC Center On Apr 18, 2010, at 10:47 AM, Miguel Afonso Oliveira wrote:> Hi again, > > Sorry I forgot to mention this only works if the "offending" OST still exists. If at this time you can no longer re-include the OST where these files existed then you can still > create a new one with the same index and then you can unlink. > > MAO > On Apr 18, 2010, at 3:16 PM, Charles Taylor wrote: > >> >> On Apr 18, 2010, at 9:35 AM, Miguel Afonso Oliveira wrote: >> >>> Hi, >>> >>> You are going to have to use "unlink" with something like this: >>> >>> for file in lost_files >>> unlink $file >> >> Nope. That''s really no different than "rm" and produces the same result... >> >> unlink /scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop >> unlink: cannot unlink `/scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop'': Invalid argument >> >> Thanks for the suggestion though, >> >> Charlie Taylor >> UF HPC Center >> >>> >>> Cheers, >>> >>> Miguel Afonso Oliveira >>> >>> P.S.: To build a list of all your lost files you can do a rsync with the dry-run flag. >>> >>> On Apr 18, 2010, at 2:30 PM, Charles Taylor wrote: >>> >>>> We lost an OST several months ago and could not recover it. We decided to deactivate until we bring some new storage online and can just rebuild the entire file system. However, now, the MDT still knows about all the files that were on the lost OST and this results in things like "invalid argument" and "?--------? ? .." in directory listings. The files cannot be removed by standard commands. We end up doing something like.... >>>> >>>> mv Dir to Tmp >>>> cp -r Tmp Dir (this produces lots of ''cp: cannot stat ...'' for the missing files) >>>> mv Tmp /lost+found (this moves all the missing file names more or less out of the way). >>>> >>>> Is there some way to remove these files from the MDT - as though they never existed - without reformatting the entire file system? >>>> >>>> Thanks, >>>> >>>> Charlie Taylor >>>> UF HPC Center >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
On Apr 18, 2010, at 11:46 AM, Bernd Schubert wrote:> > You don''t need to take the filesystem offline for lfsck.You sure about that? Looking at http://wiki.lustre.org/manual/LustreManual18_HTML/LustreRecovery.html#50598012_37365 step 1 says "Stop the Lustre File System".> Also, I have > rewritten large parts of lfsck and also fixed the parallelization code. I need > to review all patches again and probably also make a hg or git repository out > of it. Unfortunately, I always have more tasks to do than I manage to do... > But given the fact that I fixed several bugs and added safety checks, I think > my version actually is better than upstream. > > Let me know if you are interested and I can put a tar ball of > e2fsprogs-sun-ddn on my home page.Sure, we can try it but it seems to me that by the time you generate the OST data and run lfsck against the MDT, much could change on a file system being used by 600+ active clients. REgards, Charlie
On Apr 18, 2010, at 1:14 PM, Andreas Dilger wrote:> On 2010-04-18, at 07:16, Charles Taylor wrote: >> On Apr 18, 2010, at 9:35 AM, Miguel Afonso Oliveira wrote: >>> You are going to have to use "unlink" with something like this: >>> >>> for file in lost_files >>> unlink $file >> >> Nope. That''s really no different than "rm" and produces the same result... >> >> unlink /scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop >> unlink: cannot unlink `/scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop'': Invalid argument > > This surprises me that "unlink" doesn''t work, since that is the answer I was going to give also. Did you also verify that after this message is posted, that the file isn''t actually unlinked? I suspect that the file name was unlinked, but an error is returned from destroying the OST object, but that is fine since the OST is dead and gone anyway.Nope. They are still there following the Invalid argument error. It seems that before we deactivated the OST we could remove the files but got an error message but once the OST was deactivated, we get the error message and the file (err, its metadata) remains.> > What error messages are posted on the console log (dmesg/syslog)?Lots of the following but there is a find running as well so I don''t think it is necessarily from the rm command. Lustre: 4286:0:(lov_pack.c:67:lov_dump_lmm_v1()) stripe_size 1048576, stripe_count 1 Lustre: 4286:0:(lov_pack.c:76:lov_dump_lmm_v1()) stripe 0 idx 17 subobj 0x0/0x3dbe6b Lustre: 4286:0:(lov_pack.c:64:lov_dump_lmm_v1()) objid 0x38f59c8, magic 0x0bd10bd0, pattern 0x1 Lustre: 4286:0:(lov_pack.c:67:lov_dump_lmm_v1()) stripe_size 1048576, stripe_count 1 Lustre: 4286:0:(lov_pack.c:76:lov_dump_lmm_v1()) stripe 0 idx 17 subobj 0x0/0x3dc0fa Charlie
Lundgren, Andrew
2010-Apr-19 17:54 UTC
[Lustre-discuss] Lost Files - How to remove from MDT
I was also going to recommend the unlink. We have had to do this as well, the unlink worked for us. It did need to be run with privileges for the file. (root in our case.) -- Andrew -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of Andreas Dilger Sent: Sunday, April 18, 2010 11:14 AM To: Charles Taylor Cc: Lustre User Discussion Mailing List Subject: Re: [Lustre-discuss] Lost Files - How to remove from MDT On 2010-04-18, at 07:16, Charles Taylor wrote:> On Apr 18, 2010, at 9:35 AM, Miguel Afonso Oliveira wrote: >> You are going to have to use "unlink" with something like this: >> >> for file in lost_files >> unlink $file > > Nope. That''s really no different than "rm" and produces the same > result... > > unlink /scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop > unlink: cannot unlink `/scratch/crn/bwang/NCS/1O5P/1o5p_wat.prmtop'': > Invalid argumentThis surprises me that "unlink" doesn''t work, since that is the answer I was going to give also. Did you also verify that after this message is posted, that the file isn''t actually unlinked? I suspect that the file name was unlinked, but an error is returned from destroying the OST object, but that is fine since the OST is dead and gone anyway. What error messages are posted on the console log (dmesg/syslog)? Cheers, Andreas -- Andreas Dilger Principal Engineer, Lustre Group Oracle Corporation Canada Inc. _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss