I have an odd problem. I am trying to empty all files from a set of OST as indicated below, by making a list via lfs find and then sending that list to lfs_migrate. However, I have just gotten this message back from the lfs find: llapi_semantic_traverse: Failed to open ''/lustre/umt3/data13/daits/p15.6.3.10/prod/W1J_munu216465_simul'': No such file or directory (2) error: find failed for umt3-OST0021. The running script contains the following, active code lines: umdist03OBD="-obd umt3-OST001c \ -obd umt3-OST001d -obd umt3-OST001e \ -obd umt3-OST001f -obd umt3-OST0020 -obd umt3-OST0021 " # lfs find /lustre/umt3 ${umdist03OBD} > ./check_umdist03.txt dmesg on the machine where this ran shows nothing. There are still some 60GB of files on this target server, spread more or less evenly over the 6 OST. dmesg on the MGS shows no errors at this point. On the OSS, I see this but not much else: LustreError: 5226:0:(ldlm_resource.c:861:ldlm_resource_add()) lvbo_init failed for resource 9101: rc -2 Can someone give me an idea of what is wrong here? And what can be done about it, if anything? Thanks, bob
On 2010-11-29, at 20:18, Bob Ball wrote:> I have an odd problem. I am trying to empty all files from a set of OST > as indicated below, by making a list via lfs find and then sending that > list to lfs_migrate. However, I have just gotten this message back from > the lfs find: > > llapi_semantic_traverse: Failed to open > ''/lustre/umt3/data13/daits/p15.6.3.10/prod/W1J_munu216465_simul'': No > such file or directory (2) > error: find failed for umt3-OST0021.This may mean that the file was deleted while "lfs find" was running.> On the OSS, I see this but not much else: > LustreError: 5226:0:(ldlm_resource.c:861:ldlm_resource_add()) lvbo_init > failed for resource 9101: rc -2 > > Can someone give me an idea of what is wrong here? And what can be > done about it, if anything?This might mean that the file was deleted at the same time the MDS crashed, and the objects were removed but the MDS file was not. It is possible to just delete this file using the "unlink" command - it does not contain any data in any case. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
OK, thanks. Scary, to see errors out of lfs find. bob On 11/30/2010 1:47 AM, Andreas Dilger wrote:> On 2010-11-29, at 20:18, Bob Ball wrote: >> I have an odd problem. I am trying to empty all files from a set of OST >> as indicated below, by making a list via lfs find and then sending that >> list to lfs_migrate. However, I have just gotten this message back from >> the lfs find: >> >> llapi_semantic_traverse: Failed to open >> ''/lustre/umt3/data13/daits/p15.6.3.10/prod/W1J_munu216465_simul'': No >> such file or directory (2) >> error: find failed for umt3-OST0021. > This may mean that the file was deleted while "lfs find" was running. > >> On the OSS, I see this but not much else: >> LustreError: 5226:0:(ldlm_resource.c:861:ldlm_resource_add()) lvbo_init >> failed for resource 9101: rc -2 >> >> Can someone give me an idea of what is wrong here? And what can be >> done about it, if anything? > This might mean that the file was deleted at the same time the MDS crashed, and the objects were removed but the MDS file was not. It is possible to just delete this file using the "unlink" command - it does not contain any data in any case. > > Cheers, Andreas > -- > Andreas Dilger > Lustre Technical Lead > Oracle Corporation Canada Inc. > > >
OK, well, that file was just not anywhere, and it was only 1. But now that the OST is "completely empty", I find that it is not really empty. For example: [root at umdist03 d0]# pwd /mnt/ost/O/0/d0 [root at umdist03 d0]# ls -l total 182976 -rw-rw-rw- 1 daits users 45002956 Jul 5 20:52 1162976 -rw-rw-rw- 1 daits users 44569036 Jul 7 02:53 1200608 -rw-rw-rw- 1 daits users 49108913 Jun 28 04:43 1218976 -rw-rw-rw- 1 daits users 48658429 Jul 16 13:29 1254176 -rwSrwSrw- 1 root root 0 Sep 2 15:11 128 -rwSrwSrw- 1 root root 0 Sep 2 15:11 9152 -rwSrwSrw- 1 root root 0 Sep 2 15:11 9216 -rwSrwSrw- 1 root root 0 Sep 2 15:11 9248 Some time back we had an MDT issue, and upon running e2fsck, saw a LOT of corrupted entries that were just deleted. I suspect that these may have been entries pointing to these files? "lfs find" comes up empty handed for this OST, indeed, there are 6 OST here, each with about 10GB worth of files of this kind. Are those 60GB just lost? Short of pawing through these, by hand, to see what we can make of the content, is there a snowball''s chance in Hades of identifying these files? Can I simply copy them out of this "ldiskfs" mount of the file system, back into some recovery directory in the real file system, so that users can pick through them? After they are moved, the file system will be reformatted and returned to use. bob On 11/30/2010 8:53 AM, Bob Ball wrote:> OK, thanks. Scary, to see errors out of lfs find. > > bob > > On 11/30/2010 1:47 AM, Andreas Dilger wrote: >> On 2010-11-29, at 20:18, Bob Ball wrote: >>> I have an odd problem. I am trying to empty all files from a set of OST >>> as indicated below, by making a list via lfs find and then sending that >>> list to lfs_migrate. However, I have just gotten this message back from >>> the lfs find: >>> >>> llapi_semantic_traverse: Failed to open >>> ''/lustre/umt3/data13/daits/p15.6.3.10/prod/W1J_munu216465_simul'': No >>> such file or directory (2) >>> error: find failed for umt3-OST0021. >> This may mean that the file was deleted while "lfs find" was running. >> >>> On the OSS, I see this but not much else: >>> LustreError: 5226:0:(ldlm_resource.c:861:ldlm_resource_add()) lvbo_init >>> failed for resource 9101: rc -2 >>> >>> Can someone give me an idea of what is wrong here? And what can be >>> done about it, if anything? >> This might mean that the file was deleted at the same time the MDS crashed, and the objects were removed but the MDS file was not. It is possible to just delete this file using the "unlink" command - it does not contain any data in any case. >> >> Cheers, Andreas >> -- >> Andreas Dilger >> Lustre Technical Lead >> Oracle Corporation Canada Inc. >> >> >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
On 2010-11-30, at 11:17, Bob Ball wrote:> [root at umdist03 d0]# ls -l > total 182976 > -rw-rw-rw- 1 daits users 45002956 Jul 5 20:52 1162976 > -rw-rw-rw- 1 daits users 44569036 Jul 7 02:53 1200608 > -rw-rw-rw- 1 daits users 49108913 Jun 28 04:43 1218976 > -rw-rw-rw- 1 daits users 48658429 Jul 16 13:29 1254176 > -rwSrwSrw- 1 root root 0 Sep 2 15:11 128 > -rwSrwSrw- 1 root root 0 Sep 2 15:11 9152 > -rwSrwSrw- 1 root root 0 Sep 2 15:11 9216 > -rwSrwSrw- 1 root root 0 Sep 2 15:11 9248 > > Some time back we had an MDT issue, and upon running e2fsck, saw a LOT > of corrupted entries that were just deleted. I suspect that these may > have been entries pointing to these files?Likely, yes.> "lfs find" comes up empty handed for this OST, indeed, there are 6 OST > here, each with about 10GB worth of files of this kind. Are those 60GB > just lost? Short of pawing through these, by hand, to see what we can > make of the content, is there a snowball''s chance in Hades of identifying > these files?They can be mapped back to an MDS inode number, and the user/group information is intact, but that doesn''t help if the MDS inodes were deleted by e2fsck since there will not be any file name available.> Can I simply copy them out of this "ldiskfs" mount of the file system, > back into some recovery directory in the real file system, so that users > can pick through them?Yes, just rsync the non-zero-length files from the ldiskfs-mounted OST filesystem into a new "lost+found" directory created in the lustre mountpoint on a client. If you "chmod 1775 /path/to/lustre/lost+found" the owners of the file will be able to read/delete their files, but others will not (like /tmp).> After they are moved, the file system will be reformatted and returned to use.The whole Lustre filesystem, or the OST? If you are replacing the OST, then you should still do a backup of last_rcvd, CONFIGS/, and O/0/LAST_ID from the OST, and then restore them to the after the OST is reformatted. This process was very recently discussed on this list.> On 11/30/2010 8:53 AM, Bob Ball wrote: >> OK, thanks. Scary, to see errors out of lfs find. >> >> bob >> >> On 11/30/2010 1:47 AM, Andreas Dilger wrote: >>> On 2010-11-29, at 20:18, Bob Ball wrote: >>>> I have an odd problem. I am trying to empty all files from a set of OST >>>> as indicated below, by making a list via lfs find and then sending that >>>> list to lfs_migrate. However, I have just gotten this message back from >>>> the lfs find: >>>> >>>> llapi_semantic_traverse: Failed to open >>>> ''/lustre/umt3/data13/daits/p15.6.3.10/prod/W1J_munu216465_simul'': No >>>> such file or directory (2) >>>> error: find failed for umt3-OST0021. >>> This may mean that the file was deleted while "lfs find" was running. >>> >>>> On the OSS, I see this but not much else: >>>> LustreError: 5226:0:(ldlm_resource.c:861:ldlm_resource_add()) lvbo_init >>>> failed for resource 9101: rc -2 >>>> >>>> Can someone give me an idea of what is wrong here? And what can be >>>> done about it, if anything? >>> This might mean that the file was deleted at the same time the MDS crashed, and the objects were removed but the MDS file was not. It is possible to just delete this file using the "unlink" command - it does not contain any data in any case. >>> >>> Cheers, Andreas >>> -- >>> Andreas Dilger >>> Lustre Technical Lead >>> Oracle Corporation Canada Inc. >>> >>> >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Thanks. Can you tell me how to do the mapping back to the MDS inode? For example, is 1162976 in the list below the MDS inode? May as well look. I am following the directions in the recent threads to redo the OSTs (not the whole Lustre file system). I will restore the indicated files. bob On 11/30/2010 4:17 PM, Andreas Dilger wrote:> On 2010-11-30, at 11:17, Bob Ball wrote: >> [root at umdist03 d0]# ls -l >> total 182976 >> -rw-rw-rw- 1 daits users 45002956 Jul 5 20:52 1162976 >> -rw-rw-rw- 1 daits users 44569036 Jul 7 02:53 1200608 >> -rw-rw-rw- 1 daits users 49108913 Jun 28 04:43 1218976 >> -rw-rw-rw- 1 daits users 48658429 Jul 16 13:29 1254176 >> -rwSrwSrw- 1 root root 0 Sep 2 15:11 128 >> -rwSrwSrw- 1 root root 0 Sep 2 15:11 9152 >> -rwSrwSrw- 1 root root 0 Sep 2 15:11 9216 >> -rwSrwSrw- 1 root root 0 Sep 2 15:11 9248 >> >> Some time back we had an MDT issue, and upon running e2fsck, saw a LOT >> of corrupted entries that were just deleted. I suspect that these may >> have been entries pointing to these files? > Likely, yes. > >> "lfs find" comes up empty handed for this OST, indeed, there are 6 OST >> here, each with about 10GB worth of files of this kind. Are those 60GB >> just lost? Short of pawing through these, by hand, to see what we can >> make of the content, is there a snowball''s chance in Hades of identifying >> these files? > They can be mapped back to an MDS inode number, and the user/group information is intact, but that doesn''t help if the MDS inodes were deleted by e2fsck since there will not be any file name available. > >> Can I simply copy them out of this "ldiskfs" mount of the file system, >> back into some recovery directory in the real file system, so that users >> can pick through them? > Yes, just rsync the non-zero-length files from the ldiskfs-mounted OST filesystem into a new "lost+found" directory created in the lustre mountpoint on a client. If you "chmod 1775 /path/to/lustre/lost+found" the owners of the file will be able to read/delete their files, but others will not (like /tmp). > >> After they are moved, the file system will be reformatted and returned to use. > The whole Lustre filesystem, or the OST? If you are replacing the OST, then you should still do a backup of last_rcvd, CONFIGS/, and O/0/LAST_ID from the OST, and then restore them to the after the OST is reformatted. This process was very recently discussed on this list. > >> On 11/30/2010 8:53 AM, Bob Ball wrote: >>> OK, thanks. Scary, to see errors out of lfs find. >>> >>> bob >>> >>> On 11/30/2010 1:47 AM, Andreas Dilger wrote: >>>> On 2010-11-29, at 20:18, Bob Ball wrote: >>>>> I have an odd problem. I am trying to empty all files from a set of OST >>>>> as indicated below, by making a list via lfs find and then sending that >>>>> list to lfs_migrate. However, I have just gotten this message back from >>>>> the lfs find: >>>>> >>>>> llapi_semantic_traverse: Failed to open >>>>> ''/lustre/umt3/data13/daits/p15.6.3.10/prod/W1J_munu216465_simul'': No >>>>> such file or directory (2) >>>>> error: find failed for umt3-OST0021. >>>> This may mean that the file was deleted while "lfs find" was running. >>>> >>>>> On the OSS, I see this but not much else: >>>>> LustreError: 5226:0:(ldlm_resource.c:861:ldlm_resource_add()) lvbo_init >>>>> failed for resource 9101: rc -2 >>>>> >>>>> Can someone give me an idea of what is wrong here? And what can be >>>>> done about it, if anything? >>>> This might mean that the file was deleted at the same time the MDS crashed, and the objects were removed but the MDS file was not. It is possible to just delete this file using the "unlink" command - it does not contain any data in any case. >>>> >>>> Cheers, Andreas >>>> -- >>>> Andreas Dilger >>>> Lustre Technical Lead >>>> Oracle Corporation Canada Inc. >>>> >>>> >>>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > Cheers, Andreas > -- > Andreas Dilger > Lustre Technical Lead > Oracle Corporation Canada Inc. > > >
On 2010-11-30, at 15:46, Bob Ball wrote:> Thanks. Can you tell me how to do the mapping back to the MDS inode? For example, is 1162976 in the list below the MDS inode? May as well look.You can use the "ll_decode_filter_fid" tool on the OST object files (e.g. the "1162976" file below) and it will print out the MDS inode number and generation.> On 11/30/2010 4:17 PM, Andreas Dilger wrote: >> On 2010-11-30, at 11:17, Bob Ball wrote: >>> [root at umdist03 d0]# ls -l >>> total 182976 >>> -rw-rw-rw- 1 daits users 45002956 Jul 5 20:52 1162976 >>> -rw-rw-rw- 1 daits users 44569036 Jul 7 02:53 1200608 >>> -rw-rw-rw- 1 daits users 49108913 Jun 28 04:43 1218976 >>> -rw-rw-rw- 1 daits users 48658429 Jul 16 13:29 1254176 >>> -rwSrwSrw- 1 root root 0 Sep 2 15:11 128 >>> -rwSrwSrw- 1 root root 0 Sep 2 15:11 9152 >>> -rwSrwSrw- 1 root root 0 Sep 2 15:11 9216 >>> -rwSrwSrw- 1 root root 0 Sep 2 15:11 9248 >>> >>> Some time back we had an MDT issue, and upon running e2fsck, saw a LOT >>> of corrupted entries that were just deleted. I suspect that these may >>> have been entries pointing to these files? >> Likely, yes. >> >>> "lfs find" comes up empty handed for this OST, indeed, there are 6 OST >>> here, each with about 10GB worth of files of this kind. Are those 60GB >>> just lost? Short of pawing through these, by hand, to see what we can >>> make of the content, is there a snowball''s chance in Hades of identifying >>> these files? >> They can be mapped back to an MDS inode number, and the user/group information is intact, but that doesn''t help if the MDS inodes were deleted by e2fsck since there will not be any file name available. >> >>> Can I simply copy them out of this "ldiskfs" mount of the file system, >>> back into some recovery directory in the real file system, so that users >>> can pick through them? >> Yes, just rsync the non-zero-length files from the ldiskfs-mounted OST filesystem into a new "lost+found" directory created in the lustre mountpoint on a client. If you "chmod 1775 /path/to/lustre/lost+found" the owners of the file will be able to read/delete their files, but others will not (like /tmp). >> >>> After they are moved, the file system will be reformatted and returned to use. >> The whole Lustre filesystem, or the OST? If you are replacing the OST, then you should still do a backup of last_rcvd, CONFIGS/, and O/0/LAST_ID from the OST, and then restore them to the after the OST is reformatted. This process was very recently discussed on this list. >> >>> On 11/30/2010 8:53 AM, Bob Ball wrote: >>>> OK, thanks. Scary, to see errors out of lfs find. >>>> >>>> bob >>>> >>>> On 11/30/2010 1:47 AM, Andreas Dilger wrote: >>>>> On 2010-11-29, at 20:18, Bob Ball wrote: >>>>>> I have an odd problem. I am trying to empty all files from a set of OST >>>>>> as indicated below, by making a list via lfs find and then sending that >>>>>> list to lfs_migrate. However, I have just gotten this message back from >>>>>> the lfs find: >>>>>> >>>>>> llapi_semantic_traverse: Failed to open >>>>>> ''/lustre/umt3/data13/daits/p15.6.3.10/prod/W1J_munu216465_simul'': No >>>>>> such file or directory (2) >>>>>> error: find failed for umt3-OST0021. >>>>> This may mean that the file was deleted while "lfs find" was running. >>>>> >>>>>> On the OSS, I see this but not much else: >>>>>> LustreError: 5226:0:(ldlm_resource.c:861:ldlm_resource_add()) lvbo_init >>>>>> failed for resource 9101: rc -2 >>>>>> >>>>>> Can someone give me an idea of what is wrong here? And what can be >>>>>> done about it, if anything? >>>>> This might mean that the file was deleted at the same time the MDS crashed, and the objects were removed but the MDS file was not. It is possible to just delete this file using the "unlink" command - it does not contain any data in any case. >>>>> >>>>> Cheers, Andreas >>>>> -- >>>>> Andreas Dilger >>>>> Lustre Technical Lead >>>>> Oracle Corporation Canada Inc. >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> Cheers, Andreas >> -- >> Andreas Dilger >> Lustre Technical Lead >> Oracle Corporation Canada Inc. >> >> >>Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.