Dear all, After an annual e2fsck of all OSTs, two of our OSTs have become read only with error: Jul 25 08:37:34 com04 kernel: LDISKFS-fs error (device sdb1): ldiskfs_dx_find_entry: bad entry in directory #222863370: inode out of bounds - offset=3280896, inode=656179638, rec_len=4096, name_len=0 Jul 25 08:37:34 com04 kernel: Aborting journal on device sdb1-8. Jul 25 08:37:34 com04 kernel: LDISKFS-fs (sdb1): Remounting filesystem read-only tune2fs shows the OSTs are at stat "clean with error", after umount and e2fsck again, the two OSTs could be mount normally(and the stat changed to "clean"). However, we began to meet hundreds of "lvbo_init failed" on serveral OSTs, not limited on the two OSTs which have been read-only. Three of our OSTs have met hundreds of lvbo_init faild after an annual e2fsck examination. Aug 1 17:48:26 com04 kernel: LustreError: 5493:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 1 previous similar message Aug 1 17:59:02 com04 kernel: LustreError: 5632:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-publicfs-OST001d_UUID: lvbo_init failed for resource 2997406: rc -2 Aug 1 17:59:02 com04 kernel: LustreError: 5632:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 1 previous similar message Aug 1 18:10:51 com04 kernel: LustreError: 5602:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-publicfs-OST001d_UUID: lvbo_init failed for resource 3240254: rc -2 Aug 1 18:10:51 com04 kernel: LustreError: 5602:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 2 previous similar messages Aug 1 18:21:49 com04 kernel: LustreError: 5642:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-publicfs-OST001f_UUID: lvbo_init failed for resource 3204200: rc -2 Aug 1 18:21:49 com04 kernel: LustreError: 5642:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 6 previous similar messages Aug 1 18:53:18 com04 kernel: LustreError: 5324:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-publicfs-OST001f_UUID: lvbo_init failed for resource 12856264: rc -2 Aug 1 18:53:18 com04 kernel: LustreError: 5324:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 1 previous similar message According to previous discussions, it seems that the related Objects have been deleted or moved to lost+found. I am not sure: 1. if the commmand " ll_recover_lost_found_objs" can get back all the lost objects 2. if not, how can I get a list of demaged files? 3. as users continuely writing new data to the OSTs, the number of demaged Objects will increase? do you have any suggestion? Thank you very much! Lu Wang Computing Center IHEP,China
On 2011-08-01 21:13, WANG Lu wrote:> Dear all, > After an annual e2fsck of all OSTs, two of our OSTs have become read only with error: > Jul 25 08:37:34 com04 kernel: LDISKFS-fs error (device sdb1): ldiskfs_dx_find_entry: bad entry in directory #222863370: inode out of bounds - offset=3280896, inode=656179638, rec_len=4096, name_len=0 > Jul 25 08:37:34 com04 kernel: Aborting journal on device sdb1-8. > Jul 25 08:37:34 com04 kernel: LDISKFS-fs (sdb1): Remounting filesystem read-only > tune2fs shows the OSTs are at stat "clean with error", after umount and e2fsck again, the two OSTs could be mount normally(and the stat changed to "clean"). > > However, we began to meet hundreds of "lvbo_init failed" on serveral OSTs, not limited on the two OSTs which have been read-only. > > Three of our OSTs have met hundreds of lvbo_init faild after an annual e2fsck examination. > > Aug 1 17:48:26 com04 kernel: LustreError: 5493:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 1 previous similar message > Aug 1 17:59:02 com04 kernel: LustreError: 5632:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-publicfs-OST001d_UUID: lvbo_init failed for resource 2997406: rc -2 > Aug 1 17:59:02 com04 kernel: LustreError: 5632:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 1 previous similar message > Aug 1 18:10:51 com04 kernel: LustreError: 5602:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-publicfs-OST001d_UUID: lvbo_init failed for resource 3240254: rc -2 > Aug 1 18:10:51 com04 kernel: LustreError: 5602:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 2 previous similar messages > Aug 1 18:21:49 com04 kernel: LustreError: 5642:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-publicfs-OST001f_UUID: lvbo_init failed for resource 3204200: rc -2 > Aug 1 18:21:49 com04 kernel: LustreError: 5642:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 6 previous similar messages > Aug 1 18:53:18 com04 kernel: LustreError: 5324:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-publicfs-OST001f_UUID: lvbo_init failed for resource 12856264: rc -2 > Aug 1 18:53:18 com04 kernel: LustreError: 5324:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 1 previous similar message > > According to previous discussions, it seems that the related Objects have been deleted or moved to lost+found. I am not sure: > 1. if the commmand " ll_recover_lost_found_objs" can get back all the lost objectsll_recover_lost_found_objs will recover valid objects in the OST''s lost_found directory.> 2. if not, how can I get a list of demaged files? > 3. as users continuely writing new data to the OSTs, the number of demaged Objects will increase? > > do you have any suggestion? Thank you very much! > > > Lu Wang > Computing Center > IHEP,China > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- Bobi Jam <bobijam at whamcloud.com>
Update some information: 1. After running "ll_recover_lost_found_objs", we still have the "lvbo_init faild" error. 2. There is no files under "lost+found", in some of related OSTs. Here is the result of debugf: # debugfs -c /dev/sdb1 debugfs 1.41.10.sun2 (24-Feb-2010) /dev/sdb1: catastrophic mode - not reading inode or group bitmaps debugfs: ls 2 (12) . 2 (12) .. 11 (20) lost+found 103784449 (16) CONFIGS 12 (20) last_rcvd 13 (20) health_check 222863361 (3996) O debugfs: cd lost+found debugfs: ls 11 (12) . 2 (4084) .. 0 (4096) 0 (4096) 0 (4096) 3. We are currently running Lustre 1.8.5. Thank you in advance for your help! Lu Wang CC-IHEP> -----????----- > ???: "WANG Lu" <wanglu at ihep.ac.cn> > ????: 2011?8?1? ??? > ???: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org> > ??: > ??: [Lustre-discuss] lvbo_init failed after e2fsck > > Dear all, > After an annual e2fsck of all OSTs, two of our OSTs have become read only with error: > Jul 25 08:37:34 com04 kernel: LDISKFS-fs error (device sdb1): ldiskfs_dx_find_entry: bad entry in directory #222863370: inode out of bounds - offset=3280896, inode=656179638, rec_len=4096, name_len=0 > Jul 25 08:37:34 com04 kernel: Aborting journal on device sdb1-8. > Jul 25 08:37:34 com04 kernel: LDISKFS-fs (sdb1): Remounting filesystem read-only > tune2fs shows the OSTs are at stat "clean with error", after umount and e2fsck again, the two OSTs could be mount normally(and the stat changed to "clean"). > > However, we began to meet hundreds of "lvbo_init failed" on serveral OSTs, not limited on the two OSTs which have been read-only. > > Three of our OSTs have met hundreds of lvbo_init faild after an annual e2fsck examination. > > Aug 1 17:48:26 com04 kernel: LustreError: 5493:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 1 previous similar message > Aug 1 17:59:02 com04 kernel: LustreError: 5632:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-publicfs-OST001d_UUID: lvbo_init failed for resource 2997406: rc -2 > Aug 1 17:59:02 com04 kernel: LustreError: 5632:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 1 previous similar message > Aug 1 18:10:51 com04 kernel: LustreError: 5602:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-publicfs-OST001d_UUID: lvbo_init failed for resource 3240254: rc -2 > Aug 1 18:10:51 com04 kernel: LustreError: 5602:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 2 previous similar messages > Aug 1 18:21:49 com04 kernel: LustreError: 5642:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-publicfs-OST001f_UUID: lvbo_init failed for resource 3204200: rc -2 > Aug 1 18:21:49 com04 kernel: LustreError: 5642:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 6 previous similar messages > Aug 1 18:53:18 com04 kernel: LustreError: 5324:0:(ldlm_resource.c:862:ldlm_resource_add()) filter-publicfs-OST001f_UUID: lvbo_init failed for resource 12856264: rc -2 > Aug 1 18:53:18 com04 kernel: LustreError: 5324:0:(ldlm_resource.c:862:ldlm_resource_add()) Skipped 1 previous similar message > > According to previous discussions, it seems that the related Objects have been deleted or moved to lost+found. I am not sure: > 1. if the commmand " ll_recover_lost_found_objs" can get back all the lost objects > 2. if not, how can I get a list of demaged files? > 3. as users continuely writing new data to the OSTs, the number of demaged Objects will increase? > > do you have any suggestion? Thank you very much! > > > Lu Wang > Computing Center > IHEP,China > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss