Michael Sternberg
2008-Jul-02 00:52 UTC
[Lustre-discuss] OSS: bad header in inode - invalid magic
Hi, I repeatedly encounter "invalid magic" in one particular inode of one of my OSS volumes (1 of 4, each 5 TB), with the consequence of lustre remounting R/O. I run 2.6.18-53.1.13.el5_lustre.1.6.4.3smp on RHEL5.1 on a cluster with approx. 150 client nodes. The error appears on the OSS as: Jul 1 15:43:58 oss01 kernel: LDISKFS-fs error (device dm-3): ldiskfs_ext_find_extent: bad header in inode #405012501: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) Jul 1 15:43:58 oss01 kernel: Remounting filesystem read-only Jul 1 15:43:58 oss01 kernel: LDISKFS-fs error (device dm-3): ldiskfs_ext_find_extent: bad header in inode #405012501: invalid magic - magic 0, entries 0, max 0(0), depth 0(0) Jul 1 15:43:58 oss01 kernel: LustreError: 25462:0:(fsfilt-ldiskfs.c: 417:fsfilt_ldiskfs_brw_start()) can''t get handle for 45 credits: rc = -30 Jul 1 15:43:58 oss01 kernel: LustreError: 25462:0:(fsfilt-ldiskfs.c: 417:fsfilt_ldiskfs_brw_start()) Skipped 6 previous similar messages Jul 1 15:43:58 oss01 kernel: LustreError: 25462:0:(filter_io_26.c: 705:filter_commitrw_write()) error starting transaction: rc = -30 Jul 1 15:43:58 oss01 kernel: LustreError: 19569:0:(filter_io_26.c: 705:filter_commitrw_write()) error starting transaction: rc = -30 [... many repeats] Three login nodes signaled, about 10 .. 15 minutes apart the same wall(8) message: Message from syslogd@ at Tue Jul 1 16:00:02 2008 ... login1 kernel: LustreError: 5612:0:(ptlrpcd.c:72:ptlrpcd_wake()) ASSERTION(pc != NULL) failed Message from syslogd@ at Tue Jul 1 16:00:02 2008 ... login1 kernel: LustreError: 5612:0:(tracefile.c: 431:libcfs_assertion_failed()) LBUG Twice in the past, I followed this recovery procedure from the Manual and the Wiki: http://wiki.lustre.org/index.php?title=Fsck_Support#Using_e2fsck_on_a_backing_filesystem%7Cusing Using e2fsck on a backing filesystem -- nice walk-through http://manual.lustre.org/manual/LustreManual16_HTML/Failover.html#50446391_pgfId-1287654 8.4.1 Starting/Stopping a Resource [i.e., simply umounting the device on the OSS - is this correct?] http://manual.lustre.org/manual/LustreManual16_HTML/LustreInstallation.html#50446385_43530 4.2.1.5 Stopping a Server In other words: umount the OSS perform fsck on the block device remount the OSS So, last time I did: [root at oss01 ~]# umount /mnt/ost2 [root at oss01 ~]# e2fsck -fp /dev/dm-3 lustre-OST0002: recovering journal lustre-OST0002: ext3 recovery flag is clear, but journal has data. lustre-OST0002: Run journal anyway lustre-OST0002: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. (i.e., without -a or -p options) [root at oss01 ~]# mount -t ldiskfs /dev/dm-3 /mnt/ost2 [root at oss01 ~]# umount /mnt/ost2 [root at oss01 ~]# e2fsck -fp /dev/dm-3 lustre-OST0002: 342355/427253760 files (4.2% non-contiguous), 139324997/1708984375 blocks To my surprise, there were no errors. I did the same today after the error above, but left out the "-p" flag; still, fsck did not find an error (except the journal replay??): [root at oss01 ~]# e2fsck -f /dev/dm-3 e2fsck 1.40.4.cfs1 (31-Dec-2007) lustre-OST0002: recovering journal Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information lustre-OST0002: ***** FILE SYSTEM WAS MODIFIED ***** lustre-OST0002: 343702/427253760 files (4.4% non-contiguous), 137003893/1708984375 blocks [root at oss01 ~]# I haven''t mounted back yet for fear this would stall the system again in a couple of days. How can I locate the "bad" inode - should I try? Is this an inode of the lustre FS or the underlying ext3 on the OST? Are there version dependencies of e2fsck with lustre? I am running lustre-1.6.4.3 and e2fsck-1.40.4. I would appreciate any pointers. Thank you for your attention and help. Michael -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080701/0c5c1e84/attachment.html
Brian J. Murrell
2008-Jul-02 13:35 UTC
[Lustre-discuss] OSS: bad header in inode - invalid magic
On Tue, 2008-07-01 at 19:52 -0500, Michael Sternberg wrote:> Hi,Hi,> Jul 1 15:43:58 oss01 kernel: LDISKFS-fs error (device dm-3): > ldiskfs_ext_find_extent: bad header in inode #405012501: invalid magic > - magic 0, entries 0, max 0(0), depth 0(0)I would suggest making sure you have the *latest* e2fsprogs from Sun and running an e2fsck on that volume. You may have to flags to force a thorough check. Check the manpage.> Jul 1 15:43:58 oss01 kernel: Remounting filesystem read-only > Jul 1 15:43:58 oss01 kernel: LDISKFS-fs error (device dm-3): > ldiskfs_ext_find_extent: bad header in inode #405012501: invalid magic > - magic 0, entries 0, max 0(0), depth 0(0) > Jul 1 15:43:58 oss01 kernel: LustreError: > 25462:0:(fsfilt-ldiskfs.c:417:fsfilt_ldiskfs_brw_start()) can''t get > handle for 45 credits: rc = -30 > Jul 1 15:43:58 oss01 kernel: LustreError: > 25462:0:(fsfilt-ldiskfs.c:417:fsfilt_ldiskfs_brw_start()) Skipped 6 > previous similar messages > Jul 1 15:43:58 oss01 kernel: LustreError: > 25462:0:(filter_io_26.c:705:filter_commitrw_write()) error starting > transaction: rc = -30 > Jul 1 15:43:58 oss01 kernel: LustreError: > 19569:0:(filter_io_26.c:705:filter_commitrw_write()) error starting > transaction: rc = -30 > [... many repeats]These are all just fallout from the R/O remount.> Message from syslogd@ at Tue Jul 1 16:00:02 2008 ... > login1 kernel: LustreError: 5612:0:(ptlrpcd.c:72:ptlrpcd_wake()) ASSERTION(pc != NULL) failed > Message from syslogd@ at Tue Jul 1 16:00:02 2008 ... > login1 kernel: LustreError: 5612:0:(tracefile.c:431:libcfs_assertion_failed()) LBUGThis looks like bug 13888 fixed in 1.6.5. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080702/afe222b9/attachment-0001.bin
Michael Sternberg
2008-Jul-02 14:01 UTC
[Lustre-discuss] OSS: bad header in inode - invalid magic
Hi, On Jul 2, 2008, at 8:35 , Brian J. Murrell wrote:> On Tue, 2008-07-01 at 19:52 -0500, Michael Sternberg wrote: >> Jul 1 15:43:58 oss01 kernel: LDISKFS-fs error (device dm-3): >> ldiskfs_ext_find_extent: bad header in inode #405012501: invalid >> magic >> - magic 0, entries 0, max 0(0), depth 0(0) > > I would suggest making sure you have the *latest* e2fsprogs from Sun > and > running an e2fsck on that volume. You may have to flags to force a > thorough check. Check the manpage.Ah - I tried, but ran into a symbol error: [root at oss01 ~]# wget http://downloads.lustre.org/public/tools/e2fsprogs/latest/e2fsprogs-1.40.7.sun3-0redhat.rhel5.x86_64.rpm [root at oss01 ~]# rpm -Fvh ./ e2fsprogs-1.40.7.sun3-0redhat.rhel5.x86_64.rpm [root at oss01 ~]# e2fsck -fp /dev/dm-3 e2fsck: symbol lookup error: e2fsck: undefined symbol: ext2fs_mmp_update This is on RHEL5.1.>> Message from syslogd@ at Tue Jul 1 16:00:02 2008 ... >> login1 kernel: LustreError: 5612:0:(ptlrpcd.c:72:ptlrpcd_wake()) >> ASSERTION(pc != NULL) failed >> Message from syslogd@ at Tue Jul 1 16:00:02 2008 ... >> login1 kernel: LustreError: 5612:0:(tracefile.c: >> 431:libcfs_assertion_failed()) LBUG > > This looks like bug 13888 fixed in 1.6.5.OK, will go forward with 1.6.5.1, hoping OFED-1.3 etc. issues are solved. Thank you very much, Michael
Brian J. Murrell
2008-Jul-02 14:26 UTC
[Lustre-discuss] OSS: bad header in inode - invalid magic
On Wed, 2008-07-02 at 09:01 -0500, Michael Sternberg wrote:> Hi,Hi.> Ah - I tried, but ran into a symbol error: > > [root at oss01 ~]# wget http://downloads.lustre.org/public/tools/e2fsprogs/latest/e2fsprogs-1.40.7.sun3-0redhat.rhel5.x86_64.rpm > [root at oss01 ~]# rpm -Fvh ./ > e2fsprogs-1.40.7.sun3-0redhat.rhel5.x86_64.rpm > [root at oss01 ~]# e2fsck -fp /dev/dm-3 > e2fsck: symbol lookup error: e2fsck: undefined symbol: ext2fs_mmp_updateAhhh. This isn''t good. Can you file a bug in our bugzilla about that?> OK, will go forward with 1.6.5.1, hoping OFED-1.3 etc. issues are > solved.Me too! b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080702/6e3ddeb9/attachment.bin
Michael Sternberg
2008-Jul-02 15:58 UTC
[Lustre-discuss] OSS: bad header in inode - invalid magic
On Jul 2, 2008, at 9:26 , Brian J. Murrell wrote:> On Wed, 2008-07-02 at 09:01 -0500, Michael Sternberg wrote: >> Ah - I tried, but ran into a symbol error: >> >> [root at oss01 ~]# wget http://downloads.lustre.org/public/tools/e2fsprogs/latest/e2fsprogs-1.40.7.sun3-0redhat.rhel5.x86_64.rpm >> [root at oss01 ~]# rpm -Fvh ./ >> e2fsprogs-1.40.7.sun3-0redhat.rhel5.x86_64.rpm >> [root at oss01 ~]# e2fsck -fp /dev/dm-3 >> e2fsck: symbol lookup error: e2fsck: undefined symbol: >> ext2fs_mmp_update > > Ahhh. This isn''t good. Can you file a bug in our bugzilla about > that?Done: https://bugzilla.lustre.org/show_bug.cgi?id=16265 Update to CentOS-5.2 in progress ... Michael