Dear All, We have an emergent condition on the Lustre filesystem. We installed the lustre-1.6.6 with Linux kernel 2.6.22.19 on all the MGS, MDT, OST servers and clients. They runs very well. But today we encounter the disk array hardware problem (one of the hard disk of the disk array RAID 6 crashed), and soon after that the lustre filesystem got crashed, too. After we replacing the bad hard disk with a new one, the disk array seems rebuilding the RAID 6 data on the hard disk correctly. The file servers seem can access the partitions of that disk array correctly. But the OST partition on that disk array cannot be accessible now: root at wd2:~# mount -t ldiskfs /dev/sdb1 /mnt/mnt mount: wrong fs type, bad option, bad superblock on /dev/sdb1, missing codepage or other error In some cases useful info is found in syslog - try dmesg | tail or so The dmesg message shows: [ 3314.530762] LDISKFS-fs error (device sdb1): ldiskfs_check_descriptors: Block bitmap for group 11152 not in group (block 3407085568)! [ 3314.531701] LDISKFS-fs: group descriptors corrupted! If I run: ./tunefs.lustre --writeconf /dev/sdb1 Reading CONFIGS/mountdata Read previous values: Target: cwork2-OST0000 Index: 0 Lustre FS: cwork2 Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=192.168.10.50 at tcp Permanent disk data: Target: cwork2-OST0000 Index: 0 Lustre FS: cwork2 Mount type: ldiskfs Flags: 0x102 (OST writeconf ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=192.168.10.50 at tcp tunefs.lustre: Unable to mount /dev/sdb1: Invalid argument tunefs.lustre FATAL: failed to write local files tunefs.lustre: exiting with 22 (Invalid argument) The result of the command: "dumpe2fs /dev/sdb1" gives: Filesystem volume name: cwork2-OST0000 Last mounted on: <not available> Filesystem UUID: 4f4323df-73a5-4e93-9a2d-2c2b9a6c3c60 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype n eeds_recovery extents sparse_super large_file Filesystem flags: signed directory hash Default mount options: (none) Filesystem state: clean with errors Errors behavior: Continue Filesystem OS type: Linux Inode count: 101335040 Block count: 405336007 Reserved block count: 20266800 Free blocks: 164142148 Free inodes: 119852810 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 927 Blocks per group: 32768 Fragments per group: 32768 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype n eeds_recovery extents sparse_super large_file Filesystem flags: signed directory hash Default mount options: (none) Filesystem state: clean with errors Errors behavior: Continue Filesystem OS type: Linux Inode count: 101335040 Block count: 405336007 Reserved block count: 20266800 Free blocks: 164142148 Free inodes: 119852810 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 927 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 Filesystem created: Thu Oct 16 15:29:21 2008 ..... Block bitmap at 2820539742 (+2415232350), Inode bitmap at 2820539691 (+2415232299) Inode table at 2820539857-2820540368 (+2415232465) 40232 free blocks, 20387 free inodes, 0 directories dumpe2fs: /dev/sdb1: error reading bitmaps: Can''t read an block bitmap It seems that the backend ext3 file system is still there, but has errors. Could anyone suggest me a way to recover the OST partitions? Can I use e2fsck to fix the problems of the OST partitions? The MGS and MDT seem to be ok, because they are not in the disk array. Thanks very much for your kindly help. Best Regards, T.H.Hsieh
Brian J. Murrell
2009-Mar-09 18:13 UTC
[Lustre-discuss] OST crash with group descriptors corrupted
On Mon, 2009-03-09 at 19:39 +0800, thhsieh wrote:> Dear All, > > We have an emergent condition on the Lustre filesystem. > > But today > we encounter the disk array hardware problem (one of the hard disk > of the disk array RAID 6 crashed), and soon after that the lustre > filesystem got crashed, too.> The dmesg message shows: > > [ 3314.530762] LDISKFS-fs error (device sdb1): ldiskfs_check_descriptors: Block bitmap for group 11152 not in group (block 3407085568)! > [ 3314.531701] LDISKFS-fs: group descriptors corrupted!It looks like your disk error has resulted on an on-disk corruption. AFAIK, RAID is supposed to prevent this. No idea why it didn''t in this case. Maybe check with your RAID vendor.> It seems that the backend ext3 file system is still there, but has > errors.Indeed.> Could anyone suggest me a way to recover the OST partitions? Can I use > e2fsck to fix the problems of the OST partitions?Yes, e2fsck should correct the problem(s). Be aware that there is a possibility that the only way for e2fsck to correct the state of the filesystem is to (re-)move data from the filesystem. To what extent, will depend completely on how much on-disk corruption has taken place. You can get an idea of what e2fsck will do without actually doing anything to the disk data by giving it the "-n" argument. You can decide based on that "dry-run" e2fsck output whether the corrective action it will take is acceptable to you. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090309/f8818307/attachment.bin
Hello, Thanks very much for your kindly reply. We did the e2fsck (version 1.41.4) on all the OST partitions. Thousands of errors prompted. But now we enounter a serious error which I have no idea to fix. Even though the e2fsck has finished checking, one of the OST partition still has problem. The command: ./tunefs.lustre --writeconf /dev/sdb1 shows: checking for existing Lustre data: found CONFIGS/mountdata Reading CONFIGS/mountdata Read previous values: Target: cwork2-OST0000 Index: 0 Lustre FS: cwork2 Mount type: ldiskfs Flags: 0x2 (OST ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=192.168.10.50 at tcp Permanent disk data: Target: cwork2-OST0000 Index: 0 Lustre FS: cwork2 Mount type: ldiskfs Flags: 0x102 (OST writeconf ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=192.168.10.50 at tcp tunefs.lustre: Unable to mount /dev/sdb1: Invalid argument tunefs.lustre FATAL: failed to write local files tunefs.lustre: exiting with 22 (Invalid argument) and the kernel message prompted with the following error: [80083.964462] LDISKFS-fs: group descriptors corrupted! [81423.119834] LDISKFS-fs error (device sdb1): ldiskfs_check_descriptors: Checksum for group 11165 failed (0!=20224) We tried e2fsck with superblock 32768, but after some error corrections again we encounter the same problem. How could we fix this kind of problem? In any case, we are trying to rescue the existing data as much as possible, and reformat the whole filesystem after that. Is there any other information I should provide in order to make the situation more clear? Please let me know. I am really thanksful for your kindly suggestions. Best Regards, T.H.Hsieh On Mon, Mar 09, 2009 at 02:13:15PM -0400, Brian J. Murrell wrote:> On Mon, 2009-03-09 at 19:39 +0800, thhsieh wrote: > > Dear All, > > > > We have an emergent condition on the Lustre filesystem. > > > > But today > > we encounter the disk array hardware problem (one of the hard disk > > of the disk array RAID 6 crashed), and soon after that the lustre > > filesystem got crashed, too. > > > The dmesg message shows: > > > > [ 3314.530762] LDISKFS-fs error (device sdb1): ldiskfs_check_descriptors: Block bitmap for group 11152 not in group (block 3407085568)! > > [ 3314.531701] LDISKFS-fs: group descriptors corrupted! > > It looks like your disk error has resulted on an on-disk corruption. > AFAIK, RAID is supposed to prevent this. No idea why it didn''t in this > case. Maybe check with your RAID vendor. > > > It seems that the backend ext3 file system is still there, but has > > errors. > > Indeed. > > > Could anyone suggest me a way to recover the OST partitions? Can I use > > e2fsck to fix the problems of the OST partitions? > > Yes, e2fsck should correct the problem(s). Be aware that there is a > possibility that the only way for e2fsck to correct the state of the > filesystem is to (re-)move data from the filesystem. To what extent, > will depend completely on how much on-disk corruption has taken place. > > You can get an idea of what e2fsck will do without actually doing > anything to the disk data by giving it the "-n" argument. You can > decide based on that "dry-run" e2fsck output whether the corrective > action it will take is acceptable to you. > > b. >> _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Brian J. Murrell
2009-Mar-10 14:04 UTC
[Lustre-discuss] OST crash with group descriptors corrupted
On Tue, 2009-03-10 at 18:14 +0800, thhsieh wrote:> Hello,Hi.> Thanks very much for your kindly reply.NP.> We did the e2fsck (version 1.41.4) on all the OST partitions.You did that with or without "-n" in the command arguments?> [80083.964462] LDISKFS-fs: group descriptors corrupted! > [81423.119834] LDISKFS-fs error (device sdb1): ldiskfs_check_descriptors: Checksum for group 11165 failed (0!=20224)Hrm. I don''t know enough about the innards of ext3 to parse that. Maybe (well, no maybes about it) Andreas will know if he is reading.> We tried e2fsck with superblock 32768, but after some error > corrections again we encounter the same problem.Again, with our without e2fsck''s "-n" argument? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090310/80e65176/attachment.bin
Hello, We have done it without "-n" command augument. Hence the filesystem is modified. Best Regards, T.H.Hsieh On Tue, Mar 10, 2009 at 10:04:24AM -0400, Brian J. Murrell wrote:> On Tue, 2009-03-10 at 18:14 +0800, thhsieh wrote: > > Hello, > > Hi. > > > Thanks very much for your kindly reply. > > NP. > > > We did the e2fsck (version 1.41.4) on all the OST partitions. > > You did that with or without "-n" in the command arguments? > > > [80083.964462] LDISKFS-fs: group descriptors corrupted! > > [81423.119834] LDISKFS-fs error (device sdb1): ldiskfs_check_descriptors: Checksum for group 11165 failed (0!=20224) > > Hrm. I don''t know enough about the innards of ext3 to parse that. > Maybe (well, no maybes about it) Andreas will know if he is reading. > > > We tried e2fsck with superblock 32768, but after some error > > corrections again we encounter the same problem. > > Again, with our without e2fsck''s "-n" argument? > > b. >> _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Hello, I am wondering that, whether it is possible to give up that problematic OST, and only make the other OSTs active, so that we can rescue part of the data files ? Now we have totally 6 OSTs, and one of the OST has problem that I have no idea to repair now. If now I only activate the 5 OSTs, could I get back (at most) 5/6 of data files, or I can just get back junk files (since the files are divided into fragments and are distributed into all OSTs) ? If only activates the 5 OSTs can get back some data files, what''s the procedure I could do ? Because I am under time pressure to recover the system. Hence I am considering the worst situation .... Thanks so much for your kindly reply. Best Regards, T.H.Hsieh On Tue, Mar 10, 2009 at 10:56:17PM +0800, thhsieh wrote:> Hello, > > We have done it without "-n" command augument. > Hence the filesystem is modified. > > Best Regards, > > T.H.Hsieh > > > On Tue, Mar 10, 2009 at 10:04:24AM -0400, Brian J. Murrell wrote: > > On Tue, 2009-03-10 at 18:14 +0800, thhsieh wrote: > > > Hello, > > > > Hi. > > > > > Thanks very much for your kindly reply. > > > > NP. > > > > > We did the e2fsck (version 1.41.4) on all the OST partitions. > > > > You did that with or without "-n" in the command arguments? > > > > > [80083.964462] LDISKFS-fs: group descriptors corrupted! > > > [81423.119834] LDISKFS-fs error (device sdb1): ldiskfs_check_descriptors: Checksum for group 11165 failed (0!=20224) > > > > Hrm. I don''t know enough about the innards of ext3 to parse that. > > Maybe (well, no maybes about it) Andreas will know if he is reading. > > > > > We tried e2fsck with superblock 32768, but after some error > > > corrections again we encounter the same problem. > > > > Again, with our without e2fsck''s "-n" argument? > > > > b. > > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Andreas Dilger
2009-Mar-10 19:47 UTC
[Lustre-discuss] OST crash with group descriptors corrupted
On Mar 10, 2009 23:42 +0800, thhsieh wrote:> I am wondering that, whether it is possible to give up that > problematic OST, and only make the other OSTs active, so that > we can rescue part of the data files ?Yes, this is always possible. Just mout lustre as normal, and on all clients + MDS run "lctl set_param osc.*OST{number}*.active=0" so that they will return an EIO error instead of hanging and waiting for the failed OST to return. That said, I don''t think this is necessarily a fatal problem.> Now we have totally 6 OSTs, and one of the OST has problem > that I have no idea to repair now. If now I only activate > the 5 OSTs, could I get back (at most) 5/6 of data files, > or I can just get back junk files (since the files are > divided into fragments and are distributed into all OSTs) ?Lustre by default places each file on a single OST, so you should be able to get back 5/6 of your files.> If only activates the 5 OSTs can get back some data files, > what''s the procedure I could do ?Use "lfs find --obd {OST_UUID} /mount/point" to find files that are on the failed OST. Hmm, there isn''t a way to specify the opposite, however "lfs find ! --obd {OST_UUID}", please file a bug for that, it is relatively easy to implement (or you could take a crack at it in lustre/utils/lfs.c::lfs_find()).> Because I am under time pressure to recover the system. > Hence I am considering the worst situation ....I think you can possibly recover this OST.> > > You did that with or without "-n" in the command arguments? > > > > > > > [80083.964462] LDISKFS-fs: group descriptors corrupted! > > > > [81423.119834] LDISKFS-fs error (device sdb1): ldiskfs_check_descriptors: Checksum for group 11165 failed (0!=20224)It looks like this is a simple bug in the ldiskfs code AND in the e2fsck code. The feature that enables group checksums (uninit_bg) was disabled in the superblock for some reason, but e2fsck didn''t clear the checksum from disk. Now, the kernel is returning "0" for the checksum (because this feature is disabled) but there is an old checksum value on disk. The easiest way to fix this (short of modifying the kernel and/or e2fsck) is to re-enable the uninit_bg feature, and re-run e2fsck. Note that running with uninit_bg is preferable in any case, as it improves performance. # tune2fs -O uninit_bg /dev/XXX # e2fsck -fy /dev/XXX This will report an error for all of the checksum values and correct them, but then hopefully your filesystem can be mounted again. Please file a separate bug on this, it needs to be fixed in our uninit_bg code to ignore the checksum if the feature is disabled, and in e2fsck to zero this value if the feature is disabled.> > > > We did the e2fsck (version 1.41.4) on all the OST partitions.Note that e2fsck 1.41.4 is the upstream e2fsprogs, and not the Lustre-patched e2fsprogs-1.40.11.sun1. While the majority of Lustre (now ext4) functionality is included into 1.41.4 it isn''t all there. In this case I don''t know if it matters or not. Also note that the "uninit_bg" feature was called "uninit_groups" in the 1.40.11 release of Lustre e2fsprogs (this was changed beyond our control), so adjust the above steps accordingly.> > > Hrm. I don''t know enough about the innards of ext3 to parse that. > > > Maybe (well, no maybes about it) Andreas will know if he is reading. > > > > > > > We tried e2fsck with superblock 32768, but after some error > > > > corrections again we encounter the same problem.Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Hello, Thanks so much to Andreas, Megan, and Brian. Following the suggestions of Andreas, now all the OSTs are recovered and functional. Due to the large time pressure, and unfortunately I have to be out of office and can only access limited networking, I cannot get the lustre patched e2fsprogs-1.40.11.sun1 to work. So I still use e2fsprogs-1.41.4. After doing # tune2fs -O uninit_bg /dev/XXX # e2fsck -fy /dev/XXX but still I cannot run the "tunefs.lustre --writeback /dev/xxx". Kernel message complained that "Missing journal". Therefore, I tried to run: # tune2fs -j /dev/XXX This time everything works !!!!! :) Now I make the system on-line for users to download their data. For safe I guess I still need to run: lfs find --obd {OST_UUID} /mount/point or anything I need to do in order to ensure the consistancy of the lustre filesystem ? Please give me suggestions. Thanks very much again. T.H.Hsieh On Tue, Mar 10, 2009 at 01:47:39PM -0600, Andreas Dilger wrote:> On Mar 10, 2009 23:42 +0800, thhsieh wrote: > > I am wondering that, whether it is possible to give up that > > problematic OST, and only make the other OSTs active, so that > > we can rescue part of the data files ? > > Yes, this is always possible. Just mout lustre as normal, and > on all clients + MDS run "lctl set_param osc.*OST{number}*.active=0" > so that they will return an EIO error instead of hanging and waiting > for the failed OST to return. > > That said, I don''t think this is necessarily a fatal problem. > > > Now we have totally 6 OSTs, and one of the OST has problem > > that I have no idea to repair now. If now I only activate > > the 5 OSTs, could I get back (at most) 5/6 of data files, > > or I can just get back junk files (since the files are > > divided into fragments and are distributed into all OSTs) ? > > Lustre by default places each file on a single OST, so you > should be able to get back 5/6 of your files. > > > If only activates the 5 OSTs can get back some data files, > > what''s the procedure I could do ? > > Use "lfs find --obd {OST_UUID} /mount/point" to find files that > are on the failed OST. Hmm, there isn''t a way to specify the > opposite, however "lfs find ! --obd {OST_UUID}", please file a > bug for that, it is relatively easy to implement (or you could > take a crack at it in lustre/utils/lfs.c::lfs_find()). > > > Because I am under time pressure to recover the system. > > Hence I am considering the worst situation .... > > I think you can possibly recover this OST. > > > > > You did that with or without "-n" in the command arguments? > > > > > > > > > [80083.964462] LDISKFS-fs: group descriptors corrupted! > > > > > [81423.119834] LDISKFS-fs error (device sdb1): ldiskfs_check_descriptors: Checksum for group 11165 failed (0!=20224) > > It looks like this is a simple bug in the ldiskfs code AND in the e2fsck > code. The feature that enables group checksums (uninit_bg) was disabled > in the superblock for some reason, but e2fsck didn''t clear the checksum > from disk. Now, the kernel is returning "0" for the checksum (because > this feature is disabled) but there is an old checksum value on disk. > > The easiest way to fix this (short of modifying the kernel and/or e2fsck) > is to re-enable the uninit_bg feature, and re-run e2fsck. Note that > running with uninit_bg is preferable in any case, as it improves performance. > > # tune2fs -O uninit_bg /dev/XXX > # e2fsck -fy /dev/XXX > > This will report an error for all of the checksum values and correct > them, but then hopefully your filesystem can be mounted again. Please > file a separate bug on this, it needs to be fixed in our uninit_bg code > to ignore the checksum if the feature is disabled, and in e2fsck to zero > this value if the feature is disabled. > > > > > > We did the e2fsck (version 1.41.4) on all the OST partitions. > > Note that e2fsck 1.41.4 is the upstream e2fsprogs, and not the Lustre-patched > e2fsprogs-1.40.11.sun1. While the majority of Lustre (now ext4) functionality > is included into 1.41.4 it isn''t all there. In this case I don''t know if it > matters or not. > > Also note that the "uninit_bg" feature was called "uninit_groups" in the > 1.40.11 release of Lustre e2fsprogs (this was changed beyond our control), > so adjust the above steps accordingly. > > > > > Hrm. I don''t know enough about the innards of ext3 to parse that. > > > > Maybe (well, no maybes about it) Andreas will know if he is reading. > > > > > > > > > We tried e2fsck with superblock 32768, but after some error > > > > > corrections again we encounter the same problem. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. >