Michael Ulbrich
2016-Mar-24 13:47 UTC
[Ocfs2-users] fsck.ocfs2 loops + hangs but does not check
Hi Joseph, thanks for this information although this does not sound too optimistic ... So, if I understand you correctly, if we had a metadata backup from o2image _before_ the crash we could have looked up the missing info to remove the loop from group chain 73, right? But how could the loop issue be fixed and at the same time the damage to the data be minimized? There is a recent file level backup from which damaged or missing files could be restored later. 151 4054438912 15872 2152 13720 10606 1984 152 4094595072 15872 10753 5119 5119 1984 153 4090944512 15872 1818 14054 9646 1984 <-- 154 4083643392 15872 571 15301 4914 1984 155 4510758912 15872 4834 11038 6601 1984 156 4492506112 15872 6532 9340 5119 1984 Could you describe a "brute force" way how to dd out and edit record #153 to remove the loop and minimize potential loss of data at the same time? So that fsck would have a chance to complete and fix the remaining issues? Thanks a lot for your help ... Michael On 03/24/2016 02:10 PM, Joseph Qi wrote:> Hi Michael, > So I think the block of record #153 goes wrong, which points next to > block 4083643392 of record #19. > But the problem is we don't know the right info of the block of record > #153, otherwise we can dd out, edit it and then dd in to fix it. > > Thanks, > Joseph > > On 2016/3/24 18:38, Michael Ulbrich wrote: >> Hi Joseph, >> >> ok, got it! Here's the loop in chain 73: >> >> Group Chain: 73 Parent Inode: 13 Generation: 1172963971 >> CRC32: 00000000 ECC: 0000 >> ## Block# Total Used Free Contig Size >> 0 4280773632 15872 11487 4385 1774 1984 >> 1 2583263232 15872 5341 10531 5153 1984 >> 2 4543613952 15872 5329 10543 5119 1984 >> 3 4532662272 15872 10753 5119 5119 1984 >> 4 4539963392 15872 3223 12649 7530 1984 >> 5 4536312832 15872 5219 10653 5534 1984 >> 6 4529011712 15872 6047 9825 3359 1984 >> 7 4525361152 15872 4475 11397 5809 1984 >> 8 4521710592 15872 3182 12690 5844 1984 >> 9 4518060032 15872 5881 9991 5131 1984 >> 10 4236966912 15872 10753 5119 5119 1984 >> 11 4098245632 15872 10756 5116 3388 1984 >> 12 4514409472 15872 8826 7046 5119 1984 >> 13 3441144832 15872 15 15857 9680 1984 >> 14 4404892672 15872 7563 8309 5119 1984 >> 15 4233316352 15872 9398 6474 5114 1984 >> 16 4488855552 15872 6358 9514 5119 1984 >> 17 3901115392 15872 9932 5940 3757 1984 >> 18 4507108352 15872 6557 9315 6166 1984 >> 19 4083643392 15872 571 15301 4914 1984 <-- >> 20 4510758912 15872 4834 11038 6601 1984 >> 21 4492506112 15872 6532 9340 5119 1984 >> 22 4496156672 15872 10753 5119 5119 1984 >> 23 4503457792 15872 10718 5154 5119 1984 >> ... >> 154 4083643392 15872 571 15301 4914 1984 <-- >> 155 4510758912 15872 4834 11038 6601 1984 >> 156 4492506112 15872 6532 9340 5119 1984 >> 157 4496156672 15872 10753 5119 5119 1984 >> 158 4503457792 15872 10718 5154 5119 1984 >> ... >> 289 4083643392 15872 571 15301 4914 1984 <-- >> 290 4510758912 15872 4834 11038 6601 1984 >> 291 4492506112 15872 6532 9340 5119 1984 >> 292 4496156672 15872 10753 5119 5119 1984 >> 293 4503457792 15872 10718 5154 5119 1984 >> >> etc. >> >> So the loop begins at record #154 and spans 135 records, right? >> >> Will backup fs metadata as soon as I have some external storage at hand. >> >> Thanks a lot so far ... Michael >> >> On 03/24/2016 10:41 AM, Joseph Qi wrote: >>> Hi Michael, >>> It seems that dead loop happens in chain 73. You have formatted using 2K >>> block and 4K cluster, so each chain should have 1522 or 1521 records. >>> But at first glance, I cannot figure out which block goes wrong, because >>> the output you pasted indicates all blocks are different. So I suggest >>> you investigate the all blocks which belong to chain 73 and try to find >>> out if there is a loop there. >>> BTW, have you backed up the metadata using o2image? >>> >>> Thanks, >>> Joseph >>> >>> On 2016/3/24 16:40, Michael Ulbrich wrote: >>>> Hi Joseph, >>>> >>>> thanks a lot for your help. It is very much appreciated! >>>> >>>> I ran debugsfs.ocfs2 from ocfs2-tools 1.6.4 on the mounted file system: >>>> >>>> root at s1a:~# debugfs.ocfs2 -R 'stat //global_bitmap' /dev/drbd1 > >>>> debugfs_drbd1.log 2>&1 >>>> >>>> Inode: 13 Mode: 0644 Generation: 1172963971 (0x45ea0283) >>>> FS Generation: 1172963971 (0x45ea0283) >>>> CRC32: 00000000 ECC: 0000 >>>> Type: Regular Attr: 0x0 Flags: Valid System Allocbitmap Chain >>>> Dynamic Features: (0x0) >>>> User: 0 (root) Group: 0 (root) Size: 11381315956736 >>>> Links: 1 Clusters: 2778641591 >>>> ctime: 0x54010183 -- Sat Aug 30 00:41:07 2014 >>>> atime: 0x54010183 -- Sat Aug 30 00:41:07 2014 >>>> mtime: 0x54010183 -- Sat Aug 30 00:41:07 2014 >>>> dtime: 0x0 -- Thu Jan 1 01:00:00 1970 >>>> ctime_nsec: 0x00000000 -- 0 >>>> atime_nsec: 0x00000000 -- 0 >>>> mtime_nsec: 0x00000000 -- 0 >>>> Refcount Block: 0 >>>> Last Extblk: 0 Orphan Slot: 0 >>>> Sub Alloc Slot: Global Sub Alloc Bit: 7 >>>> Bitmap Total: 2778641591 Used: 1083108631 Free: 1695532960 >>>> Clusters per Group: 15872 Bits per Cluster: 1 >>>> Count: 115 Next Free Rec: 115 >>>> ## Total Used Free Block# >>>> 0 24173056 9429318 14743738 4533995520 >>>> 1 24173056 9421663 14751393 4548629504 >>>> 2 24173056 9432421 14740635 4588817408 >>>> 3 24173056 9427533 14745523 4548692992 >>>> 4 24173056 9433978 14739078 4508568576 >>>> 5 24173056 9436974 14736082 4636369920 >>>> 6 24173056 9428411 14744645 4563390464 >>>> 7 24173056 9426950 14746106 4479459328 >>>> 8 24173056 9428099 14744957 4548851712 >>>> 9 24173056 9431794 14741262 4585389056 >>>> ... >>>> 105 24157184 9414241 14742943 4690652160 >>>> 106 24157184 9419715 14737469 4467999744 >>>> 107 24157184 9411479 14745705 4431525888 >>>> 108 24157184 9413235 14743949 4559327232 >>>> 109 24157184 9417948 14739236 4500950016 >>>> 110 24157184 9411013 14746171 4566691840 >>>> 111 24157184 9421252 14735932 4522916864 >>>> 112 24157184 9416726 14740458 4537550848 >>>> 113 24157184 9415358 14741826 4676303872 >>>> 114 24157184 9420448 14736736 4526662656 >>>> >>>> Group Chain: 0 Parent Inode: 13 Generation: 1172963971 >>>> CRC32: 00000000 ECC: 0000 >>>> ## Block# Total Used Free Contig Size >>>> 0 4533995520 15872 6339 9533 3987 1984 >>>> 1 4530344960 15872 10755 5117 5117 1984 >>>> 2 2997109760 15872 10753 5119 5119 1984 >>>> 3 4526694400 15872 10753 5119 5119 1984 >>>> 4 3022663680 15872 10753 5119 5119 1984 >>>> 5 4512092160 15872 9043 6829 2742 1984 >>>> 6 4523043840 15872 4948 10924 9612 1984 >>>> 7 4519393280 15872 6150 9722 5595 1984 >>>> 8 4515742720 15872 4323 11549 6603 1984 >>>> 9 3771028480 15872 10753 5119 5119 1984 >>>> ... >>>> 1513 5523297280 15872 1 15871 15871 1984 >>>> 1514 5526947840 15872 1 15871 15871 1984 >>>> 1515 5530598400 15872 1 15871 15871 1984 >>>> 1516 5534248960 15872 1 15871 15871 1984 >>>> 1517 5537899520 15872 1 15871 15871 1984 >>>> 1518 5541550080 15872 1 15871 15871 1984 >>>> 1519 5545200640 15872 1 15871 15871 1984 >>>> 1520 5548851200 15872 1 15871 15871 1984 >>>> 1521 5552501760 15872 1 15871 15871 1984 >>>> 1522 5556152320 15872 1 15871 15871 1984 >>>> >>>> Group Chain: 1 Parent Inode: 13 Generation: 1172963971 >>>> CRC32: 00000000 ECC: 0000 >>>> ## Block# Total Used Free Contig Size >>>> 0 4548629504 15872 10755 5117 2496 1984 >>>> 1 2993490944 15872 59 15813 14451 1984 >>>> 2 2489713664 15872 10758 5114 3726 1984 >>>> 3 3117609984 15872 3958 11914 6165 1984 >>>> 4 2544472064 15872 10753 5119 5119 1984 >>>> 5 3040948224 15872 10753 5119 5119 1984 >>>> 6 2971587584 15872 10753 5119 5119 1984 >>>> 7 4493871104 15872 8664 7208 3705 1984 >>>> 8 4544978944 15872 8711 7161 2919 1984 >>>> 9 4417209344 15872 3253 12619 6447 1984 >>>> ... >>>> 1513 5523329024 15872 1 15871 15871 1984 >>>> 1514 5526979584 15872 1 15871 15871 1984 >>>> 1515 5530630144 15872 1 15871 15871 1984 >>>> 1516 5534280704 15872 1 15871 15871 1984 >>>> 1517 5537931264 15872 1 15871 15871 1984 >>>> 1518 5541581824 15872 1 15871 15871 1984 >>>> 1519 5545232384 15872 1 15871 15871 1984 >>>> 1520 5548882944 15872 1 15871 15871 1984 >>>> 1521 5552533504 15872 1 15871 15871 1984 >>>> 1522 5556184064 15872 1 15871 15871 1984 >>>> >>>> ... all following group chains are similarly structured up to #73 which >>>> looks as follows: >>>> >>>> Group Chain: 73 Parent Inode: 13 Generation: 1172963971 >>>> CRC32: 00000000 ECC: 0000 >>>> ## Block# Total Used Free Contig Size >>>> 0 2583263232 15872 5341 10531 5153 1984 >>>> 1 4543613952 15872 5329 10543 5119 1984 >>>> 2 4532662272 15872 10753 5119 5119 1984 >>>> 3 4539963392 15872 3223 12649 7530 1984 >>>> 4 4536312832 15872 5219 10653 5534 1984 >>>> 5 4529011712 15872 6047 9825 3359 1984 >>>> 6 4525361152 15872 4475 11397 5809 1984 >>>> 7 4521710592 15872 3182 12690 5844 1984 >>>> 8 4518060032 15872 5881 9991 5131 1984 >>>> 9 4236966912 15872 10753 5119 5119 1984 >>>> ... >>>> 2059651 4299026432 15872 4334 11538 4816 1984 >>>> 2059652 4087293952 15872 7003 8869 2166 1984 >>>> 2059653 4295375872 15872 6626 9246 5119 1984 >>>> 2059654 4288074752 15872 509 15363 9662 1984 >>>> 2059655 4291725312 15872 6151 9721 5119 1984 >>>> 2059656 4284424192 15872 10052 5820 5119 1984 >>>> 2059657 4277123072 15872 7383 8489 5120 1984 >>>> 2059658 4273472512 15872 14 15858 5655 1984 >>>> 2059659 4269821952 15872 2637 13235 7060 1984 >>>> 2059660 4266171392 15872 10758 5114 3674 1984 >>>> ... >>>> >>>> Assuming this would go on forever I stopped debugfs.ocfs2. >>>> >>>> With debugs.ocfs2 from ocfs2-tools 1.8.4 I get an identical result. >>>> >>>> Please let me know if I can provide any further information and help to >>>> fix this issue. >>>> >>>> Thanks again + Best regards ... Michael >>>> >>>> On 03/24/2016 01:30 AM, Joseph Qi wrote: >>>>> Hi Michael, >>>>> Could you please use debugfs to check the output? >>>>> # debugfs.ocfs2 -R 'stat //global_bitmap' <device> >>>>> >>>>> Thanks, >>>>> Joseph >>>>> >>>>> On 2016/3/24 6:38, Michael Ulbrich wrote: >>>>>> Hi ocfs2-users, >>>>>> >>>>>> my first post to this list from yesterday probably didn't get through. >>>>>> >>>>>> Anyway, I've made some progress in the meantime and may now ask more >>>>>> specific questions ... >>>>>> >>>>>> I'm having issues with an 11 TB ocfs2 shared filesystem on Debian Wheezy: >>>>>> >>>>>> Linux s1a 3.2.0-4-amd64 #1 SMP Debian 3.2.54-2 x86_64 GNU/Linux >>>>>> >>>>>> the kernel modules are: >>>>>> >>>>>> modinfo ocfs2 -> version: 1.5.0 >>>>>> >>>>>> using stock ocfs2-tools 1.6.4-1+deb7u1 from the distri. >>>>>> >>>>>> As an alternative I cloned and built the latest ocfs2-tools from >>>>>> markfasheh's ocfs2-tools on github which should be version 1.8.4. >>>>>> >>>>>> The filesystem runs on top of drbd, is used to roughly 40 % and suffers >>>>>> from read-only remounts and hanging clients since the last reboot. This >>>>>> may be DLM problems but I suspect they stem from some corrupt disk >>>>>> structures. Before that it all ran stable for months. >>>>>> >>>>>> This situation made me want to run fsck.ocfs2 and now I wonder how to do >>>>>> that. The filesystem is not mounted. >>>>>> >>>>>> With the stock ocfs-tools 1.6.4: >>>>>> >>>>>> root at s1a:~# fsck.ocfs2 -v -f /dev/drbd1 > fsck_drbd1.log 2>&1 >>>>>> fsck.ocfs2 1.6.4 >>>>>> Checking OCFS2 filesystem in /dev/drbd1: >>>>>> Label: ocfs2_ASSET >>>>>> UUID: 6A1A0189A3F94E32B6B9A526DF9060F3 >>>>>> Number of blocks: 5557283182 >>>>>> Block size: 2048 >>>>>> Number of clusters: 2778641591 >>>>>> Cluster size: 4096 >>>>>> Number of slots: 16 >>>>>> >>>>>> I'm checking fsck_drbd1.log and find that it is making progress in >>>>>> >>>>>> Pass 0a: Checking cluster allocation chains >>>>>> >>>>>> until it reaches "chain 73" and goes into an infinite loop filling the >>>>>> logfile with breathtaking speed. >>>>>> >>>>>> With the newly built ocfs-tools 1.8.4 I get: >>>>>> >>>>>> root at s1a:~# fsck.ocfs2 -v -f /dev/drbd1 > fsck_drbd1.log 2>&1 >>>>>> fsck.ocfs2 1.8.4 >>>>>> Checking OCFS2 filesystem in /dev/drbd1: >>>>>> Label: ocfs2_ASSET >>>>>> UUID: 6A1A0189A3F94E32B6B9A526DF9060F3 >>>>>> Number of blocks: 5557283182 >>>>>> Block size: 2048 >>>>>> Number of clusters: 2778641591 >>>>>> Cluster size: 4096 >>>>>> Number of slots: 16 >>>>>> >>>>>> Again watching the verbose output in fsck_drbd1.log I find that this >>>>>> time it proceeds up to >>>>>> >>>>>> Pass 0a: Checking cluster allocation chains >>>>>> o2fsck_pass0:1360 | found inode alloc 13 at block 13 >>>>>> >>>>>> and stays there without any further progress. I've terminated this >>>>>> process after waiting for more than an hour. >>>>>> >>>>>> Now - I'm lost somehow ... and would very much appreciate if anybody on >>>>>> this list would share his knowledge and give me a hint what to do next. >>>>>> >>>>>> What could be done to get this file system checked and repaired? Am I >>>>>> missing something important or do I just have to wait a little bit >>>>>> longer? Is there a version of ocfs2-tools / fsck.ocfs2 which will >>>>>> perform as expected? >>>>>> >>>>>> I'm prepared to upgrade the kernel to 3.16.0-0.bpo.4-amd64 but shy away >>>>>> from taking that risk without any clue of whether that might solve my >>>>>> problem ... >>>>>> >>>>>> Thanks in advance ... Michael Ulbrich >>>>>> >>>>>> _______________________________________________ >>>>>> Ocfs2-users mailing list >>>>>> Ocfs2-users at oss.oracle.com >>>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-users >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Ocfs2-users mailing list >>>>> Ocfs2-users at oss.oracle.com >>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-users >>>>> >>>> >>>> _______________________________________________ >>>> Ocfs2-users mailing list >>>> Ocfs2-users at oss.oracle.com >>>> https://oss.oracle.com/mailman/listinfo/ocfs2-users >>>> >>>> . >>>> >>> >>> >> >> . >> > >
Hi Michael, On 2016/3/24 21:47, Michael Ulbrich wrote:> Hi Joseph, > > thanks for this information although this does not sound too optimistic ... > > So, if I understand you correctly, if we had a metadata backup from > o2image _before_ the crash we could have looked up the missing info to > remove the loop from group chain 73, right?If we have metadata backup, we can use o2image to restore it back, but this may loss some data.> > But how could the loop issue be fixed and at the same time the damage to > the data be minimized? There is a recent file level backup from which > damaged or missing files could be restored later. > > 151 4054438912 15872 2152 13720 10606 1984 > 152 4094595072 15872 10753 5119 5119 1984 > 153 4090944512 15872 1818 14054 9646 1984 <-- > 154 4083643392 15872 571 15301 4914 1984 > 155 4510758912 15872 4834 11038 6601 1984 > 156 4492506112 15872 6532 9340 5119 1984 > > Could you describe a "brute force" way how to dd out and edit record > #153 to remove the loop and minimize potential loss of data at the same > time? So that fsck would have a chance to complete and fix the remaining > issues?This is dangerous until we can know exactly what's info the block should store. My idea is to find out the actual block of record #154 and let block 4090944512 of record #153 points to it. This must be a bit complicated and should be done under deep understanding of the disk layout. I have went though fsck.ocfs2 patches, and found the following may help: commit efca4b0f2241 (Break a chain loop in group desc) But as you said, you have already upgraded to version 1.8.4. So I'm sorry currently I don't have a better idea. Thanks, Joseph> > Thanks a lot for your help ... Michael > > On 03/24/2016 02:10 PM, Joseph Qi wrote: >> Hi Michael, >> So I think the block of record #153 goes wrong, which points next to >> block 4083643392 of record #19. >> But the problem is we don't know the right info of the block of record >> #153, otherwise we can dd out, edit it and then dd in to fix it. >> >> Thanks, >> Joseph >> >> On 2016/3/24 18:38, Michael Ulbrich wrote: >>> Hi Joseph, >>> >>> ok, got it! Here's the loop in chain 73: >>> >>> Group Chain: 73 Parent Inode: 13 Generation: 1172963971 >>> CRC32: 00000000 ECC: 0000 >>> ## Block# Total Used Free Contig Size >>> 0 4280773632 15872 11487 4385 1774 1984 >>> 1 2583263232 15872 5341 10531 5153 1984 >>> 2 4543613952 15872 5329 10543 5119 1984 >>> 3 4532662272 15872 10753 5119 5119 1984 >>> 4 4539963392 15872 3223 12649 7530 1984 >>> 5 4536312832 15872 5219 10653 5534 1984 >>> 6 4529011712 15872 6047 9825 3359 1984 >>> 7 4525361152 15872 4475 11397 5809 1984 >>> 8 4521710592 15872 3182 12690 5844 1984 >>> 9 4518060032 15872 5881 9991 5131 1984 >>> 10 4236966912 15872 10753 5119 5119 1984 >>> 11 4098245632 15872 10756 5116 3388 1984 >>> 12 4514409472 15872 8826 7046 5119 1984 >>> 13 3441144832 15872 15 15857 9680 1984 >>> 14 4404892672 15872 7563 8309 5119 1984 >>> 15 4233316352 15872 9398 6474 5114 1984 >>> 16 4488855552 15872 6358 9514 5119 1984 >>> 17 3901115392 15872 9932 5940 3757 1984 >>> 18 4507108352 15872 6557 9315 6166 1984 >>> 19 4083643392 15872 571 15301 4914 1984 <-- >>> 20 4510758912 15872 4834 11038 6601 1984 >>> 21 4492506112 15872 6532 9340 5119 1984 >>> 22 4496156672 15872 10753 5119 5119 1984 >>> 23 4503457792 15872 10718 5154 5119 1984 >>> ... >>> 154 4083643392 15872 571 15301 4914 1984 <-- >>> 155 4510758912 15872 4834 11038 6601 1984 >>> 156 4492506112 15872 6532 9340 5119 1984 >>> 157 4496156672 15872 10753 5119 5119 1984 >>> 158 4503457792 15872 10718 5154 5119 1984 >>> ... >>> 289 4083643392 15872 571 15301 4914 1984 <-- >>> 290 4510758912 15872 4834 11038 6601 1984 >>> 291 4492506112 15872 6532 9340 5119 1984 >>> 292 4496156672 15872 10753 5119 5119 1984 >>> 293 4503457792 15872 10718 5154 5119 1984 >>> >>> etc. >>> >>> So the loop begins at record #154 and spans 135 records, right? >>> >>> Will backup fs metadata as soon as I have some external storage at hand. >>> >>> Thanks a lot so far ... Michael >>> >>> On 03/24/2016 10:41 AM, Joseph Qi wrote: >>>> Hi Michael, >>>> It seems that dead loop happens in chain 73. You have formatted using 2K >>>> block and 4K cluster, so each chain should have 1522 or 1521 records. >>>> But at first glance, I cannot figure out which block goes wrong, because >>>> the output you pasted indicates all blocks are different. So I suggest >>>> you investigate the all blocks which belong to chain 73 and try to find >>>> out if there is a loop there. >>>> BTW, have you backed up the metadata using o2image? >>>> >>>> Thanks, >>>> Joseph >>>> >>>> On 2016/3/24 16:40, Michael Ulbrich wrote: >>>>> Hi Joseph, >>>>> >>>>> thanks a lot for your help. It is very much appreciated! >>>>> >>>>> I ran debugsfs.ocfs2 from ocfs2-tools 1.6.4 on the mounted file system: >>>>> >>>>> root at s1a:~# debugfs.ocfs2 -R 'stat //global_bitmap' /dev/drbd1 > >>>>> debugfs_drbd1.log 2>&1 >>>>> >>>>> Inode: 13 Mode: 0644 Generation: 1172963971 (0x45ea0283) >>>>> FS Generation: 1172963971 (0x45ea0283) >>>>> CRC32: 00000000 ECC: 0000 >>>>> Type: Regular Attr: 0x0 Flags: Valid System Allocbitmap Chain >>>>> Dynamic Features: (0x0) >>>>> User: 0 (root) Group: 0 (root) Size: 11381315956736 >>>>> Links: 1 Clusters: 2778641591 >>>>> ctime: 0x54010183 -- Sat Aug 30 00:41:07 2014 >>>>> atime: 0x54010183 -- Sat Aug 30 00:41:07 2014 >>>>> mtime: 0x54010183 -- Sat Aug 30 00:41:07 2014 >>>>> dtime: 0x0 -- Thu Jan 1 01:00:00 1970 >>>>> ctime_nsec: 0x00000000 -- 0 >>>>> atime_nsec: 0x00000000 -- 0 >>>>> mtime_nsec: 0x00000000 -- 0 >>>>> Refcount Block: 0 >>>>> Last Extblk: 0 Orphan Slot: 0 >>>>> Sub Alloc Slot: Global Sub Alloc Bit: 7 >>>>> Bitmap Total: 2778641591 Used: 1083108631 Free: 1695532960 >>>>> Clusters per Group: 15872 Bits per Cluster: 1 >>>>> Count: 115 Next Free Rec: 115 >>>>> ## Total Used Free Block# >>>>> 0 24173056 9429318 14743738 4533995520 >>>>> 1 24173056 9421663 14751393 4548629504 >>>>> 2 24173056 9432421 14740635 4588817408 >>>>> 3 24173056 9427533 14745523 4548692992 >>>>> 4 24173056 9433978 14739078 4508568576 >>>>> 5 24173056 9436974 14736082 4636369920 >>>>> 6 24173056 9428411 14744645 4563390464 >>>>> 7 24173056 9426950 14746106 4479459328 >>>>> 8 24173056 9428099 14744957 4548851712 >>>>> 9 24173056 9431794 14741262 4585389056 >>>>> ... >>>>> 105 24157184 9414241 14742943 4690652160 >>>>> 106 24157184 9419715 14737469 4467999744 >>>>> 107 24157184 9411479 14745705 4431525888 >>>>> 108 24157184 9413235 14743949 4559327232 >>>>> 109 24157184 9417948 14739236 4500950016 >>>>> 110 24157184 9411013 14746171 4566691840 >>>>> 111 24157184 9421252 14735932 4522916864 >>>>> 112 24157184 9416726 14740458 4537550848 >>>>> 113 24157184 9415358 14741826 4676303872 >>>>> 114 24157184 9420448 14736736 4526662656 >>>>> >>>>> Group Chain: 0 Parent Inode: 13 Generation: 1172963971 >>>>> CRC32: 00000000 ECC: 0000 >>>>> ## Block# Total Used Free Contig Size >>>>> 0 4533995520 15872 6339 9533 3987 1984 >>>>> 1 4530344960 15872 10755 5117 5117 1984 >>>>> 2 2997109760 15872 10753 5119 5119 1984 >>>>> 3 4526694400 15872 10753 5119 5119 1984 >>>>> 4 3022663680 15872 10753 5119 5119 1984 >>>>> 5 4512092160 15872 9043 6829 2742 1984 >>>>> 6 4523043840 15872 4948 10924 9612 1984 >>>>> 7 4519393280 15872 6150 9722 5595 1984 >>>>> 8 4515742720 15872 4323 11549 6603 1984 >>>>> 9 3771028480 15872 10753 5119 5119 1984 >>>>> ... >>>>> 1513 5523297280 15872 1 15871 15871 1984 >>>>> 1514 5526947840 15872 1 15871 15871 1984 >>>>> 1515 5530598400 15872 1 15871 15871 1984 >>>>> 1516 5534248960 15872 1 15871 15871 1984 >>>>> 1517 5537899520 15872 1 15871 15871 1984 >>>>> 1518 5541550080 15872 1 15871 15871 1984 >>>>> 1519 5545200640 15872 1 15871 15871 1984 >>>>> 1520 5548851200 15872 1 15871 15871 1984 >>>>> 1521 5552501760 15872 1 15871 15871 1984 >>>>> 1522 5556152320 15872 1 15871 15871 1984 >>>>> >>>>> Group Chain: 1 Parent Inode: 13 Generation: 1172963971 >>>>> CRC32: 00000000 ECC: 0000 >>>>> ## Block# Total Used Free Contig Size >>>>> 0 4548629504 15872 10755 5117 2496 1984 >>>>> 1 2993490944 15872 59 15813 14451 1984 >>>>> 2 2489713664 15872 10758 5114 3726 1984 >>>>> 3 3117609984 15872 3958 11914 6165 1984 >>>>> 4 2544472064 15872 10753 5119 5119 1984 >>>>> 5 3040948224 15872 10753 5119 5119 1984 >>>>> 6 2971587584 15872 10753 5119 5119 1984 >>>>> 7 4493871104 15872 8664 7208 3705 1984 >>>>> 8 4544978944 15872 8711 7161 2919 1984 >>>>> 9 4417209344 15872 3253 12619 6447 1984 >>>>> ... >>>>> 1513 5523329024 15872 1 15871 15871 1984 >>>>> 1514 5526979584 15872 1 15871 15871 1984 >>>>> 1515 5530630144 15872 1 15871 15871 1984 >>>>> 1516 5534280704 15872 1 15871 15871 1984 >>>>> 1517 5537931264 15872 1 15871 15871 1984 >>>>> 1518 5541581824 15872 1 15871 15871 1984 >>>>> 1519 5545232384 15872 1 15871 15871 1984 >>>>> 1520 5548882944 15872 1 15871 15871 1984 >>>>> 1521 5552533504 15872 1 15871 15871 1984 >>>>> 1522 5556184064 15872 1 15871 15871 1984 >>>>> >>>>> ... all following group chains are similarly structured up to #73 which >>>>> looks as follows: >>>>> >>>>> Group Chain: 73 Parent Inode: 13 Generation: 1172963971 >>>>> CRC32: 00000000 ECC: 0000 >>>>> ## Block# Total Used Free Contig Size >>>>> 0 2583263232 15872 5341 10531 5153 1984 >>>>> 1 4543613952 15872 5329 10543 5119 1984 >>>>> 2 4532662272 15872 10753 5119 5119 1984 >>>>> 3 4539963392 15872 3223 12649 7530 1984 >>>>> 4 4536312832 15872 5219 10653 5534 1984 >>>>> 5 4529011712 15872 6047 9825 3359 1984 >>>>> 6 4525361152 15872 4475 11397 5809 1984 >>>>> 7 4521710592 15872 3182 12690 5844 1984 >>>>> 8 4518060032 15872 5881 9991 5131 1984 >>>>> 9 4236966912 15872 10753 5119 5119 1984 >>>>> ... >>>>> 2059651 4299026432 15872 4334 11538 4816 1984 >>>>> 2059652 4087293952 15872 7003 8869 2166 1984 >>>>> 2059653 4295375872 15872 6626 9246 5119 1984 >>>>> 2059654 4288074752 15872 509 15363 9662 1984 >>>>> 2059655 4291725312 15872 6151 9721 5119 1984 >>>>> 2059656 4284424192 15872 10052 5820 5119 1984 >>>>> 2059657 4277123072 15872 7383 8489 5120 1984 >>>>> 2059658 4273472512 15872 14 15858 5655 1984 >>>>> 2059659 4269821952 15872 2637 13235 7060 1984 >>>>> 2059660 4266171392 15872 10758 5114 3674 1984 >>>>> ... >>>>> >>>>> Assuming this would go on forever I stopped debugfs.ocfs2. >>>>> >>>>> With debugs.ocfs2 from ocfs2-tools 1.8.4 I get an identical result. >>>>> >>>>> Please let me know if I can provide any further information and help to >>>>> fix this issue. >>>>> >>>>> Thanks again + Best regards ... Michael >>>>> >>>>> On 03/24/2016 01:30 AM, Joseph Qi wrote: >>>>>> Hi Michael, >>>>>> Could you please use debugfs to check the output? >>>>>> # debugfs.ocfs2 -R 'stat //global_bitmap' <device> >>>>>> >>>>>> Thanks, >>>>>> Joseph >>>>>> >>>>>> On 2016/3/24 6:38, Michael Ulbrich wrote: >>>>>>> Hi ocfs2-users, >>>>>>> >>>>>>> my first post to this list from yesterday probably didn't get through. >>>>>>> >>>>>>> Anyway, I've made some progress in the meantime and may now ask more >>>>>>> specific questions ... >>>>>>> >>>>>>> I'm having issues with an 11 TB ocfs2 shared filesystem on Debian Wheezy: >>>>>>> >>>>>>> Linux s1a 3.2.0-4-amd64 #1 SMP Debian 3.2.54-2 x86_64 GNU/Linux >>>>>>> >>>>>>> the kernel modules are: >>>>>>> >>>>>>> modinfo ocfs2 -> version: 1.5.0 >>>>>>> >>>>>>> using stock ocfs2-tools 1.6.4-1+deb7u1 from the distri. >>>>>>> >>>>>>> As an alternative I cloned and built the latest ocfs2-tools from >>>>>>> markfasheh's ocfs2-tools on github which should be version 1.8.4. >>>>>>> >>>>>>> The filesystem runs on top of drbd, is used to roughly 40 % and suffers >>>>>>> from read-only remounts and hanging clients since the last reboot. This >>>>>>> may be DLM problems but I suspect they stem from some corrupt disk >>>>>>> structures. Before that it all ran stable for months. >>>>>>> >>>>>>> This situation made me want to run fsck.ocfs2 and now I wonder how to do >>>>>>> that. The filesystem is not mounted. >>>>>>> >>>>>>> With the stock ocfs-tools 1.6.4: >>>>>>> >>>>>>> root at s1a:~# fsck.ocfs2 -v -f /dev/drbd1 > fsck_drbd1.log 2>&1 >>>>>>> fsck.ocfs2 1.6.4 >>>>>>> Checking OCFS2 filesystem in /dev/drbd1: >>>>>>> Label: ocfs2_ASSET >>>>>>> UUID: 6A1A0189A3F94E32B6B9A526DF9060F3 >>>>>>> Number of blocks: 5557283182 >>>>>>> Block size: 2048 >>>>>>> Number of clusters: 2778641591 >>>>>>> Cluster size: 4096 >>>>>>> Number of slots: 16 >>>>>>> >>>>>>> I'm checking fsck_drbd1.log and find that it is making progress in >>>>>>> >>>>>>> Pass 0a: Checking cluster allocation chains >>>>>>> >>>>>>> until it reaches "chain 73" and goes into an infinite loop filling the >>>>>>> logfile with breathtaking speed. >>>>>>> >>>>>>> With the newly built ocfs-tools 1.8.4 I get: >>>>>>> >>>>>>> root at s1a:~# fsck.ocfs2 -v -f /dev/drbd1 > fsck_drbd1.log 2>&1 >>>>>>> fsck.ocfs2 1.8.4 >>>>>>> Checking OCFS2 filesystem in /dev/drbd1: >>>>>>> Label: ocfs2_ASSET >>>>>>> UUID: 6A1A0189A3F94E32B6B9A526DF9060F3 >>>>>>> Number of blocks: 5557283182 >>>>>>> Block size: 2048 >>>>>>> Number of clusters: 2778641591 >>>>>>> Cluster size: 4096 >>>>>>> Number of slots: 16 >>>>>>> >>>>>>> Again watching the verbose output in fsck_drbd1.log I find that this >>>>>>> time it proceeds up to >>>>>>> >>>>>>> Pass 0a: Checking cluster allocation chains >>>>>>> o2fsck_pass0:1360 | found inode alloc 13 at block 13 >>>>>>> >>>>>>> and stays there without any further progress. I've terminated this >>>>>>> process after waiting for more than an hour. >>>>>>> >>>>>>> Now - I'm lost somehow ... and would very much appreciate if anybody on >>>>>>> this list would share his knowledge and give me a hint what to do next. >>>>>>> >>>>>>> What could be done to get this file system checked and repaired? Am I >>>>>>> missing something important or do I just have to wait a little bit >>>>>>> longer? Is there a version of ocfs2-tools / fsck.ocfs2 which will >>>>>>> perform as expected? >>>>>>> >>>>>>> I'm prepared to upgrade the kernel to 3.16.0-0.bpo.4-amd64 but shy away >>>>>>> from taking that risk without any clue of whether that might solve my >>>>>>> problem ... >>>>>>> >>>>>>> Thanks in advance ... Michael Ulbrich >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Ocfs2-users mailing list >>>>>>> Ocfs2-users at oss.oracle.com >>>>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-users >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Ocfs2-users mailing list >>>>>> Ocfs2-users at oss.oracle.com >>>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-users >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Ocfs2-users mailing list >>>>> Ocfs2-users at oss.oracle.com >>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-users >>>>> >>>>> . >>>>> >>>> >>>> >>> >>> . >>> >> >> > > . >