Michael Ulbrich
2016-Mar-24 08:40 UTC
[Ocfs2-users] fsck.ocfs2 loops + hangs but does not check
Hi Joseph, thanks a lot for your help. It is very much appreciated! I ran debugsfs.ocfs2 from ocfs2-tools 1.6.4 on the mounted file system: root at s1a:~# debugfs.ocfs2 -R 'stat //global_bitmap' /dev/drbd1 > debugfs_drbd1.log 2>&1 Inode: 13 Mode: 0644 Generation: 1172963971 (0x45ea0283) FS Generation: 1172963971 (0x45ea0283) CRC32: 00000000 ECC: 0000 Type: Regular Attr: 0x0 Flags: Valid System Allocbitmap Chain Dynamic Features: (0x0) User: 0 (root) Group: 0 (root) Size: 11381315956736 Links: 1 Clusters: 2778641591 ctime: 0x54010183 -- Sat Aug 30 00:41:07 2014 atime: 0x54010183 -- Sat Aug 30 00:41:07 2014 mtime: 0x54010183 -- Sat Aug 30 00:41:07 2014 dtime: 0x0 -- Thu Jan 1 01:00:00 1970 ctime_nsec: 0x00000000 -- 0 atime_nsec: 0x00000000 -- 0 mtime_nsec: 0x00000000 -- 0 Refcount Block: 0 Last Extblk: 0 Orphan Slot: 0 Sub Alloc Slot: Global Sub Alloc Bit: 7 Bitmap Total: 2778641591 Used: 1083108631 Free: 1695532960 Clusters per Group: 15872 Bits per Cluster: 1 Count: 115 Next Free Rec: 115 ## Total Used Free Block# 0 24173056 9429318 14743738 4533995520 1 24173056 9421663 14751393 4548629504 2 24173056 9432421 14740635 4588817408 3 24173056 9427533 14745523 4548692992 4 24173056 9433978 14739078 4508568576 5 24173056 9436974 14736082 4636369920 6 24173056 9428411 14744645 4563390464 7 24173056 9426950 14746106 4479459328 8 24173056 9428099 14744957 4548851712 9 24173056 9431794 14741262 4585389056 ... 105 24157184 9414241 14742943 4690652160 106 24157184 9419715 14737469 4467999744 107 24157184 9411479 14745705 4431525888 108 24157184 9413235 14743949 4559327232 109 24157184 9417948 14739236 4500950016 110 24157184 9411013 14746171 4566691840 111 24157184 9421252 14735932 4522916864 112 24157184 9416726 14740458 4537550848 113 24157184 9415358 14741826 4676303872 114 24157184 9420448 14736736 4526662656 Group Chain: 0 Parent Inode: 13 Generation: 1172963971 CRC32: 00000000 ECC: 0000 ## Block# Total Used Free Contig Size 0 4533995520 15872 6339 9533 3987 1984 1 4530344960 15872 10755 5117 5117 1984 2 2997109760 15872 10753 5119 5119 1984 3 4526694400 15872 10753 5119 5119 1984 4 3022663680 15872 10753 5119 5119 1984 5 4512092160 15872 9043 6829 2742 1984 6 4523043840 15872 4948 10924 9612 1984 7 4519393280 15872 6150 9722 5595 1984 8 4515742720 15872 4323 11549 6603 1984 9 3771028480 15872 10753 5119 5119 1984 ... 1513 5523297280 15872 1 15871 15871 1984 1514 5526947840 15872 1 15871 15871 1984 1515 5530598400 15872 1 15871 15871 1984 1516 5534248960 15872 1 15871 15871 1984 1517 5537899520 15872 1 15871 15871 1984 1518 5541550080 15872 1 15871 15871 1984 1519 5545200640 15872 1 15871 15871 1984 1520 5548851200 15872 1 15871 15871 1984 1521 5552501760 15872 1 15871 15871 1984 1522 5556152320 15872 1 15871 15871 1984 Group Chain: 1 Parent Inode: 13 Generation: 1172963971 CRC32: 00000000 ECC: 0000 ## Block# Total Used Free Contig Size 0 4548629504 15872 10755 5117 2496 1984 1 2993490944 15872 59 15813 14451 1984 2 2489713664 15872 10758 5114 3726 1984 3 3117609984 15872 3958 11914 6165 1984 4 2544472064 15872 10753 5119 5119 1984 5 3040948224 15872 10753 5119 5119 1984 6 2971587584 15872 10753 5119 5119 1984 7 4493871104 15872 8664 7208 3705 1984 8 4544978944 15872 8711 7161 2919 1984 9 4417209344 15872 3253 12619 6447 1984 ... 1513 5523329024 15872 1 15871 15871 1984 1514 5526979584 15872 1 15871 15871 1984 1515 5530630144 15872 1 15871 15871 1984 1516 5534280704 15872 1 15871 15871 1984 1517 5537931264 15872 1 15871 15871 1984 1518 5541581824 15872 1 15871 15871 1984 1519 5545232384 15872 1 15871 15871 1984 1520 5548882944 15872 1 15871 15871 1984 1521 5552533504 15872 1 15871 15871 1984 1522 5556184064 15872 1 15871 15871 1984 ... all following group chains are similarly structured up to #73 which looks as follows: Group Chain: 73 Parent Inode: 13 Generation: 1172963971 CRC32: 00000000 ECC: 0000 ## Block# Total Used Free Contig Size 0 2583263232 15872 5341 10531 5153 1984 1 4543613952 15872 5329 10543 5119 1984 2 4532662272 15872 10753 5119 5119 1984 3 4539963392 15872 3223 12649 7530 1984 4 4536312832 15872 5219 10653 5534 1984 5 4529011712 15872 6047 9825 3359 1984 6 4525361152 15872 4475 11397 5809 1984 7 4521710592 15872 3182 12690 5844 1984 8 4518060032 15872 5881 9991 5131 1984 9 4236966912 15872 10753 5119 5119 1984 ... 2059651 4299026432 15872 4334 11538 4816 1984 2059652 4087293952 15872 7003 8869 2166 1984 2059653 4295375872 15872 6626 9246 5119 1984 2059654 4288074752 15872 509 15363 9662 1984 2059655 4291725312 15872 6151 9721 5119 1984 2059656 4284424192 15872 10052 5820 5119 1984 2059657 4277123072 15872 7383 8489 5120 1984 2059658 4273472512 15872 14 15858 5655 1984 2059659 4269821952 15872 2637 13235 7060 1984 2059660 4266171392 15872 10758 5114 3674 1984 ... Assuming this would go on forever I stopped debugfs.ocfs2. With debugs.ocfs2 from ocfs2-tools 1.8.4 I get an identical result. Please let me know if I can provide any further information and help to fix this issue. Thanks again + Best regards ... Michael On 03/24/2016 01:30 AM, Joseph Qi wrote:> Hi Michael, > Could you please use debugfs to check the output? > # debugfs.ocfs2 -R 'stat //global_bitmap' <device> > > Thanks, > Joseph > > On 2016/3/24 6:38, Michael Ulbrich wrote: >> Hi ocfs2-users, >> >> my first post to this list from yesterday probably didn't get through. >> >> Anyway, I've made some progress in the meantime and may now ask more >> specific questions ... >> >> I'm having issues with an 11 TB ocfs2 shared filesystem on Debian Wheezy: >> >> Linux s1a 3.2.0-4-amd64 #1 SMP Debian 3.2.54-2 x86_64 GNU/Linux >> >> the kernel modules are: >> >> modinfo ocfs2 -> version: 1.5.0 >> >> using stock ocfs2-tools 1.6.4-1+deb7u1 from the distri. >> >> As an alternative I cloned and built the latest ocfs2-tools from >> markfasheh's ocfs2-tools on github which should be version 1.8.4. >> >> The filesystem runs on top of drbd, is used to roughly 40 % and suffers >> from read-only remounts and hanging clients since the last reboot. This >> may be DLM problems but I suspect they stem from some corrupt disk >> structures. Before that it all ran stable for months. >> >> This situation made me want to run fsck.ocfs2 and now I wonder how to do >> that. The filesystem is not mounted. >> >> With the stock ocfs-tools 1.6.4: >> >> root at s1a:~# fsck.ocfs2 -v -f /dev/drbd1 > fsck_drbd1.log 2>&1 >> fsck.ocfs2 1.6.4 >> Checking OCFS2 filesystem in /dev/drbd1: >> Label: ocfs2_ASSET >> UUID: 6A1A0189A3F94E32B6B9A526DF9060F3 >> Number of blocks: 5557283182 >> Block size: 2048 >> Number of clusters: 2778641591 >> Cluster size: 4096 >> Number of slots: 16 >> >> I'm checking fsck_drbd1.log and find that it is making progress in >> >> Pass 0a: Checking cluster allocation chains >> >> until it reaches "chain 73" and goes into an infinite loop filling the >> logfile with breathtaking speed. >> >> With the newly built ocfs-tools 1.8.4 I get: >> >> root at s1a:~# fsck.ocfs2 -v -f /dev/drbd1 > fsck_drbd1.log 2>&1 >> fsck.ocfs2 1.8.4 >> Checking OCFS2 filesystem in /dev/drbd1: >> Label: ocfs2_ASSET >> UUID: 6A1A0189A3F94E32B6B9A526DF9060F3 >> Number of blocks: 5557283182 >> Block size: 2048 >> Number of clusters: 2778641591 >> Cluster size: 4096 >> Number of slots: 16 >> >> Again watching the verbose output in fsck_drbd1.log I find that this >> time it proceeds up to >> >> Pass 0a: Checking cluster allocation chains >> o2fsck_pass0:1360 | found inode alloc 13 at block 13 >> >> and stays there without any further progress. I've terminated this >> process after waiting for more than an hour. >> >> Now - I'm lost somehow ... and would very much appreciate if anybody on >> this list would share his knowledge and give me a hint what to do next. >> >> What could be done to get this file system checked and repaired? Am I >> missing something important or do I just have to wait a little bit >> longer? Is there a version of ocfs2-tools / fsck.ocfs2 which will >> perform as expected? >> >> I'm prepared to upgrade the kernel to 3.16.0-0.bpo.4-amd64 but shy away >> from taking that risk without any clue of whether that might solve my >> problem ... >> >> Thanks in advance ... Michael Ulbrich >> >> _______________________________________________ >> Ocfs2-users mailing list >> Ocfs2-users at oss.oracle.com >> https://oss.oracle.com/mailman/listinfo/ocfs2-users >> >> > > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-users >
Hi Michael, It seems that dead loop happens in chain 73. You have formatted using 2K block and 4K cluster, so each chain should have 1522 or 1521 records. But at first glance, I cannot figure out which block goes wrong, because the output you pasted indicates all blocks are different. So I suggest you investigate the all blocks which belong to chain 73 and try to find out if there is a loop there. BTW, have you backed up the metadata using o2image? Thanks, Joseph On 2016/3/24 16:40, Michael Ulbrich wrote:> Hi Joseph, > > thanks a lot for your help. It is very much appreciated! > > I ran debugsfs.ocfs2 from ocfs2-tools 1.6.4 on the mounted file system: > > root at s1a:~# debugfs.ocfs2 -R 'stat //global_bitmap' /dev/drbd1 > > debugfs_drbd1.log 2>&1 > > Inode: 13 Mode: 0644 Generation: 1172963971 (0x45ea0283) > FS Generation: 1172963971 (0x45ea0283) > CRC32: 00000000 ECC: 0000 > Type: Regular Attr: 0x0 Flags: Valid System Allocbitmap Chain > Dynamic Features: (0x0) > User: 0 (root) Group: 0 (root) Size: 11381315956736 > Links: 1 Clusters: 2778641591 > ctime: 0x54010183 -- Sat Aug 30 00:41:07 2014 > atime: 0x54010183 -- Sat Aug 30 00:41:07 2014 > mtime: 0x54010183 -- Sat Aug 30 00:41:07 2014 > dtime: 0x0 -- Thu Jan 1 01:00:00 1970 > ctime_nsec: 0x00000000 -- 0 > atime_nsec: 0x00000000 -- 0 > mtime_nsec: 0x00000000 -- 0 > Refcount Block: 0 > Last Extblk: 0 Orphan Slot: 0 > Sub Alloc Slot: Global Sub Alloc Bit: 7 > Bitmap Total: 2778641591 Used: 1083108631 Free: 1695532960 > Clusters per Group: 15872 Bits per Cluster: 1 > Count: 115 Next Free Rec: 115 > ## Total Used Free Block# > 0 24173056 9429318 14743738 4533995520 > 1 24173056 9421663 14751393 4548629504 > 2 24173056 9432421 14740635 4588817408 > 3 24173056 9427533 14745523 4548692992 > 4 24173056 9433978 14739078 4508568576 > 5 24173056 9436974 14736082 4636369920 > 6 24173056 9428411 14744645 4563390464 > 7 24173056 9426950 14746106 4479459328 > 8 24173056 9428099 14744957 4548851712 > 9 24173056 9431794 14741262 4585389056 > ... > 105 24157184 9414241 14742943 4690652160 > 106 24157184 9419715 14737469 4467999744 > 107 24157184 9411479 14745705 4431525888 > 108 24157184 9413235 14743949 4559327232 > 109 24157184 9417948 14739236 4500950016 > 110 24157184 9411013 14746171 4566691840 > 111 24157184 9421252 14735932 4522916864 > 112 24157184 9416726 14740458 4537550848 > 113 24157184 9415358 14741826 4676303872 > 114 24157184 9420448 14736736 4526662656 > > Group Chain: 0 Parent Inode: 13 Generation: 1172963971 > CRC32: 00000000 ECC: 0000 > ## Block# Total Used Free Contig Size > 0 4533995520 15872 6339 9533 3987 1984 > 1 4530344960 15872 10755 5117 5117 1984 > 2 2997109760 15872 10753 5119 5119 1984 > 3 4526694400 15872 10753 5119 5119 1984 > 4 3022663680 15872 10753 5119 5119 1984 > 5 4512092160 15872 9043 6829 2742 1984 > 6 4523043840 15872 4948 10924 9612 1984 > 7 4519393280 15872 6150 9722 5595 1984 > 8 4515742720 15872 4323 11549 6603 1984 > 9 3771028480 15872 10753 5119 5119 1984 > ... > 1513 5523297280 15872 1 15871 15871 1984 > 1514 5526947840 15872 1 15871 15871 1984 > 1515 5530598400 15872 1 15871 15871 1984 > 1516 5534248960 15872 1 15871 15871 1984 > 1517 5537899520 15872 1 15871 15871 1984 > 1518 5541550080 15872 1 15871 15871 1984 > 1519 5545200640 15872 1 15871 15871 1984 > 1520 5548851200 15872 1 15871 15871 1984 > 1521 5552501760 15872 1 15871 15871 1984 > 1522 5556152320 15872 1 15871 15871 1984 > > Group Chain: 1 Parent Inode: 13 Generation: 1172963971 > CRC32: 00000000 ECC: 0000 > ## Block# Total Used Free Contig Size > 0 4548629504 15872 10755 5117 2496 1984 > 1 2993490944 15872 59 15813 14451 1984 > 2 2489713664 15872 10758 5114 3726 1984 > 3 3117609984 15872 3958 11914 6165 1984 > 4 2544472064 15872 10753 5119 5119 1984 > 5 3040948224 15872 10753 5119 5119 1984 > 6 2971587584 15872 10753 5119 5119 1984 > 7 4493871104 15872 8664 7208 3705 1984 > 8 4544978944 15872 8711 7161 2919 1984 > 9 4417209344 15872 3253 12619 6447 1984 > ... > 1513 5523329024 15872 1 15871 15871 1984 > 1514 5526979584 15872 1 15871 15871 1984 > 1515 5530630144 15872 1 15871 15871 1984 > 1516 5534280704 15872 1 15871 15871 1984 > 1517 5537931264 15872 1 15871 15871 1984 > 1518 5541581824 15872 1 15871 15871 1984 > 1519 5545232384 15872 1 15871 15871 1984 > 1520 5548882944 15872 1 15871 15871 1984 > 1521 5552533504 15872 1 15871 15871 1984 > 1522 5556184064 15872 1 15871 15871 1984 > > ... all following group chains are similarly structured up to #73 which > looks as follows: > > Group Chain: 73 Parent Inode: 13 Generation: 1172963971 > CRC32: 00000000 ECC: 0000 > ## Block# Total Used Free Contig Size > 0 2583263232 15872 5341 10531 5153 1984 > 1 4543613952 15872 5329 10543 5119 1984 > 2 4532662272 15872 10753 5119 5119 1984 > 3 4539963392 15872 3223 12649 7530 1984 > 4 4536312832 15872 5219 10653 5534 1984 > 5 4529011712 15872 6047 9825 3359 1984 > 6 4525361152 15872 4475 11397 5809 1984 > 7 4521710592 15872 3182 12690 5844 1984 > 8 4518060032 15872 5881 9991 5131 1984 > 9 4236966912 15872 10753 5119 5119 1984 > ... > 2059651 4299026432 15872 4334 11538 4816 1984 > 2059652 4087293952 15872 7003 8869 2166 1984 > 2059653 4295375872 15872 6626 9246 5119 1984 > 2059654 4288074752 15872 509 15363 9662 1984 > 2059655 4291725312 15872 6151 9721 5119 1984 > 2059656 4284424192 15872 10052 5820 5119 1984 > 2059657 4277123072 15872 7383 8489 5120 1984 > 2059658 4273472512 15872 14 15858 5655 1984 > 2059659 4269821952 15872 2637 13235 7060 1984 > 2059660 4266171392 15872 10758 5114 3674 1984 > ... > > Assuming this would go on forever I stopped debugfs.ocfs2. > > With debugs.ocfs2 from ocfs2-tools 1.8.4 I get an identical result. > > Please let me know if I can provide any further information and help to > fix this issue. > > Thanks again + Best regards ... Michael > > On 03/24/2016 01:30 AM, Joseph Qi wrote: >> Hi Michael, >> Could you please use debugfs to check the output? >> # debugfs.ocfs2 -R 'stat //global_bitmap' <device> >> >> Thanks, >> Joseph >> >> On 2016/3/24 6:38, Michael Ulbrich wrote: >>> Hi ocfs2-users, >>> >>> my first post to this list from yesterday probably didn't get through. >>> >>> Anyway, I've made some progress in the meantime and may now ask more >>> specific questions ... >>> >>> I'm having issues with an 11 TB ocfs2 shared filesystem on Debian Wheezy: >>> >>> Linux s1a 3.2.0-4-amd64 #1 SMP Debian 3.2.54-2 x86_64 GNU/Linux >>> >>> the kernel modules are: >>> >>> modinfo ocfs2 -> version: 1.5.0 >>> >>> using stock ocfs2-tools 1.6.4-1+deb7u1 from the distri. >>> >>> As an alternative I cloned and built the latest ocfs2-tools from >>> markfasheh's ocfs2-tools on github which should be version 1.8.4. >>> >>> The filesystem runs on top of drbd, is used to roughly 40 % and suffers >>> from read-only remounts and hanging clients since the last reboot. This >>> may be DLM problems but I suspect they stem from some corrupt disk >>> structures. Before that it all ran stable for months. >>> >>> This situation made me want to run fsck.ocfs2 and now I wonder how to do >>> that. The filesystem is not mounted. >>> >>> With the stock ocfs-tools 1.6.4: >>> >>> root at s1a:~# fsck.ocfs2 -v -f /dev/drbd1 > fsck_drbd1.log 2>&1 >>> fsck.ocfs2 1.6.4 >>> Checking OCFS2 filesystem in /dev/drbd1: >>> Label: ocfs2_ASSET >>> UUID: 6A1A0189A3F94E32B6B9A526DF9060F3 >>> Number of blocks: 5557283182 >>> Block size: 2048 >>> Number of clusters: 2778641591 >>> Cluster size: 4096 >>> Number of slots: 16 >>> >>> I'm checking fsck_drbd1.log and find that it is making progress in >>> >>> Pass 0a: Checking cluster allocation chains >>> >>> until it reaches "chain 73" and goes into an infinite loop filling the >>> logfile with breathtaking speed. >>> >>> With the newly built ocfs-tools 1.8.4 I get: >>> >>> root at s1a:~# fsck.ocfs2 -v -f /dev/drbd1 > fsck_drbd1.log 2>&1 >>> fsck.ocfs2 1.8.4 >>> Checking OCFS2 filesystem in /dev/drbd1: >>> Label: ocfs2_ASSET >>> UUID: 6A1A0189A3F94E32B6B9A526DF9060F3 >>> Number of blocks: 5557283182 >>> Block size: 2048 >>> Number of clusters: 2778641591 >>> Cluster size: 4096 >>> Number of slots: 16 >>> >>> Again watching the verbose output in fsck_drbd1.log I find that this >>> time it proceeds up to >>> >>> Pass 0a: Checking cluster allocation chains >>> o2fsck_pass0:1360 | found inode alloc 13 at block 13 >>> >>> and stays there without any further progress. I've terminated this >>> process after waiting for more than an hour. >>> >>> Now - I'm lost somehow ... and would very much appreciate if anybody on >>> this list would share his knowledge and give me a hint what to do next. >>> >>> What could be done to get this file system checked and repaired? Am I >>> missing something important or do I just have to wait a little bit >>> longer? Is there a version of ocfs2-tools / fsck.ocfs2 which will >>> perform as expected? >>> >>> I'm prepared to upgrade the kernel to 3.16.0-0.bpo.4-amd64 but shy away >>> from taking that risk without any clue of whether that might solve my >>> problem ... >>> >>> Thanks in advance ... Michael Ulbrich >>> >>> _______________________________________________ >>> Ocfs2-users mailing list >>> Ocfs2-users at oss.oracle.com >>> https://oss.oracle.com/mailman/listinfo/ocfs2-users >>> >>> >> >> >> >> _______________________________________________ >> Ocfs2-users mailing list >> Ocfs2-users at oss.oracle.com >> https://oss.oracle.com/mailman/listinfo/ocfs2-users >> > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-users > > . >