Keith Keller
2014-Jun-02 02:43 UTC
Re: [long] major problems on fs; e2fsck running out of memory
Hi Bodo and Ted, Thank you both for your responses; they confirm what I thought might be the case. Knowing that I can try to proceed with your suggestions. I do have some followup questions for you: On Sun, Jun 01, 2014 at 09:05:09PM -0400, Theodore Ts'o wrote:> Unfortunately, there has been a huge number of bug fixes for ext4's > online resize since 2.6.32 and 1.42.11. It's quite possible that you > hit one of them.Would this scenario be explained by these bugs? I'd expect that if a resize2fs failed, it would report a problem pretty quickly. (But perhaps that's the nature of some of these bugs.)> Well, actually it's not quite that simple. There are multiple passes > to e2fsck, and the first pass is estimated to be 70% of the total > e2fsck run. So 51.8% reported by the progress means e2fsck had gotten > 74% of the way through pass 1. So that would mean that it had got > through about inodes associated to about 3.9TB into the file system.Aha! Thanks for the clarification. That's certainly well more than the original fs size.> That being said, it's pretty clear that portions of the inode table > and block group descriptor was badly corrupted. So I suspect there > isn't going to be much that can be done to try to repair the file > system completely. If there are specific files you need to recover, > I'd suggest trying to recover them first before trying to do anything > else. The good news is that probably around 75% of your files can > probably be recovered.So, now when I try to mount, I get an error: # mount -o ro -t ext4 /dev/mapper/vg1--sdb-lv_vz /vz/ mount: Stale NFS file handle That's clearly a spurious error, so I checked dmesg: # dmesg|tail [159891.219387] EXT4-fs (dm-0): ext4_check_descriptors: Checksum for group 42252 failed (36703!=0) [159891.219586] EXT4-fs (dm-0): ext4_check_descriptors: Checksum for group 42253 failed (51517!=0) [159891.219786] EXT4-fs (dm-0): ext4_check_descriptors: Checksum for group 42254 failed (51954!=0) [159891.220025] EXT4-fs (dm-0): ext4_check_descriptors: Checksum for group 42496 failed (37296!=0) [159891.220225] EXT4-fs (dm-0): ext4_check_descriptors: Checksum for group 42497 failed (31921!=0) [159891.220451] EXT4-fs (dm-0): ext4_check_descriptors: Checksum for group 42498 failed (2993!=0) [159891.220650] EXT4-fs (dm-0): ext4_check_descriptors: Checksum for group 42499 failed (59056!=0) [159891.220850] EXT4-fs (dm-0): ext4_check_descriptors: Checksum for group 42500 failed (28571!=22299) [159891.225762] EXT4-fs (dm-0): get root inode failed [159891.227436] EXT4-fs (dm-0): mount failed and before that there are many other checksum failed errors. When I try a rw mount I get these messages instead: [160052.031554] EXT4-fs (dm-0): ext4_check_descriptors: Checksum for group 0 failed (43864!=0) [160052.031782] EXT4-fs (dm-0): group descriptors corrupted! Are there any other options I can try to force the mount so I can try to get to the changed files? If that'll be challenging, I'll just sacrifice those files, but if it'd be relatively straightforward I'd like to make the attempt. Thanks again! --keith -- kkeller@wombat.san-francisco.ca.us
Keith Keller
2014-Jun-02 02:56 UTC
Re: [long] major problems on fs; e2fsck running out of memory
Hi again all, I apologize for not asking this in my first message; I just remembered the question after sending. On Sun, Jun 01, 2014 at 07:43:12PM -0700, Keith Keller wrote:> > On Sun, Jun 01, 2014 at 09:05:09PM -0400, Theodore Ts'o wrote: > > Unfortunately, there has been a huge number of bug fixes for ext4's > > online resize since 2.6.32 and 1.42.11. It's quite possible that you > > hit one of them. > > Would this scenario be explained by these bugs? I'd expect that if a > resize2fs failed, it would report a problem pretty quickly. (But > perhaps that's the nature of some of these bugs.)I have a very similar second server which has undergone a similar chain of events, an initial ~2.5tb fs followed by a resize later. I believe that it has been fsck'd since the resize (but don't quote me on that). Am I likely to run into this issue with this fs? And if I do, what steps should I do differently (e.g., use the latest e2fsck right away; don't e2fsck, get files off quickly, and mke2fs; something else)? --keith -- kkeller@wombat.san-francisco.ca.us
Theodore Ts'o
2014-Jun-02 03:24 UTC
Re: [long] major problems on fs; e2fsck running out of memory
On Sun, Jun 01, 2014 at 07:43:12PM -0700, Keith Keller wrote:> > That's clearly a spurious error, so I checked dmesg: > > [159891.225762] EXT4-fs (dm-0): get root inode failed > [159891.227436] EXT4-fs (dm-0): mount failedThe "get root inode failed" is rather unfortunate. Try running "debugfs /dev/dm0" and then use the "stat /" command. You can use debugfs to look at the file system and recover individual files without needing to mount it. However, if the root directory has been compromised, that makes using debugfs quite a bit more difficult. You can look at inodes by inode number by surrounding them with angle brackets. i.e., if you want to look at inode 12345, you could say "stat <12345>", and if you inode 12345 is a directory, you could list it via "ls <12345>", etc. See the debugfs man page for more details. - Ted
Keith Keller
2014-Jun-02 03:54 UTC
Re: [long] major problems on fs; e2fsck running out of memory
On Sun, Jun 01, 2014 at 11:24:51PM -0400, Theodore Ts'o wrote:> > The "get root inode failed" is rather unfortunate.Heh, I like your understatement. :) I think this helps answer part of my questions in my second email: I should probably try to preserve changes from last backup before getting too deep into a tricky e2fsck. At one point the fs was still mountable, so I could have tried to copy files off first. (In a physical failure scenario it's exactly what I'd have done, but I wasn't thinking of that in this case.)> Try running "debugfs /dev/dm0" > > and then use the "stat /" command.No happiness: # ./e2fsprogs-1.42.10/debugfs/debugfs /dev/dm-0 debugfs 1.42.10 (18-May-2014) debugfs: stat / stat: A block group is missing an inode table while reading inode 2 My hunch is that it would take a large and lucky effort to try to get anything useful off this fs. Does that seem like a reasonable guess? --keith -- kkeller@wombat.san-francisco.ca.us
Eric Sandeen
2014-Jun-02 15:51 UTC
Re: [long] major problems on fs; e2fsck running out of memory
On 6/1/14, 9:43 PM, Keith Keller wrote:> Hi Bodo and Ted, > > Thank you both for your responses; they confirm what I thought might be > the case. Knowing that I can try to proceed with your suggestions. I > do have some followup questions for you: > > > On Sun, Jun 01, 2014 at 09:05:09PM -0400, Theodore Ts'o wrote: >> Unfortunately, there has been a huge number of bug fixes for ext4's >> online resize since 2.6.32 and 1.42.11. It's quite possible that you >> hit one of them. > > Would this scenario be explained by these bugs? I'd expect that if a > resize2fs failed, it would report a problem pretty quickly. (But > perhaps that's the nature of some of these bugs.)Well, for what it's worth, there have been several resize fixes shipped in RHEL6/Centos6, so it's not just vanilla 1.42.11 or 2.6.32. But we walk a fine line between too much churn and risk, and fixing the serious problems, so it's possible that you hit an unfixed case. I think it's fairly hard to know without a reproducer. Your corruption looks bad enough that I tend to agree with Bodo - that it may be some more fundamental underlying storage problem. However, some semi-recent fixes, for example: resize2fs: reserve all metadata blocks for flex_bg file systems have yet to make it into RHEL6 (they will soon...) -Eric
Bodo Thiesen
2014-Jun-02 20:52 UTC
Re: [long] major problems on fs; e2fsck running out of memory
* Keith Keller <kkeller@wombat.san-francisco.ca.us> hat geschrieben: Hi Keith> I have a very similar second server which has undergone a similar chain > of events, an initial ~2.5tb fs followed by a resize later. I believe > that it has been fsck'd since the resize (but don't quote me on that). > Am I likely to run into this issue with this fs? And if I do, what > steps should I do differently (e.g., use the latest e2fsck right away; > don't e2fsck, get files off quickly, and mke2fs; something else)?umount and then e2fsck -f -n -C 0 (the -C 0 is only for the progress bar) If it report the fs to be clean (a hand full of errors like b_size wrong or deleted inode has zero dtime and stuff like that in low number is ok - to be sure, you might want to post that output here and ask before removing the -n to fix those errors), you should be save. If it reports tons of errors or includes invalid blocks or checksum errors. mount -o ro and backup everything and then mke2fs. Regards, Bodo
Possibly Parallel Threads
- Re: [long] major problems on fs; e2fsck running out of memory
- Re: [long] major problems on fs; e2fsck running out of memory
- Re: [long] major problems on fs; e2fsck running out of memory
- [long] major problems on fs; e2fsck running out of memory
- Re: [long] major problems on fs; e2fsck running out of memory