Hi Ted and all, I have a couple of questions near the end of this message, but first I have to describe my problem in some detail. The power failure on Thursday did something evil to my ext3 file system (box running RH9+patches, ext3, /dev/md0, raid5 driver, 400GB f/s using 3x200GB IDE drives and one hot-spare). The f/s got corrupt badly and the symptoms are very similar to what Eddy described here: https://www.redhat.com/archives/ext3-users/2003-July/msg00015.html That is, nearly everything I try results in and error such as "Invalid argument while checking ext3 journal for /dev/md0" Ted answered here: https://www.redhat.com/archives/ext3-users/2003-July/msg00035.html and suggested the last ditch approach using mke2fs -S to reinitialize the superblock and group descriptors. After trying all sort of "safe" methods to recover the files, I have tried the -S option as follows: ------------------------------------------------------------------------------ # mke2fs -j -b 4096 -S /dev/md0 mke2fs 1.32 (09-Nov-2002) Filesystem labelOS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 49790976 inodes, 99570816 blocks 4978540 blocks (5.00%) reserved for the super user First data block=0 3039 block groups 32768 blocks per group, 32768 fragments per group 16384 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968 Creating journal (8192 blocks): mke2fs: File exists while trying to create journal ------------------------------------------------------------------------------ Note that the above command ran too fast for me. It felt as if it didn't actually write any info to the f/s. Indeed, I next ran this command: ------------------------------------------------------------------------------ # e2fsck -b 98304 -B 4096 /dev/md0 e2fsck 1.32 (09-Nov-2002) e2fsck: Invalid argument while checking ext3 journal for /dev/md0 ------------------------------------------------------------------------------ And once again got this error wrt the journal. Note that before I even tried this -S procedure, I tried to simply turn off the has_journal bit using tune2fs: didn't help. (I'm willing to lose the info in the journal, as long as I can get the rest of my large f/s.) But tune2fs and friends gave me a chicken-and-egg error about the invalid arg wrt the journal, while trying to turn it off (duhh). At this point I've begun to suspect that there's something awfully wrong with the journal inode, and maybe, just maybe, my superblocks and group descriptors might be intact. Next, I tried to reinitialize the superblocks and group descriptors WITHOUT a journal (tell mke2fs to make a plain ext2 f/s): ------------------------------------------------------------------------------ # mke2fs -b 4096 -S /dev/md0 mke2fs 1.32 (09-Nov-2002) Filesystem labelOS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 49790976 inodes, 99570816 blocks 4978540 blocks (5.00%) reserved for the super user First data block=0 3039 block groups 32768 blocks per group, 32768 fragments per group 16384 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968 Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 34 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. ------------------------------------------------------------------------------ Bingo. This time I got no error, and the command took a couple of seconds longer, indicating to me that it actually did write something to the disk (or maybe it wrote more than when I tried "-S -j"). Now I was able to start "e2fsck -b 71663616 -B 4096 /dev/md0". It's been running for a couple of hours already. Of course, it's discovering all sorts of wonderful new events and spewing messages I've never even seen before. 1/2 a :-) Anyway, my hypothesis now is that the f/s in question may have just had a really really bad journal inode on it that was preventing anything else from happening, and that perhaps I shouldn't have tried "mke2fs -S" above had I been able to just nuke the pesky journal (it might have prevented further corruption that fsck is now "fixing"). The good news is that prior to experimentation, I have made a dd backup of /dev/md0 (400GB) onto a file on another file server (1.5T), so I can dd it back onto my real /dev/md0 if need be. Alternatively, I can make a second copy of that backup file, use losetup on the second copy, and then experiment. Questions: 1. Is there any reason why I couldn't experiment with e2fsprogs binaries on a f/s dd image mounted over /dev/loopN? I.e., will it behave the same as a disk device as far as e2fsprogs are concerned? 2. If my assertion is correct that most of my f/s is intact but the journal is FUBAR, I need to find a way to force fsck to ignore the journal no matter what. Is there such a tool or option to some tool? Is there a way I could simply scan the disk and truncate the journal file, or turn off the has_journal bit w/o touching the rest of the f/s? Any suggestions are welcome. Thanks, Erez.
On Mon, Aug 18, 2003 at 12:39:46PM -0400, Erez Zadok wrote:> The power failure on Thursday did something evil to my ext3 file system (box > running RH9+patches, ext3, /dev/md0, raid5 driver, 400GB f/s using 3x200GB > IDE drives and one hot-spare). The f/s got corrupt badly and the symptoms > are very similar to what Eddy described here: > > https://www.redhat.com/archives/ext3-users/2003-July/msg00015.html > > That is, nearly everything I try results in and error such as > > "Invalid argument while checking ext3 journal for /dev/md0"What probably happened is that the power failed while you were writing out the inode table, and the memory failed before the DMA engine and hard drive did, since DRAM tends to be more sensitive to voltage drops that other parts of the system. As a result, random garbage got scribbled all over the disk. (Ted 's observation: PC Class hardware is sh*t.) Normally, this isn't a problem, since the ext3 journal contains full backups of recently written data blocks. (As opposed to filesystems that use soft update or logically journaled filesystems, which are even more fragile in the face of cheap hardware that scribble random garbage on power failure.) However, this is not true when the first part of the inode table is scribbled upon, such that the journal inode can not be found. Given that this sort of failure has been reported at least 2 or 3 times, now, it's clear we need to address this vulnerability, probably by keeping a backup copy of the journal inode (or at least the journal data blocks) in the superblock, so it can survive this particular lossage mode.> Ted answered here: > > https://www.redhat.com/archives/ext3-users/2003-July/msg00035.html > > and suggested the last ditch approach using mke2fs -S to reinitialize the > superblock and group descriptors. After trying all sort of "safe" methods > to recover the files, I have tried the -S option as follows: > > # mke2fs -j -b 4096 -S /dev/md0....> Creating journal (8192 blocks): mke2fs: File exists > while trying to create journal > ----------------------------------------------------------------------------Yeah, what happened here is that the -S option does not clear the inode table. So when it tried to create the journal inode, it found that there was something there already (but probably garbaged) and then bombed out.> And once again got this error wrt the journal. Note that before I even > tried this -S procedure, I tried to simply turn off the has_journal bit > using tune2fs: didn't help. (I'm willing to lose the info in the journal, > as long as I can get the rest of my large f/s.) But tune2fs and friends > gave me a chicken-and-egg error about the invalid arg wrt the journal, while > trying to turn it off (duhh).You could have turned it off using debugfs, but up until now it's not something that I've encouraged because of concerns that there might be real data loss if it was too easy for users to disable the journal.> Now I was able to start "e2fsck -b 71663616 -B 4096 /dev/md0". It's been > running for a couple of hours already. Of course, it's discovering all > sorts of wonderful new events and spewing messages I've never even seen > before. 1/2 a :-)Yup. Some of the damage was caused by not replaying the journal before running e2fsck, and some was done probably by the power failure causing garbage to be scribbled on the disk.> Anyway, my hypothesis now is that the f/s in question may have just had a > really really bad journal inode on it that was preventing anything else from > happening, and that perhaps I shouldn't have tried "mke2fs -S" above had I > been able to just nuke the pesky journal (it might have prevented further > corruption that fsck is now "fixing").Your hypothesis was right. Whether you nuked the journal by using debugfs or y using mke2fs -S probably wouldn't have made any difference, however.> The good news is that prior to experimentation, I have made a dd backup of > /dev/md0 (400GB) onto a file on another file server (1.5T), so I can dd it > back onto my real /dev/md0 if need be. Alternatively, I can make a second > copy of that backup file, use losetup on the second copy, and then > experiment. > > Questions: > > 1. Is there any reason why I couldn't experiment with e2fsprogs binaries on > a f/s dd image mounted over /dev/loopN? I.e., will it behave the same as > a disk device as far as e2fsprogs are concerned?No reason. The e2fsprogs binaries don't need to operate on a block device. You can just point it at an dd image directly.> 2. If my assertion is correct that most of my f/s is intact but the journal > is FUBAR, I need to find a way to force fsck to ignore the journal no > matter what. Is there such a tool or option to some tool? Is there a > way I could simply scan the disk and truncate the journal file, or turn > off the has_journal bit w/o touching the rest of the f/s?You can use debugfs's feature command to turn off the has_journal bit as follows: debugfs -w /dev/hdaXX debugfs: features ^has_journal debugfs: features ^needs_recovery debugfs: quit Hmm.... this will work unless the group descriptors are so badly damaged that debugfs refuses to touch the superblock. You can open it in catastrophic mode, but right now as a safety precaution, you' re not allowed to open the filesystem in read/write mode when in catastrophic mode. I can remove this restriction if we add some more safety checks that will prevent debugfs from doing more damage when opened in read/write catastrophic mode, at the moment, debugfs has been written with a "first do no harm" principle. Ultimately, though, it's probably more important to add a backup copy of the journal inode to avoid needing to play games like this in the first place, and to allow e2fsck to recover from these situations automatically. - Ted
In message <20030821190811.GC1040@matchmail.com>, Mike Fedyk writes:> On Thu, Aug 21, 2003 at 12:54:37PM -0400, Erez Zadok wrote: > > Cool. Is there an ext3 patch that'll support the backup inode? Are there > > any issues of moving b/t the two modes of ext3 (back/forw compatibility > > etc.)? > > There's no need to support it in the kernel. The inode number is kept in > the superblock, and that's updated at mkfs and tune2fs time, not from the > kernel. > > Also, there isn't a second inode, it's just that the inode number is being > kept in the superblocks too.How does the kernel know to write the journal data first to some data block belonging to inode X, and then to another data block of inode Y? Both X and Y are journal inodes, right? Will there be a reserved inum other than 8, for the backup journal? Is there some magic in which the kernel can identify any number of special journal inodes? And while we're at it, why only one backup journal inode? Why not several? If it's good enough to have several copies of superblocks etc., then why not the journal (for those willing to pay the performance penalty)? Erez.