thr3ads.net - Ext3 users - another seriously corrupt ext3 -- pesky journal [Aug 2003]

If this information is useful, please help other people find it:
Share via:

Erez Zadok

2003-Aug-18 16:39 UTC

another seriously corrupt ext3 -- pesky journal

Hi Ted and all,

I have a couple of questions near the end of this message, but first I have
to describe my problem in some detail.

The power failure on Thursday did something evil to my ext3 file system (box
running RH9+patches, ext3, /dev/md0, raid5 driver, 400GB f/s using 3x200GB
IDE drives and one hot-spare).  The f/s got corrupt badly and the symptoms
are very similar to what Eddy described here:

	https://www.redhat.com/archives/ext3-users/2003-July/msg00015.html

That is, nearly everything I try results in and error such as

	"Invalid argument while checking ext3 journal for /dev/md0"

Ted answered here:

	https://www.redhat.com/archives/ext3-users/2003-July/msg00035.html

and suggested the last ditch approach using mke2fs -S to reinitialize the
superblock and group descriptors.  After trying all sort of "safe"
methods
to recover the files, I have tried the -S option as follows:

------------------------------------------------------------------------------
# mke2fs -j -b 4096 -S /dev/md0
mke2fs 1.32 (09-Nov-2002)
Filesystem labelOS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
49790976 inodes, 99570816 blocks
4978540 blocks (5.00%) reserved for the super user
First data block=0
3039 block groups
32768 blocks per group, 32768 fragments per group
16384 inodes per group
Superblock backups stored on blocks: 
 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968

Creating journal (8192 blocks): mke2fs: File exists 
 while trying to create journal
------------------------------------------------------------------------------

Note that the above command ran too fast for me.  It felt as if it didn't
actually write any info to the f/s.  Indeed, I next ran this command:

------------------------------------------------------------------------------
# e2fsck -b 98304 -B 4096 /dev/md0
e2fsck 1.32 (09-Nov-2002)
e2fsck: Invalid argument while checking ext3 journal for /dev/md0
------------------------------------------------------------------------------

And once again got this error wrt the journal.  Note that before I even
tried this -S procedure, I tried to simply turn off the has_journal bit
using tune2fs: didn't help.  (I'm willing to lose the info in the
journal,
as long as I can get the rest of my large f/s.)  But tune2fs and friends
gave me a chicken-and-egg error about the invalid arg wrt the journal, while
trying to turn it off (duhh).

At this point I've begun to suspect that there's something awfully wrong
with the journal inode, and maybe, just maybe, my superblocks and group
descriptors might be intact.  Next, I tried to reinitialize the superblocks
and group descriptors WITHOUT a journal (tell mke2fs to make a plain ext2
f/s):

------------------------------------------------------------------------------
# mke2fs -b 4096 -S /dev/md0
mke2fs 1.32 (09-Nov-2002)
Filesystem labelOS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
49790976 inodes, 99570816 blocks
4978540 blocks (5.00%) reserved for the super user
First data block=0
3039 block groups
32768 blocks per group, 32768 fragments per group
16384 inodes per group
Superblock backups stored on blocks: 
 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968

Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 34 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
------------------------------------------------------------------------------

Bingo.  This time I got no error, and the command took a couple of seconds
longer, indicating to me that it actually did write something to the disk
(or maybe it wrote more than when I tried "-S -j").

Now I was able to start "e2fsck -b 71663616 -B 4096 /dev/md0". 
It's been
running for a couple of hours already.  Of course, it's discovering all
sorts of wonderful new events and spewing messages I've never even seen
before. 1/2 a :-)

Anyway, my hypothesis now is that the f/s in question may have just had a
really really bad journal inode on it that was preventing anything else from
happening, and that perhaps I shouldn't have tried "mke2fs -S"
above had I
been able to just nuke the pesky journal (it might have prevented further
corruption that fsck is now "fixing").

The good news is that prior to experimentation, I have made a dd backup of
/dev/md0 (400GB) onto a file on another file server (1.5T), so I can dd it
back onto my real /dev/md0 if need be.  Alternatively, I can make a second
copy of that backup file, use losetup on the second copy, and then
experiment.

Questions:

1. Is there any reason why I couldn't experiment with e2fsprogs binaries on
   a f/s dd image mounted over /dev/loopN?  I.e., will it behave the same as
   a disk device as far as e2fsprogs are concerned?

2. If my assertion is correct that most of my f/s is intact but the journal
   is FUBAR, I need to find a way to force fsck to ignore the journal no
   matter what.  Is there such a tool or option to some tool?  Is there a
   way I could simply scan the disk and truncate the journal file, or turn
   off the has_journal bit w/o touching the rest of the f/s?

Any suggestions are welcome.

Thanks,
Erez.

Theodore Ts'o

2003-Aug-18 18:13 UTC

head link

Re: another seriously corrupt ext3 -- pesky journal

On Mon, Aug 18, 2003 at 12:39:46PM -0400, Erez Zadok
wrote:> The power failure on Thursday did something evil to my ext3 file system
(box
> running RH9+patches, ext3, /dev/md0, raid5 driver, 400GB f/s using 3x200GB
> IDE drives and one hot-spare).  The f/s got corrupt badly and the symptoms
> are very similar to what Eddy described here:
> 
> 	https://www.redhat.com/archives/ext3-users/2003-July/msg00015.html
> 
> That is, nearly everything I try results in and error such as
> 
> 	"Invalid argument while checking ext3 journal for /dev/md0"
What probably happened is that the power failed while you were writing
out the inode table, and the memory failed before the DMA engine and
hard drive did, since DRAM tends to be more sensitive to voltage drops
that other parts of the system.  As a result, random garbage got
scribbled all over the disk.  (Ted 's observation: PC Class hardware
is sh*t.)

Normally, this isn't a problem, since the ext3 journal contains full
backups of recently written data blocks.  (As opposed to filesystems
that use soft update or logically journaled filesystems, which are
even more fragile in the face of cheap hardware that scribble random
garbage on power failure.)  However, this is not true when the first
part of the inode table is scribbled upon, such that the journal inode
can not be found.  

Given that this sort of failure has been reported at least 2 or 3
times, now, it's clear we need to address this vulnerability, probably
by keeping a backup copy of the journal inode (or at least the journal
data blocks) in the superblock, so it can survive this particular
lossage mode.
> Ted answered here:
> 
> 	https://www.redhat.com/archives/ext3-users/2003-July/msg00035.html
> 
> and suggested the last ditch approach using mke2fs -S to reinitialize the
> superblock and group descriptors.  After trying all sort of
"safe" methods
> to recover the files, I have tried the -S option as follows:
> 
> # mke2fs -j -b 4096 -S /dev/md0
   ....> Creating journal (8192 blocks): mke2fs: File exists 
>  while trying to create journal
>
----------------------------------------------------------------------------
Yeah, what happened here is that the -S option does not clear the
inode table.  So when it tried to create the journal inode, it found
that there was something there already (but probably garbaged) and
then bombed out.
> And once again got this error wrt the journal.  Note that before I even
> tried this -S procedure, I tried to simply turn off the has_journal bit
> using tune2fs: didn't help.  (I'm willing to lose the info in the
journal,
> as long as I can get the rest of my large f/s.)  But tune2fs and friends
> gave me a chicken-and-egg error about the invalid arg wrt the journal,
while
> trying to turn it off (duhh).
You could have turned it off using debugfs, but up until now it's not
something that I've encouraged because of concerns that there might be
real data loss if it was too easy for users to disable the journal.
> Now I was able to start "e2fsck -b 71663616 -B 4096 /dev/md0". 
It's been
> running for a couple of hours already.  Of course, it's discovering all
> sorts of wonderful new events and spewing messages I've never even seen
> before. 1/2 a :-)
Yup.  Some of the damage was caused by not replaying the journal
before running e2fsck, and some was done probably by the power failure
causing garbage to be scribbled on the disk.
> Anyway, my hypothesis now is that the f/s in question may have just had a
> really really bad journal inode on it that was preventing anything else
from
> happening, and that perhaps I shouldn't have tried "mke2fs
-S" above had I
> been able to just nuke the pesky journal (it might have prevented further
> corruption that fsck is now "fixing").
Your hypothesis was right.  Whether you nuked the journal by using
debugfs or y using mke2fs -S probably wouldn't have made any
difference, however.
> The good news is that prior to experimentation, I have made a dd backup of
> /dev/md0 (400GB) onto a file on another file server (1.5T), so I can dd it
> back onto my real /dev/md0 if need be.  Alternatively, I can make a second
> copy of that backup file, use losetup on the second copy, and then
> experiment.
> 
> Questions:
> 
> 1. Is there any reason why I couldn't experiment with e2fsprogs
binaries on
>    a f/s dd image mounted over /dev/loopN?  I.e., will it behave the same
as
>    a disk device as far as e2fsprogs are concerned?
No reason.  The e2fsprogs binaries don't need to operate on a block
device.  You can just point it at an dd image directly.
> 2. If my assertion is correct that most of my f/s is intact but the journal
>    is FUBAR, I need to find a way to force fsck to ignore the journal no
>    matter what.  Is there such a tool or option to some tool?  Is there a
>    way I could simply scan the disk and truncate the journal file, or turn
>    off the has_journal bit w/o touching the rest of the f/s?
You can use debugfs's feature command to turn off the has_journal bit
as follows:

debugfs -w /dev/hdaXX
debugfs: features ^has_journal
debugfs: features ^needs_recovery 
debugfs: quit

Hmm.... this will work unless the group descriptors are so badly
damaged that debugfs refuses to touch the superblock.  You can open it
in catastrophic mode, but right now as a safety precaution, you' re
not allowed to open the filesystem in read/write mode when in
catastrophic mode.  I can remove this restriction if we add some more
safety checks that will prevent debugfs from doing more damage when
opened in read/write catastrophic mode, at the moment, debugfs has
been written with a "first do no harm" principle.

Ultimately, though, it's probably more important to add a backup copy
of the journal inode to avoid needing to play games like this in the
first place, and to allow e2fsck to recover from these situations
automatically.

						- Ted

Erez Zadok

2003-Aug-21 19:28 UTC

head link

Re: another seriously corrupt ext3 -- pesky journal

In message <20030821190811.GC1040@matchmail.com>, Mike Fedyk
writes:> On Thu, Aug 21, 2003 at 12:54:37PM -0400, Erez Zadok wrote:
> > Cool.  Is there an ext3 patch that'll support the backup inode? 
Are there
> > any issues of moving b/t the two modes of ext3 (back/forw
compatibility
> > etc.)?
> 
> There's no need to support it in the kernel.  The inode number is kept
in
> the superblock, and that's updated at mkfs and tune2fs time, not from
the
> kernel.
> 
> Also, there isn't a second inode, it's just that the inode number
is being
> kept in the superblocks too.
How does the kernel know to write the journal data first to some data block
belonging to inode X, and then to another data block of inode Y?  Both X and
Y are journal inodes, right?  Will there be a reserved inum other than 8,
for the backup journal?

Is there some magic in which the kernel can identify any number of special
journal inodes?

And while we're at it, why only one backup journal inode?  Why not several?
If it's good enough to have several copies of superblocks etc., then why not
the journal (for those willing to pay the performance penalty)?

Erez.

Reasonably Related Threads

Search for more maybe matching threads

Ext3 users - Aug 2003 - another seriously corrupt ext3 -- pesky journal

another seriously corrupt ext3 -- pesky journal

Re: another seriously corrupt ext3 -- pesky journal

Re: another seriously corrupt ext3 -- pesky journal

Reasonably Related Threads