thr3ads.net - Ext3 users - Smashing EXT3 for fun and profit (or: how to loose all your data) [May 2005]

If this information is useful, please help other people find it:
Share via:

Hans Yperman

2005-May-12 22:35 UTC

Smashing EXT3 for fun and profit (or: how to loose all your data)

Hello everyone,

I've just lost my whole EXT3 linux partition by what was probably a
bug.  For your reading pleasure, and in the hope there is enough
information to fix this problem in the future, here the story of a
violent ending:

This tragic history starts actually on windows: MS Word had wiped out
an important file on a floppy, and I got the task of retrieving what
was possible.  Using Linux, I made an image with dd,and put it on the
now extinct EXT3 partition. I used an undelete programma ,  and then
mounted the image with a loopback device:
mount -o loop /tmp/image.img /floppy
  As it turns out,the undeleter managed to screw up the FAT, and the
loopback device complains about reading past the end of the device. 
After fixing the floppy on another computer, I come back to the linux
computer.  The console is full of error messages.

What happened?  A first bug: Linux remounted the loopback-device
read-only because  of the bad FAT on the image.  BUT this did not work
out right: not only the loopback device, but the whole EXT3-partition
were now read-only. Every little write action results in an error,
hence all the messages.  I did not really think much of it at that
point, and just did a
mount -o remount,rw /

At this point, I am already screwed, but I don't realize it yet:  The
computer works completely normal from here on.  The problem happens
the next time I boot: fsck complains about problems (weird, fsck is
not supposed to run for EXT3).  Specifically, fsck complains about
double-allocated blocks, does a pass 1B and 1C (I'd never seen these
before either), dumps pages and pages and pages of block numbers,
get's very very veeeeryyy slow, and crashes.  I restart fsck.  This
time it starts asking me tons of yes/no questions because it wants to
know what to do with the double-allocated block.  I yes them all
(There is no real right answer anyhow) and reboot.

And that was it:  init starts, and complains about not having an
/etc/inittab (and asks me which runlevel to start.  Never seen that
before either). Then it crashes.   Booting with knoppix reveals lots
and lost of damaged files.  Everything that was cached seems to be
damaged, and some random files are also dead (my gues is ext3 screwed
up while updating atimes or something like that).  Game over.

I guess these 2 facts need fixing:
1) loopback devices should not pass errors over to their underlying filesystems.
2) ext3 suicidally allows remounting read-write when parts of its data
are invalid.

Now I don't complain much.  I have a 1 day old backup of my home
directory (thanks, unison). I lost all my tweaks to /etc, but, well,
the hard drive image was copied/resized from computer to computer to
computer, and initially started its life under linux 2.0.35 on a
pentium 133Mhz.  A rewrite was probably a good idea anyway.  I lost
all my MP3's, but a very nice girl promised me to help me re-rip them
all from my CD's. (Thanks to ext3 I get to spend some time with a very
sexy girl.  Lots of it by talking and laughing while we wait for lame
to end.  I actually start to think my hard drive should get erased
more often ;-)  ).

Other people might not like loosing a whole partition, so I mail this
sad story to you all.  A bit of advice: if you ever see ext3
complaining about being read-only, press the reset button.  It might
save your partition.

I did not test my claim of the loopback being the bug, as I am busy
reinstalling right now (on EXT2 this time).

Have a nice day, everyone,

Hans.

Joseph D. Wagner

2005-May-13 19:55 UTC

head link

Smashing EXT3 for fun and profit (or: how to loose all your data)

> I guess these 2 facts need fixing:
> 1) loopback devices should not pass errors over
> to their underlying filesystems.
I have a test partition setup for these circumstances.  I'll try to
reproduce the read-write/read-only error spreading to an underlying file system
when the loopback file system has the error.  However, I will have to double
check with the file system designers.  There may be a good reason it behaves
this way.
> 2) ext3 suicidally allows remounting read-write
> when parts of its data are invalid.
When you are logged in as root, it will let you whatever suicidal -- or imho
stupid -- things you tell it to do.  That is not going to change.

It actually takes something serious to bring down a file system mid-stride, not
just an atime update.  In other words, by the time Linux is remounting your file
system as read-only, something is already fubar.  The remount as read-only is
really only a stop-gap measure to prevent further damage while you save your
work -- on other partitions -- and reboot.

If all you have is one honkin' / (root) partition, you may just want to
change that behavior to panic.  After all, if you only have 1 partition,
there's no where else to save your work.

So long as you're redoing your partitions, be sure to separate out /tmp,
/var, and just to be safe /home too, so next time all you lose is the one bad
partition.

Joseph D. Wagner

Theodore Ts'o

2005-May-14 02:28 UTC

head link

Smashing EXT3 for fun and profit (or: how to loose all your data)

On Fri, May 13, 2005 at 12:35:16AM +0200, Hans Yperman
wrote:> This tragic history starts actually on windows: MS Word had wiped out
> an important file on a floppy, and I got the task of retrieving what
> was possible.  Using Linux, I made an image with dd,and put it on the
> now extinct EXT3 partition. I used an undelete programma ,  and then
> mounted the image with a loopback device:
> mount -o loop /tmp/image.img /floppy
>   As it turns out,the undeleter managed to screw up the FAT, and the
> loopback device complains about reading past the end of the device. 
> After fixing the floppy on another computer, I come back to the linux
> computer.  The console is full of error messages.
What version of the kernel are you using?  What undelete program were
you using?  Most undelete programs don't require that you mount the
filesystem; in fact, they often require that you *don't* mount them.
> What happened?  A first bug: Linux remounted the loopback-device
> read-only because  of the bad FAT on the image.  BUT this did not work
> out right: not only the loopback device, but the whole EXT3-partition
> were now read-only. Every little write action results in an error,
> hence all the messages.  I did not really think much of it at that
> point, and just did a
> mount -o remount,rw /
Without the logs, it sounds like the ext3 filesystem got corrupted,
and so it was mounted remounted read-only.  How this happened is not
clear, and you didn't give us enough information to determine that;
but it's consistent with e2fsck displaying errors.
> At this point, I am already screwed, but I don't realize it yet:  The
> computer works completely normal from here on.  The problem happens
> the next time I boot: fsck complains about problems (weird, fsck is
> not supposed to run for EXT3).  
When the kernel discovered a filesystem corruption, it marks the
filesystem as containing errors, and remounts it read-only.  When fsck
will run, it will note the fact that filesystem has problems, and try
to fix it.
> Specifically, fsck complains about
> double-allocated blocks, does a pass 1B and 1C (I'd never seen these
> before either), dumps pages and pages and pages of block numbers,
> get's very very veeeeryyy slow, and crashes.  I restart fsck.  This
> time it starts asking me tons of yes/no questions because it wants to
> know what to do with the double-allocated block.  I yes them all
> (There is no real right answer anyhow) and reboot.
What version of e2fsck are you running?  It must be an ancient one if
got really slow like that.  You wouldn't be running Debian
Obsolete^H^H^H^H^H^H^H Stable, are you?
> And that was it:  init starts, and complains about not having an
> /etc/inittab (and asks me which runlevel to start.  Never seen that
> before either). Then it crashes.   Booting with knoppix reveals lots
> and lost of damaged files.  Everything that was cached seems to be
> damaged, and some random files are also dead (my gues is ext3 screwed
> up while updating atimes or something like that).  Game over.
The filesystem was probably screwed up much earlier than that.
Probably something with the undelete program was run, or perhaps
because you remounted the filesystem read-write after errors were
uncovered, but it's going to be hard to reconstruct without a lot more
details.  (What specific messages were printed by the kernel
describing the errors, exactly what version of the kernel, e2fsprogs,
and undelete program you were using, etc.)

I will say that while remounting a filesystem read/write after errors
is dangerous, the fact that e2fsck displayed pages and pages of block
numbers tends to indicate that that there was something more that went
wrong.  Merely remounting a filesystem read/write might result in a
some multiply claimed blocks, which pass 1b/1c/1d are designed to
resolve, but how many you have depends on how many files are written
and how badly corrupted were the block allocation bitmaps.  

Assuming that you didn't run the system for very long before you
rebooted, or didn't write a lot of files during this interim, it seems
somewhat unlikely that it would have resulted in "pages and pages and
pages" of block numbers.  That would tend to argue that portions of
the inode table got written to the wrong location, which is generally
caused by a hardware error.  It might have been caused by the undelete
program, but that seems hard to believe.  But then again, I don't know
which undelete program you used, and it does seem very surprising that
the undelete program would work with a mounted filesystem, so that
part sounds like another user error (but not one that would be
expected to cause major filesystem corruption).  So the bottom line is
I can't really tell you what could have happened with the limited
facts that you've given me.
> I guess these 2 facts need fixing:
> 1) loopback devices should not pass errors over to their underlying
filesystems.
Loopback devices don't pass errors back over to their underlying
filesystems.  
> 2) ext3 suicidally allows remounting read-write when parts of its data
> are invalid.
Linux will allow you to do many things that might be, well,
ill-advised.  When the kernel printed all of the warnings, it warned
you that the filesystem had errors.  Remounting it read/write was a
really bad idea --- but then again, so is running the command
"dd if=/dev/zero of=/dev/hda1" as root.

> Other people might not like loosing a whole partition, so I mail this
> sad story to you all.  A bit of advice: if you ever see ext3
> complaining about being read-only, press the reset button.  It might
> save your partition.
Or run e2fsck manually yourself; there are a number of things that you
can do.  Blindly remounting the filesystem read/write is certainly not
one of them.  Saving all of the error messages from the kernel
describing the filesystem corruption is a really good idea.  As is
saving the messages from e2fsck, so people can figure out what
happened after the fact.

The one good thing is that you kept good backups, so you didn't lose
that much; I definitely commend that.  :-)

						- Ted

Possibly Parallel Threads

Search for more reasonably related threads

Ext3 users - May 2005 - Smashing EXT3 for fun and profit (or: how to loose all your data)

Smashing EXT3 for fun and profit (or: how to loose all your data)

Smashing EXT3 for fun and profit (or: how to loose all your data)

Smashing EXT3 for fun and profit (or: how to loose all your data)

Possibly Parallel Threads