Hans Yperman
2005-May-12 22:35 UTC
Smashing EXT3 for fun and profit (or: how to loose all your data)
Hello everyone, I've just lost my whole EXT3 linux partition by what was probably a bug. For your reading pleasure, and in the hope there is enough information to fix this problem in the future, here the story of a violent ending: This tragic history starts actually on windows: MS Word had wiped out an important file on a floppy, and I got the task of retrieving what was possible. Using Linux, I made an image with dd,and put it on the now extinct EXT3 partition. I used an undelete programma , and then mounted the image with a loopback device: mount -o loop /tmp/image.img /floppy As it turns out,the undeleter managed to screw up the FAT, and the loopback device complains about reading past the end of the device. After fixing the floppy on another computer, I come back to the linux computer. The console is full of error messages. What happened? A first bug: Linux remounted the loopback-device read-only because of the bad FAT on the image. BUT this did not work out right: not only the loopback device, but the whole EXT3-partition were now read-only. Every little write action results in an error, hence all the messages. I did not really think much of it at that point, and just did a mount -o remount,rw / At this point, I am already screwed, but I don't realize it yet: The computer works completely normal from here on. The problem happens the next time I boot: fsck complains about problems (weird, fsck is not supposed to run for EXT3). Specifically, fsck complains about double-allocated blocks, does a pass 1B and 1C (I'd never seen these before either), dumps pages and pages and pages of block numbers, get's very very veeeeryyy slow, and crashes. I restart fsck. This time it starts asking me tons of yes/no questions because it wants to know what to do with the double-allocated block. I yes them all (There is no real right answer anyhow) and reboot. And that was it: init starts, and complains about not having an /etc/inittab (and asks me which runlevel to start. Never seen that before either). Then it crashes. Booting with knoppix reveals lots and lost of damaged files. Everything that was cached seems to be damaged, and some random files are also dead (my gues is ext3 screwed up while updating atimes or something like that). Game over. I guess these 2 facts need fixing: 1) loopback devices should not pass errors over to their underlying filesystems. 2) ext3 suicidally allows remounting read-write when parts of its data are invalid. Now I don't complain much. I have a 1 day old backup of my home directory (thanks, unison). I lost all my tweaks to /etc, but, well, the hard drive image was copied/resized from computer to computer to computer, and initially started its life under linux 2.0.35 on a pentium 133Mhz. A rewrite was probably a good idea anyway. I lost all my MP3's, but a very nice girl promised me to help me re-rip them all from my CD's. (Thanks to ext3 I get to spend some time with a very sexy girl. Lots of it by talking and laughing while we wait for lame to end. I actually start to think my hard drive should get erased more often ;-) ). Other people might not like loosing a whole partition, so I mail this sad story to you all. A bit of advice: if you ever see ext3 complaining about being read-only, press the reset button. It might save your partition. I did not test my claim of the loopback being the bug, as I am busy reinstalling right now (on EXT2 this time). Have a nice day, everyone, Hans.
Joseph D. Wagner
2005-May-13 19:55 UTC
Smashing EXT3 for fun and profit (or: how to loose all your data)
> I guess these 2 facts need fixing: > 1) loopback devices should not pass errors over > to their underlying filesystems.I have a test partition setup for these circumstances. I'll try to reproduce the read-write/read-only error spreading to an underlying file system when the loopback file system has the error. However, I will have to double check with the file system designers. There may be a good reason it behaves this way.> 2) ext3 suicidally allows remounting read-write > when parts of its data are invalid.When you are logged in as root, it will let you whatever suicidal -- or imho stupid -- things you tell it to do. That is not going to change. It actually takes something serious to bring down a file system mid-stride, not just an atime update. In other words, by the time Linux is remounting your file system as read-only, something is already fubar. The remount as read-only is really only a stop-gap measure to prevent further damage while you save your work -- on other partitions -- and reboot. If all you have is one honkin' / (root) partition, you may just want to change that behavior to panic. After all, if you only have 1 partition, there's no where else to save your work. So long as you're redoing your partitions, be sure to separate out /tmp, /var, and just to be safe /home too, so next time all you lose is the one bad partition. Joseph D. Wagner
Theodore Ts'o
2005-May-14 02:28 UTC
Smashing EXT3 for fun and profit (or: how to loose all your data)
On Fri, May 13, 2005 at 12:35:16AM +0200, Hans Yperman wrote:> This tragic history starts actually on windows: MS Word had wiped out > an important file on a floppy, and I got the task of retrieving what > was possible. Using Linux, I made an image with dd,and put it on the > now extinct EXT3 partition. I used an undelete programma , and then > mounted the image with a loopback device: > mount -o loop /tmp/image.img /floppy > As it turns out,the undeleter managed to screw up the FAT, and the > loopback device complains about reading past the end of the device. > After fixing the floppy on another computer, I come back to the linux > computer. The console is full of error messages.What version of the kernel are you using? What undelete program were you using? Most undelete programs don't require that you mount the filesystem; in fact, they often require that you *don't* mount them.> What happened? A first bug: Linux remounted the loopback-device > read-only because of the bad FAT on the image. BUT this did not work > out right: not only the loopback device, but the whole EXT3-partition > were now read-only. Every little write action results in an error, > hence all the messages. I did not really think much of it at that > point, and just did a > mount -o remount,rw /Without the logs, it sounds like the ext3 filesystem got corrupted, and so it was mounted remounted read-only. How this happened is not clear, and you didn't give us enough information to determine that; but it's consistent with e2fsck displaying errors.> At this point, I am already screwed, but I don't realize it yet: The > computer works completely normal from here on. The problem happens > the next time I boot: fsck complains about problems (weird, fsck is > not supposed to run for EXT3).When the kernel discovered a filesystem corruption, it marks the filesystem as containing errors, and remounts it read-only. When fsck will run, it will note the fact that filesystem has problems, and try to fix it.> Specifically, fsck complains about > double-allocated blocks, does a pass 1B and 1C (I'd never seen these > before either), dumps pages and pages and pages of block numbers, > get's very very veeeeryyy slow, and crashes. I restart fsck. This > time it starts asking me tons of yes/no questions because it wants to > know what to do with the double-allocated block. I yes them all > (There is no real right answer anyhow) and reboot.What version of e2fsck are you running? It must be an ancient one if got really slow like that. You wouldn't be running Debian Obsolete^H^H^H^H^H^H^H Stable, are you?> And that was it: init starts, and complains about not having an > /etc/inittab (and asks me which runlevel to start. Never seen that > before either). Then it crashes. Booting with knoppix reveals lots > and lost of damaged files. Everything that was cached seems to be > damaged, and some random files are also dead (my gues is ext3 screwed > up while updating atimes or something like that). Game over.The filesystem was probably screwed up much earlier than that. Probably something with the undelete program was run, or perhaps because you remounted the filesystem read-write after errors were uncovered, but it's going to be hard to reconstruct without a lot more details. (What specific messages were printed by the kernel describing the errors, exactly what version of the kernel, e2fsprogs, and undelete program you were using, etc.) I will say that while remounting a filesystem read/write after errors is dangerous, the fact that e2fsck displayed pages and pages of block numbers tends to indicate that that there was something more that went wrong. Merely remounting a filesystem read/write might result in a some multiply claimed blocks, which pass 1b/1c/1d are designed to resolve, but how many you have depends on how many files are written and how badly corrupted were the block allocation bitmaps. Assuming that you didn't run the system for very long before you rebooted, or didn't write a lot of files during this interim, it seems somewhat unlikely that it would have resulted in "pages and pages and pages" of block numbers. That would tend to argue that portions of the inode table got written to the wrong location, which is generally caused by a hardware error. It might have been caused by the undelete program, but that seems hard to believe. But then again, I don't know which undelete program you used, and it does seem very surprising that the undelete program would work with a mounted filesystem, so that part sounds like another user error (but not one that would be expected to cause major filesystem corruption). So the bottom line is I can't really tell you what could have happened with the limited facts that you've given me.> I guess these 2 facts need fixing: > 1) loopback devices should not pass errors over to their underlying filesystems.Loopback devices don't pass errors back over to their underlying filesystems.> 2) ext3 suicidally allows remounting read-write when parts of its data > are invalid.Linux will allow you to do many things that might be, well, ill-advised. When the kernel printed all of the warnings, it warned you that the filesystem had errors. Remounting it read/write was a really bad idea --- but then again, so is running the command "dd if=/dev/zero of=/dev/hda1" as root.> Other people might not like loosing a whole partition, so I mail this > sad story to you all. A bit of advice: if you ever see ext3 > complaining about being read-only, press the reset button. It might > save your partition.Or run e2fsck manually yourself; there are a number of things that you can do. Blindly remounting the filesystem read/write is certainly not one of them. Saving all of the error messages from the kernel describing the filesystem corruption is a really good idea. As is saving the messages from e2fsck, so people can figure out what happened after the fact. The one good thing is that you kept good backups, so you didn't lose that much; I definitely commend that. :-) - Ted