Mats Ahlgren
2007-Mar-18 01:42 UTC
Frequent metadata corruption with ext3 + hard power-off
Hello. I'm having serious issues with ext3; any insight would be greatly appreciated: _____ Overview: I believe ext3 is supposed to be recoverable in the case of a power failure by replaying the log. However, on two separate computers (running different operatings systems too), this has been everything but the case. _____ Specifics: Sometimes, my kernel will hard-freeze and I'll have to do a hard reboot. When this happens, sometimes fsck will insist on running and find some orphaned inodes, which it will proceed to put in the /lost+found directory. This is unacceptable: The last time this happened, random files in my operating system were plucked from the file system and stuffed in lost+found, corrupting the OS and forcing a reinstall. Another time, files I had recently moved (a final project) a minute before the crash were orphaned and put in the lost+found, effectively destroying it. Why should a lost+found folder even be necessary when the file hierarchy is guaranteed to be consistent? In response to these problems, I changed the ext3 journaling mode to "journal" rather than "ordered" (frankly it seems deeply disturbing that "ordered" is the default). Since then, I've once had to hard-reboot and yet again found files in the /lost+found folder. Might anyone know why ext3 is not fulfilling its promise of an always-consistent file system? _____ Other interacting issues: I'm running RAID1 (mirroring) on one computer, but I've had the same issues on another computer without RAID. (In response to "you shouldn't hard-reboot your computer": I realize that most computers are not meant to be hard-rebooted, but I don't have a sysrq key and xmodmapping it has been difficult. I also realize that kernels shouldn't crash, but what's a person to do if the computer doesn't respond to ctrl-alt-f1 and doesn't leave any messages in the logs...) (In response to "maybe your drive is defective": This is not a problem with a defective drive; I've tried multiple drives.) (In response to "you should backup your data": Periodic backups clearly help, but it's ridiculous to restore a system from backup every week because a hard-freeze corrupted your filesystem...) Any insight would be greatly appreciated. These problems have been making me look for other file systems (such as zfs, which unfortunately I can't use to boot; or reiser4, which also makes a filesystem-is-always-consistent guarantee); I would prefer to use ext3, but I've never had these sorts of problems with old Mac OS, OS X, or Windows. Thank you, Mats
Theodore Tso
2007-Mar-18 13:33 UTC
Frequent metadata corruption with ext3 + hard power-off
It sounds like you have a disk which is doing very aggressive write caching. If you are using a new enough kernel (2.6.9 or greater should have this), adding "barrier=1" to your mount options should help. We should probably make this the default at this point... - Ted
Mats Ahlgren
2007-May-18 13:48 UTC
Frequent metadata corruption with ext3 + hard power-off
Hello, After 2 months of usage, all such system-destroying problems have disappeared after disabling write caching and setting data=journal (instead of ordered). Just thought I should let everyone know. Thank you Ted. If anyone has any insight into what was going on, I'd appreciate it: Namely, I'm confused: I would guess caching simply delays the time data gets to disk, and perhaps exacerbates data being written in not-the-order it was given? But, how could this cause a problem on a journaled filesystem? if one is (theoretically) only appending to the journal, checksumming/hashing to detect consistent journal entries on failure (since the last checkpoint), and only replaying consistent journal entries (which are idempotent)... then, assuming all those things above work, how could caching cause massive corruption of the directory tree? (Is the above an accurate model for ext3?) Also, does anyone think data-journaling mode being 'ordered' instead of 'journaled' had anything to do with it? Sincerely, Mats On Sunday 18 March 2007 09:33:59 Theodore Tso wrote:> It sounds like you have a disk which is doing very aggressive write > caching. If you are using a new enough kernel (2.6.9 or greater > should have this), adding "barrier=1" to your mount options should > help. We should probably make this the default at this point... > > - TedOn Saturday 17 March 2007 21:42:17 Mats Ahlgren wrote:> Hello. > > I'm having serious issues with ext3; any insight would be greatlyappreciated:> > > _____ Overview: > > I believe ext3 is supposed to be recoverable in the case of a power failureby> replaying the log. > > However, on two separate computers (running different operatings systemstoo),> this has been everything but the case. > > > _____ Specifics: > > Sometimes, my kernel will hard-freeze and I'll have to do a hard reboot.When> this happens, sometimes fsck will insist on running and find some orphaned > inodes, which it will proceed to put in the /lost+found directory. > > This is unacceptable: The last time this happened, random files in my > operating system were plucked from the file system and stuffed inlost+found,> corrupting the OS and forcing a reinstall. Another time, files I hadrecently> moved (a final project) a minute before the crash were orphaned and put in > the lost+found, effectively destroying it. > > Why should a lost+found folder even be necessary when the file hierarchy is > guaranteed to be consistent? > > > In response to these problems, I changed the ext3 journaling modeto "journal"> rather than "ordered" (frankly it seems deeply disturbing that "ordered" is > the default). Since then, I've once had to hard-reboot and yet again found > files in the /lost+found folder. > > Might anyone know why ext3 is not fulfilling its promise of an > always-consistent file system? > > > _____ Other interacting issues: > > I'm running RAID1 (mirroring) on one computer, but I've had the same issueson> another computer without RAID. > > (In response to "you shouldn't hard-reboot your computer": I realize thatmost> computers are not meant to be hard-rebooted, but I don't have a sysrq keyand> xmodmapping it has been difficult. I also realize that kernels shouldn't > crash, but what's a person to do if the computer doesn't respond to > ctrl-alt-f1 and doesn't leave any messages in the logs...) > > (In response to "maybe your drive is defective": This is not a problem witha> defective drive; I've tried multiple drives.) > > (In response to "you should backup your data": Periodic backups clearlyhelp,> but it's ridiculous to restore a system from backup every week because a > hard-freeze corrupted your filesystem...) > > > Any insight would be greatly appreciated. These problems have been making me > look for other file systems (such as zfs, which unfortunately I can't use to > boot; or reiser4, which also makes a filesystem-is-always-consistent > guarantee); I would prefer to use ext3, but I've never had these sorts of > problems with old Mac OS, OS X, or Windows. > > > Thank you, > Mats > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users >-------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: <http://listman.redhat.com/archives/ext3-users/attachments/20070518/6b6eac3d/attachment.sig>