David Clunie
2005-May-15 13:56 UTC
Intermittent ext3 corruption on external firewire Micronet 1.5Tb RAID on FC3
Hi I have a Firewire connected Micronet 1.5TB RAID with a single large ext3 filesystem on one partition on a dual Xeon system. I am checking out from an extremely large cvs repository (don't ask) to this drive over the course of many days, and intermittently I get bad blocks and the filesystem goes read-only. This is not related to any power failure or anything similar. The RAID is currently about 40% full; this started to happen around the 15% mark as I recall. I checked the RAID firmware setup, found that caching was set to write-back, and changed it to write-through to see if that would help (since I gather the Linux kernel presumes write-through, though why it should make a difference in the absence of a reboot or power failure I don't understand). This reduced the frequency of the error from once a night to once every couple of nights; interestingly mostly at about 04:03 AM or so. Looking at cron.daily, only mrtg and sa seem to be starting up at about that time. I suspect the timing is related to a change in the pattern of disk activity rather than anything else. I have no reason to suspect that there is anything actually wrong with the RAID itself, which just appears as a really big firewire external disk. It is new however, so this can't be ruled out. My next step is to just turn off journaling and see if doing this with just ext2 works OK. Journaling doesn't seem to be doing much good as I am stuck regularly running ordinary fsck's with all these errors anyway ! I just thought I would ask if anyone else has had a similar experience, and whether such issues are known to be with ext3, or the firewire interface, or both together. PS. I did actually create the partition and did the mkfs on an AMD64 FC3 system at a different site, though that is not the system to which the RAID is currently connected. Just mention that in case this makes a difference, but I presume an fsck would have noticed and fixed anything fundamentally wrong in this regard. David May 15 04:03:30 localhost kernel: Aborting journal on device sdd1. May 15 04:03:30 localhost kernel: EXT3-fs error (device sdd1): ext3_journal_start_sb: Detected aborted journal May 15 04:03:30 localhost kernel: EXT3-fs error (device sdd1): ext3_xattr_get: inode 63343526: bad block 165510584 May 15 04:03:30 localhost kernel: EXT3-fs error (device sdd1) in start_transaction: Journal has aborted May 15 04:03:30 localhost kernel: EXT3-fs error (device sdd1) in start_transaction: Journal has aborted May 15 04:03:30 localhost kernel: inode_doinit_with_dentry: getxattr returned 5 for dev=sdd1 ino=63343526 May 15 04:03:34 localhost kernel: EXT3-fs error (device sdd1): ext3_xattr_get: inode 63343381: bad block 141623810 May 15 04:03:34 localhost kernel: EXT3-fs error (device sdd1): ext3_xattr_get: inode 63947123: bad block 203323361 Linux localhost.localdomain 2.6.9-1.667smp #1 SMP Tue Nov 2 14:59:52 EST 2004 i686 i686 i386 GNU/Linux
Joseph D. Wagner
2005-May-15 22:48 UTC
Intermittent ext3 corruption on external firewire Micronet 1.5Tb RAID on FC3
> May 15 04:03:30 localhost kernel: EXT3-fs error (device sdd1): > ext3_xattr_get: inode 63343526: bad block 165510584 > May 15 04:03:34 localhost kernel: EXT3-fs error (device sdd1): > ext3_xattr_get: inode 63343381: bad block 141623810 > May 15 04:03:34 localhost kernel: EXT3-fs error (device sdd1): > ext3_xattr_get: inode 63947123: bad block 203323361These errors cannot be caused by a bug in the file system. It is possible, although highly unlikely, that a bug in the device driver could generate these errors. The most likely cause is that there actually are bad blocks on your new 1.5TB file system. Do us all a favor and run: Badblocks -v -b block_size /dev/device And let us know about the results. Joseph D. Wagner
Hi all, I was having a ext3 filesystem with writeback. yesterday my system crashed and now when i try to mount it, it gives me "Invalid argument". Following is the command line #mount -t ext3 /dev/hda1 /mnt/home i tried debugging it and later i found out, its was complaining about journaling inode. Is there any way to recover my files, i did clone the disk and mounted it as ext2 after few tries but there was nothing in it. any help or pointers will be appreciated, Thanks anand
Andreas Dilger
2005-May-17 06:04 UTC
Intermittent ext3 corruption on external firewire Micronet 1.5Tb RAID on FC3
On May 15, 2005 09:56 -0400, David Clunie wrote:> I have a Firewire connected Micronet 1.5TB RAID with a single > large ext3 filesystem on one partition on a dual Xeon system.For some kernels (maybe even current ones) it is possible that there is a problem with IO beyond 1 TB. What I would do (if you don't mind overwriting the disk, presumably not if it is just new and doesn't contain important data) is to write a small test program to write the byte offset at the start of every 4kB block on the disk, then read them all back and verify it is correct. This will tell you if there is aliasing in the block device (possibly e.g. an int used instead of __u32 or sector_t). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.