tweeks
2006-Sep-05 20:53 UTC
IO lockups and ext3 readonly filecorruption on RHEL4 (pre and post U4)
Has anyone been seeing IO lockup problems on EL4? I've tried multiple IO scheduler options (elevator=) in the boot... I'm seeing the same behavior regardless. Independent of hardware. Whitebox ATA, HA enclosure with dedicated SCSI, megaraid RAID hardware, Dell 2850s... same behavior: A semi-busy system will suddenly go into some kind of IO la-la land where nothing can be written to disk for >1hour. Of course when this happens, the ext3 kernel module freaks out and remounts all the filesystems as readonly. Then when the system is rebooted, if the system is allowed to fsck, the journal is hosed and the filesystem eats itself. Moving them off the RH kernel all together seems to fix the problem, but I have not found a way to reproduce the problem yet (burning and stress testing doesn't seem to make it appear), so real re-testing is difficult at best. It's become so big of a problem that we're moving some customers that require rock solid systems either over to RHEL3, or off RH and over to SLES or other distro with a non-RH kernel. Just the ext3 problem (minus the IO lockup part) can be seen in other BZ tickets: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=175877 (when the filesystem fills up) Has anyone seen these type of IO lockups + ext3 corruption on RHEL4? Can you reproduce it? Tweeks
Wolber, Richard C
2006-Sep-05 21:19 UTC
IO lockups and ext3 readonly filecorruption on RHEL4 (pre and postU4)
We're using the same systems with the same OS (well okay, actually CentOS 4) and aren't seeing the same thing. 2.6.14.3 #1 SMP PREEMPT Thu Dec 8 10:34:08 PST 2005 i686 i686 i386 GNU/Linux ..Chuck..> -----Original Message----- > From: tweeks [mailto:tweeks at rackspace.com] > Sent: Tuesday, September 05, 2006 1:54 PM > To: ext3-users at redhat.com > Subject: IO lockups and ext3 readonly filecorruption on RHEL4 > (pre and postU4) > > Has anyone been seeing IO lockup problems on EL4? > > I've tried multiple IO scheduler options (elevator=) in the > boot... I'm seeing the same behavior regardless. Independent > of hardware. Whitebox ATA, HA enclosure with dedicated SCSI, > megaraid RAID hardware, Dell 2850s... same > behavior: > > A semi-busy system will suddenly go into some kind of IO > la-la land where nothing can be written to disk for >1hour. > Of course when this happens, the > ext3 kernel module freaks out and remounts all the > filesystems as readonly. > Then when the system is rebooted, if the system is allowed to > fsck, the journal is hosed and the filesystem eats itself. > Moving them off the RH kernel all together seems to fix the > problem, but I have not found a way to reproduce the problem > yet (burning and stress testing doesn't seem to make it > appear), so real re-testing is difficult at best. > > It's become so big of a problem that we're moving some > customers that require rock solid systems either over to > RHEL3, or off RH and over to SLES or other distro with a > non-RH kernel. > > Just the ext3 problem (minus the IO lockup part) can be seen > in other BZ > tickets: > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=175877 > (when the filesystem fills up) > > Has anyone seen these type of IO lockups + ext3 corruption on RHEL4? > Can you reproduce it? > > Tweeks > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users >
Christian
2006-Sep-06 00:34 UTC
IO lockups and ext3 readonly filecorruption on RHEL4 (pre a U4)
On Tue, 5 Sep 2006, tweeks wrote:> Has anyone been seeing IO lockup problems on EL4?not using RHEL here, but...> A semi-busy system will suddenly go into some kind of IO la-la land where > nothing can be written to disk for >1hour.ok, so ext3 will remount the fs to RO. this would happen if a panic() occurs? is there anything related in the logs? (if /var is RO too, try to setup a loghost).> Then when the system is rebooted, if the system is allowed to fsck, the > journal is hosed and the filesystem eats itself.coud you be more specific? what does fsck.ext3 say? is there something in lost+found? remember to use latest version of e2fsprogs. have you tried a vanilla kernel yet? -- BOFH excuse #289: Interference between the keyboard and the chair.