Hi folks, I'm in really big trouble with ext3. At about every second reboot I have files changed on my ext3 filesystem! In most cases I realize that sshd didn't start, and after examination I found that /usr/sbin/sshd or /lib/libutil-x.y.so changed. But when I reboot, everything seems OK. I looked once into the binary, and find parts of syslog in it!!! Horror! And this is a ususal symptom. Other times I found lost files and directories. And what with losses I don't find immediatly? I have daily backups, so I could restore, but... It's not what I except from a filesystem these days... I had using reiserfs since more than a year without problem. But I need extended attributes, and anyway didn't plan to switch my root in middle of project. But cannot put the server into production until this is fixed. It seems this happens during reboot only, and a second reboot mostly fixes it, at least with /usr/sbin/sshd, but not always. But even without sshd it's too bad to have a remote server! Had anyone of you this happening? About my config: Kernel: 2.4.19 with minimal patches (evms and vserver). Using EVMS. Glibc 2.3.1. Gentoo 1.4-rc2 distribution. Every help and experience would help! Thanks in advance, viktor at neotek dot hu
Hi Viktor! I use a 2.4.20 kernel with all the patches from http://www.zipworld.com.au/~akpm/linux/ext3/ So far, everything works fine, with external journal and with the normal internal one (no data=journal). You might try that if you want to upgrade your kernel. Bye, Norman. Bodrogi Viktor schrieb:> Hi folks, > > I'm in really big trouble with ext3. > At about every second reboot I have files changed on my ext3 filesystem! > In most cases I realize that sshd didn't start, and after examination > I found that /usr/sbin/sshd or /lib/libutil-x.y.so changed. > But when I reboot, everything seems OK. > I looked once into the binary, and find parts of syslog in it!!! > Horror! > And this is a ususal symptom. > Other times I found lost files and directories. > And what with losses I don't find immediatly? > I have daily backups, so I could restore, but... > It's not what I except from a filesystem these days... > I had using reiserfs since more than a year without problem. > But I need extended attributes, and anyway didn't plan to switch my root > in middle of project. But cannot put the server into production until > this is fixed. > It seems this happens during reboot only, and a second reboot mostly fixes > it, at least with /usr/sbin/sshd, but not always. But even without sshd it's > too bad to have a remote server! > > Had anyone of you this happening? > > About my config: > > Kernel: 2.4.19 with minimal patches (evms and vserver). > Using EVMS. > Glibc 2.3.1. > Gentoo 1.4-rc2 distribution. > > Every help and experience would help! > > Thanks in advance, > viktor at neotek dot hu > > > > _______________________________________________ > Ext3-users mailing list > Ext3-users@redhat.com > https://listman.redhat.com/mailman/listinfo/ext3-users >-- -- Norman Schmidt Institut für Physikal. u. Theoret. Chemie cand. chem. Friedrich-Alexander-Universitaet schmidt@naa.net Erlangen-Nuernberg
msg Montag 03 Februar 2003 08:50 by Bodrogi Viktor:> Kernel: 2.4.19 with minimal patches (evms and vserver). > Using EVMS. > Glibc 2.3.1. > Gentoo 1.4-rc2 distribution. > > Every help and experience would help!I've given up on gentoo rc's for different reasons... -- . ___ | | Irmund Thum | |
On Mon, Feb 03, 2003 at 07:50:50AM -0000, Bodrogi Viktor wrote:> Hi folks, > > I'm in really big trouble with ext3. > At about every second reboot I have files changed on my ext3 filesystem! > In most cases I realize that sshd didn't start, and after examination > I found that /usr/sbin/sshd or /lib/libutil-x.y.so changed. > But when I reboot, everything seems OK. > I looked once into the binary, and find parts of syslog in it!!!This sounds like a hardware problem. It's likely that incorrect blocks are getting read into the page cache. So when you look at the file, you see incorrect data. When you reboot, that clears the page cache, and file then looks OK again.> It seems this happens during reboot only, and a second reboot mostly fixes > it, at least with /usr/sbin/sshd, but not always. But even without sshd it's > too bad to have a remote server!If the problem is be related to be whether your system is warm booted versus cold booted, that might explain why the second reboot fixes things for you.> Had anyone of you this happening?Every time I've heard of anything like this, it's turned out to be a hardware problem. Good luck!! - Ted
Hi!> > This sounds like a hardware problem. It's likely that incorrect > blocks are getting read into the page cache. So when you look at the > file, you see incorrect data. When you reboot, that clears the page > cache, and file then looks OK again. >Seems interesting. I forgot to mention (yes, sorry, it's important piece of information), that I have RAID 1 (mirrored disks), so HW problem is less possible. And I have reiserfs partition on the mirror too, without any problem. Anyway, do you have an idea how to test for HW errors? thanks for the answers! viktor at neotek dot hu
On Mon, Feb 03, 2003 at 10:53:08PM -0000, Bodrogi Viktor wrote:> > Seems interesting. > I forgot to mention (yes, sorry, it's important piece of information), > that I have RAID 1 (mirrored disks), so HW problem is less possible. > And I have reiserfs partition on the mirror too, without any problem.Raid protects you against disk failures. It does not protect you from cable problems causing data corruption, or your RAID controller going insane. Unfortunately a lot of people seem to believe that just because they have RAID, they are immune from hardware problems, and then stop doing backups. I usually hear from them after they've gotten screwed, and when they ask if I can perform miracles.... In any case, the scenario I described (a controller/cable problem, or an incorrectly configured IDE DMA settings) are all still possible with RAID; RAID does not help you prevent these sorts of problems. As far as your not noticing the problem with reiserfs that could be because you've been lucky, and not noticed because the block addresses causing the problem do not (yet) contain data. But the symptoms you've described sound very much like hardware induced errors.> Anyway, do you have an idea how to test for HW errors?Well, if you have a scratch partition that's not being used, you can try using the badblocks program. Try using the -w option, which will do a read/write test. This doesn't do a random access test, so it might not detect any problems, though. I'd suggest checking your internal cabling, and replacing the controller cable if it looks dubious. Making everything is well plugged in, too. Good luck! - Ted
Hi! This morning I booted and, what a horror, found bad superblock on /var! fsck -ing reported nothing, but mount said bad superblock. It's the best can happen after due day of project, but before finishing it, isn't? So I decided to switch to reiserfs, which has performance advantages too. After about fifth reboot I could mount /var, and copied it to a new partition together with root partition. And, terrible, I had the same problem with /usr/sbin/sshd startup, without the binary changes, according to a diff with a probably-good backup (who can be sure about after all these...). So the conclusion is that pssibly this has nothing to do with ext3. It's not openssh because I had problems with other files/dirs, too... Maybe it's evms? Maybe it's the kernel? It's a stock 2.4.19, only with evms and vserves patches. I don't think it's a distro problem... So sorry about talking about this on ext3 list! Thanks for all help! viktor more comments below...> > > > Seems interesting. > > I forgot to mention (yes, sorry, it's important piece of information), > > that I have RAID 1 (mirrored disks), so HW problem is less possible. > > And I have reiserfs partition on the mirror too, without any problem. > > Raid protects you against disk failures. It does not protect you from > cable problems causing data corruption, or your RAID controller going > insane. Unfortunately a lot of people seem to believe that just > because they have RAID, they are immune from hardware problems, and > then stop doing backups. I usually hear from them after they've > gotten screwed, and when they ask if I can perform miracles....Yes, RAID is completly different than backup. RAID doesn't protect you of rm -fr / ;))> > In any case, the scenario I described (a controller/cable problem, or > an incorrectly configured IDE DMA settings) are all still possible > with RAID; RAID does not help you prevent these sorts of problems.It's SW RAID-1, disks are on the same controller, but different buses / cables. Am I right, that in this case HW errors are *very* unlikely? That would mean that there are exactly the same bits of errors at exactly the same time on different cables/disks...> As far as your not noticing the problem with reiserfs that could be > because you've been lucky, and not noticed because the block addresses > causing the problem do not (yet) contain data. But the symptoms > you've described sound very much like hardware induced errors. > > > Anyway, do you have an idea how to test for HW errors? > > Well, if you have a scratch partition that's not being used, you can > try using the badblocks program. Try using the -w option, which will > do a read/write test. This doesn't do a random access test, so it > might not detect any problems, though. > > I'd suggest checking your internal cabling, and replacing the > controller cable if it looks dubious. Making everything is well > plugged in, too. >I use the most expensive, twisted, shielded, etc. cables, plugged well, at least visualy... Thanks for all answers! viktor
> As I said earlier, it's probably a hardware problem, or perhaps a > combination of hardware and kernel (i.e., the kernel tries to be too > agressive with the IDE DMA configuration, as Stephen conjectured). > > > > In any case, the scenario I described (a controller/cable problem, or > > > an incorrectly configured IDE DMA settings) are all still possible > > > with RAID; RAID does not help you prevent these sorts of problems. > > > > It's SW RAID-1, disks are on the same controller, > > but different buses / cables. > > Am I right, that in this case HW errors are *very* unlikely? > > That would mean that there are exactly the same bits of errors at exactly > > the same time on different cables/disks... > > Nope, you're incorrect here. When you read from a SW-RAID-1 array, > the Software Raid driver picks one or the other disk (whichever one is > available) and reads from the that particular disk. It does *not* > read the block from both disks, and compare the blocks read from both > disks to make sure they are identical, as you seem to believe.Yes, I did believe that. So only one more question (and answer) would be usefull for all of us: Do You know about if there is a mode switch for RAID-1 setup (my case is evms-raid) to do this comparision? This makes sense as an option for debuging and for high availability production also. viktor
> > Do You know about if there is a mode switch for RAID-1 setup (my case is > > evms-raid) to do this comparision? > > This makes sense as an option for debuging and for high availability > > production also. > > No there isn't, on any RAID systems that I'm aware of.This really breaks my confidence in RAID-1 mirrors. Would the situation get better with a four disk RAID-5? As I imagine, it should... I prefer definitive errors than unknown failures. Then it gets show up as a disk error, not as random segfaults. If this phenomena is HW error, should it be logged anywhere? I didn't find anything in syslog... I will stop this thread, really, it gets out of the list's topic... Thanks for all your help! viktor