thr3ads.net - Ext3 users - Ext3 strangeness data loss [Feb 2003]

If this information is useful, please help other people find it:
Share via:

Bodrogi Viktor

2003-Feb-03 07:50 UTC

Ext3 strangeness data loss

Hi folks,

I'm in really big trouble with ext3.
At about every second reboot I have files changed on my ext3 filesystem!
In most cases I realize that sshd didn't start, and after examination
I found that /usr/sbin/sshd or /lib/libutil-x.y.so changed.
But when I reboot, everything seems OK.
I looked once into the binary, and find parts of syslog in it!!!
Horror!
And this is a ususal symptom.
Other times I found lost files and directories.
And what with losses I don't find immediatly?
I have daily backups, so I could restore, but...
It's not what I except from a filesystem these days...
I had using reiserfs since more than a year without problem.
But I need extended attributes, and anyway didn't plan to switch my root
in middle of project. But cannot put the server into production until 
this is fixed.
It seems this happens during reboot only, and a second reboot mostly fixes
it, at least with /usr/sbin/sshd, but not always. But even without sshd it's
too bad to have a remote server!

Had anyone of you this happening?

About my config:

Kernel: 2.4.19 with minimal patches (evms and vserver).
Using EVMS.
Glibc 2.3.1.
Gentoo 1.4-rc2 distribution.

Every help and experience would help!

Thanks in advance,
viktor at neotek dot hu

Norman Schmidt

2003-Feb-03 11:19 UTC

head link

Re: Ext3 strangeness data loss

Hi Viktor!

I use a 2.4.20 kernel with all the patches from 
http://www.zipworld.com.au/~akpm/linux/ext3/
So far, everything works fine, with external journal and with the normal 
internal one (no data=journal).

You might try that if you want to upgrade your kernel.

Bye, Norman.

Bodrogi Viktor schrieb:> Hi folks,
> 
> I'm in really big trouble with ext3.
> At about every second reboot I have files changed on my ext3 filesystem!
> In most cases I realize that sshd didn't start, and after examination
> I found that /usr/sbin/sshd or /lib/libutil-x.y.so changed.
> But when I reboot, everything seems OK.
> I looked once into the binary, and find parts of syslog in it!!!
> Horror!
> And this is a ususal symptom.
> Other times I found lost files and directories.
> And what with losses I don't find immediatly?
> I have daily backups, so I could restore, but...
> It's not what I except from a filesystem these days...
> I had using reiserfs since more than a year without problem.
> But I need extended attributes, and anyway didn't plan to switch my
root
> in middle of project. But cannot put the server into production until 
> this is fixed.
> It seems this happens during reboot only, and a second reboot mostly fixes
> it, at least with /usr/sbin/sshd, but not always. But even without sshd
it's
> too bad to have a remote server!
> 
> Had anyone of you this happening?
> 
> About my config:
> 
> Kernel: 2.4.19 with minimal patches (evms and vserver).
> Using EVMS.
> Glibc 2.3.1.
> Gentoo 1.4-rc2 distribution.
> 
> Every help and experience would help!
> 
> Thanks in advance,
> viktor at neotek dot hu
> 
> 
> 
> _______________________________________________
> Ext3-users mailing list
> Ext3-users@redhat.com
> https://listman.redhat.com/mailman/listinfo/ext3-users
> 

-- 
--

Norman Schmidt          Institut für Physikal. u. Theoret. Chemie
cand. chem.             Friedrich-Alexander-Universitaet
schmidt@naa.net         Erlangen-Nuernberg

i.t

2003-Feb-03 12:01 UTC

head link

Re: Ext3 strangeness data loss

msg Montag 03 Februar 2003 08:50 by Bodrogi Viktor:> Kernel: 2.4.19 with minimal patches (evms and vserver).
> Using EVMS.
> Glibc 2.3.1.
> Gentoo 1.4-rc2 distribution.
>
> Every help and experience would help!
I've given up on gentoo rc's for different reasons...
-- 
 . ___
 |  |  Irmund     Thum
 |  |

Theodore Ts'o

2003-Feb-03 20:05 UTC

head link

Re: Ext3 strangeness data loss

On Mon, Feb 03, 2003 at 07:50:50AM -0000, Bodrogi Viktor
wrote:> Hi folks,
> 
> I'm in really big trouble with ext3.
> At about every second reboot I have files changed on my ext3 filesystem!
> In most cases I realize that sshd didn't start, and after examination
> I found that /usr/sbin/sshd or /lib/libutil-x.y.so changed.
> But when I reboot, everything seems OK.
> I looked once into the binary, and find parts of syslog in it!!!
This sounds like a hardware problem.  It's likely that incorrect
blocks are getting read into the page cache.  So when you look at the
file, you see incorrect data.  When you reboot, that clears the page
cache, and file then looks OK again.
> It seems this happens during reboot only, and a second reboot mostly fixes
> it, at least with /usr/sbin/sshd, but not always. But even without sshd
it's
> too bad to have a remote server!
If the problem is be related to be whether your system is warm booted
versus cold booted, that might explain why the second reboot fixes
things for you.  
> Had anyone of you this happening?
Every time I've heard of anything like this, it's turned out to be a
hardware problem.

Good luck!!

						- Ted

Bodrogi Viktor

2003-Feb-03 22:53 UTC

head link

Re: Ext3 strangeness data loss

Hi!
> 
> This sounds like a hardware problem.  It's likely that incorrect
> blocks are getting read into the page cache.  So when you look at the
> file, you see incorrect data.  When you reboot, that clears the page
> cache, and file then looks OK again.
> 
Seems interesting.
I forgot to mention (yes, sorry, it's important piece of information),
that I have RAID 1 (mirrored disks), so HW problem is less possible.
And I have reiserfs partition on the mirror too, without any problem.

Anyway, do you have an idea how to test for HW errors?

thanks for the answers!

viktor at neotek dot hu

Theodore Ts'o

2003-Feb-04 04:42 UTC

head link

Re: Ext3 strangeness data loss

On Mon, Feb 03, 2003 at 10:53:08PM -0000, Bodrogi Viktor
wrote:> 
> Seems interesting.
> I forgot to mention (yes, sorry, it's important piece of information),
> that I have RAID 1 (mirrored disks), so HW problem is less possible.
> And I have reiserfs partition on the mirror too, without any problem.
Raid protects you against disk failures.  It does not protect you from
cable problems causing data corruption, or your RAID controller going
insane.  Unfortunately a lot of people seem to believe that just
because they have RAID, they are immune from hardware problems, and
then stop doing backups.  I usually hear from them after they've
gotten screwed, and when they ask if I can perform miracles....

In any case, the scenario I described (a controller/cable problem, or
an incorrectly configured IDE DMA settings) are all still possible
with RAID; RAID does not help you prevent these sorts of problems.

As far as your not noticing the problem with reiserfs that could be
because you've been lucky, and not noticed because the block addresses
causing the problem do not (yet) contain data.  But the symptoms
you've described sound very much like hardware induced errors.
> Anyway, do you have an idea how to test for HW errors?
Well, if you have a scratch partition that's not being used, you can
try using the badblocks program.  Try using the -w option, which will
do a read/write test.  This doesn't do a random access test, so it
might not detect any problems, though.

I'd suggest checking your internal cabling, and replacing the
controller cable if it looks dubious.  Making everything is well
plugged in, too.

Good luck!

						- Ted

Bodrogi Viktor

2003-Feb-04 12:47 UTC

head link

Re: Ext3 strangeness data loss

Hi!

This morning I booted and, what a horror, found bad superblock on /var!
fsck -ing reported nothing, but mount said bad superblock.
It's the best can happen after due day of project, but before finishing it,
isn't?
So I decided to switch to reiserfs, which has performance advantages too.
After about fifth reboot I could mount /var, and copied it to a new
partition together with root partition.
And, terrible, I had the same problem with /usr/sbin/sshd startup, without
the binary changes, according to a diff with a probably-good backup (who can
be sure about after all these...).

So the conclusion is that pssibly this has nothing to do with ext3.
It's not openssh because I had problems with other files/dirs, too...
Maybe it's evms?
Maybe it's the kernel?
It's a stock 2.4.19, only with evms and vserves patches.
I don't think it's a distro problem...

So sorry about talking about this on ext3 list!

Thanks for all help!

viktor

more comments below...
> > 
> > Seems interesting.
> > I forgot to mention (yes, sorry, it's important piece of
information),
> > that I have RAID 1 (mirrored disks), so HW problem is less possible.
> > And I have reiserfs partition on the mirror too, without any problem.
> 
> Raid protects you against disk failures.  It does not protect you from
> cable problems causing data corruption, or your RAID controller going
> insane.  Unfortunately a lot of people seem to believe that just
> because they have RAID, they are immune from hardware problems, and
> then stop doing backups.  I usually hear from them after they've
> gotten screwed, and when they ask if I can perform miracles....
Yes, RAID is completly different than backup.
RAID doesn't protect you of rm -fr / ;))
> 
> In any case, the scenario I described (a controller/cable problem, or
> an incorrectly configured IDE DMA settings) are all still possible
> with RAID; RAID does not help you prevent these sorts of problems.
It's SW RAID-1, disks are on the same controller,
but different buses / cables.
Am I right, that in this case HW errors are *very* unlikely?
That would mean that there are exactly the same bits of errors at exactly
the same time on different cables/disks...
> As far as your not noticing the problem with reiserfs that could be
> because you've been lucky, and not noticed because the block addresses
> causing the problem do not (yet) contain data.  But the symptoms
> you've described sound very much like hardware induced errors.
> 
> > Anyway, do you have an idea how to test for HW errors?
> 
> Well, if you have a scratch partition that's not being used, you can
> try using the badblocks program.  Try using the -w option, which will
> do a read/write test.  This doesn't do a random access test, so it
> might not detect any problems, though.
> 
> I'd suggest checking your internal cabling, and replacing the
> controller cable if it looks dubious.  Making everything is well
> plugged in, too.
> 
I use the most expensive, twisted, shielded, etc. cables, plugged well, at
least visualy...

Thanks for all answers!

viktor

Bodrogi Viktor

2003-Feb-04 15:55 UTC

head link

Re: Ext3 strangeness data loss

> As I said earlier, it's probably a hardware problem, or perhaps a
> combination of hardware and kernel (i.e., the kernel tries to be too
> agressive with the IDE DMA configuration, as Stephen conjectured).
> 
> > > In any case, the scenario I described (a controller/cable
problem, or
> > > an incorrectly configured IDE DMA settings) are all still
possible
> > > with RAID; RAID does not help you prevent these sorts of
problems.
> > 
> > It's SW RAID-1, disks are on the same controller,
> > but different buses / cables.
> > Am I right, that in this case HW errors are *very* unlikely?
> > That would mean that there are exactly the same bits of errors at
exactly
> > the same time on different cables/disks...
> 
> Nope, you're incorrect here.  When you read from a SW-RAID-1 array,
> the Software Raid driver picks one or the other disk (whichever one is
> available) and reads from the that particular disk.  It does *not*
> read the block from both disks, and compare the blocks read from both
> disks to make sure they are identical, as you seem to believe.
Yes, I did believe that.
So only one more question (and answer) would be usefull for all of us:
Do You know about if there is a mode switch for RAID-1 setup (my case is
evms-raid) to do this comparision?
This makes sense as an option for debuging and for high availability
production also.

viktor

Bodrogi Viktor

2003-Feb-04 16:43 UTC

head link

Re: Ext3 strangeness data loss

> > Do You know about if there is a mode switch for RAID-1 setup (my case
is
> > evms-raid) to do this comparision?
> > This makes sense as an option for debuging and for high availability
> > production also.
> 
> No there isn't, on any RAID systems that I'm aware of.
This really breaks my confidence in RAID-1 mirrors.
Would the situation get better with a four disk RAID-5?
As I imagine, it should...
I prefer definitive errors than unknown failures.
Then it gets show up as a disk error, not as random segfaults.

If this phenomena is HW error, should it be logged anywhere?
I didn't find anything in syslog...

I will stop this thread, really, it gets out of the list's topic...

Thanks for all your help!

viktor

Reasonably Related Threads

Search for more maybe matching threads

Ext3 users - Feb 2003 - Ext3 strangeness data loss

Ext3 strangeness data loss

Re: Ext3 strangeness data loss

Re: Ext3 strangeness data loss

Re: Ext3 strangeness data loss

Re: Ext3 strangeness data loss

Re: Ext3 strangeness data loss

Re: Ext3 strangeness data loss

Re: Ext3 strangeness data loss

Re: Ext3 strangeness data loss

Reasonably Related Threads