On Thu, Apr 16, 2009 at 07:53:59AM -0400, Kyle Brandt
wrote:>
> On several of my servers I seem to have a high rate of server crashes do to
> file system errors. So I have some questions related to this:
>
> Is there any Mean Time Between Failure ( MTBF) data for the ext3
> file-system?
>
> Does increased partition size cause a higher risk of the partition being
> corrupted? If so, is there any data on the ratio between partition size and
> the likely hood of failure?
The probability of these sorts of filesystem problems is going to be
dominated by hardware induced corruptions --- so it's not going to
make a lot of sense to talk about MTBF failures without having a
specific hardware context in mind. If you have lousy memory, or a
lousy disk controller cable, or a cable connector which is loose then
corruptions will happen often. If you are are located some place
where there is a strong alpha particle source, then you will have a
much greater percentage chance of bit flips. If you use ECC memory,
and do very careful hardware selection, with enterprise-quality disks
that trade off disk capacity for a much stronger level of ECC codes,
then of course the MBTF will be much less.
(For example, there was the imfamous story in the early 1990's when
Sun had a spate of bad memory; I think it was ultimately traced to
radioactive contamination of the ceramic materials used to make their
memory chips; this caused alpha particles to cause "bit flips" and
which had the result of making their customers rather antsy,
especially since Sun tried todeny there was even a problem for quite
some time.)
So if you are having a high rate of server crashes, the first thing I
would do is to make sure you have the latest distribution updates;
it's possible it's caused by a kernel bug that has since been fixed,
but it's somewhat unlikely. The next thing I would do is take one of
the machines that has been cashing off line, and try running a 36-48
hour memory test.
> Does ext3 on hardware raid (10) increase the possibility of file system
> corruption?
No, it shouldn't --- unless you have a buggy or otherwise dodgy
hardware raid controller.
- Ted