thr3ads.net - CentOS - [CentOS] pagecache corruption on Tyan S3870 [Mar 2007]

If this information is useful, please help other people find it:
Share via:

Dan Halbert

2007-Mar-01 04:21 UTC

[CentOS] pagecache corruption on Tyan S3870

A couple of months ago I reported some problems with a batch of Tyan 
K8SSA (S3870) based machines. We are continuing to have an odd problem 
with these boxes, and if anyone has seen something similar elsewhere, 
I'd appreciate hearing about it.

These boxes are running Centos 4.4 x86_64 with kernel 
2.6.9-42.0.3.ELsmp. They are dual Opteron 265's (dual core) with 4x2GB 
DIMM's. The DIMMs used to be mixed sizes, but Tyan recommended making 
them all the same, and the vendor made the substitutions. We have also 
clocked the memory down from 400 MHz to 266 MHz, also on the advice of Tyan.

The symptom is that some large (700MB to >1GB) files opened for read and 
then closed show corruption in the pagecache. One or more 4k blocks in a 
file will be completely trashed. It's as if a random page of other data 
is substituted. A reboot or a flush of the pagecache fixes the problem, 
so it's only in the pagecache, not on disk. We are doing regular MD5 
checksums of the files, which shows up the problem, in addition to 
having our application crash from time to time.

We have some older Tyan motherboards that don't show this problem. At 
this point it seems it is either a hardware problem or a kernel 
motherboard-support problem, but it's pretty baffling.

Thanks,
Dan

DamianS

2007-Mar-01 10:40 UTC

head link

[CentOS] pagecache corruption on Tyan S3870

What you should be doing, is swapping the CPUs and ram modules from
board to board.
With the help of someone who can do some statistical analysis for you,
you can quickly pinpoint whether the problem resides in the
motherboards, or some CPUs or ram modules, or combinations thereof.

Presumably, since the servers are so prone to error at the moment, they
will not be doing anything important, allowing you to easily swap stuff
around.
If you can include in this trial, some identical servers which seem to
be working fine, this will greatly speed up the process of apportioning
blame.

Scott Silva

2007-Mar-01 17:10 UTC

head link

[CentOS] Re: pagecache corruption on Tyan S3870

Dan Halbert spake the following on 2/28/2007 8:21 PM:> A couple of months ago I reported some problems with a batch of Tyan
> K8SSA (S3870) based machines. We are continuing to have an odd problem
> with these boxes, and if anyone has seen something similar elsewhere,
> I'd appreciate hearing about it.
> 
> These boxes are running Centos 4.4 x86_64 with kernel
> 2.6.9-42.0.3.ELsmp. They are dual Opteron 265's (dual core) with 4x2GB
> DIMM's. The DIMMs used to be mixed sizes, but Tyan recommended making
> them all the same, and the vendor made the substitutions. We have also
> clocked the memory down from 400 MHz to 266 MHz, also on the advice of
> Tyan.
> 
> The symptom is that some large (700MB to >1GB) files opened for read and
> then closed show corruption in the pagecache. One or more 4k blocks in a
> file will be completely trashed. It's as if a random page of other data
> is substituted. A reboot or a flush of the pagecache fixes the problem,
> so it's only in the pagecache, not on disk. We are doing regular MD5
> checksums of the files, which shows up the problem, in addition to
> having our application crash from time to time.
> 
> We have some older Tyan motherboards that don't show this problem. At
> this point it seems it is either a hardware problem or a kernel
> motherboard-support problem, but it's pretty baffling.
> 
> Thanks,
> DanHave you tried a newer kernel to see if it changes the problem?


-- 

MailScanner is like deodorant...
You hope everybody uses it, and
you notice quickly if they don't!!!!

Dan Halbert

2007-Mar-06 22:08 UTC

head link

[CentOS] pagecache corruption on Tyan S3870

To follow up on issues we are having with the Tyan S3870 (K8SSA) Opteron
motherboards:

We actually saw another problem with these boxes, but only with i386 Linux
(CentOS, FC6, etc.). A certain compute-intensive application that also read
about 10MB of data files would get wrong answers when several instances were run
in parallel. (Interestingly, a yum update I ran on the box also got occasional
strange errors.)

This was an easier error to check for, since I could reproduce the error in a
few minutes.

After systematically trying many different memory swaps and BIOS settings
(including memory timings), I discovered that booting with "noapic"
fixed the problem above. We haven't yet completed an x86_64 pagecache
corruption test with "noapic", but I am pretty suspicious that these
problems are related. Running with maxcpus=1 also fixes the problem, which
confirms it's an smp-related problem.

I'll report back one more time if noapic fixes our pagecache problems.

Tyan updated the BIOS for this board a few versions back to fix a booting
problem with x86_64 Redhat. Some people worked around that problem with noapic.
I wonder if the BIOS still has some problems...

Thanks for all your suggestions,
Dan

Dan Halbert

2007-Mar-28 23:52 UTC

head link

[CentOS] pagecache corruption on Tyan S3870

Dan Halbert wrote:> A couple of months ago I reported some problems with a batch of Tyan K8SSA
(S3870) based machines.
>...
> The symptom is that some large (700MB to >1GB) files opened for read and
> then closed show corruption in the pagecache. One or more 4k blocks in a 
> file will be completely trashed...
> A reboot or a flush of the pagecache fixes the problem, 
> so it's only in the pagecache, not on disk.
One more followup on this, for posterity. (I don't like unanswered questions
in mailing-list archives.) It turns out this problem seems to be the same one
reported in this kernel bug: http://bugzilla.kernel.org/show_bug.cgi?id=7768. It
has also been discussed on LKML.

The bug was reported on AMD Nvidia boards; we have AMD ServerWorks, but the
problem aooears to be the same. AMD is working on this. The current workaround
is to boot with "iommu=soft".

Dan

Maybe Matching Threads

Search for more reasonably related threads

CentOS - Mar 2007 - pagecache corruption on Tyan S3870

[CentOS] pagecache corruption on Tyan S3870

[CentOS] pagecache corruption on Tyan S3870

[CentOS] Re: pagecache corruption on Tyan S3870

[CentOS] pagecache corruption on Tyan S3870

[CentOS] pagecache corruption on Tyan S3870

Maybe Matching Threads