A couple of months ago I reported some problems with a batch of Tyan K8SSA (S3870) based machines. We are continuing to have an odd problem with these boxes, and if anyone has seen something similar elsewhere, I'd appreciate hearing about it. These boxes are running Centos 4.4 x86_64 with kernel 2.6.9-42.0.3.ELsmp. They are dual Opteron 265's (dual core) with 4x2GB DIMM's. The DIMMs used to be mixed sizes, but Tyan recommended making them all the same, and the vendor made the substitutions. We have also clocked the memory down from 400 MHz to 266 MHz, also on the advice of Tyan. The symptom is that some large (700MB to >1GB) files opened for read and then closed show corruption in the pagecache. One or more 4k blocks in a file will be completely trashed. It's as if a random page of other data is substituted. A reboot or a flush of the pagecache fixes the problem, so it's only in the pagecache, not on disk. We are doing regular MD5 checksums of the files, which shows up the problem, in addition to having our application crash from time to time. We have some older Tyan motherboards that don't show this problem. At this point it seems it is either a hardware problem or a kernel motherboard-support problem, but it's pretty baffling. Thanks, Dan
What you should be doing, is swapping the CPUs and ram modules from board to board. With the help of someone who can do some statistical analysis for you, you can quickly pinpoint whether the problem resides in the motherboards, or some CPUs or ram modules, or combinations thereof. Presumably, since the servers are so prone to error at the moment, they will not be doing anything important, allowing you to easily swap stuff around. If you can include in this trial, some identical servers which seem to be working fine, this will greatly speed up the process of apportioning blame.
Dan Halbert spake the following on 2/28/2007 8:21 PM:> A couple of months ago I reported some problems with a batch of Tyan > K8SSA (S3870) based machines. We are continuing to have an odd problem > with these boxes, and if anyone has seen something similar elsewhere, > I'd appreciate hearing about it. > > These boxes are running Centos 4.4 x86_64 with kernel > 2.6.9-42.0.3.ELsmp. They are dual Opteron 265's (dual core) with 4x2GB > DIMM's. The DIMMs used to be mixed sizes, but Tyan recommended making > them all the same, and the vendor made the substitutions. We have also > clocked the memory down from 400 MHz to 266 MHz, also on the advice of > Tyan. > > The symptom is that some large (700MB to >1GB) files opened for read and > then closed show corruption in the pagecache. One or more 4k blocks in a > file will be completely trashed. It's as if a random page of other data > is substituted. A reboot or a flush of the pagecache fixes the problem, > so it's only in the pagecache, not on disk. We are doing regular MD5 > checksums of the files, which shows up the problem, in addition to > having our application crash from time to time. > > We have some older Tyan motherboards that don't show this problem. At > this point it seems it is either a hardware problem or a kernel > motherboard-support problem, but it's pretty baffling. > > Thanks, > DanHave you tried a newer kernel to see if it changes the problem? -- MailScanner is like deodorant... You hope everybody uses it, and you notice quickly if they don't!!!!
To follow up on issues we are having with the Tyan S3870 (K8SSA) Opteron motherboards: We actually saw another problem with these boxes, but only with i386 Linux (CentOS, FC6, etc.). A certain compute-intensive application that also read about 10MB of data files would get wrong answers when several instances were run in parallel. (Interestingly, a yum update I ran on the box also got occasional strange errors.) This was an easier error to check for, since I could reproduce the error in a few minutes. After systematically trying many different memory swaps and BIOS settings (including memory timings), I discovered that booting with "noapic" fixed the problem above. We haven't yet completed an x86_64 pagecache corruption test with "noapic", but I am pretty suspicious that these problems are related. Running with maxcpus=1 also fixes the problem, which confirms it's an smp-related problem. I'll report back one more time if noapic fixes our pagecache problems. Tyan updated the BIOS for this board a few versions back to fix a booting problem with x86_64 Redhat. Some people worked around that problem with noapic. I wonder if the BIOS still has some problems... Thanks for all your suggestions, Dan
Dan Halbert wrote:> A couple of months ago I reported some problems with a batch of Tyan K8SSA (S3870) based machines. >... > The symptom is that some large (700MB to >1GB) files opened for read and > then closed show corruption in the pagecache. One or more 4k blocks in a > file will be completely trashed... > A reboot or a flush of the pagecache fixes the problem, > so it's only in the pagecache, not on disk.One more followup on this, for posterity. (I don't like unanswered questions in mailing-list archives.) It turns out this problem seems to be the same one reported in this kernel bug: http://bugzilla.kernel.org/show_bug.cgi?id=7768. It has also been discussed on LKML. The bug was reported on AMD Nvidia boards; we have AMD ServerWorks, but the problem aooears to be the same. AMD is working on this. The current workaround is to boot with "iommu=soft". Dan