We have been seeing failures with CentOS 4.4 i386 (not x86_64) running compute-intensive programs on Tyan K8SRE (S2891) Tymotherboards, running Opteron 265's. This motherboard is used in the Tyan barebones box GT24 (B2881). We have these boards populated with 8GB of RAM, consisting of mixed 2GB and 1GB sticks. The symptom is that CPU-bound programs (may or may not be related to floating point) fail randomly and intermittently, with wrong answers or segfaults. Running several in parallel seems to make the failures more likely. We have not seen any kernel crashes. It is not hard to reproduce the problem with some internal programs we have; it takes only a few minutes. This is using a completely-up-to-date-as-of-yesterday CentOS 4.4 i386, hugemem or not doesn't make a difference. We have seen this on many boxes, so it's not bad memory. We do NOT see this problem if we run CentOS 4.4 x86_64 on the same boxes, using the same 32-bit test executables. We also don't see this problem on some slightly older boxes with Tyan K8SD motherboards running CentOS 4.4 i386 (also Opteron 265's, with 8GB of 1GB DIMMs). We have been looking at BIOS settings, but haven't seen anything that stands out. memtest86 does not show errors. Thanks for any suggestions of what this issue might be, Dan
Dan Halbert wrote:> We have been seeing failures with CentOS 4.4 i386 (not x86_64) running > compute-intensive programs on Tyan K8SRE (S2891) Tymotherboards, > running Opteron 265's. This motherboard is used in the Tyan barebones > box GT24 (B2881). We have these boards populated with 8GB of RAM, > consisting of mixed 2GB and 1GB sticks. > > The symptom is that CPU-bound programs (may or may not be related to > floating point) fail randomly and intermittently, with wrong answers > or segfaults. Running several in parallel seems to make the failures > more likely. We have not seen any kernel crashes. It is not hard to > reproduce the problem with some internal programs we have; it takes > only a few minutes.as a FPU test, try this... from a user account... mkdir mprime cd mprime wget ftp://mersenne.org/gimps/mprime2414.tar.gz tar xzvf mprime2414.tar.gz ./mprime -A0 -t & ./mprime -A1 -t & (if you have two dual core opterons, do this twice more with -A2 and -A3) this will HAMMER the cpu/cache/memory bus with intensive FPU operations. let it run all night on an otherwise idle box, note any errors spewed to the terminal. each instance will use about 16MB of ram, and will be executing near peak speed FPU/SSE operations. it auto-nice's itself to minimize the impact on the rest of the system. your CPUs will run hotter than they've ever run before :) hey, I thought mixing dimm sizes was verbotten on opterons?
have you tried single sourcing the ram in one of those machines? I think the mixed capacities are causing issues. Dan Halbert wrote:> We have been seeing failures with CentOS 4.4 i386 (not x86_64) running > compute-intensive programs on Tyan K8SRE (S2891) Tymotherboards, running > Opteron 265's. This motherboard is used in the Tyan barebones box GT24 > (B2881). We have these boards populated with 8GB of RAM, consisting of > mixed 2GB and 1GB sticks. > > The symptom is that CPU-bound programs (may or may not be related to > floating point) fail randomly and intermittently, with wrong answers or > segfaults. Running several in parallel seems to make the failures more > likely. We have not seen any kernel crashes. It is not hard to reproduce > the problem with some internal programs we have; it takes only a few > minutes. > > This is using a completely-up-to-date-as-of-yesterday CentOS 4.4 i386, > hugemem or not doesn't make a difference. We have seen this on many > boxes, so it's not bad memory. We do NOT see this problem if we run > CentOS 4.4 x86_64 on the same boxes, using the same 32-bit test > executables. We also don't see this problem on some slightly older boxes > with Tyan K8SD motherboards running CentOS 4.4 i386 (also Opteron 265's, > with 8GB of 1GB DIMMs). > > We have been looking at BIOS settings, but haven't seen anything that > stands out. memtest86 does not show errors. > > Thanks for any suggestions of what this issue might be, > Dan > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos > >-- My "Foundation" verse: Isa 54:17 No weapon that is formed against thee shall prosper; and every tongue that shall rise against thee in judgment thou shalt condemn. This is the heritage of the servants of the LORD, and their righteousness is of me, saith the LORD. -- carpe ductum -- "Grab the tape" CDTT (Certified Duct Tape Technician) Linux user #322099 Machines: 206822 256638 276825 http://counter.li.org/
Dan Halbert wrote:> We have been seeing failures with CentOS 4.4 i386 (not x86_64) running > compute-intensive programs on Tyan K8SRE (S2891) Tymotherboards...I talked with a friend who's been using a lot of Thunder T8WE boards with dual opterons for structural engineering systems software, he says ... They are VERY PICKY about memory, I have learned. Do *NOT*, I repeat, do *NOT* cheap out on your memory. I have hosed a perfectly good system that took almost a week of troubleshooting due to sh**ty memory corrupting the raid We now only buy memory that is certified for Supermicro and Tyan systems. It doesn't cost a whole lot more then the cheap stuff. memtest might not catch issues that are multiprocessor related.