I have recently upgraded to 2.6.18-194.3.1.el5 and within several days the machine crashed with the following error (repeating in mcelog): MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 2 BANK 8 MISC 41 MCG status: MCi status: Error overflow Uncorrected error MCi_MISC register valid Processor context corrupt MCA: MEMORY CONTROLLER AC_CHANNEL0_ERR Transaction: Address/Command error Memory address parity error Memory corrected error count (CORE_ERR_CNT): 911 Memory transaction Tracker ID (RTId): 41 Memory DIMM ID of error: 0 Memory channel ID of error: 0 Memory ECC syndrome: 0 STATUS ea10e3c0008000b0 MCGSTATUS 0 MCE 0 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 2 BANK 8 MISC 41 MCG status: MCi status: Error overflow Uncorrected error MCi_MISC register valid Processor context corrupt MCA: MEMORY CONTROLLER AC_CHANNEL0_ERR Transaction: Address/Command error Memory address parity error Memory corrected error count (CORE_ERR_CNT): 7970 Memory transaction Tracker ID (RTId): 41 Memory DIMM ID of error: 0 Memory channel ID of error: 0 Memory ECC syndrome: 0 STATUS ea17c880008000b0 MCGSTATUS 0 Everytime the error occurs, the only variables that change are CORE_ERR_CNT and STATUS. Since this appears to be a memory error, I have run memtest86+ many times. However it does not report any errors. Reverting back other Kernels (below) and testing, this above error would be generated only once (after boot) and then not be reported again and definitely wasn't causing kernel panic and crashing the machine. CentOS-5.4 (2.6.18-164.15.1.el5) CentOS (2.6.18-164.9.1.el5) CentOS (2.6.18-164.el5) Would this error indicate a motherboard or CPU problem? How can I diagnose? or is there something funny with the Kernel? Hardware: Supermicro X8DTL-iF motherboard. Intel Server Xeon E5502 1.86GHz Nehalem 8GB Ram Kingston DDR3-1333 w/ Parity w/ Thermal Sensor I have read on bugzilla note about mcelog and not supporting nehalem processor during error decoding. I think this is fixed in Centos 5.5, but maybe there is still a bug? https://bugzilla.redhat.com/show_bug.cgi?id=473392
Hi! Eric (2010/06/22 13:11), Eric Deis wrote:> Transaction: Address/Command errorIts mother board (memory controller) problem. Its *not* DIMM problem.(memtest can't detect this error.) your data transfer(read/write) sometimes met bit errors. This is Nehalem cpu's error detecting feature.(MCE) Try new mother board, or your MB always indicates this error in latest kernel, Its time to buy certified vendors hardware. Supermicro's MB is not certified hardware, but she just indicates hardware problem. Tsuyoshi.
Eric, On Tuesday, June 22, 2010 you wrote:> I have recently upgraded to 2.6.18-194.3.1.el5 and within several days > the machine crashed with the following error (repeating in mcelog):[...]> Would this error indicate a motherboard or CPU problem? How can I > diagnose? or is there something funny with the Kernel?I ran across the same problem some time ago. At that time the kernel was updated to recognize a new chip set (the one I use). The errors appeared with the new kernel, but of course they were there (undetected) before. I was checking CPU, memory and finally main board and it turned out that my TYAN board was faulty. You may want to google the error message. I found the exact error string to appear in the new kernel version for the first time. best regards --- Michael Schumacher PAMAS Partikelmess- und Analysesysteme GmbH Dieselstr.10, D-71277 Rutesheim Tel +49-7152-99630 Fax +49-7152-996333 Gesch?ftsf?hrer: Gerhard Schreck Handelsregister B Stuttgart HRB 252024
On Tuesday 22 June 2010, Eric Deis wrote:> I have recently upgraded to 2.6.18-194.3.1.el5 and within several days > the machine crashed with the following error (repeating in mcelog):I'm guessing the old kernel just didn't notice. The below MCEs indicate bad hardware. Since the DIMMs are a lot easier to debug I'd suggest you start there (but it could be the systemboard too). Try running with half you DIMMs then the other half. /Peter> MCE 0 > HARDWARE ERROR. This is *NOT* a software problem! > Please contact your hardware vendor > CPU 2 BANK 8 MISC 41... -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: <http://lists.centos.org/pipermail/centos/attachments/20100622/b82766af/attachment-0001.sig>
On 06/22/10 12:21 AM, Peter Kjellstrom wrote:> On Tuesday 22 June 2010, Eric Deis wrote: > >> I have recently upgraded to 2.6.18-194.3.1.el5 and within several days >> the machine crashed with the following error (repeating in mcelog): >> > I'm guessing the old kernel just didn't notice. > > The below MCEs indicate bad hardware. Since the DIMMs are a lot easier to > debug I'd suggest you start there (but it could be the systemboard too). Try > running with half you DIMMs then the other half. >and on nehalem (xeon 5500, 5600), the memory controller is in the CPUs, so they are suspect too. first, however, i'd see if there's a BIOS flash upgrade for the mainboard. these sometimes have microcode fixes for various specific Intel CPUs, and also may have updated memory timing parameters.