thr3ads.net - CentOS - [CentOS] New kernel causes hardware error? [Jun 2010]

If this information is useful, please help other people find it:
Share via:

Eric Deis

2010-Jun-22 04:11 UTC

[CentOS] New kernel causes hardware error?

I have recently upgraded to 2.6.18-194.3.1.el5 and within several days 
the machine crashed with the following error (repeating in mcelog):

MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 BANK 8 MISC 41
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
Processor context corrupt
MCA: MEMORY CONTROLLER AC_CHANNEL0_ERR
Transaction: Address/Command error
Memory address parity error
Memory corrected error count (CORE_ERR_CNT): 911
Memory transaction Tracker ID (RTId): 41
Memory DIMM ID of error: 0
Memory channel ID of error: 0
Memory ECC syndrome: 0
STATUS ea10e3c0008000b0 MCGSTATUS 0
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 BANK 8 MISC 41
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
Processor context corrupt
MCA: MEMORY CONTROLLER AC_CHANNEL0_ERR
Transaction: Address/Command error
Memory address parity error
Memory corrected error count (CORE_ERR_CNT): 7970
Memory transaction Tracker ID (RTId): 41
Memory DIMM ID of error: 0
Memory channel ID of error: 0
Memory ECC syndrome: 0
STATUS ea17c880008000b0 MCGSTATUS 0

Everytime the error occurs, the only variables that change are 
CORE_ERR_CNT and STATUS.

Since this appears to be a memory error, I have run memtest86+ many 
times. However it does not report any errors.

Reverting back other Kernels (below) and testing, this above error would 
be generated only once (after boot) and then not be reported again and 
definitely wasn't causing kernel panic and crashing the machine.
CentOS-5.4 (2.6.18-164.15.1.el5)
CentOS (2.6.18-164.9.1.el5)
CentOS (2.6.18-164.el5)

Would this error indicate a motherboard or CPU problem? How can I 
diagnose? or is there something funny with the Kernel?

Hardware:
Supermicro X8DTL-iF motherboard.
Intel Server Xeon E5502 1.86GHz Nehalem
8GB Ram Kingston DDR3-1333 w/ Parity w/ Thermal Sensor

I have read on bugzilla  note about mcelog and not supporting nehalem 
processor during error decoding. I think this is fixed in Centos 5.5, 
but maybe there is still a bug?
https://bugzilla.redhat.com/show_bug.cgi?id=473392

Tsuyoshi Nagata

2010-Jun-22 05:52 UTC

head link

[CentOS] New kernel causes hardware error?

Hi! Eric
(2010/06/22 13:11), Eric Deis wrote:> Transaction: Address/Command error
Its mother board (memory controller) problem.
Its *not* DIMM problem.(memtest can't detect this error.)
your data transfer(read/write) sometimes met bit errors.
This is Nehalem cpu's error detecting feature.(MCE)

Try new mother board,
or your MB always indicates this error in latest kernel,
Its time to buy certified vendors hardware.

Supermicro's MB is not certified hardware, but
she just indicates hardware problem.

Tsuyoshi.

Michael Schumacher

2010-Jun-22 06:24 UTC

head link

[CentOS] New kernel causes hardware error?

Eric,

On Tuesday, June 22, 2010 you wrote:
> I have recently upgraded to 2.6.18-194.3.1.el5 and within several days
> the machine crashed with the following error (repeating in mcelog):
[...]
> Would this error indicate a motherboard or CPU problem? How can I 
> diagnose? or is there something funny with the Kernel?
I ran across the same problem some time ago. At that time the kernel
was updated to recognize a new chip set (the one I use). The errors
appeared with the new kernel, but of course they were there
(undetected) before. I was checking CPU, memory and finally main board
and it turned out that my TYAN board was faulty.
You may want to google the error message. I found the exact error
string to appear in the new kernel version for the first time.


best regards
---
Michael Schumacher
PAMAS Partikelmess- und Analysesysteme GmbH
Dieselstr.10, D-71277 Rutesheim
Tel +49-7152-99630
Fax +49-7152-996333
Gesch?ftsf?hrer: Gerhard Schreck
Handelsregister B Stuttgart HRB 252024

Peter Kjellstrom

2010-Jun-22 07:21 UTC

head link

[CentOS] New kernel causes hardware error?

On Tuesday 22 June 2010, Eric Deis wrote:> I have recently upgraded to 2.6.18-194.3.1.el5 and within several days
> the machine crashed with the following error (repeating in mcelog):
I'm guessing the old kernel just didn't notice.

The below MCEs indicate bad hardware. Since the DIMMs are a lot easier to 
debug I'd suggest you start there (but it could be the systemboard too). Try
running with half you DIMMs then the other half.

/Peter
> MCE 0
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 2 BANK 8 MISC 41...
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL:
<http://lists.centos.org/pipermail/centos/attachments/20100622/b82766af/attachment-0001.sig>

John R Pierce

2010-Jun-22 07:27 UTC

head link

[CentOS] New kernel causes hardware error?

On 06/22/10 12:21 AM, Peter Kjellstrom wrote:> On Tuesday 22 June 2010, Eric Deis wrote:
>    
>> I have recently upgraded to 2.6.18-194.3.1.el5 and within several days
>> the machine crashed with the following error (repeating in mcelog):
>>      
> I'm guessing the old kernel just didn't notice.
>
> The below MCEs indicate bad hardware. Since the DIMMs are a lot easier to
> debug I'd suggest you start there (but it could be the systemboard
too). Try
> running with half you DIMMs then the other half.
>    
and on nehalem (xeon 5500, 5600), the memory controller is in the CPUs, 
so they are suspect too.

first, however, i'd see if there's a BIOS flash upgrade for the 
mainboard.  these sometimes have microcode fixes for various specific 
Intel CPUs, and also may have updated memory timing parameters.

Maybe Matching Threads

Search for more maybe matching threads

CentOS - Jun 2010 - New kernel causes hardware error?

[CentOS] New kernel causes hardware error?

[CentOS] New kernel causes hardware error?

[CentOS] New kernel causes hardware error?

[CentOS] New kernel causes hardware error?

[CentOS] New kernel causes hardware error?

Maybe Matching Threads