thr3ads.net - CentOS - [CentOS] Cant find out MCE reason (CPU 35 BANK 8) [Mar 2011]

If this information is useful, please help other people find it:
Share via:

Vladimir Budnev

2011-Mar-21 14:51 UTC

[CentOS] Cant find out MCE reason (CPU 35 BANK 8)

Hello community.

We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel Xeon
E5630 and 8xKingston KVR1333D3D4R9S/4G

For some time we have lots of MCE in mcelog and we cant find out the reason.
"Ordinary" mce message looks like:

CPU 51 BANK 8 TSC 8511e3ca77dc
MISC 274d587f00006141 ADDR 807044840
STATUS cc0055000001009f MCGSTATUS 0

decode with mcelog --ascii --cpu p4(cause there is no xeon56xx in list):

HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 53 BANK 8 TSC 1982d8f72b1f
MISC e1742eac00006242 ADDR 7ffd78a80
MCG status:
MCi status:
Error overflow
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
Transaction: Memory read error
STATUS cc0002000001009f MCGSTATUS 0

The global question is it possible to find out the exact hw which causes
those messages?
First we thought that according to

/* A machine check record */
struct mce {
        __u64 status;   /* bank status register */
        __u64 misc;     /* misc register (always 0 right now) */
        __u64 addr;     /* address or 0 */
        __u64 mcgstatus; /* global MC status register */
        __u64 rip;      /* Program counter or 0 for silent error */
        __u64 tsc;      /* cpu time stamp counter */
        __u64 res1;     /* for future extension */
        __u64 res2;     /* dito. */
        __u8 cs;        /* code segment */
        __u8 bank;      /* machine check bank */
        __u8 cpu;       /* cpu that raised the error */
        __u8 finished; /* entry is valid */
        __u32 pad;
};

cpu is the cpu rised the exception, but we have 2 quadro cpus with HT so
maximum cpu number should be 16 and in logs we see 53 etc.
So no we r not sure about what cpu value is :)Does anyone know what the CPU
number means exactly?

One more interesting thins is the following output:
[root at zuno]# cat /var/log/mcelog |grep CPU|sort|awk '{print $2}'|uniq
32
33
34
35
50
51
52
53

Those numbers are always the same.

Ok.Supposed we have problem in RAM, since i dont really know what those cpu
numbers mean we suppose that cpu+bank can point the problem hw.Is it
possible?
According to our "broken ram theory" we suppose that those numbers
32,33,34,45 and 50,51,52,53 indicate some simetric problem with ram/or slots
or smth else.Is it correct?

Thanks in advance.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos/attachments/20110321/6a3d61ae/attachment-0003.html>
-------------- next part --------------
_______________________________________________
CentOS mailing list
CentOS at centos.org
http://lists.centos.org/mailman/listinfo/centos

m.roth at 5-cent.us

2011-Mar-21 15:12 UTC

head link

[CentOS] Cant find out MCE reason (CPU 35 BANK 8)

Vladimir Budnev wrote:> Hello community.
>
> We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel Xeon
> E5630 and 8xKingston KVR1333D3D4R9S/4G
>
> For some time we have lots of MCE in mcelog and we cant find out the
> reason.
The only thing that shows there (when it shows, since sometimes it doesn't
seem to) is a hardware error. You *WILL* be replacing hardware, sometime
soon, like yesterday.

"Normal" is not: *ANYTHING* here is Bad News. First, you've got
DIMMs
failing.  CPU 53, assuming this system doesn't have 53+ physical CPUs,
means that you have x-core systems, so you need to divide by x, so that if
it's a 12-core system with 6 physical chips, that would make it DIMM 8
associated with that physical CPU.
<snip>> One more interesting thins is the following output:
> [root at zuno]# cat /var/log/mcelog |grep CPU|sort|awk '{print
$2}'|uniq
> 32
> 33
> 34
> 35
> 50
> 51
> 52
> 53
>
> Those numbers are always the same.
Bad news: you have *two* DIMMs failing, one associated with the physical
CPU that has core 53, and another associated with the physical CPU that
has cores 32-35.

Talk to your OEM support to help identify which banks need replacing,
and/or find a motherboard diagram.

          mark, who has to deal *again* with one machine with the same
problem....

_______________________________________________
CentOS mailing list
CentOS at centos.org
http://lists.centos.org/mailman/listinfo/centos

Possibly Parallel Threads

Search for more maybe matching threads

CentOS - Mar 2011 - Cant find out MCE reason (CPU 35 BANK 8)

[CentOS] Cant find out MCE reason (CPU 35 BANK 8)

[CentOS] Cant find out MCE reason (CPU 35 BANK 8)

Possibly Parallel Threads