Hello community. We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel Xeon E5630 and 8xKingston KVR1333D3D4R9S/4G For some time we have lots of MCE in mcelog and we cant find out the reason. "Ordinary" mce message looks like: CPU 51 BANK 8 TSC 8511e3ca77dc MISC 274d587f00006141 ADDR 807044840 STATUS cc0055000001009f MCGSTATUS 0 decode with mcelog --ascii --cpu p4(cause there is no xeon56xx in list): HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 53 BANK 8 TSC 1982d8f72b1f MISC e1742eac00006242 ADDR 7ffd78a80 MCG status: MCi status: Error overflow MCi_MISC register valid MCi_ADDR register valid MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR Transaction: Memory read error STATUS cc0002000001009f MCGSTATUS 0 The global question is it possible to find out the exact hw which causes those messages? First we thought that according to /* A machine check record */ struct mce { __u64 status; /* bank status register */ __u64 misc; /* misc register (always 0 right now) */ __u64 addr; /* address or 0 */ __u64 mcgstatus; /* global MC status register */ __u64 rip; /* Program counter or 0 for silent error */ __u64 tsc; /* cpu time stamp counter */ __u64 res1; /* for future extension */ __u64 res2; /* dito. */ __u8 cs; /* code segment */ __u8 bank; /* machine check bank */ __u8 cpu; /* cpu that raised the error */ __u8 finished; /* entry is valid */ __u32 pad; }; cpu is the cpu rised the exception, but we have 2 quadro cpus with HT so maximum cpu number should be 16 and in logs we see 53 etc. So no we r not sure about what cpu value is :)Does anyone know what the CPU number means exactly? One more interesting thins is the following output: [root at zuno]# cat /var/log/mcelog |grep CPU|sort|awk '{print $2}'|uniq 32 33 34 35 50 51 52 53 Those numbers are always the same. Ok.Supposed we have problem in RAM, since i dont really know what those cpu numbers mean we suppose that cpu+bank can point the problem hw.Is it possible? According to our "broken ram theory" we suppose that those numbers 32,33,34,45 and 50,51,52,53 indicate some simetric problem with ram/or slots or smth else.Is it correct? Thanks in advance. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.centos.org/pipermail/centos/attachments/20110321/6a3d61ae/attachment-0003.html> -------------- next part -------------- _______________________________________________ CentOS mailing list CentOS at centos.org http://lists.centos.org/mailman/listinfo/centos
m.roth at 5-cent.us
2011-Mar-21 15:12 UTC
[CentOS] Cant find out MCE reason (CPU 35 BANK 8)
Vladimir Budnev wrote:> Hello community. > > We are running, Centos 4.8 on SuperMicro SYS-6026T-3RF with 2xIntel Xeon > E5630 and 8xKingston KVR1333D3D4R9S/4G > > For some time we have lots of MCE in mcelog and we cant find out the > reason.The only thing that shows there (when it shows, since sometimes it doesn't seem to) is a hardware error. You *WILL* be replacing hardware, sometime soon, like yesterday. "Normal" is not: *ANYTHING* here is Bad News. First, you've got DIMMs failing. CPU 53, assuming this system doesn't have 53+ physical CPUs, means that you have x-core systems, so you need to divide by x, so that if it's a 12-core system with 6 physical chips, that would make it DIMM 8 associated with that physical CPU. <snip>> One more interesting thins is the following output: > [root at zuno]# cat /var/log/mcelog |grep CPU|sort|awk '{print $2}'|uniq > 32 > 33 > 34 > 35 > 50 > 51 > 52 > 53 > > Those numbers are always the same.Bad news: you have *two* DIMMs failing, one associated with the physical CPU that has core 53, and another associated with the physical CPU that has cores 32-35. Talk to your OEM support to help identify which banks need replacing, and/or find a motherboard diagram. mark, who has to deal *again* with one machine with the same problem.... _______________________________________________ CentOS mailing list CentOS at centos.org http://lists.centos.org/mailman/listinfo/centos