Hi all, one of our production servers, 11.2p3 is logging this every couple of minutes: Nov 19 11:48:06 ph002 kernel: MCA: CPU 0 COR (5) OVER MS channel 3 memory error Nov 19 11:48:06 ph002 kernel: MCA: Address 0x1f709a48c0 Nov 19 11:48:06 ph002 kernel: MCA: Misc 0x90010000040188c Nov 19 11:48:06 ph002 kernel: MCA: Bank 12, Status 0xcc00010c000800c3 Nov 19 11:48:06 ph002 kernel: MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000 Nov 19 11:48:06 ph002 kernel: MCA: Vendor "GenuineIntel", ID 0x406f1, APIC ID 0 Address and core varies but it is always bank 12. It seems like applications are unaffected, we use, of course ECC memory. Is the OS able to work around these errors and just notifies us or is in-memory data already getting corrupted? I?m at a bit of a loss identifying which DIMM might be the cause so I contacted Supermicro support. They answered:> We can't really answer this, we do not know how various OS's map the memory slots. > Our advise is always to look at IPMI, but if that doesn't log any issues then we're not sure you're looking at a hardware issue. > > But assuming the OS looks at the ranks of a module as a bank and you use dual rank memory then it should logically point at DIMMC2.They are right on the IPMI (I told them when opening the case) - there?s nothing at all in the event log. Can they be correct that it might not even be a hardware issue? If not how can I be sure which DIMM is to blame? Spare parts are ready but I?d like to have a rather short maintenance break outside regular business hours. I?ll attach a dmesg.boot. HW is a X10DRW-NT mainboard, SYS-1028R-WTNRT server platform. Thanks for any hints, Patrick -- punkt.de GmbH Internet - Dienstleistungen - Beratung Kaiserallee 13a Tel.: 0721 9109-0 Fax: -100 76133 Karlsruhe info at punkt.de http://punkt.de AG Mannheim 108285 Gf: Juergen Egeling -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: dmesg-boot.txt URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20181119/cc204869/attachment.txt>
19.11.2018 20:10, Patrick M. Hausen wrote:> Hi all, > > one of our production servers, 11.2p3 is logging this every couple of minutes: > > Nov 19 11:48:06 ph002 kernel: MCA: CPU 0 COR (5) OVER MS channel 3 memory error > Nov 19 11:48:06 ph002 kernel: MCA: Address 0x1f709a48c0 > Nov 19 11:48:06 ph002 kernel: MCA: Misc 0x90010000040188c > Nov 19 11:48:06 ph002 kernel: MCA: Bank 12, Status 0xcc00010c000800c3 > Nov 19 11:48:06 ph002 kernel: MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000 > Nov 19 11:48:06 ph002 kernel: MCA: Vendor "GenuineIntel", ID 0x406f1, APIC ID 0 > > Address and core varies but it is always bank 12. > > It seems like applications are unaffected, we use, of course ECC memory. > > Is the OS able to work around these errors and just notifies us or is in-memory > data already getting corrupted? > > I?m at a bit of a loss identifying which DIMM might be the cause so I contacted Supermicro > support. They answered: > >> We can't really answer this, we do not know how various OS's map the memory slots. >> Our advise is always to look at IPMI, but if that doesn't log any issues then we're not sure you're looking at a hardware issue. >> >> But assuming the OS looks at the ranks of a module as a bank and you use dual rank memory then it should logically point at DIMMC2. > > They are right on the IPMI (I told them when opening the case) - there?s nothing at all > in the event log. > > Can they be correct that it might not even be a hardware issue?Use sysutils/mcelog port (or package) to decode such MCA logs with "mcelog --no-dmi --ascii" command. For your logs, it reports:> Hardware event. This is not a software error. > CPU 0 BANK 12 > MISC 0 ADDR 0 > MCG status: > MemCtrl: Corrected patrol scrub error > STATUS cc00010c000800c3 MCGSTATUS 0 > MCGCAP 7000c16 APICID 0 SOCKETID 0 > CPUID Vendor Intel Family 6 Model 79 > (Fields were incomplete)Seems like hardware memory error corrected with ECC, so no data corruption (yet). You better replace a module in BANK 12 of CPU 0.
On Mon, Nov 19, 2018 at 02:10:00PM +0100, Patrick M. Hausen wrote:> Hi all, > > one of our production servers, 11.2p3 is logging this every couple of minutes: > > Nov 19 11:48:06 ph002 kernel: MCA: CPU 0 COR (5) OVER MS channel 3 memory error > Nov 19 11:48:06 ph002 kernel: MCA: Address 0x1f709a48c0 > Nov 19 11:48:06 ph002 kernel: MCA: Misc 0x90010000040188c > Nov 19 11:48:06 ph002 kernel: MCA: Bank 12, Status 0xcc00010c000800c3 > Nov 19 11:48:06 ph002 kernel: MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000 > Nov 19 11:48:06 ph002 kernel: MCA: Vendor "GenuineIntel", ID 0x406f1, APIC ID 0[...]>From what I understood so far about those things, it is alway good tostart investigation from checking one's power supply. Have a multimeter and plug it into molex PATA power supply while the box is working. I have no idea where to plug it in case there is no PATA cables in your power supply, though. HTH -- Regards, Tomasz Rola -- ** A C programmer asked whether computer had Buddha's nature. ** ** As the answer, master did "rm -rif" on the programmer's home ** ** directory. And then the C programmer became enlightened... ** ** ** ** Tomasz Rola mailto:tomasz_rola at bigfoot.com **
On Mon, Nov 19, 2018 at 06:58:59PM +0100, Tomasz Rola wrote:> From what I understood so far about those things, it is alway good to > start investigation from checking one's power supply. Have a > multimeter and plug it into molex PATA power supply while the box is > working. I have no idea where to plug it in case there is no PATA > cables in your power supply, though.Ugh, I mean, measure voltages and how much they differ from the expected +3v, +5v and +12v. Up to 10% might be ok, more than this, might be not ok. Voltages might slip when the box is loaded, drives are working etc. In case your power supply is modular one, it might be possible to plug multimeter into supply's unused sockets. -- Regards, Tomasz Rola -- ** A C programmer asked whether computer had Buddha's nature. ** ** As the answer, master did "rm -rif" on the programmer's home ** ** directory. And then the C programmer became enlightened... ** ** ** ** Tomasz Rola mailto:tomasz_rola at bigfoot.com **
Am 19.11.18 um 14:10 schrieb Patrick M. Hausen:> Hi all, > > one of our production servers, 11.2p3 is logging this every couple of minutes: > > Nov 19 11:48:06 ph002 kernel: MCA: CPU 0 COR (5) OVER MS channel 3 memory error > Nov 19 11:48:06 ph002 kernel: MCA: Address 0x1f709a48c0 > Nov 19 11:48:06 ph002 kernel: MCA: Misc 0x90010000040188c > Nov 19 11:48:06 ph002 kernel: MCA: Bank 12, Status 0xcc00010c000800c3 > Nov 19 11:48:06 ph002 kernel: MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000 > Nov 19 11:48:06 ph002 kernel: MCA: Vendor "GenuineIntel", ID 0x406f1, APIC ID 0 > > Address and core varies but it is always bank 12. > > It seems like applications are unaffected, we use, of course ECC memory. > > Is the OS able to work around these errors and just notifies us or is in-memory > data already getting corrupted? > > I?m at a bit of a loss identifying which DIMM might be the cause so I contacted Supermicro > support. They answered: > >> We can't really answer this, we do not know how various OS's map the memory slots. >> Our advise is always to look at IPMI, but if that doesn't log any issues then we're not sure you're looking at a hardware issue. >> >> But assuming the OS looks at the ranks of a module as a bank and you use dual rank memory then it should logically point at DIMMC2. > > They are right on the IPMI (I told them when opening the case) - there?s nothing at all > in the event log. > > Can they be correct that it might not even be a hardware issue? > If not how can I be sure which DIMM is to blame? Spare parts are ready but I?d like to > have a rather short maintenance break outside regular business hours. > > I?ll attach a dmesg.boot. HW is a X10DRW-NT mainboard, SYS-1028R-WTNRT server platform. > > Thanks for any hints, > Patrick > > > _______________________________________________ > freebsd-stable at freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org" >Hi Patrick, we had a similar experience with one of our servers (HP DL380 G7): Tons of MCA errors concerning a single memory bank. This bank number did not correspond to a special memory slot (HP numbers them from A to I for each cpu). iLO and mcelog output was not of any help for me. We did not notice any data loss, but to get rid of these annoying messages, I did the following: After taking the server out of production, I removed pairs of memory modules until the MCA messages stopped. Then the last removed pair contained the problematic module. Re-adding one of these last modules left a 50-percent chance to identify the defective module. After replacing this module, the server did no longer complain about memory problems. There should definitely be a more sophisticated method to identify problematic memory modules. Perhaps there is someone on the list who is able to shed some light on this kind of errors. -- Sincerely Alfred Bartsch Data-Service GmbH Beethovenstr. 2A 23617 Stockelsdorf fon: +49 451 490010 fax: +49 451 4900123 Amtsgericht L?beck, HRB 318 BS Gesch?ftsf?hrer: Wilfried Paepcke, Dr. Andreas Longwitz, Dr. Hans-Martin Rasch, Dr. Uwe Szyszka