thr3ads.net - freebsd stable - Memory error logged in /var/log/messages [Nov 2018]

If this information is useful, please help other people find it:
Share via:

Patrick M. Hausen

2018-Nov-19 13:10 UTC

Memory error logged in /var/log/messages

Hi all,

one of our production servers, 11.2p3 is logging this every couple of minutes:

Nov 19 11:48:06 ph002 kernel: MCA: CPU 0 COR (5) OVER MS channel 3 memory error
Nov 19 11:48:06 ph002 kernel: MCA: Address 0x1f709a48c0
Nov 19 11:48:06 ph002 kernel: MCA: Misc 0x90010000040188c
Nov 19 11:48:06 ph002 kernel: MCA: Bank 12, Status 0xcc00010c000800c3
Nov 19 11:48:06 ph002 kernel: MCA: Global Cap 0x0000000007000c16, Status
0x0000000000000000
Nov 19 11:48:06 ph002 kernel: MCA: Vendor "GenuineIntel", ID 0x406f1,
APIC ID 0

Address and core varies but it is always bank 12.

It seems like applications are unaffected, we use, of course ECC memory.

Is the OS able to work around these errors and just notifies us or is in-memory
data already getting corrupted?

I?m at a bit of a loss identifying which DIMM might be the cause so I contacted
Supermicro
support. They answered:
> We can't really answer this, we do not know how various OS's map
the memory slots.
> Our advise is always to look at IPMI, but if that doesn't log any
issues then we're not sure you're looking at a hardware issue.
> 
> But assuming the OS looks at the ranks of a module as a bank and you use
dual rank memory then it should logically point at DIMMC2.
They are right on the IPMI (I told them when opening the case) - there?s nothing
at all
in the event log.

Can they be correct that it might not even be a hardware issue?
If not how can I be sure which DIMM is to blame? Spare parts are ready but I?d
like to
have a rather short maintenance break outside regular business hours.

I?ll attach a dmesg.boot. HW is a X10DRW-NT mainboard, SYS-1028R-WTNRT server
platform.

Thanks for any hints,
Patrick
-- 
punkt.de GmbH			Internet - Dienstleistungen - Beratung
Kaiserallee 13a			Tel.: 0721 9109-0 Fax: -100
76133 Karlsruhe			info at punkt.de	http://punkt.de
AG Mannheim 108285		Gf: Juergen Egeling
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: dmesg-boot.txt
URL:
<http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20181119/cc204869/attachment.txt>

Eugene Grosbein

2018-Nov-19 13:38 UTC

head link

Memory error logged in /var/log/messages

19.11.2018 20:10, Patrick M. Hausen wrote:
> Hi all,
> 
> one of our production servers, 11.2p3 is logging this every couple of
minutes:
> 
> Nov 19 11:48:06 ph002 kernel: MCA: CPU 0 COR (5) OVER MS channel 3 memory
error
> Nov 19 11:48:06 ph002 kernel: MCA: Address 0x1f709a48c0
> Nov 19 11:48:06 ph002 kernel: MCA: Misc 0x90010000040188c
> Nov 19 11:48:06 ph002 kernel: MCA: Bank 12, Status 0xcc00010c000800c3
> Nov 19 11:48:06 ph002 kernel: MCA: Global Cap 0x0000000007000c16, Status
0x0000000000000000
> Nov 19 11:48:06 ph002 kernel: MCA: Vendor "GenuineIntel", ID
0x406f1, APIC ID 0
> 
> Address and core varies but it is always bank 12.
> 
> It seems like applications are unaffected, we use, of course ECC memory.
> 
> Is the OS able to work around these errors and just notifies us or is
in-memory
> data already getting corrupted?
> 
> I?m at a bit of a loss identifying which DIMM might be the cause so I
contacted Supermicro
> support. They answered:
> 
>> We can't really answer this, we do not know how various OS's
map the memory slots.
>> Our advise is always to look at IPMI, but if that doesn't log any
issues then we're not sure you're looking at a hardware issue.
>>
>> But assuming the OS looks at the ranks of a module as a bank and you
use dual rank memory then it should logically point at DIMMC2.
> 
> They are right on the IPMI (I told them when opening the case) - there?s
nothing at all
> in the event log.
> 
> Can they be correct that it might not even be a hardware issue?
Use sysutils/mcelog port (or package) to decode such MCA logs
with "mcelog --no-dmi --ascii" command. For your logs, it reports:
> Hardware event. This is not a software error.
> CPU 0 BANK 12
> MISC 0 ADDR 0
> MCG status:
> MemCtrl: Corrected patrol scrub error
> STATUS cc00010c000800c3 MCGSTATUS 0
> MCGCAP 7000c16 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 79
> (Fields were incomplete)
Seems like hardware memory error corrected with ECC, so no data corruption
(yet).
You better replace a module in BANK 12 of CPU 0.

Tomasz Rola

2018-Nov-19 17:58 UTC

head link

Memory error logged in /var/log/messages

On Mon, Nov 19, 2018 at 02:10:00PM +0100, Patrick M. Hausen
wrote:> Hi all,
> 
> one of our production servers, 11.2p3 is logging this every couple of
minutes:
> 
> Nov 19 11:48:06 ph002 kernel: MCA: CPU 0 COR (5) OVER MS channel 3 memory
error
> Nov 19 11:48:06 ph002 kernel: MCA: Address 0x1f709a48c0
> Nov 19 11:48:06 ph002 kernel: MCA: Misc 0x90010000040188c
> Nov 19 11:48:06 ph002 kernel: MCA: Bank 12, Status 0xcc00010c000800c3
> Nov 19 11:48:06 ph002 kernel: MCA: Global Cap 0x0000000007000c16, Status
0x0000000000000000
> Nov 19 11:48:06 ph002 kernel: MCA: Vendor "GenuineIntel", ID
0x406f1, APIC ID 0[...]
>From what I understood so far about those things, it is alway good tostart investigation from checking one's power supply. Have a
multimeter and plug it into molex PATA power supply while the box is
working. I have no idea where to plug it in case there is no PATA
cables in your power supply, though.

HTH

-- 
Regards,
Tomasz Rola

--
** A C programmer asked whether computer had Buddha's nature.      **
** As the answer, master did "rm -rif" on the programmer's home   
**
** directory. And then the C programmer became enlightened...      **
**                                                                 **
** Tomasz Rola          mailto:tomasz_rola at bigfoot.com             **

Tomasz Rola

2018-Nov-19 18:03 UTC

head link

Memory error logged in /var/log/messages

On Mon, Nov 19, 2018 at 06:58:59PM +0100, Tomasz Rola
wrote:> From what I understood so far about those things, it is alway good to
> start investigation from checking one's power supply. Have a
> multimeter and plug it into molex PATA power supply while the box is
> working. I have no idea where to plug it in case there is no PATA
> cables in your power supply, though.
Ugh, I mean, measure voltages and how much they differ from the
expected +3v, +5v and +12v. Up to 10% might be ok, more than this,
might be not ok. Voltages might slip when the box is loaded, drives
are working etc.

In case your power supply is modular one, it might be possible to plug
multimeter into supply's unused sockets.

-- 
Regards,
Tomasz Rola

--
** A C programmer asked whether computer had Buddha's nature.      **
** As the answer, master did "rm -rif" on the programmer's home   
**
** directory. And then the C programmer became enlightened...      **
**                                                                 **
** Tomasz Rola          mailto:tomasz_rola at bigfoot.com             **

Alfred Bartsch

2018-Nov-20 09:08 UTC

head link

Memory error logged in /var/log/messages

Am 19.11.18 um 14:10 schrieb Patrick M. Hausen:> Hi all,
> 
> one of our production servers, 11.2p3 is logging this every couple of
minutes:
> 
> Nov 19 11:48:06 ph002 kernel: MCA: CPU 0 COR (5) OVER MS channel 3 memory
error
> Nov 19 11:48:06 ph002 kernel: MCA: Address 0x1f709a48c0
> Nov 19 11:48:06 ph002 kernel: MCA: Misc 0x90010000040188c
> Nov 19 11:48:06 ph002 kernel: MCA: Bank 12, Status 0xcc00010c000800c3
> Nov 19 11:48:06 ph002 kernel: MCA: Global Cap 0x0000000007000c16, Status
0x0000000000000000
> Nov 19 11:48:06 ph002 kernel: MCA: Vendor "GenuineIntel", ID
0x406f1, APIC ID 0
> 
> Address and core varies but it is always bank 12.
> 
> It seems like applications are unaffected, we use, of course ECC memory.
> 
> Is the OS able to work around these errors and just notifies us or is
in-memory
> data already getting corrupted?
> 
> I?m at a bit of a loss identifying which DIMM might be the cause so I
contacted Supermicro
> support. They answered:
> 
>> We can't really answer this, we do not know how various OS's
map the memory slots.
>> Our advise is always to look at IPMI, but if that doesn't log any
issues then we're not sure you're looking at a hardware issue.
>>
>> But assuming the OS looks at the ranks of a module as a bank and you
use dual rank memory then it should logically point at DIMMC2.
> 
> They are right on the IPMI (I told them when opening the case) - there?s
nothing at all
> in the event log.
> 
> Can they be correct that it might not even be a hardware issue?
> If not how can I be sure which DIMM is to blame? Spare parts are ready but
I?d like to
> have a rather short maintenance break outside regular business hours.
> 
> I?ll attach a dmesg.boot. HW is a X10DRW-NT mainboard, SYS-1028R-WTNRT
server platform.
> 
> Thanks for any hints,
> Patrick
> 
> 
> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at
freebsd.org"
> 
Hi Patrick,
we had a similar experience with one of our servers (HP DL380 G7): Tons
of MCA errors concerning a single memory bank. This bank number did not
correspond to a special memory slot (HP numbers them from A to I for
each cpu). iLO and mcelog output was not of any help for me.
We did not notice any data loss, but to get rid of these annoying
messages, I did the following:
After taking the server out of production, I removed pairs of memory
modules until the MCA messages stopped. Then the last removed pair
contained the problematic module. Re-adding one of these last modules
left a 50-percent chance to identify the defective module. After
replacing this module, the server did no longer complain about memory
problems.

There should definitely be a more sophisticated method to identify
problematic memory modules. Perhaps there is someone on the list who is
able to shed some light on this kind of errors.

-- 
Sincerely
Alfred Bartsch
Data-Service GmbH
Beethovenstr. 2A
23617 Stockelsdorf
fon: +49 451 490010 fax: +49 451 4900123
Amtsgericht L?beck, HRB 318 BS
Gesch?ftsf?hrer: Wilfried Paepcke, Dr. Andreas Longwitz, Dr. Hans-Martin
Rasch, Dr. Uwe Szyszka

freebsd stable - Nov 2018 - Memory error logged in /var/log/messages

Memory error logged in /var/log/messages

Memory error logged in /var/log/messages

Memory error logged in /var/log/messages

Memory error logged in /var/log/messages

Memory error logged in /var/log/messages