thr3ads.net - CentOS - [CentOS] Kernel:[Hardware Error]: [Aug 2017]

If this information is useful, please help other people find it:
Share via:

Fred Smith

2017-Aug-12 19:50 UTC

[CentOS] Kernel:[Hardware Error]:

I had a series of kernel hardware error reports today while I was away 
from my computer:

Message from syslogd at fcshome at Aug 12 10:12:24 ...
 kernel:[Hardware Error]: MC2 Error: VB Data ECC or parity error.

Message from syslogd at fcshome at Aug 12 10:12:24 ...
 kernel:[Hardware Error]: Error Status: Corrected error, no action required.

Message from syslogd at fcshome at Aug 12 10:12:24 ...
 kernel:[Hardware Error]: CPU:2 (15:2:0) MC2_STATUS[-|CE|MiscV|-|-|-|-|CECC]:
0x98444000010c0176

Message from syslogd at fcshome at Aug 12 10:12:24 ...
 kernel:[Hardware Error]: cache level: L2, tx: DATA, mem-tx: EV

never saw anything like that before.

cpu is:

	$ cat /proc/cpuinfo
	processor	: 0
	vendor_id	: AuthenticAMD
	cpu family	: 21
	model		: 2
	model name	: AMD FX(tm)-6300 Six-Core Processor
	stepping	: 0
	microcode	: 0x600084f
	cpu MHz		: 1400.000
	cache size	: 2048 KB
	physical id	: 0
	siblings	: 6
	core id		: 0
	cpu cores	: 3
	apicid		: 16
	initial apicid	: 0
	fpu		: yes
	fpu_exception	: yes
	cpuid level	: 13
	wp		: yes
	flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm
constant_tsc art rep_good nopl nonstop_tsc extd_apicid aperfmperf pni pclmulqdq
monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm
cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs
xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb arat
cpb hw_pstate npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid
decodeassists pausefilter pfthreshold bmi1
	bogomips	: 7023.90
	TLB size	: 1536 4K pages
	clflush size	: 64
	cache_alignment	: 64
	address sizes	: 48 bits physical, 48 bits virtual
	power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro


six core AMD, above is one of the cores.

Any clues to figure out the errors, and/or mitigate?

thanks!

Fred
-- 
-------------------------------------------------------------------------------
 .----    Fred Smith   /              
( /__  ,__.   __   __ /  __   : /     
 /    /  /   /__) /  /  /__) .+'           Home: fredex at
fcshome.stoneham.ma.us
/    /  (__ (___ (__(_ (___ / :__                                 781-438-5471 
-------------------------------- Jude 1:24,25 ---------------------------------

Steven Tardy

2017-Aug-12 21:51 UTC

head link

[CentOS] Kernel:[Hardware Error]:

> On Aug 12, 2017, at 3:50 PM, Fred Smith <fredex at
fcshome.stoneham.ma.us> wrote:
> 
> I had a series of kernel hardware error reports today while I was away 
> from my computer:
> 
> Message from syslogd at fcshome at Aug 12 10:12:24 ...
> kernel:[Hardware Error]: MC2 Error: VB Data ECC or parity error.
> 
> Message from syslogd at fcshome at Aug 12 10:12:24 ...
> kernel:[Hardware Error]: Error Status: Corrected error, no action required.
> 
> Message from syslogd at fcshome at Aug 12 10:12:24 ...
> kernel:[Hardware Error]: CPU:2 (15:2:0)
MC2_STATUS[-|CE|MiscV|-|-|-|-|CECC]: 0x98444000010c0176
> 
> Message from syslogd at fcshome at Aug 12 10:12:24 ...
> kernel:[Hardware Error]: cache level: L2, tx: DATA, mem-tx: EV
> 
> never saw anything like that before.
> 
> cpu is:
> 
>    $ cat /proc/cpuinfo
>    processor    : 0
>    vendor_id    : AuthenticAMD
>    cpu family    : 21
>    model        : 2
>    model name    : AMD FX(tm)-6300 Six-Core Processor
>    stepping    : 0
>    microcode    : 0x600084f
>    cpu MHz        : 1400.000
>    cache size    : 2048 KB
>    physical id    : 0
>    siblings    : 6
>    core id        : 0
>    cpu cores    : 3
>    apicid        : 16
>    initial apicid    : 0
>    fpu        : yes
>    fpu_exception    : yes
>    cpuid level    : 13
>    wp        : yes
>    flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb
rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid aperfmperf pni
pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm
cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs
xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb arat
cpb hw_pstate npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid
decodeassists pausefilter pfthreshold bmi1
>    bogomips    : 7023.90
>    TLB size    : 1536 4K pages
>    clflush size    : 64
>    cache_alignment    : 64
>    address sizes    : 48 bits physical, 48 bits virtual
>    power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro
> 
> 
> six core AMD, above is one of the cores.
> 
> Any clues to figure out the errors, and/or mitigate?
> 
> thanks!
> 
> Fred
MC == Machine check exception.
The important part of a MC is the "status" code.
One can use the Intel doc "Architecture Software Developers Manual" to
decode this (4000 page .pdf).
Unsure but it looks like AMD does similar MC codes.
Luckily Linux does some heavy lifting and decodes to "cache hierarchy error
L2 data eviction".
The next most important part is the "corrected" bit.

Now what does that really mean?
*shrug*, could be
firmware/drivers/overheating/poor-CPU-seating/DIMM-seating/faulty-motherboard/faulty-CPU/faulty-DIMM.

Hope that doesn't confuse too much. (:

Chris Murphy

2017-Aug-12 22:03 UTC

head link

[CentOS] Kernel:[Hardware Error]:

On Sat, Aug 12, 2017 at 1:50 PM, Fred Smith
<fredex at fcshome.stoneham.ma.us> wrote:> I had a series of kernel hardware error reports today while I was away
> from my computer:
>
> Message from syslogd at fcshome at Aug 12 10:12:24 ...
>  kernel:[Hardware Error]: MC2 Error: VB Data ECC or parity error.
>
> Message from syslogd at fcshome at Aug 12 10:12:24 ...
>  kernel:[Hardware Error]: Error Status: Corrected error, no action
required.

Cosmic ray corrupted data in RAM, and ECC detected and corrected it?
Whatever it was, working as intended.


-- 
Chris Murphy

Fred Smith

2017-Aug-12 23:24 UTC

head link

[CentOS] Kernel:[Hardware Error]:

On Sat, Aug 12, 2017 at 05:51:33PM -0400, Steven Tardy
wrote:> 
> > On Aug 12, 2017, at 3:50 PM, Fred Smith <fredex at
fcshome.stoneham.ma.us> wrote:
> > 
> > I had a series of kernel hardware error reports today while I was away
> > from my computer:
> > 
> > Message from syslogd at fcshome at Aug 12 10:12:24 ...
> > kernel:[Hardware Error]: MC2 Error: VB Data ECC or parity error.
> > 
> > Message from syslogd at fcshome at Aug 12 10:12:24 ...
> > kernel:[Hardware Error]: Error Status: Corrected error, no action
required.
> > 
> > Message from syslogd at fcshome at Aug 12 10:12:24 ...
> > kernel:[Hardware Error]: CPU:2 (15:2:0)
MC2_STATUS[-|CE|MiscV|-|-|-|-|CECC]: 0x98444000010c0176
> > 
> > Message from syslogd at fcshome at Aug 12 10:12:24 ...
> > kernel:[Hardware Error]: cache level: L2, tx: DATA, mem-tx: EV
> > 
> > never saw anything like that before.
> > 
> > cpu is:
> > 
> >    $ cat /proc/cpuinfo
> >    processor    : 0
> >    vendor_id    : AuthenticAMD
> >    cpu family    : 21
> >    model        : 2
> >    model name    : AMD FX(tm)-6300 Six-Core Processor
> >    stepping    : 0
> >    microcode    : 0x600084f
> >    cpu MHz        : 1400.000
> >    cache size    : 2048 KB
> >    physical id    : 0
> >    siblings    : 6
> >    core id        : 0
> >    cpu cores    : 3
> >    apicid        : 16
> >    initial apicid    : 0
> >    fpu        : yes
> >    fpu_exception    : yes
> >    cpuid level    : 13
> >    wp        : yes
> >    flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid
aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave
avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse
3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext
perfctr_core perfctr_nb arat cpb hw_pstate npt lbrv svm_lock nrip_save tsc_scale
vmcb_clean flushbyasid decodeassists pausefilter pfthreshold bmi1
> >    bogomips    : 7023.90
> >    TLB size    : 1536 4K pages
> >    clflush size    : 64
> >    cache_alignment    : 64
> >    address sizes    : 48 bits physical, 48 bits virtual
> >    power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro
> > 
> > 
> > six core AMD, above is one of the cores.
> > 
> > Any clues to figure out the errors, and/or mitigate?
> > 
> > thanks!
> > 
> > Fred
> 
> MC == Machine check exception.
> The important part of a MC is the "status" code.
> One can use the Intel doc "Architecture Software Developers
Manual" to decode this (4000 page .pdf).
> Unsure but it looks like AMD does similar MC codes.
> Luckily Linux does some heavy lifting and decodes to "cache hierarchy
error L2 data eviction".
> The next most important part is the "corrected" bit.
> 
> Now what does that really mean?
> *shrug*, could be
firmware/drivers/overheating/poor-CPU-seating/DIMM-seating/faulty-motherboard/faulty-CPU/faulty-DIMM.
Well. overheating is possible... we don't live in the cleanest possible
house, AND we have cats. so, in general I open up this box twice a year
and vacuum out the house dirt and cat fuzzies. I'm probably overdue for
this task.

This is the first one of these I've had. Hope it's the last. but a
little PM is in order either way.

thanks for the reply.

Fred> 
> Hope that doesn't confuse too much. (:
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos
-- 
---- Fred Smith -- fredex at fcshome.stoneham.ma.us
-----------------------------
                    The Lord detests the way of the wicked 
                  but he loves those who pursue righteousness.
----------------------------- Proverbs 15:9 (niv) -----------------------------

Possibly Parallel Threads

Search for more seemingly similar threads

CentOS - Aug 2017 - Kernel:[Hardware Error]:

[CentOS] Kernel:[Hardware Error]:

[CentOS] Kernel:[Hardware Error]:

[CentOS] Kernel:[Hardware Error]:

[CentOS] Kernel:[Hardware Error]:

Possibly Parallel Threads