thr3ads.net - freebsd stable - MCA messages after upgrade to 8.2-BEAT1 [Dec 2010]

If this information is useful, please help other people find it:
Share via:

Miroslav Lachman

2010-Dec-22 12:41 UTC

MCA messages after upgrade to 8.2-BEAT1

Hi,
the machine in question was upgraded from 7.3 to FreeBSD 8.2-BETA1 i386 
GENERIC
After this upgrade, i got following mesages in /var/log/messages every 
hour. The machine is almost idle (for testing only)

Dec 21 12:42:26 kavkaz kernel: MCA: Bank 0, Status 0xd40e400000000833
Dec 21 12:42:26 kavkaz kernel: MCA: Global Cap 0x0000000000000105, 
Status 0x0000000000000000
Dec 21 12:42:26 kavkaz kernel: MCA: Vendor "AuthenticAMD", ID 0x40f33,
APIC ID 0
Dec 21 12:42:26 kavkaz kernel: MCA: CPU 0 COR OVER BUSLG Source DRD Memory
Dec 21 12:42:26 kavkaz kernel: MCA: Address 0x236493c0
Dec 21 12:42:26 kavkaz kernel: MCA: Bank 1, Status 0xd400400000000853
Dec 21 12:42:26 kavkaz kernel: MCA: Global Cap 0x0000000000000105, 
Status 0x0000000000000000
Dec 21 12:42:26 kavkaz kernel: MCA: Vendor "AuthenticAMD", ID 0x40f33,
APIC ID 0
Dec 21 12:42:26 kavkaz kernel: MCA: CPU 0 COR OVER BUSLG Source IRD Memory
Dec 21 12:42:26 kavkaz kernel: MCA: Address 0x2a1c9440
Dec 21 12:42:26 kavkaz kernel: MCA: Bank 2, Status 0xd000400000000863
Dec 21 12:42:26 kavkaz kernel: MCA: Global Cap 0x0000000000000105, 
Status 0x0000000000000000
Dec 21 12:42:26 kavkaz kernel: MCA: Vendor "AuthenticAMD", ID 0x40f33,
APIC ID 0
Dec 21 12:42:26 kavkaz kernel: MCA: CPU 0 COR OVER BUSLG Source PREFETCH 
Memory
Dec 21 12:42:26 kavkaz kernel: MCA: Bank 4, Status 0xdc0e400200000813
Dec 21 12:42:26 kavkaz kernel: MCA: Global Cap 0x0000000000000105, 
Status 0x0000000000000000
Dec 21 12:42:26 kavkaz kernel: MCA: Vendor "AuthenticAMD", ID 0x40f33,
APIC ID 0
Dec 21 12:42:26 kavkaz kernel: MCA: CPU 0 COR OVER BUSLG Source RD Memory
Dec 21 12:42:26 kavkaz kernel: MCA: Address 0x2cac9678
Dec 21 12:42:26 kavkaz kernel: MCA: Misc 0xe00d0fff00000000
Dec 21 12:42:26 kavkaz kernel: MCA: Bank 0, Status 0xd40e400000000833
Dec 21 12:42:26 kavkaz kernel: MCA: Global Cap 0x0000000000000105, 
Status 0x0000000000000000
Dec 21 12:42:26 kavkaz kernel: MCA: Vendor "AuthenticAMD", ID 0x40f33,
APIC ID 1
Dec 21 12:42:26 kavkaz kernel: MCA: CPU 1 COR OVER BUSLG Source DRD Memory
Dec 21 12:42:26 kavkaz kernel: MCA: Address 0x23649640
Dec 21 12:42:26 kavkaz kernel: MCA: Bank 1, Status 0xd400400000000853
Dec 21 12:42:26 kavkaz kernel: MCA: Global Cap 0x0000000000000105, 
Status 0x0000000000000000
Dec 21 12:42:26 kavkaz kernel: MCA: Vendor "AuthenticAMD", ID 0x40f33,
APIC ID 1
Dec 21 12:42:26 kavkaz kernel: MCA: CPU 1 COR OVER BUSLG Source IRD Memory
Dec 21 12:42:26 kavkaz kernel: MCA: Address 0x2a1c9440
Dec 21 12:42:26 kavkaz kernel: MCA: Bank 2, Status 0xd000400000000863
Dec 21 12:42:26 kavkaz kernel: MCA: Global Cap 0x0000000000000105, 
Status 0x0000000000000000
Dec 21 12:42:26 kavkaz kernel: MCA: Vendor "AuthenticAMD", ID 0x40f33,
APIC ID 1
Dec 21 12:42:26 kavkaz kernel: MCA: CPU 1 COR OVER BUSLG Source PREFETCH 
Memory

Can somebody tell me, what these messages are?

Miroslav Lachman

John Baldwin

2010-Dec-22 14:59 UTC

head link

MCA messages after upgrade to 8.2-BEAT1

On Wednesday, December 22, 2010 7:41:25 am Miroslav Lachman
wrote:> Dec 21 12:42:26 kavkaz kernel: MCA: Bank 0, Status 0xd40e400000000833
> Dec 21 12:42:26 kavkaz kernel: MCA: Global Cap 0x0000000000000105, 
> Status 0x0000000000000000
> Dec 21 12:42:26 kavkaz kernel: MCA: Vendor "AuthenticAMD", ID
0x40f33,
> APIC ID 0
> Dec 21 12:42:26 kavkaz kernel: MCA: CPU 0 COR OVER BUSLG Source DRD Memory
> Dec 21 12:42:26 kavkaz kernel: MCA: Address 0x236493c0
You are getting corrected ECC errors in your RAM.  You see them once an hour
because we poll the machine check registers once an hour.  If this happens
constantly you might have a DIMM that is dying?

% ~/mcelog --ascii < foo.txt 
mcelog: Cannot open /dev/mem for DMI decoding: Permission denied
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 0 data cache 
ADDR 236493c0 
  Data cache ECC error (syndrome 1c)
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
             data read mem transaction
             memory access, level generic'
STATUS d40e400000000833 MCGSTATUS 0
MCGCAP 105 APICID 0 SOCKETID 0 
CPUID Vendor AMD Family 15 Model 67
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 1 instruction cache 
ADDR 2a1c9440 
  Instruction cache ECC error
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
             instruction fetch mem transaction
             memory access, level generic'
STATUS d400400000000853 MCGSTATUS 0
MCGCAP 105 APICID 0 SOCKETID 0 
CPUID Vendor AMD Family 15 Model 67
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 2 bus unit 
  L2 cache ECC error
  Bus or cache array error
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
             prefetch mem transaction
             memory access, level generic'
STATUS d000400000000863 MCGSTATUS 0
MCGCAP 105 APICID 0 SOCKETID 0 
CPUID Vendor AMD Family 15 Model 67
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge 
MISC e00d0fff00000000 ADDR 2cac9678 
  Northbridge RAM ECC error
  ECC syndrome = 1c
       bit33 = err cpu1
       bit46 = corrected ecc error
       bit59 = misc error valid
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
             generic read mem transaction
             memory access, level generic'
STATUS dc0e400200000813 MCGSTATUS 0
MCGCAP 105 APICID 0 SOCKETID 0 
CPUID Vendor AMD Family 15 Model 67
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 0 data cache 
ADDR 23649640 
  Data cache ECC error (syndrome 1c)
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
             data read mem transaction
             memory access, level generic'
STATUS d40e400000000833 MCGSTATUS 0
MCGCAP 105 APICID 1 SOCKETID 0 
CPUID Vendor AMD Family 15 Model 67
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 1 instruction cache 
ADDR 2a1c9440 
  Instruction cache ECC error
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
             instruction fetch mem transaction
             memory access, level generic'
STATUS d400400000000853 MCGSTATUS 0
MCGCAP 105 APICID 1 SOCKETID 0 
CPUID Vendor AMD Family 15 Model 67
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 2 bus unit 
  L2 cache ECC error
  Bus or cache array error
       bit46 = corrected ecc error
       bit62 = error overflow (multiple errors)
  bus error 'local node origin, request didn't time out
             prefetch mem transaction
             memory access, level generic'
STATUS d000400000000863 MCGSTATUS 0
MCGCAP 105 APICID 1 SOCKETID 0 
CPUID Vendor AMD Family 15 Model 67


-- 
John Baldwin

Miroslav Lachman

2010-Dec-23 19:00 UTC

head link

MCA messages after upgrade to 8.2-BEAT1

John Baldwin wrote:> On Wednesday, December 22, 2010 7:41:25 am Miroslav Lachman wrote:
>> Dec 21 12:42:26 kavkaz kernel: MCA: Bank 0, Status 0xd40e400000000833
>> Dec 21 12:42:26 kavkaz kernel: MCA: Global Cap 0x0000000000000105,
>> Status 0x0000000000000000
>> Dec 21 12:42:26 kavkaz kernel: MCA: Vendor "AuthenticAMD", ID
0x40f33,
>> APIC ID 0
>> Dec 21 12:42:26 kavkaz kernel: MCA: CPU 0 COR OVER BUSLG Source DRD
Memory
>> Dec 21 12:42:26 kavkaz kernel: MCA: Address 0x236493c0
>
> You are getting corrected ECC errors in your RAM.  You see them once an
hour
> because we poll the machine check registers once an hour.  If this happens
> constantly you might have a DIMM that is dying?
Yes, it happens constantly. Does Bank in this context means DIMM socket 
or anything else? If it is DIMM socket, then it means all modules are 
dying at the same time :(

Thank you for mcelog output. BTW do you have any time plan for releasing 
port of mcelog?

Miroslav Lachman

Dan Langille

2010-Dec-23 19:39 UTC

head link

MCA messages after upgrade to 8.2-BEAT1

On 12/22/2010 9:57 AM, John Baldwin wrote:> On Wednesday, December 22, 2010 7:41:25 am Miroslav Lachman wrote:
>> Dec 21 12:42:26 kavkaz kernel: MCA: Bank 0, Status 0xd40e400000000833
>> Dec 21 12:42:26 kavkaz kernel: MCA: Global Cap 0x0000000000000105,
>> Status 0x0000000000000000
>> Dec 21 12:42:26 kavkaz kernel: MCA: Vendor "AuthenticAMD", ID
0x40f33,
>> APIC ID 0
>> Dec 21 12:42:26 kavkaz kernel: MCA: CPU 0 COR OVER BUSLG Source DRD
Memory
>> Dec 21 12:42:26 kavkaz kernel: MCA: Address 0x236493c0
>
> You are getting corrected ECC errors in your RAM.  You see them once an
hour
> because we poll the machine check registers once an hour.  If this happens
> constantly you might have a DIMM that is dying?
John:

I take it these ECC errors *may* have been happening for some time. 
What has changed is the OS now polls for the errors and reports them.

-- 
Dan Langille - http://langille.org/

Alan Cox

2010-Dec-24 19:43 UTC

head link

MCA messages after upgrade to 8.2-BEAT1

2010/12/23 Dan Langille <dan@langille.org>
> On 12/22/2010 9:57 AM, John Baldwin wrote:
>
>> On Wednesday, December 22, 2010 7:41:25 am Miroslav Lachman wrote:
>>
>>> Dec 21 12:42:26 kavkaz kernel: MCA: Bank 0, Status
0xd40e400000000833
>>> Dec 21 12:42:26 kavkaz kernel: MCA: Global Cap 0x0000000000000105,
>>> Status 0x0000000000000000
>>> Dec 21 12:42:26 kavkaz kernel: MCA: Vendor
"AuthenticAMD", ID 0x40f33,
>>> APIC ID 0
>>> Dec 21 12:42:26 kavkaz kernel: MCA: CPU 0 COR OVER BUSLG Source DRD
>>> Memory
>>> Dec 21 12:42:26 kavkaz kernel: MCA: Address 0x236493c0
>>>
>>
>> You are getting corrected ECC errors in your RAM.  You see them once an
>> hour
>> because we poll the machine check registers once an hour.  If this
happens
>> constantly you might have a DIMM that is dying?
>>
>
> John:
>
> I take it these ECC errors *may* have been happening for some time. What
> has changed is the OS now polls for the errors and reports them.
>
>Yes, we enabled MCA by default in 8.1-RELEASE.

Alan

freebsd stable - Dec 2010 - MCA messages after upgrade to 8.2-BEAT1

MCA messages after upgrade to 8.2-BEAT1

MCA messages after upgrade to 8.2-BEAT1

MCA messages after upgrade to 8.2-BEAT1

MCA messages after upgrade to 8.2-BEAT1

MCA messages after upgrade to 8.2-BEAT1