Hi, the machine in question was upgraded from 7.3 to FreeBSD 8.2-BETA1 i386 GENERIC After this upgrade, i got following mesages in /var/log/messages every hour. The machine is almost idle (for testing only) Dec 21 12:42:26 kavkaz kernel: MCA: Bank 0, Status 0xd40e400000000833 Dec 21 12:42:26 kavkaz kernel: MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000 Dec 21 12:42:26 kavkaz kernel: MCA: Vendor "AuthenticAMD", ID 0x40f33, APIC ID 0 Dec 21 12:42:26 kavkaz kernel: MCA: CPU 0 COR OVER BUSLG Source DRD Memory Dec 21 12:42:26 kavkaz kernel: MCA: Address 0x236493c0 Dec 21 12:42:26 kavkaz kernel: MCA: Bank 1, Status 0xd400400000000853 Dec 21 12:42:26 kavkaz kernel: MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000 Dec 21 12:42:26 kavkaz kernel: MCA: Vendor "AuthenticAMD", ID 0x40f33, APIC ID 0 Dec 21 12:42:26 kavkaz kernel: MCA: CPU 0 COR OVER BUSLG Source IRD Memory Dec 21 12:42:26 kavkaz kernel: MCA: Address 0x2a1c9440 Dec 21 12:42:26 kavkaz kernel: MCA: Bank 2, Status 0xd000400000000863 Dec 21 12:42:26 kavkaz kernel: MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000 Dec 21 12:42:26 kavkaz kernel: MCA: Vendor "AuthenticAMD", ID 0x40f33, APIC ID 0 Dec 21 12:42:26 kavkaz kernel: MCA: CPU 0 COR OVER BUSLG Source PREFETCH Memory Dec 21 12:42:26 kavkaz kernel: MCA: Bank 4, Status 0xdc0e400200000813 Dec 21 12:42:26 kavkaz kernel: MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000 Dec 21 12:42:26 kavkaz kernel: MCA: Vendor "AuthenticAMD", ID 0x40f33, APIC ID 0 Dec 21 12:42:26 kavkaz kernel: MCA: CPU 0 COR OVER BUSLG Source RD Memory Dec 21 12:42:26 kavkaz kernel: MCA: Address 0x2cac9678 Dec 21 12:42:26 kavkaz kernel: MCA: Misc 0xe00d0fff00000000 Dec 21 12:42:26 kavkaz kernel: MCA: Bank 0, Status 0xd40e400000000833 Dec 21 12:42:26 kavkaz kernel: MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000 Dec 21 12:42:26 kavkaz kernel: MCA: Vendor "AuthenticAMD", ID 0x40f33, APIC ID 1 Dec 21 12:42:26 kavkaz kernel: MCA: CPU 1 COR OVER BUSLG Source DRD Memory Dec 21 12:42:26 kavkaz kernel: MCA: Address 0x23649640 Dec 21 12:42:26 kavkaz kernel: MCA: Bank 1, Status 0xd400400000000853 Dec 21 12:42:26 kavkaz kernel: MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000 Dec 21 12:42:26 kavkaz kernel: MCA: Vendor "AuthenticAMD", ID 0x40f33, APIC ID 1 Dec 21 12:42:26 kavkaz kernel: MCA: CPU 1 COR OVER BUSLG Source IRD Memory Dec 21 12:42:26 kavkaz kernel: MCA: Address 0x2a1c9440 Dec 21 12:42:26 kavkaz kernel: MCA: Bank 2, Status 0xd000400000000863 Dec 21 12:42:26 kavkaz kernel: MCA: Global Cap 0x0000000000000105, Status 0x0000000000000000 Dec 21 12:42:26 kavkaz kernel: MCA: Vendor "AuthenticAMD", ID 0x40f33, APIC ID 1 Dec 21 12:42:26 kavkaz kernel: MCA: CPU 1 COR OVER BUSLG Source PREFETCH Memory Can somebody tell me, what these messages are? Miroslav Lachman
On Wednesday, December 22, 2010 7:41:25 am Miroslav Lachman wrote:> Dec 21 12:42:26 kavkaz kernel: MCA: Bank 0, Status 0xd40e400000000833 > Dec 21 12:42:26 kavkaz kernel: MCA: Global Cap 0x0000000000000105, > Status 0x0000000000000000 > Dec 21 12:42:26 kavkaz kernel: MCA: Vendor "AuthenticAMD", ID 0x40f33, > APIC ID 0 > Dec 21 12:42:26 kavkaz kernel: MCA: CPU 0 COR OVER BUSLG Source DRD Memory > Dec 21 12:42:26 kavkaz kernel: MCA: Address 0x236493c0You are getting corrected ECC errors in your RAM. You see them once an hour because we poll the machine check registers once an hour. If this happens constantly you might have a DIMM that is dying? % ~/mcelog --ascii < foo.txt mcelog: Cannot open /dev/mem for DMI decoding: Permission denied HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 0 data cache ADDR 236493c0 Data cache ECC error (syndrome 1c) bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node origin, request didn't time out data read mem transaction memory access, level generic' STATUS d40e400000000833 MCGSTATUS 0 MCGCAP 105 APICID 0 SOCKETID 0 CPUID Vendor AMD Family 15 Model 67 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 1 instruction cache ADDR 2a1c9440 Instruction cache ECC error bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node origin, request didn't time out instruction fetch mem transaction memory access, level generic' STATUS d400400000000853 MCGSTATUS 0 MCGCAP 105 APICID 0 SOCKETID 0 CPUID Vendor AMD Family 15 Model 67 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 2 bus unit L2 cache ECC error Bus or cache array error bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node origin, request didn't time out prefetch mem transaction memory access, level generic' STATUS d000400000000863 MCGSTATUS 0 MCGCAP 105 APICID 0 SOCKETID 0 CPUID Vendor AMD Family 15 Model 67 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 4 northbridge MISC e00d0fff00000000 ADDR 2cac9678 Northbridge RAM ECC error ECC syndrome = 1c bit33 = err cpu1 bit46 = corrected ecc error bit59 = misc error valid bit62 = error overflow (multiple errors) bus error 'local node origin, request didn't time out generic read mem transaction memory access, level generic' STATUS dc0e400200000813 MCGSTATUS 0 MCGCAP 105 APICID 0 SOCKETID 0 CPUID Vendor AMD Family 15 Model 67 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 0 data cache ADDR 23649640 Data cache ECC error (syndrome 1c) bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node origin, request didn't time out data read mem transaction memory access, level generic' STATUS d40e400000000833 MCGSTATUS 0 MCGCAP 105 APICID 1 SOCKETID 0 CPUID Vendor AMD Family 15 Model 67 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 1 instruction cache ADDR 2a1c9440 Instruction cache ECC error bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node origin, request didn't time out instruction fetch mem transaction memory access, level generic' STATUS d400400000000853 MCGSTATUS 0 MCGCAP 105 APICID 1 SOCKETID 0 CPUID Vendor AMD Family 15 Model 67 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 1 2 bus unit L2 cache ECC error Bus or cache array error bit46 = corrected ecc error bit62 = error overflow (multiple errors) bus error 'local node origin, request didn't time out prefetch mem transaction memory access, level generic' STATUS d000400000000863 MCGSTATUS 0 MCGCAP 105 APICID 1 SOCKETID 0 CPUID Vendor AMD Family 15 Model 67 -- John Baldwin
John Baldwin wrote:> On Wednesday, December 22, 2010 7:41:25 am Miroslav Lachman wrote: >> Dec 21 12:42:26 kavkaz kernel: MCA: Bank 0, Status 0xd40e400000000833 >> Dec 21 12:42:26 kavkaz kernel: MCA: Global Cap 0x0000000000000105, >> Status 0x0000000000000000 >> Dec 21 12:42:26 kavkaz kernel: MCA: Vendor "AuthenticAMD", ID 0x40f33, >> APIC ID 0 >> Dec 21 12:42:26 kavkaz kernel: MCA: CPU 0 COR OVER BUSLG Source DRD Memory >> Dec 21 12:42:26 kavkaz kernel: MCA: Address 0x236493c0 > > You are getting corrected ECC errors in your RAM. You see them once an hour > because we poll the machine check registers once an hour. If this happens > constantly you might have a DIMM that is dying?Yes, it happens constantly. Does Bank in this context means DIMM socket or anything else? If it is DIMM socket, then it means all modules are dying at the same time :( Thank you for mcelog output. BTW do you have any time plan for releasing port of mcelog? Miroslav Lachman
On 12/22/2010 9:57 AM, John Baldwin wrote:> On Wednesday, December 22, 2010 7:41:25 am Miroslav Lachman wrote: >> Dec 21 12:42:26 kavkaz kernel: MCA: Bank 0, Status 0xd40e400000000833 >> Dec 21 12:42:26 kavkaz kernel: MCA: Global Cap 0x0000000000000105, >> Status 0x0000000000000000 >> Dec 21 12:42:26 kavkaz kernel: MCA: Vendor "AuthenticAMD", ID 0x40f33, >> APIC ID 0 >> Dec 21 12:42:26 kavkaz kernel: MCA: CPU 0 COR OVER BUSLG Source DRD Memory >> Dec 21 12:42:26 kavkaz kernel: MCA: Address 0x236493c0 > > You are getting corrected ECC errors in your RAM. You see them once an hour > because we poll the machine check registers once an hour. If this happens > constantly you might have a DIMM that is dying?John: I take it these ECC errors *may* have been happening for some time. What has changed is the OS now polls for the errors and reports them. -- Dan Langille - http://langille.org/
2010/12/23 Dan Langille <dan@langille.org>> On 12/22/2010 9:57 AM, John Baldwin wrote: > >> On Wednesday, December 22, 2010 7:41:25 am Miroslav Lachman wrote: >> >>> Dec 21 12:42:26 kavkaz kernel: MCA: Bank 0, Status 0xd40e400000000833 >>> Dec 21 12:42:26 kavkaz kernel: MCA: Global Cap 0x0000000000000105, >>> Status 0x0000000000000000 >>> Dec 21 12:42:26 kavkaz kernel: MCA: Vendor "AuthenticAMD", ID 0x40f33, >>> APIC ID 0 >>> Dec 21 12:42:26 kavkaz kernel: MCA: CPU 0 COR OVER BUSLG Source DRD >>> Memory >>> Dec 21 12:42:26 kavkaz kernel: MCA: Address 0x236493c0 >>> >> >> You are getting corrected ECC errors in your RAM. You see them once an >> hour >> because we poll the machine check registers once an hour. If this happens >> constantly you might have a DIMM that is dying? >> > > John: > > I take it these ECC errors *may* have been happening for some time. What > has changed is the OS now polls for the errors and reports them. > >Yes, we enabled MCA by default in 8.1-RELEASE. Alan