thr3ads.net - CentOS - [CentOS] ECC memory errors [Apr 2013]

If this information is useful, please help other people find it:
Share via:

Peter Peltonen

2013-Apr-29 08:17 UTC

[CentOS] ECC memory errors

I started to receive this kind of messages a few days ago on one of my
servers:

Message from syslogd@ at Mon Apr 29 08:02:55 2013 ...
server1 kernel: EDAC MC0: UE row 0, channel-a= 0 channel-b= 1 labels
"-":
(Branch=0 DRAM-Bank=0 RDWR=Read RAS=0 CAS=0, UE Err=0x2 (Aliased
Uncorrectable Non-Mirrored Demand Data ECC))

I've never had ECC memory to fail on me before, so now I am wondering the
following:

* The server is running CentOS 5.7 and is acting as Xen dom0. Is there any
possibility this could be a kernel issue and upgrading would help, or would
upgrading at this point just cause more trouble?

* Is there now a possibility that my data can get corrupt: should I
shutdown the server as soon as possible or can I keep running until I
replace the memories?

* This server has been running for several years in a datacenter without
problems: what are your experiences, are these kind of problems most likely
caused by a failing motherboard or the memories?

Regards,
Peter

mark

2013-Apr-29 11:59 UTC

head link

[CentOS] ECC memory errors

On 04/29/13 04:17, Peter Peltonen wrote:> I started to receive this kind of messages a few days ago on one of my
> servers:
>
> Message from syslogd@ at Mon Apr 29 08:02:55 2013 ...
> server1 kernel: EDAC MC0: UE row 0, channel-a= 0 channel-b= 1 labels
"-":
> (Branch=0 DRAM-Bank=0 RDWR=Read RAS=0 CAS=0, UE Err=0x2 (Aliased
> Uncorrectable Non-Mirrored Demand Data ECC))
>
> I've never had ECC memory to fail on me before, so now I am wondering
the
> following:
>
> * The server is running CentOS 5.7 and is acting as Xen dom0. Is there any
> possibility this could be a kernel issue and upgrading would help, or would
> upgrading at this point just cause more trouble?
Not in my experience.>
> * Is there now a possibility that my data can get corrupt: should I
> shutdown the server as soon as possible or can I keep running until I
> replace the memories?
Maybe - I'm just not sure. You need to replace the memory asap; order 
it, and schedule a maintenance window with all your users
*now*.>
> * This server has been running for several years in a datacenter without
> problems: what are your experiences, are these kind of problems most likely
> caused by a failing motherboard or the memories?
DIMM went bad. No big thing. Your only problem may be to identify which 
one, he says, about to go into work to do just that.

	mark


-- 
"Stock traders are a superstitious and cowardly lot", to paraphrase
the
Batman

Peter Peltonen

2013-Apr-29 12:41 UTC

head link

[CentOS] ECC memory errors

Hi,

On Mon, Apr 29, 2013 at 2:59 PM, mark <m.roth at 5-cent.us> wrote:
>
> DIMM went bad. No big thing. Your only problem may be to identify which
> one, he says, about to go into work to do just that.
>
Thanks for your response and suggestions.

About identifying the faulty DIMM: Is the memtest provided on the CentOS5
installation disk best tool for this purpose? And do I need to switch ECC
off from BIOS while I test the memories?

The EDAC error msg reports problems with bank0. Can I trust this? I tried
installing edac-utils to get more information, but after installation it
only generates segmentation fault:

# edac-util --report=simple
Segmentation fault

# edac-util -s
Segmentation fault

# rpm -qv edac-utils
edac-utils-0.9-6.el5

Regards,
Peter

Vipul Agarwal

2013-Apr-29 13:50 UTC

head link

[CentOS] ECC memory errors

On Mon, Apr 29, 2013 at 1:41 PM, Peter Peltonen <peter.peltonen at
gmail.com>wrote:
> Hi,
>
> On Mon, Apr 29, 2013 at 2:59 PM, mark <m.roth at 5-cent.us> wrote:
>
> >
> > DIMM went bad. No big thing. Your only problem may be to identify
which
> > one, he says, about to go into work to do just that.
> >
>
> Thanks for your response and suggestions.
>
> About identifying the faulty DIMM: Is the memtest provided on the CentOS5
> installation disk best tool for this purpose? And do I need to switch ECC
> off from BIOS while I test the memories?
>
> The EDAC error msg reports problems with bank0. Can I trust this? I tried
> installing edac-utils to get more information, but after installation it
> only generates segmentation fault:
>
> # edac-util --report=simple
> Segmentation fault
>
> # edac-util -s
> Segmentation fault
>
> # rpm -qv edac-utils
> edac-utils-0.9-6.el5
>
> Regards,
> Peter
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
>
Hi Peter

One of my old HP DL585 had a similar issue but it turned out that the DIMM
slots were at fault. The server chassis had few led blinking red for those
DIMM slots and indicating that they are faulty. I removed the memory from
those slot and re-inserted them to the spare DIMM slots and everything is
working fine since then.

Regards,
Vipul

Peter Peltonen

2013-May-02 21:14 UTC

head link

[CentOS] ECC memory errors

Replying to myself:

On Mon, Apr 29, 2013 at 3:41 PM, Peter Peltonen <peter.peltonen at
gmail.com>wrote:
> The EDAC error msg reports problems with bank0. Can I trust this? I tried
> installing edac-utils to get more information, but after installation it
> only generates segmentation fault:
>
> # edac-util --report=simple
> Segmentation fault
>
>Replacing the first memory pair made the error messages go away.

Edac-util still segfaults though. But as the system seems to be otheriwse
stable, I probably will not investigate this further.

Regards,
Peter

Reasonably Related Threads

Search for more reasonably related threads

CentOS - Apr 2013 - ECC memory errors

[CentOS] ECC memory errors

[CentOS] ECC memory errors

[CentOS] ECC memory errors

[CentOS] ECC memory errors

[CentOS] ECC memory errors

Reasonably Related Threads