I started to receive this kind of messages a few days ago on one of my servers: Message from syslogd@ at Mon Apr 29 08:02:55 2013 ... server1 kernel: EDAC MC0: UE row 0, channel-a= 0 channel-b= 1 labels "-": (Branch=0 DRAM-Bank=0 RDWR=Read RAS=0 CAS=0, UE Err=0x2 (Aliased Uncorrectable Non-Mirrored Demand Data ECC)) I've never had ECC memory to fail on me before, so now I am wondering the following: * The server is running CentOS 5.7 and is acting as Xen dom0. Is there any possibility this could be a kernel issue and upgrading would help, or would upgrading at this point just cause more trouble? * Is there now a possibility that my data can get corrupt: should I shutdown the server as soon as possible or can I keep running until I replace the memories? * This server has been running for several years in a datacenter without problems: what are your experiences, are these kind of problems most likely caused by a failing motherboard or the memories? Regards, Peter
On 04/29/13 04:17, Peter Peltonen wrote:> I started to receive this kind of messages a few days ago on one of my > servers: > > Message from syslogd@ at Mon Apr 29 08:02:55 2013 ... > server1 kernel: EDAC MC0: UE row 0, channel-a= 0 channel-b= 1 labels "-": > (Branch=0 DRAM-Bank=0 RDWR=Read RAS=0 CAS=0, UE Err=0x2 (Aliased > Uncorrectable Non-Mirrored Demand Data ECC)) > > I've never had ECC memory to fail on me before, so now I am wondering the > following: > > * The server is running CentOS 5.7 and is acting as Xen dom0. Is there any > possibility this could be a kernel issue and upgrading would help, or would > upgrading at this point just cause more trouble?Not in my experience.> > * Is there now a possibility that my data can get corrupt: should I > shutdown the server as soon as possible or can I keep running until I > replace the memories?Maybe - I'm just not sure. You need to replace the memory asap; order it, and schedule a maintenance window with all your users *now*.> > * This server has been running for several years in a datacenter without > problems: what are your experiences, are these kind of problems most likely > caused by a failing motherboard or the memories?DIMM went bad. No big thing. Your only problem may be to identify which one, he says, about to go into work to do just that. mark -- "Stock traders are a superstitious and cowardly lot", to paraphrase the Batman
Hi, On Mon, Apr 29, 2013 at 2:59 PM, mark <m.roth at 5-cent.us> wrote:> > DIMM went bad. No big thing. Your only problem may be to identify which > one, he says, about to go into work to do just that. >Thanks for your response and suggestions. About identifying the faulty DIMM: Is the memtest provided on the CentOS5 installation disk best tool for this purpose? And do I need to switch ECC off from BIOS while I test the memories? The EDAC error msg reports problems with bank0. Can I trust this? I tried installing edac-utils to get more information, but after installation it only generates segmentation fault: # edac-util --report=simple Segmentation fault # edac-util -s Segmentation fault # rpm -qv edac-utils edac-utils-0.9-6.el5 Regards, Peter
On Mon, Apr 29, 2013 at 1:41 PM, Peter Peltonen <peter.peltonen at gmail.com>wrote:> Hi, > > On Mon, Apr 29, 2013 at 2:59 PM, mark <m.roth at 5-cent.us> wrote: > > > > > DIMM went bad. No big thing. Your only problem may be to identify which > > one, he says, about to go into work to do just that. > > > > Thanks for your response and suggestions. > > About identifying the faulty DIMM: Is the memtest provided on the CentOS5 > installation disk best tool for this purpose? And do I need to switch ECC > off from BIOS while I test the memories? > > The EDAC error msg reports problems with bank0. Can I trust this? I tried > installing edac-utils to get more information, but after installation it > only generates segmentation fault: > > # edac-util --report=simple > Segmentation fault > > # edac-util -s > Segmentation fault > > # rpm -qv edac-utils > edac-utils-0.9-6.el5 > > Regards, > Peter > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos >Hi Peter One of my old HP DL585 had a similar issue but it turned out that the DIMM slots were at fault. The server chassis had few led blinking red for those DIMM slots and indicating that they are faulty. I removed the memory from those slot and re-inserted them to the spare DIMM slots and everything is working fine since then. Regards, Vipul
Replying to myself: On Mon, Apr 29, 2013 at 3:41 PM, Peter Peltonen <peter.peltonen at gmail.com>wrote:> The EDAC error msg reports problems with bank0. Can I trust this? I tried > installing edac-utils to get more information, but after installation it > only generates segmentation fault: > > # edac-util --report=simple > Segmentation fault > >Replacing the first memory pair made the error messages go away. Edac-util still segfaults though. But as the system seems to be otheriwse stable, I probably will not investigate this further. Regards, Peter