Hi, On one of our servers running xen, we see many instances like this in /var/log/messages on dom0: Feb 2 17:45:11 maradona kernel: [172988.068048] MCE_DOM0_LOG: enter dom0 mce vIRQ handler Feb 2 17:45:11 maradona kernel: [172988.068050] MCE_DOM0_LOG: No more urgent data Feb 2 17:45:11 maradona kernel: [172988.068056] [CPU8, BANK12, addr cf839a00, state cc0035400001009f] Feb 2 17:45:11 maradona kernel: [172988.068059] MCE_DOM0_LOG: No more nonurgent data it is always CPU8, BANK12. And the server will sometimes just abruptly reboot after logging this. Does it mean that MCE messages are logged by xen in /var/log/messages and that there is a problem with this cpu? Do you know how I can dig further and find what the problem is? Meanwhile, I have set dom0_max_vcpus=1 dom0_vcpus_pin, so my understanding is that dom0 only uses cpu0. Does it mean that the problem reported is about a cpu that is only used by a domU? Thanks a lot for any light on this, Sylvain
On Fri, Feb 03, 2012 at 11:37:27AM +0800, Sylvain Chevalier wrote:> Hi, > > On one of our servers running xen, we see many instances like this in > /var/log/messages on dom0: > > Feb 2 17:45:11 maradona kernel: [172988.068048] MCE_DOM0_LOG: enter > dom0 mce vIRQ handler > Feb 2 17:45:11 maradona kernel: [172988.068050] MCE_DOM0_LOG: No more > urgent data > Feb 2 17:45:11 maradona kernel: [172988.068056] [CPU8, BANK12, addr > cf839a00, state cc0035400001009f] > Feb 2 17:45:11 maradona kernel: [172988.068059] MCE_DOM0_LOG: No more > nonurgent data > > it is always CPU8, BANK12. And the server will sometimes just abruptly > reboot after logging this.> Does it mean that MCE messages are logged by xen in /var/log/messages > and that there is a problem with this cpu? Do you know how I can dig > further and find what the problem is?Betcha it is the ram in that bank. I''m getting similar errors in a server that I just swapped out, only my MCE errors say: (XEN) MCE: The hardware reports a non fatal, correctable incident occured on CPU 0. (XEN) Bank 4: dc0c4000fe080813[c008000401000000] at 363fe9000 (this is on my serial console, not /var/log/messages) ''non-fatal, correctable incident on cpu0, Bank 4'' sure sounds a lot like it''s a correctable ECC error. The crash would then be explained by an uncorrectable ecc error (commonly in failing ram, you get correctable errors, then an uncorrectable error.) Now, this was on an ancient garbage nvidia mcp55 motherboard and nothing like the kernel EDAC/bluesmoke module works with it, xen or no. The counter evidence to that theory is that the motherboard system event log (accessed through the bios setup screen) doesn''t show any errors. Now, like I said, this server was in production, so I drove a spare in to the co-lo, swapped the hard drives and brought it back up . (took a lot longer than it should, as this server hadn''t been touched in years, and somehow the good disks didn''t end up with bootloaders. By ''somehow'' I mean, "i am an idiot and did not install bootloaders when I replaced bad disks" - then I didn''t bring my rescue cd, and the DHCP/tftp PXE server I would have used to boot it into rescue mode was on the server that was down. It took all day when it should have taken about as long as it takes to get up to the 14th floor of market post tower.) Anyhow, I''m delaying diagnostics on my bad server until tomorrow; I''d bet lunch that if I turn ecc off and run memtest, I''ll find a bad ram module.
Hi both, you might wanna throw 30 minutes into setting up a OMD nagios instance (www.omdistro.org), adding the affected servers to the check_mk config and grab my linux ECC error check plugin from the community exchange (http://exchange.check-mk.org) I *really really hope* I got everything right and it will be able to detect ECC 1/2bit errors once the CPUs report them. The error>> Feb 2 17:45:11 maradona kernel: [172988.068056] [CPU8, BANK12, addr >> cf839a00, state cc0035400001009f]is as descriptive as anything that isn''t a real big iron Unix box can get. (Of course, then you''d have better ECC and a page deallocation table anyway and all this would not be causing problems) My assumption is that Xen properly forwards MCEs. There was a presentation by Intel on the topic at one of the last XenSummits. I wasn''t there but read through it some time. I guess you''ll be able to find it. If needed I can do a short walkthrough of the setup. I just wanna avoid this looking like an advertisement. It''s not my fault there''s no other good ECC check plugin for Nagios :) 2012/2/3 Luke S. Crawford <lsc@prgmr.com>:> On Fri, Feb 03, 2012 at 11:37:27AM +0800, Sylvain Chevalier wrote: >> Hi, >> >> On one of our servers running xen, we see many instances like this in >> /var/log/messages on dom0: >> >> Feb 2 17:45:11 maradona kernel: [172988.068048] MCE_DOM0_LOG: enter >> dom0 mce vIRQ handler >> Feb 2 17:45:11 maradona kernel: [172988.068050] MCE_DOM0_LOG: No more >> urgent data >> Feb 2 17:45:11 maradona kernel: [172988.068056] [CPU8, BANK12, addr >> cf839a00, state cc0035400001009f] >> Feb 2 17:45:11 maradona kernel: [172988.068059] MCE_DOM0_LOG: No more >> nonurgent data >> >> it is always CPU8, BANK12. And the server will sometimes just abruptly >> reboot after logging this. > >> Does it mean that MCE messages are logged by xen in /var/log/messages >> and that there is a problem with this cpu? Do you know how I can dig >> further and find what the problem is? > > Betcha it is the ram in that bank. > > I''m getting similar errors in a server that I just swapped out, only my > MCE errors say: > > (XEN) MCE: The hardware reports a non fatal, correctable incident occured on CPU 0. > (XEN) Bank 4: dc0c4000fe080813[c008000401000000] at 363fe9000 > > (this is on my serial console, not /var/log/messages) > > ''non-fatal, correctable incident on cpu0, Bank 4'' sure sounds a lot > like it''s a correctable ECC error. The crash would then be explained > by an uncorrectable ecc error (commonly in failing ram, you get correctable > errors, then an uncorrectable error.)bingo :>> Now, this was on an ancient garbage nvidia mcp55 motherboard and nothing > like the kernel EDAC/bluesmoke module works with it, xen or no. > > The counter evidence to that theory is that the motherboard system event > log (accessed through the bios setup screen) doesn''t show any errors.MCEs are often seen while nothing shows up in iLO or other things. I guess this is since Intel / AMD decide when the cpu sends out an MCE/EDAC event, whereas the HW vendors might even be slightly inclined to not immediately replace stuff because of a single pci crc error. (which aren''t even checked in linux as per default... lol) Flo -- the purpose of libvirt is to provide an abstraction layer hiding all xen features added since 2006 until they were finally understood and copied by the kvm devs.