Hi all, we are running CentOS 5.2 Xen virtualization system with the latest CentOS packages with couple of VMs on DELL PowerEdge. "Sometimes" the whole machine freezes without anything in log files, anything on the console. "Sometimes" really means we cannot define why or when. Sometimes the machine was idle with just one VM, sometimes quite busy with couple of VMs. Has anybody had the same experience? If yes, any hints on how to resolve it or how to trace the cause? Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.centos.org/pipermail/centos-virt/attachments/20090403/17ee0e0e/attachment.html
On Fri, Apr 3, 2009 at 4:18 PM, Maros TIMKO <timko at pobox.sk> wrote:> > we are running CentOS 5.2 Xen virtualization system with the latest CentOS > packages with couple of VMs on DELL PowerEdge. "Sometimes" the whole machine > freezes without anything in log files, anything on the console. "Sometimes" > really means we cannot define why or when. Sometimes the machine was idle > with just one VM, sometimes quite busy with couple of VMs. > > ?Has anybody had the same experience? If yes, any hints on how to resolve it > or how to trace the cause? >The complete freezing of a machine like that sounds like a hardware issue to me, most likely the memory. Does the machine unfreeze after a while or do you have to power cycle the server when it happens ? I would suggest running a memtest. Regards, Tim -- Tim Verhoeven - tim.verhoeven.be at gmail.com - 0479 / 88 11 83 Hoping the problem magically goes away by ignoring it is the "microsoft approach to programming" and should never be allowed. (Linus Torvalds)
I've had to deal with issues like these in the past and I can say they always suck. Normally, the whole OS freezes due to a hardware issue. Isolating the cause is extremely time consuming. If it happens on a regular basis, (I.E. every 60 or maybe 90 days) the most likely culprit is the DRAC card. There is a known issue where a virtual USB floppy or CD device spontaneously disappears from the OS, causing an OS freeze. I believe there is a kernel parameter to pass and a firmware upgrade to apply. When addressing any hardware issue, the default response from the vendor will always be "have you upgraded the BIOS and firmware on all the cards"? In general, that will be your first step. The next canned reply will be "do you have any third-party cards or equipment?" (External USB drives, third-party memory, unsupported cards, etc.). If so, you will be told to remove them or you're on your own. Check the controller card logs (BIOS, DRAC, RAID Controller, etc.) and run the Dell diagnostic tools on the server. Make sure you run a full check on the memory. (You might also try swap memory DIMM positions to see if the behavior changes.) The dmesg log is your friend. Investigate setting up net-dump to create a crash dump file on a remote system. A remote monitoring system, collecting system logs, snmp traps and performing active monitoring can be useful in identifying any events that lead up to the system freeze. (I.E. Memory slowly leaking away, processor spiking, etc.) If you have a DRAC or BMC card, configure it with an IP address and to send SNMP traps to a monitoring system. Pay attention to any physical changes that coincide with the freeze. (I.E. fans are running full bore, which normally means some instruction ran into a loop.) Just a note, you really want your Xen system to be running bare bones. Do not install any unnecessary packages. It just complicates your troubleshooting in this instance. Configuring the server to send syslog messages to tty12 or serial console to monitor on a another system) can sometimes be helpful to see what the last write was supposed to be (if the disk is dying before a write). Add the following to syslog.conf and leave your console on tty12 (since you won't be able to change it after a freeze). # Log everything to tty12 *.* /dev/tty12 I thought I read that the PAE kernel is superficial (since 5.x), but maybe that is with Cent 5.3. Maros TIMKO wrote:> > Hi all, > > we are running CentOS 5.2 Xen virtualization system with the latest > CentOS packages with couple of VMs on DELL PowerEdge. "Sometimes" the > whole machine freezes without anything in log files, anything on the > console. "Sometimes" really means we cannot define why or when. > Sometimes the machine was idle with just one VM, sometimes quite busy > with couple of VMs. > > Has anybody had the same experience? If yes, any hints on how to > resolve it or how to trace the cause? > > > > Thanks. > > ------------------------------------------------------------------------ > > _______________________________________________ > CentOS-virt mailing list > CentOS-virt at centos.org > http://lists.centos.org/mailman/listinfo/centos-virt >
Hi all, thanks to all for valuable replies. It seems like we identified the issue. We assured that it is not HW related as it was already reproduced on different machines and platforms, with different BIOS versions. We are running a system performance/statistics collector that executes "xentop" command on Dom0 regularly. This is causing issues. If we execute: xentop -b -d 0.1 > /dev/null in multiple instances, it will freeze the system. It was reproduced on CentOS 5.3 (kernel-xen-2.6.18-128.1.6.el5) system. There is created a bug for this issue: http://bugs.centos.org/view.php?id=3454 With regards, Tino 2009/4/3 Maros TIMKO <timko at pobox.sk>> Hi all, > > we are running CentOS 5.2 Xen virtualization system with the latest CentOS > packages with couple of VMs on DELL PowerEdge. "Sometimes" the whole machine > freezes without anything in log files, anything on the console. "Sometimes" > really means we cannot define why or when. Sometimes the machine was idle > with just one VM, sometimes quite busy with couple of VMs. > > Has anybody had the same experience? If yes, any hints on how to resolve > it or how to trace the cause? > > > > Thanks. > > _______________________________________________ > CentOS-virt mailing list > CentOS-virt at centos.org > http://lists.centos.org/mailman/listinfo/centos-virt > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.centos.org/pipermail/centos-virt/attachments/20090407/617db553/attachment.html
Hey, I'm wondering if it is possible that your problem is related to mine. Earlier today I had to restart one of our domUs on one of our systems. I used xm shutdown instead of xm destroy and then did xm list to determine if the domU had shutdown or not. Upon issuing xm list a second time, the entire server crashed and rebooted. I've checked the logs and have yet to find anything. I've attached a transcript of the commands as I executed them on the server. The system is running CentOS 5.3 x64 w/Xen (kernel 2.6.18-128.1.6.el5xen). Any thoughts? Thanks, Matt -- Mathew S. McCarrell Clarkson University '10 mccarrms at gmail.com mccarrms at clarkson.edu 2009/4/7 Maros Timko <timkom at gmail.com>> Hi all, > > thanks to all for valuable replies. > It seems like we identified the issue. We assured that it is not HW related > as it was already reproduced on different machines and platforms, with > different BIOS versions. > We are running a system performance/statistics collector that executes > "xentop" command on Dom0 regularly. This is causing issues. If we execute: > xentop -b -d 0.1 > /dev/null > in multiple instances, it will freeze the system. > It was reproduced on CentOS 5.3 (kernel-xen-2.6.18-128.1.6.el5) system. > There is created a bug for this issue: > http://bugs.centos.org/view.php?id=3454 > > With regards, > > Tino > > > 2009/4/3 Maros TIMKO <timko at pobox.sk> > >> Hi all, >> >> we are running CentOS 5.2 Xen virtualization system with the latest CentOS >> packages with couple of VMs on DELL PowerEdge. "Sometimes" the whole machine >> freezes without anything in log files, anything on the console. "Sometimes" >> really means we cannot define why or when. Sometimes the machine was idle >> with just one VM, sometimes quite busy with couple of VMs. >> >> Has anybody had the same experience? If yes, any hints on how to resolve >> it or how to trace the cause? >> >> >> >> Thanks. >> >> _______________________________________________ >> CentOS-virt mailing list >> CentOS-virt at centos.org >> http://lists.centos.org/mailman/listinfo/centos-virt >> >> > > _______________________________________________ > CentOS-virt mailing list > CentOS-virt at centos.org > http://lists.centos.org/mailman/listinfo/centos-virt > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.centos.org/pipermail/centos-virt/attachments/20090429/02dd001d/attachment-0003.html> -------------- next part -------------- [mccarrms at isengard ~]$ ssh xen1 ___ __ _____ ___ < / \ \ / -_) _ \/ / /_\_\\__/_//_/_/ Last login: Mon Apr 27 11:26:22 2009 from isengard.cslabs.clarkson.edu [mccarrms at xen1 ~]$ sudo xm list We trust you have received the usual lecture from the local System Administrator. It usually boils down to these three things: #1) Respect the privacy of others. #2) Think before you type. #3) With great power comes great responsibility. Password: Sorry, try again. Password: Name ID Mem(MiB) VCPUs State Time(s) Domain-0 0 9899 8 r----- 12458.2 atp 11 255 1 -b---- 72748.8 auth 1 127 1 -b---- 121.7 autoguilt 2 255 1 -b---- 523.2 dukr 3 255 1 -b---- 770.4 list 5 255 1 -b---- 191.4 management 6 255 1 -b---- 517.4 osp1 7 255 1 -b---- 70.4 osp2 8 255 1 -b---- 68.8 tremulous 9 255 1 -b---- 397287.4 [mccarrms at xen1 ~]$ xm console atp ERROR Internal error: Could not obtain handle on privileged command interface (13 = Permission denied) Error: Most commands need root access. Please try again as root. [mccarrms at xen1 ~]$ sudo xm console atp Out of Memory: Kill process 2626 (TreeLimitedRun) score 96585 and children. Out of memory: Killed process 2627 (spectrum). Out of Memory: Kill process 2694 (TreeLimitedRun) score 96565 and children. Out of memory: Killed process 2695 (spectrum). Out of Memory: Kill process 2914 (TreeLimitedRun) score 96210 and children. Out of memory: Killed process 2915 (spectrum). Out of Memory: Kill process 3014 (TreeLimitedRun) score 96153 and children. Out of memory: Killed process 3015 (spectrum). Out of Memory: Kill process 3018 (TreeLimitedRun) score 96177 and children. Out of memory: Killed process 3019 (spectrum). Out of Memory: Kill process 4466 (spectrum) score 189626 and children. Out of memory: Killed process 4466 (spectrum). Out of Memory: Kill process 6324 (TreeLimitedRun) score 96129 and children. Out of memory: Killed process 6325 (spectrum). Out of Memory: Kill process 10680 (TreeLimitedRun) score 96147 and children. Out of memory: Killed process 10681 (spectrum). Out of Memory: Kill process 10800 (TreeLimitedRun) score 96167 and children. Out of memory: Killed process 10801 (spectrum). Out of Memory: Kill process 10852 (TreeLimitedRun) score 96159 and children. Out of memory: Killed process 10853 (spectrum). Out of Memory: Kill process 10856 (TreeLimitedRun) score 96218 and children. Out of memory: Killed process 10857 (spectrum). [mccarrms at xen1 ~]$ xm shutdown atp ERROR Internal error: Could not obtain handle on privileged command interface (13 = Permission denied) Error: Most commands need root access. Please try again as root. [mccarrms at xen1 ~]$ sudo xm shutdown atp [mccarrms at xen1 ~]$ sudo xm list Name ID Mem(MiB) VCPUs State Time(s) Domain-0 0 9899 8 r----- 12459.1 atp 11 255 1 -b---- 72749.6 auth 1 127 1 -b---- 121.7 autoguilt 2 255 1 -b---- 523.2 dukr 3 255 1 -b---- 770.4 list 5 255 1 -b---- 191.4 management 6 255 1 -b---- 517.4 osp1 7 255 1 -b---- 70.4 osp2 8 255 1 -b---- 68.8 tremulous 9 255 1 -b---- 397287.4 [mccarrms at xen1 ~]$ sudo xm list