Hi all, This is my first post to the list, I hope someone out there can help! I am running xen 3.0.3, with CentOS 5.2 based Dom0 (kernel-xen-2.6.18-92.1.22.el5) Recently I have noticed some complete system lockups on a few different servers. Neither Dom0 or any of the guests respond to pings, connecting a keyboard and monitor to the system only shows a blank screen. Nothing is written to logs at time of lockup. The problem is very difficult to reproduce and seems very random by nature. Sometimes if a system is left running for a few weeks it will happen, other times it can happen after a reboot. I have tried taxing the system running various scripts, rebooting numerous times, and creating/destroying a few guests, etc but no luck. It seems like a hardware issue but has been reproduced on a few different machines. For a while (clutching at straws) I thought it was due to changes in the clock (from daylight savings) so tried changing time backwards and forwards but this had no effect. Has anyone else out there seen a problem like this? Is there any way to diagnose it when it does happen. (It is very frustrating to have a hanged system where you cannot access for any information). If anyone wants any further info or ideas on what I could try please let me know. Regards, Paraic. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Fri, Apr 03, 2009 at 03:56:28PM +0100, Paraic Gallagher wrote:> I am running xen 3.0.3, with CentOS 5.2 based Dom0 > (kernel-xen-2.6.18-92.1.22.el5) > Recently I have noticed some complete system lockups on a few different > servers. Neither Dom0 or any of the guests respond to pings, connecting a > keyboard and monitor to the system only shows a blank screen. Nothing is > written to logs at time of lockup.I have seen similar issues with one of my servers. I have yet to nail down the issue. Specs: Distro: Debian Etch Kernel: 2.6.18-6-xen-amd64 CPU: 2x Quad-Core AMD Opteron(tm) Processor 2350 Memory: 16G Disk: 3ware 9650LE with 8 drive Raid6 Xen: 3.2 (from debian repo) All vms are LVM backed. Not running any HVM guests. For a while I was seeing softlockup on cpu scrolling on the console and thought that may have caused it. Unfortunatly after updating the kernel the errors went away and I have had another lockup since then. Ive found a fairly set pattern though no time periods to predict. A VM typically goes unresponsive first. If left unchecked for long enough the host will lock. If caught in time I have had limited success running xm destroy on the domU. Most of the time running xm destroy on the domU causes the host to lock immediately requiring a hard reboot. The most recent lockup was a bit different that what I had in the past. The domU locked up (no output on domU console). xm destroy locked dom0. I rebooted with a remote power strip. dom0 and all domUs came back up. Nothing in logs as usual. 10 minutes later dom0 was locked again. I drove to the datacenter and about 30-45 minutes after the lock the machine became responsive again (according to monitoring server) I was able to display a website running on a vm. Then the machine went unresponsive again. Not responding to physical console access either. Another hard reboot and things are ok. That was the first time I had ever had so many lockups so close together. Typically the lockups seem to be 1-2 weeks apart. I have even tried setting up netconsole on dom0 to try to catch kernel errors with no success. -- Nick Anderson <nick@anders0n.net> http://www.cmdln.org _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
2009/4/3 Nick Anderson <nick@anders0n.net>> On Fri, Apr 03, 2009 at 03:56:28PM +0100, Paraic Gallagher wrote: > > I am running xen 3.0.3, with CentOS 5.2 based Dom0 > > (kernel-xen-2.6.18-92.1.22.el5) > > Recently I have noticed some complete system lockups on a few different > > servers. Neither Dom0 or any of the guests respond to pings, connecting a > > keyboard and monitor to the system only shows a blank screen. Nothing is > > written to logs at time of lockup. > > I have seen similar issues with one of my servers. I have yet to nail > down the issue. > > Specs: > Distro: Debian Etch > Kernel: 2.6.18-6-xen-amd64 > CPU: 2x Quad-Core AMD Opteron(tm) Processor 2350 > Memory: 16G > Disk: 3ware 9650LE with 8 drive Raid6 > Xen: 3.2 (from debian repo) > > All vms are LVM backed. Not running any HVM guests.Thanks for the response. After searching net for few weeks with no luck in finding similar issues was beginning to think I was going crazy! Just with some further details. I have seen the issue on two types of servers Dell PE 1950, and 2950 2x Quad core Intel Xeon E5410@2.33GHz Memory 4G and 16G Disk, PERC 6/i 1.11, 2x250 Raid1, ST3250620NS Rev: 3BKT All vms are LVM backed on this system except for Dom0.> > For a while I was seeing softlockup on cpu scrolling on the console > and thought that may have caused it. Unfortunatly after updating the > kernel the errors went away and I have had another lockup since then. > > Ive found a fairly set pattern though no time periods to predict. > > A VM typically goes unresponsive first. If left unchecked for long > enough the host will lock. If caught in time I have had limited > success running xm destroy on the domU. Most of the time running xm > destroy on the domU causes the host to lock immediately requiring a > hard reboot. > > The most recent lockup was a bit different that what I had in the > past. > > The domU locked up (no output on domU console). xm destroy locked > dom0. I rebooted with a remote power strip. dom0 and all domUs came > back up. Nothing in logs as usual. 10 minutes later dom0 was locked > again. I drove to the datacenter and about 30-45 minutes after the > lock the machine became responsive again (according to monitoring > server) I was able to display a website running on a vm. Then the > machine went unresponsive again. Not responding to physical console > access either. Another hard reboot and things are ok. > > That was the first time I had ever had so many lockups so close > together. Typically the lockups seem to be 1-2 weeks apart. > > I have even tried setting up netconsole on dom0 to try to catch kernel > errors with no success. >This seems to be quite a similar problem from the description, however I haven''t noticed the guest vms locking up prior to Dom0. Something to keep an eye on. Are you running a particular load on the system at the time or is it somewhat idle? Seems to be idle in my case before lockup. rgds, Paraic.> > > -- > Nick Anderson <nick@anders0n.net> > http://www.cmdln.org > >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Fri, Apr 03, 2009 at 04:59:33PM +0100, Paraic Gallagher wrote:> Thanks for the response. After searching net for few weeks with no luck > in finding similar issues was beginning to think I was going crazy!I''m feeling the same way I just can''t seem to find anything that points to an answer.> All vms are LVM backed on this system except for Dom0. > This seems to be quite a similar problem from the description, however I > haven''t > noticed the guest vms locking up prior to Dom0. Something to keep an eye on. > Are you running a particular load on the system at the time or is it > somewhat > idle? Seems to be idle in my case before lockup.The domUs are defiantly not overloaded. I have had the lockups at different loads and times but it most frequently happens during the night (so lower load times). But I have had it occur in the middle of the day as well. When this first started I didn''t notice the vms locking either. I tightened up my monitoring so I could get faster notifications of any one thing becoming unresponsive and that helped me get logged into the dom0 in time to start looking at the domU that was locked. Also this is not specific to any one domU. At first I thought it was but time has proved me wrong and I have had each domU lock up similarly. All of my domUs currently have only a single vcpu but at one point one (the one I thought was the only domU locking) had 2 vcpus. This had no bearing on the lockups as far as I could tell. Again the only thing I have correlated was the appearance of "softlockup detected on cpu" msgs on the domU consoles during an episode. I''m still baffled by how the server was unresponsive for 45 minutes then became responsive again for about 10 minutes only to lockup again and be unresponsive from the dom0 physical console. -- Nick Anderson <nick@anders0n.net> http://www.cmdln.org _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Fri, Apr 3, 2009 at 11:44 PM, Nick Anderson <nick@anders0n.net> wrote:> Again the only thing I have correlated was the appearance of > "softlockup detected on cpu" msgs on the domU consoles during an > episode.As a general suggestion, you might want to try: - upgrading to Centos 5.3 - setting up a syslog server - setting dom0 and domU to log to syslog server - dedicate a physical core to dom0 Regards, Fajar _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Hi, I just want to tell you that I''ve the same issue for one server! Hardware: Fujitsu Siemens PRIMERGY TX200 S4 CPU: Intel Xeon Dual Quad E5405 2GhZ Hardware Raid: LSI Logic / Symbios Logic MegaRAID SAS 1078 with SAS HDDs I''m running 4 guests on it: - Win2003 - Win2003 - Gentoo Linux - Windows XP Prof The xen 3.3.0 is running on a gentoo with a 2.6.18-xen-r12 kernel. The systems hangs round about all 3-4 weeks as far as I can tell. This server is quite new (from nov 2008) and the ServerView doesn''t tell me anything about hardware problems. It seems from this point that the hardware is ok. If the server hangs then it''s not responsive for any kind of input. Neither the network is working (ping to dom0 or one of the guests) nor keyboard/monitor of the server itself is responding to anything. Black screen.. nothing more. A hard reset is the only thing to get the system back to life. /var/log/messages just show nothing. It''s like disconnecting the power cable. I have no idea and no hints about this problem. At the moment I''ve a cronjob running which collects some system informations of the dom0 every minute - I hope that the very last run (just before the next crash happens) will show me some kind of informations which maybe point me to the problem!? However - I currently have no clue which kind of informations will be helpful for this purpose. I currently log the following things every minute: - dmesg - free - netstat -lnp - ps aux - w - vgdisplay - lvdisplay hints about other informations which could be helpful? any xen related commands? Interesting that you use lvm too. I also use lvm for my guests and use the snapshot functionality on a daily basis to backup the server to a tape. dom0 is running on a normal partition. I use lvm 2.02.36 Regards, Martin Am Freitag, 3. April 2009 16:56:28 schrieb Paraic Gallagher:> Hi all, > > This is my first post to the list, I hope someone out there can help! > > I am running xen 3.0.3, with CentOS 5.2 based Dom0 > (kernel-xen-2.6.18-92.1.22.el5) > > Recently I have noticed some complete system lockups on a few different > servers. Neither Dom0 or any of the guests respond to pings, connecting a > keyboard and monitor to the system only shows a blank screen. Nothing is > written to logs at time of lockup. > > The problem is very difficult to reproduce and seems very random by nature. > Sometimes if a system is left running for a few weeks it will happen, other > times it can happen after a reboot. I have tried taxing the system running > various scripts, rebooting numerous times, and creating/destroying a few > guests, etc but no luck. It seems like a hardware issue but has been > reproduced on a few different machines. > > For a while (clutching at straws) I thought it was due to changes in the > clock (from daylight savings) so tried changing time backwards and forwards > but this had no effect. > > Has anyone else out there seen a problem like this? Is there any way to > diagnose it when it does happen. (It is very frustrating to have a hanged > system where you cannot access for any information). > > If anyone wants any further info or ideas on what I could try please let me > know. > > Regards, > Paraic._______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Sat, Apr 04, 2009 at 05:00:57PM +0200, Martin Fernau wrote:> Hi, > I just want to tell you that I''ve the same issue for one server! > hints about other informations which could be helpful? any xen relatedSar from the sysstat package collects invaluable information. -- Nick Anderson <nick@anders0n.net> http://www.cmdln.org _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Sat, Apr 4, 2009 at 4:00 PM, Martin Fernau <m.fernau@cps-net.de> wrote:> Hi, > > I just want to tell you that I''ve the same issue for one server! > Hardware: > Fujitsu Siemens PRIMERGY TX200 S4 > CPU: Intel Xeon Dual Quad E5405 2GhZ > Hardware Raid: LSI Logic / Symbios Logic MegaRAID SAS 1078 with SAS HDDs > I''m running 4 guests on it: > - Win2003 > - Win2003 > - Gentoo Linux > - Windows XP Prof > > The xen 3.3.0 is running on a gentoo with a 2.6.18-xen-r12 kernel.I had the same problem with the 2.6.18-xen-rX Gentoo kernels, so I made my own ebuild and patches from the openSUSE Xen patches, you can get it from http://code.google.com/p/gentoo-xen-kernel/downloads/list Andy> > The systems hangs round about all 3-4 weeks as far as I can tell. This server > is quite new (from nov 2008) and the ServerView doesn''t tell me anything about > hardware problems. It seems from this point that the hardware is ok. > If the server hangs then it''s not responsive for any kind of input. Neither > the network is working (ping to dom0 or one of the guests) nor > keyboard/monitor of the server itself is responding to anything. Black > screen.. nothing more. A hard reset is the only thing to get the system back > to life. > > /var/log/messages just show nothing. It''s like disconnecting the power cable. > I have no idea and no hints about this problem. > At the moment I''ve a cronjob running which collects some system informations > of the dom0 every minute - I hope that the very last run (just before the next > crash happens) will show me some kind of informations which maybe point me to > the problem!? However - I currently have no clue which kind of informations > will be helpful for this purpose. I currently log the following things every > minute: > - dmesg > - free > - netstat -lnp > - ps aux > - w > - vgdisplay > - lvdisplay > > hints about other informations which could be helpful? any xen related > commands? > > Interesting that you use lvm too. I also use lvm for my guests and use the > snapshot functionality on a daily basis to backup the server to a tape. dom0 > is running on a normal partition. I use lvm 2.02.36 > > Regards, > Martin > > Am Freitag, 3. April 2009 16:56:28 schrieb Paraic Gallagher: >> Hi all, >> >> This is my first post to the list, I hope someone out there can help! >> >> I am running xen 3.0.3, with CentOS 5.2 based Dom0 >> (kernel-xen-2.6.18-92.1.22.el5) >> >> Recently I have noticed some complete system lockups on a few different >> servers. Neither Dom0 or any of the guests respond to pings, connecting a >> keyboard and monitor to the system only shows a blank screen. Nothing is >> written to logs at time of lockup. >> >> The problem is very difficult to reproduce and seems very random by nature. >> Sometimes if a system is left running for a few weeks it will happen, other >> times it can happen after a reboot. I have tried taxing the system running >> various scripts, rebooting numerous times, and creating/destroying a few >> guests, etc but no luck. It seems like a hardware issue but has been >> reproduced on a few different machines. >> >> For a while (clutching at straws) I thought it was due to changes in the >> clock (from daylight savings) so tried changing time backwards and forwards >> but this had no effect. >> >> Has anyone else out there seen a problem like this? Is there any way to >> diagnose it when it does happen. (It is very frustrating to have a hanged >> system where you cannot access for any information). >> >> If anyone wants any further info or ideas on what I could try please let me >> know. >> >> Regards, >> Paraic. > > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
I tried this before. I had your kernel a few months but this changed nothing. I had freezed with this kernel too in the same way. Am Sonntag, 5. April 2009 14:29:20 schrieb Andrew Lyon:> On Sat, Apr 4, 2009 at 4:00 PM, Martin Fernau <m.fernau@cps-net.de> wrote: > > Hi, > > > > I just want to tell you that I''ve the same issue for one server! > > Hardware: > > Fujitsu Siemens PRIMERGY TX200 S4 > > CPU: Intel Xeon Dual Quad E5405 2GhZ > > Hardware Raid: LSI Logic / Symbios Logic MegaRAID SAS 1078 with SAS HDDs > > I''m running 4 guests on it: > > - Win2003 > > - Win2003 > > - Gentoo Linux > > - Windows XP Prof > > > > The xen 3.3.0 is running on a gentoo with a 2.6.18-xen-r12 kernel. > > I had the same problem with the 2.6.18-xen-rX Gentoo kernels, so I > made my own ebuild and patches from the openSUSE Xen patches, you can > get it from http://code.google.com/p/gentoo-xen-kernel/downloads/list > > Andy > > > The systems hangs round about all 3-4 weeks as far as I can tell. This > > server is quite new (from nov 2008) and the ServerView doesn''t tell me > > anything about hardware problems. It seems from this point that the > > hardware is ok. If the server hangs then it''s not responsive for any kind > > of input. Neither the network is working (ping to dom0 or one of the > > guests) nor > > keyboard/monitor of the server itself is responding to anything. Black > > screen.. nothing more. A hard reset is the only thing to get the system > > back to life. > > > > /var/log/messages just show nothing. It''s like disconnecting the power > > cable. I have no idea and no hints about this problem. > > At the moment I''ve a cronjob running which collects some system > > informations of the dom0 every minute - I hope that the very last run > > (just before the next crash happens) will show me some kind of > > informations which maybe point me to the problem!? However - I currently > > have no clue which kind of informations will be helpful for this purpose. > > I currently log the following things every minute: > > - dmesg > > - free > > - netstat -lnp > > - ps aux > > - w > > - vgdisplay > > - lvdisplay > > > > hints about other informations which could be helpful? any xen related > > commands? > > > > Interesting that you use lvm too. I also use lvm for my guests and use > > the snapshot functionality on a daily basis to backup the server to a > > tape. dom0 is running on a normal partition. I use lvm 2.02.36 > > > > Regards, > > Martin > > > > Am Freitag, 3. April 2009 16:56:28 schrieb Paraic Gallagher: > >> Hi all, > >> > >> This is my first post to the list, I hope someone out there can help! > >> > >> I am running xen 3.0.3, with CentOS 5.2 based Dom0 > >> (kernel-xen-2.6.18-92.1.22.el5) > >> > >> Recently I have noticed some complete system lockups on a few different > >> servers. Neither Dom0 or any of the guests respond to pings, connecting > >> a keyboard and monitor to the system only shows a blank screen. Nothing > >> is written to logs at time of lockup. > >> > >> The problem is very difficult to reproduce and seems very random by > >> nature. Sometimes if a system is left running for a few weeks it will > >> happen, other times it can happen after a reboot. I have tried taxing > >> the system running various scripts, rebooting numerous times, and > >> creating/destroying a few guests, etc but no luck. It seems like a > >> hardware issue but has been reproduced on a few different machines. > >> > >> For a while (clutching at straws) I thought it was due to changes in the > >> clock (from daylight savings) so tried changing time backwards and > >> forwards but this had no effect. > >> > >> Has anyone else out there seen a problem like this? Is there any way to > >> diagnose it when it does happen. (It is very frustrating to have a > >> hanged system where you cannot access for any information). > >> > >> If anyone wants any further info or ideas on what I could try please let > >> me know. > >> > >> Regards, > >> Paraic. > > > > _______________________________________________ > > Xen-users mailing list > > Xen-users@lists.xensource.com > > http://lists.xensource.com/xen-users_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Over the last year, I''ve experienced a couple of sources of lockups. The first was resolved by going to the stock xen 2.6.18.8 kernel compiled from source (had been using the Debian etch kernel; found commentary online describing the same symptoms on Ubuntu, Redhat, and CentOS though, each with their distro-specific kernel). This one tended to result in kernel oops messages--soft IRQ lockups as I recall. Lockup would start with a domU and within a few minutes would kill the dom0 too. The fastest way to trigger this one was to create and shutdown domU''s, although I don''t recall that being the only way. The second, with the stock kernel, was an errant USB hub attached to a xen host. Removing the hub resolved the issue. These were complete, sudden lockups of the dom0 and all domUs -- basically everything. Higher traffic over the USB port would trigger this lockup. So, for those who haven''t tried the stock xen kernel, and are able to try it (based on driver support, etc.), it might help. --t On Apr 5, 2009, at 1:20 PM, Martin Fernau wrote:> I tried this before. I had your kernel a few months but this changed > nothing. > I had freezed with this kernel too in the same way. > > > Am Sonntag, 5. April 2009 14:29:20 schrieb Andrew Lyon: >> On Sat, Apr 4, 2009 at 4:00 PM, Martin Fernau <m.fernau@cps-net.de> >> wrote: >>> Hi, >>> >>> I just want to tell you that I''ve the same issue for one server! >>> Hardware: >>> Fujitsu Siemens PRIMERGY TX200 S4 >>> CPU: Intel Xeon Dual Quad E5405 2GhZ >>> Hardware Raid: LSI Logic / Symbios Logic MegaRAID SAS 1078 with >>> SAS HDDs >>> I''m running 4 guests on it: >>> - Win2003 >>> - Win2003 >>> - Gentoo Linux >>> - Windows XP Prof >>> >>> The xen 3.3.0 is running on a gentoo with a 2.6.18-xen-r12 kernel. >> >> I had the same problem with the 2.6.18-xen-rX Gentoo kernels, so I >> made my own ebuild and patches from the openSUSE Xen patches, you can >> get it from http://code.google.com/p/gentoo-xen-kernel/downloads/list >> >> Andy >> >>> The systems hangs round about all 3-4 weeks as far as I can tell. >>> This >>> server is quite new (from nov 2008) and the ServerView doesn''t >>> tell me >>> anything about hardware problems. It seems from this point that the >>> hardware is ok. If the server hangs then it''s not responsive for >>> any kind >>> of input. Neither the network is working (ping to dom0 or one of the >>> guests) nor >>> keyboard/monitor of the server itself is responding to anything. >>> Black >>> screen.. nothing more. A hard reset is the only thing to get the >>> system >>> back to life. >>> >>> /var/log/messages just show nothing. It''s like disconnecting the >>> power >>> cable. I have no idea and no hints about this problem. >>> At the moment I''ve a cronjob running which collects some system >>> informations of the dom0 every minute - I hope that the very last >>> run >>> (just before the next crash happens) will show me some kind of >>> informations which maybe point me to the problem!? However - I >>> currently >>> have no clue which kind of informations will be helpful for this >>> purpose. >>> I currently log the following things every minute: >>> - dmesg >>> - free >>> - netstat -lnp >>> - ps aux >>> - w >>> - vgdisplay >>> - lvdisplay >>> >>> hints about other informations which could be helpful? any xen >>> related >>> commands? >>> >>> Interesting that you use lvm too. I also use lvm for my guests and >>> use >>> the snapshot functionality on a daily basis to backup the server >>> to a >>> tape. dom0 is running on a normal partition. I use lvm 2.02.36 >>> >>> Regards, >>> Martin >>> >>> Am Freitag, 3. April 2009 16:56:28 schrieb Paraic Gallagher: >>>> Hi all, >>>> >>>> This is my first post to the list, I hope someone out there can >>>> help! >>>> >>>> I am running xen 3.0.3, with CentOS 5.2 based Dom0 >>>> (kernel-xen-2.6.18-92.1.22.el5) >>>> >>>> Recently I have noticed some complete system lockups on a few >>>> different >>>> servers. Neither Dom0 or any of the guests respond to pings, >>>> connecting >>>> a keyboard and monitor to the system only shows a blank screen. >>>> Nothing >>>> is written to logs at time of lockup. >>>> >>>> The problem is very difficult to reproduce and seems very random by >>>> nature. Sometimes if a system is left running for a few weeks it >>>> will >>>> happen, other times it can happen after a reboot. I have tried >>>> taxing >>>> the system running various scripts, rebooting numerous times, and >>>> creating/destroying a few guests, etc but no luck. It seems like a >>>> hardware issue but has been reproduced on a few different machines. >>>> >>>> For a while (clutching at straws) I thought it was due to changes >>>> in the >>>> clock (from daylight savings) so tried changing time backwards and >>>> forwards but this had no effect. >>>> >>>> Has anyone else out there seen a problem like this? Is there any >>>> way to >>>> diagnose it when it does happen. (It is very frustrating to have a >>>> hanged system where you cannot access for any information). >>>> >>>> If anyone wants any further info or ideas on what I could try >>>> please let >>>> me know. >>>> >>>> Regards, >>>> Paraic. >>> >>> _______________________________________________ >>> Xen-users mailing list >>> Xen-users@lists.xensource.com >>> http://lists.xensource.com/xen-users > > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
With "stock xen 2.6.18.8 kernel" you mean the original Kernel from "http://www.xen.org/download/" ? I currently use the xen-kernel 2.6.18-r12 from my distro. So I could give it a try... How dit you get notice of these kernel oops and/or soft IRQ lockups? I''m not able to discover _any_ abnormal events on my system as all logfiles are clean. There must be a way to debug this... The only USB device I currently have attached to my dom0 is a Smart-UPS System. I don''t know if this really could kill the whole machine as the communication between dom0 and this ups should be very very low. We must find a way to discover these lockups. Are there any debug-log functionality we could enable in xen to start to discover this problem? I''m afraid that these lockups could become a ko criteria for xen in the future for professional servers... Am Sonntag, 5. April 2009 22:33:36 schrieb thomas morgan:> Over the last year, I''ve experienced a couple of sources of lockups. > > The first was resolved by going to the stock xen 2.6.18.8 kernel > compiled from source (had been using the Debian etch kernel; found > commentary online describing the same symptoms on Ubuntu, Redhat, and > CentOS though, each with their distro-specific kernel). > > This one tended to result in kernel oops messages--soft IRQ lockups as > I recall. Lockup would start with a domU and within a few minutes > would kill the dom0 too. The fastest way to trigger this one was to > create and shutdown domU''s, although I don''t recall that being the > only way. > > The second, with the stock kernel, was an errant USB hub attached to a > xen host. Removing the hub resolved the issue. These were complete, > sudden lockups of the dom0 and all domUs -- basically everything. > Higher traffic over the USB port would trigger this lockup. > > So, for those who haven''t tried the stock xen kernel, and are able to > try it (based on driver support, etc.), it might help. > > --t_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
This problem occurred again this weekend on one of my servers. No response to input or pings to any domains - just a blank screen when keyboard and monitor connected. It had been running for around 1 week. There was no load running on the system, CentOS 5.2 Dom0 and one CentOS 5.2 domU, and one RHEL 4.1 domU. There were no errors written to syslog around the time of the lockup. It is a Dell PE 1950 and I had the console redirected to Serial Over Lan. I had sysrc enabled on the system and attempted to get some further debugging information using these keys. However the system did not respond. I hit Ctrl-A to switch the input to Xen and get this screen and triggered a crash dump. (XEN) *** Serial input -> Xen (type ''CTRL-a'' three times to switch input to DOM0). (XEN) ''h'' pressed -> showing installed handlers (XEN) key ''%'' (ascii ''25'') => Trap to xendbg (XEN) key ''C'' (ascii ''43'') => trigger a crashdump (XEN) key ''H'' (ascii ''48'') => dump heap info (XEN) key ''N'' (ascii ''4e'') => NMI statistics (XEN) key ''R'' (ascii ''52'') => reboot machine (XEN) key ''a'' (ascii ''61'') => dump timer queues (XEN) key ''d'' (ascii ''64'') => dump registers (XEN) key ''h'' (ascii ''68'') => show this message (XEN) key ''i'' (ascii ''69'') => dump interrupt bindings (XEN) key ''m'' (ascii ''6d'') => memory info (XEN) key ''n'' (ascii ''6e'') => trigger an NMI (XEN) key ''q'' (ascii ''71'') => dump domain (and guest debug) info (XEN) key ''r'' (ascii ''72'') => dump run queues (XEN) key ''t'' (ascii ''74'') => display multi-cpu clock info (XEN) key ''u'' (ascii ''75'') => dump numa info (XEN) key ''z'' (ascii ''7a'') => print ioapic info Does this mean the hypervisor is still active but all guests, including Dom0 are hosed? Is there something of value to look for in the Xen menu?>From this thread three people have reported repeated system lockups, onvarious hardware, with no real warning or logging information, and no solution other than a hard reset of the system. Is anyone aware of a bug id for this problem or should a bug be raised? Is there some other information I can provide from my setup which would be useful to diagnose the problem? regards, Paraic. 2009/4/6 Martin Fernau <m.fernau@cps-net.de>> With "stock xen 2.6.18.8 kernel" you mean the original Kernel from > "http://www.xen.org/download/" ? I currently use the xen-kernel 2.6.18-r12 > from my distro. So I could give it a try... > > How dit you get notice of these kernel oops and/or soft IRQ lockups? I''m > not > able to discover _any_ abnormal events on my system as all logfiles are > clean. > There must be a way to debug this... > > The only USB device I currently have attached to my dom0 is a Smart-UPS > System. I don''t know if this really could kill the whole machine as the > communication between dom0 and this ups should be very very low. > > We must find a way to discover these lockups. Are there any debug-log > functionality we could enable in xen to start to discover this problem? > > I''m afraid that these lockups could become a ko criteria for xen in the > future > for professional servers... > > Am Sonntag, 5. April 2009 22:33:36 schrieb thomas morgan: > > Over the last year, I''ve experienced a couple of sources of lockups. > > > > The first was resolved by going to the stock xen 2.6.18.8 kernel > > compiled from source (had been using the Debian etch kernel; found > > commentary online describing the same symptoms on Ubuntu, Redhat, and > > CentOS though, each with their distro-specific kernel). > > > > This one tended to result in kernel oops messages--soft IRQ lockups as > > I recall. Lockup would start with a domU and within a few minutes > > would kill the dom0 too. The fastest way to trigger this one was to > > create and shutdown domU''s, although I don''t recall that being the > > only way. > > > > The second, with the stock kernel, was an errant USB hub attached to a > > xen host. Removing the hub resolved the issue. These were complete, > > sudden lockups of the dom0 and all domUs -- basically everything. > > Higher traffic over the USB port would trigger this lockup. > > > > So, for those who haven''t tried the stock xen kernel, and are able to > > try it (based on driver support, etc.), it might help. > > > > --t > > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Apr 6, 2009, at 1:44 AM, Martin Fernau wrote:> With "stock xen 2.6.18.8 kernel" you mean the original Kernel from > "http://www.xen.org/download/" ? I currently use the xen-kernel > 2.6.18-r12 > from my distro. So I could give it a try...Yes, from www.xen.org.> How dit you get notice of these kernel oops and/or soft IRQ lockups? > I''m not > able to discover _any_ abnormal events on my system as all logfiles > are clean. > There must be a way to debug this...As often as I managed to get it to crash, I occasionally saw the error messages on the serial console. It depended on if I was connected to the console and if I was watching. Even then, I only saw them sometimes.> The only USB device I currently have attached to my dom0 is a Smart- > UPS > System. I don''t know if this really could kill the whole machine as > the > communication between dom0 and this ups should be very very low.Then I''d doubt it''s USB in your case. Also, in case it helps others track anything, the kernel oops: soft IRQ lockups were on some Dell 1950 III''s. The USB issue was on some HP DL380 G4''s. --t _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
It is possible to reproduce the issue that I was seeing by running xentop continuously for a few minutes like: # xentop -b -d 0.1 > /dev/null>From examination of the crashdump generated after the lockup yesterday itappeared that xentop was the active process at the time of the crash. Xentop was being used on the system for gathering some performance statistics. It was possible to reproduce the same issue on CentOS 5.3 ( kernel-xen-2.6.18-128.1.6.el5), and on different hardware. 2009/4/6 Paraic Gallagher <paraic.gallagher@gmail.com>> This problem occurred again this weekend on one of my servers. No response > to input or pings to any domains - just a blank screen when keyboard and > monitor connected. It had been running for around 1 week. There was no load > running on the system, CentOS 5.2 Dom0 and one CentOS 5.2 domU, and one RHEL > 4.1 domU. There were no errors written to syslog around the time of the > lockup. > > It is a Dell PE 1950 and I had the console redirected to Serial Over Lan. I > had sysrc enabled on the system and attempted to get some further debugging > information using these keys. However the system did not respond. I hit > Ctrl-A to switch the input to Xen and get this screen and triggered a crash > dump. > > (XEN) *** Serial input -> Xen (type ''CTRL-a'' three times to switch input to > DOM0). > (XEN) ''h'' pressed -> showing installed handlers > (XEN) key ''%'' (ascii ''25'') => Trap to xendbg > (XEN) key ''C'' (ascii ''43'') => trigger a crashdump > (XEN) key ''H'' (ascii ''48'') => dump heap info > (XEN) key ''N'' (ascii ''4e'') => NMI statistics > (XEN) key ''R'' (ascii ''52'') => reboot machine > (XEN) key ''a'' (ascii ''61'') => dump timer queues > (XEN) key ''d'' (ascii ''64'') => dump registers > (XEN) key ''h'' (ascii ''68'') => show this message > (XEN) key ''i'' (ascii ''69'') => dump interrupt bindings > (XEN) key ''m'' (ascii ''6d'') => memory info > (XEN) key ''n'' (ascii ''6e'') => trigger an NMI > (XEN) key ''q'' (ascii ''71'') => dump domain (and guest debug) info > (XEN) key ''r'' (ascii ''72'') => dump run queues > (XEN) key ''t'' (ascii ''74'') => display multi-cpu clock info > (XEN) key ''u'' (ascii ''75'') => dump numa info > (XEN) key ''z'' (ascii ''7a'') => print ioapic info > > Does this mean the hypervisor is still active but all guests, including > Dom0 are hosed? > > Is there something of value to look for in the Xen menu? > > From this thread three people have reported repeated system lockups, on > various > hardware, with no real warning or logging information, and no solution > other than a hard > reset of the system. > > Is anyone aware of a bug id for this problem or should a bug be raised? > Is there some other information I can provide from my setup which would be > useful to diagnose the problem? > > regards, > Paraic. > > > 2009/4/6 Martin Fernau <m.fernau@cps-net.de> > > With "stock xen 2.6.18.8 kernel" you mean the original Kernel from >> "http://www.xen.org/download/" ? I currently use the xen-kernel >> 2.6.18-r12 >> from my distro. So I could give it a try... >> >> How dit you get notice of these kernel oops and/or soft IRQ lockups? I''m >> not >> able to discover _any_ abnormal events on my system as all logfiles are >> clean. >> There must be a way to debug this... >> >> The only USB device I currently have attached to my dom0 is a Smart-UPS >> System. I don''t know if this really could kill the whole machine as the >> communication between dom0 and this ups should be very very low. >> >> We must find a way to discover these lockups. Are there any debug-log >> functionality we could enable in xen to start to discover this problem? >> >> I''m afraid that these lockups could become a ko criteria for xen in the >> future >> for professional servers... >> >> Am Sonntag, 5. April 2009 22:33:36 schrieb thomas morgan: >> > Over the last year, I''ve experienced a couple of sources of lockups. >> > >> > The first was resolved by going to the stock xen 2.6.18.8 kernel >> > compiled from source (had been using the Debian etch kernel; found >> > commentary online describing the same symptoms on Ubuntu, Redhat, and >> > CentOS though, each with their distro-specific kernel). >> > >> > This one tended to result in kernel oops messages--soft IRQ lockups as >> > I recall. Lockup would start with a domU and within a few minutes >> > would kill the dom0 too. The fastest way to trigger this one was to >> > create and shutdown domU''s, although I don''t recall that being the >> > only way. >> > >> > The second, with the stock kernel, was an errant USB hub attached to a >> > xen host. Removing the hub resolved the issue. These were complete, >> > sudden lockups of the dom0 and all domUs -- basically everything. >> > Higher traffic over the USB port would trigger this lockup. >> > >> > So, for those who haven''t tried the stock xen kernel, and are able to >> > try it (based on driver support, etc.), it might help. >> > >> > --t >> >> >> _______________________________________________ >> Xen-users mailing list >> Xen-users@lists.xensource.com >> http://lists.xensource.com/xen-users >> > >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Tue, Apr 07, 2009 at 12:03:10AM -0600, thomas morgan wrote:> As often as I managed to get it to crash, I occasionally saw the error > messages on the serial console. It depended on if I was connected to the > console and if I was watching. Even then, I only saw them sometimes.I haven''t seen any errors on the domU or dom0 console since updating to the latest etch xen kernel. But I still seem to be getting the lockups every couple weeks. I don''t have any USB devices attached.> Also, in case it helps others track anything, the kernel oops: soft IRQ > lockups were on some Dell 1950 III''s. The USB issue was on some HP DL380This is on a Supermicro H8DM8-2 With 2x Quad-Core AMD Opteron(tm) Processor 2350 -- Nick Anderson <nick@anders0n.net> http://www.cmdln.org _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Tue, Apr 7, 2009 at 3:39 PM, Nick Anderson <nick@anders0n.net> wrote:> On Tue, Apr 07, 2009 at 12:03:10AM -0600, thomas morgan wrote: >> As often as I managed to get it to crash, I occasionally saw the error >> messages on the serial console. It depended on if I was connected to the >> console and if I was watching. Even then, I only saw them sometimes. > I haven''t seen any errors on the domU or dom0 console since updating > to the latest etch xen kernel. But I still seem to be getting the > lockups every couple weeks. > > I don''t have any USB devices attached. >> Also, in case it helps others track anything, the kernel oops: soft IRQ >> lockups were on some Dell 1950 III''s. The USB issue was on some HP DL380 > > This is on a Supermicro H8DM8-2 > With 2x Quad-Core AMD Opteron(tm) Processor 2350 > > -- > Nick Anderson <nick@anders0n.net> > http://www.cmdln.org > > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users >I used to get the same lockups with any kernel on my supermicro x7dwa-n, I changed the graphics card from a 9600GT with 256mb to a 8600GT with 512, I''ve not had a lockup since. The lockups described in this thead sound very similar, I also used to get softlockup messages, and a couple of times the system locked dead for several minutes and then was suddenly responsive again. I doubt you have nvidia cards in your servers, but I thought I might as well as some info to this thread. Andy _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
It would be interesting to know whether sar data was captured during this time. From this you could track whether there was any process creation or destruction occurring. Might also be worth adding a cron entry to append the output of lsof to a file every N minutes (perhaps with logrotate enabled) to see if you can capture what changed in the running system when this "lockup" occurred? Also worth collecting ps output every minute On Apr 3, 2009, at 11:59 AM, Paraic Gallagher wrote:> > > 2009/4/3 Nick Anderson <nick@anders0n.net> > On Fri, Apr 03, 2009 at 03:56:28PM +0100, Paraic Gallagher wrote: > > I am running xen 3.0.3, with CentOS 5.2 based Dom0 > > (kernel-xen-2.6.18-92.1.22.el5) > > Recently I have noticed some complete system lockups on a few > different > > servers. Neither Dom0 or any of the guests respond to pings, > connecting a > > keyboard and monitor to the system only shows a blank screen. > Nothing is > > written to logs at time of lockup. > > I have seen similar issues with one of my servers. I have yet to nail > down the issue. > > Specs: > Distro: Debian Etch > Kernel: 2.6.18-6-xen-amd64 > CPU: 2x Quad-Core AMD Opteron(tm) Processor 2350 > Memory: 16G > Disk: 3ware 9650LE with 8 drive Raid6 > Xen: 3.2 (from debian repo) > > All vms are LVM backed. Not running any HVM guests. > > Thanks for the response. After searching net for few weeks with no > luck > in finding similar issues was beginning to think I was going crazy! > > Just with some further details. > I have seen the issue on two types of servers Dell PE 1950, and 2950 > 2x Quad core Intel Xeon E5410@2.33GHz > Memory 4G and 16G > Disk, PERC 6/i 1.11, 2x250 Raid1, ST3250620NS Rev: 3BKT > > All vms are LVM backed on this system except for Dom0. > > For a while I was seeing softlockup on cpu scrolling on the console > and thought that may have caused it. Unfortunatly after updating the > kernel the errors went away and I have had another lockup since then. > > Ive found a fairly set pattern though no time periods to predict. > > A VM typically goes unresponsive first. If left unchecked for long > enough the host will lock. If caught in time I have had limited > success running xm destroy on the domU. Most of the time running xm > destroy on the domU causes the host to lock immediately requiring a > hard reboot. > > The most recent lockup was a bit different that what I had in the > past. > > The domU locked up (no output on domU console). xm destroy locked > dom0. I rebooted with a remote power strip. dom0 and all domUs came > back up. Nothing in logs as usual. 10 minutes later dom0 was locked > again. I drove to the datacenter and about 30-45 minutes after the > lock the machine became responsive again (according to monitoring > server) I was able to display a website running on a vm. Then the > machine went unresponsive again. Not responding to physical console > access either. Another hard reboot and things are ok. > > That was the first time I had ever had so many lockups so close > together. Typically the lockups seem to be 1-2 weeks apart. > > I have even tried setting up netconsole on dom0 to try to catch kernel > errors with no success. > > This seems to be quite a similar problem from the description, > however I haven''t > noticed the guest vms locking up prior to Dom0. Something to keep an > eye on. > > Are you running a particular load on the system at the time or is it > somewhat > idle? Seems to be idle in my case before lockup. > > rgds, > Paraic. > > > > -- > Nick Anderson <nick@anders0n.net> > http://www.cmdln.org > > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Tue, Apr 21, 2009 at 08:30:32AM -0400, Peter Booth wrote:> It would be interesting to know whether sar data was captured during > this time. From this you could track whether there was any process > creation or destruction occurring.I just had another lockup this weekend. Sar (from the host) 12:35:01 PM all 0.00 0.00 0.00 0.00 0.01 99.99 12:45:01 PM all 0.00 0.00 0.00 0.00 0.01 99.99 12:55:01 PM all 0.00 0.00 0.00 0.00 0.01 99.99 01:05:01 PM all 0.00 0.00 0.00 0.00 0.01 99.99 01:15:01 PM all 0.00 0.00 0.00 0.00 0.01 99.99 Average: all 0.00 0.00 0.00 0.00 0.01 99.98 01:25:53 PM LINUX RESTART 01:35:02 PM CPU %user %nice %system %iowait %steal %idle 01:45:01 PM all 0.00 0.00 0.00 0.00 0.01 99.99 01:55:01 PM all 0.00 0.00 0.00 0.00 0.01 99.99 02:05:01 PM all 0.00 0.00 0.00 0.00 0.01 99.99 sar -b 11:55:01 AM 12.22 0.90 11.32 12.90 257.89 12:05:01 PM 13.97 0.49 13.48 7.68 331.48 12:15:01 PM 18.88 7.30 11.59 161.74 260.17 12:25:01 PM 14.34 1.10 13.23 16.53 438.73 12:35:01 PM 9.01 0.43 8.58 6.96 208.50 12:45:01 PM 8.47 0.35 8.12 5.23 186.03 12:55:01 PM 10.00 1.09 8.91 19.22 245.17 01:05:01 PM 11.89 1.82 10.06 27.76 279.90 01:15:01 PM 10.06 0.34 9.72 5.23 214.62 Average: 17.55 6.12 11.43 385.87 369.74 01:25:53 PM LINUX RESTART 01:35:02 PM tps rtps wtps bread/s bwrtn/s 01:45:01 PM 19.01 7.19 11.83 113.49 273.91 01:55:01 PM 12.23 2.44 9.79 37.42 239.82 02:05:01 PM 16.89 2.79 14.10 47.93 422.02 02:15:01 PM 17.09 1.92 15.17 26.93 495.01 02:25:01 PM 13.91 3.42 10.49 164.83 282.82 02:35:01 PM 12.47 2.05 10.42 28.45 256.32 02:45:01 PM 13.67 1.81 11.87 31.78 340.39 sar -c 12:45:01 PM 0.02 12:55:01 PM 0.02 01:05:01 PM 0.02 01:15:01 PM 0.02 Average: 0.03 01:25:53 PM LINUX RESTART 01:35:02 PM proc/s 01:45:01 PM 0.02 01:55:01 PM 0.02 sar -q 12:55:01 PM 0 147 0.00 0.00 0.00 01:05:01 PM 0 147 0.07 0.03 0.01 01:15:01 PM 0 147 0.00 0.00 0.00 Average: 0 147 0.00 0.00 0.00 01:25:53 PM LINUX RESTART 01:35:02 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 01:45:01 PM 0 147 0.00 0.00 0.00 01:55:01 PM 0 147 0.00 0.00 0.00 sar -r 01:05:01 PM 7312568 1878856 20.44 175416 66532 1044184 0 0.00 0 01:15:01 PM 7311948 1879476 20.45 175416 66544 1044184 0 0.00 0 Average: 7328126 1863298 20.27 175403 67011 1044184 0 0.00 0 01:25:53 PM LINUX RESTART 01:35:02 PM kbmemfree kbmemused %memused kbbuffers kbcached kbswpfree kbswpused %swpused kbswpcad 01:45:01 PM 8620940 570484 6.21 64136 36012 1044184 0 0.00 0 01:55:01 PM 8619824 571600 6.22 64972 36028 1044184 0 0.00 0 02:05:01 PM 8618204 573220 6.24 65800 36040 1044184 0 0.00 0 ============================================================== Now perhaps I have missed something but to me that all looks just fine. I should setup something to log ps. But in my guests I see steal pushed through the roof. And its like that for days ahead time. Ive noticed the steal during the lockups before but either I neglected to look back several days or forgot what I saw. I didnt recall steal being at 100% as far back as my logs go. 12:55:01 PM CPU %user %nice %system %iowait %steal %idle 01:05:01 PM all 0.00 0.00 0.00 0.00 100.00 0.00 01:15:01 PM all 0.00 0.00 0.00 0.00 100.00 0.00 Average: all 0.00 0.00 0.00 0.00 100.00 0.00 01:27:49 PM LINUX RESTART 01:35:01 PM CPU %user %nice %system %iowait %steal %idle 01:45:01 PM all 4.04 0.00 1.80 0.64 0.02 93.50 01:55:01 PM all 4.10 0.00 1.76 0.31 0.02 93.80 02:05:01 PM all 5.45 0.00 2.47 0.23 0.02 91.83 02:15:01 PM all 7.03 0.00 3.22 0.22 0.02 89.51 02:25:01 PM all 4.82 0.00 2.31 0.18 0.01 92.6> Might also be worth adding a cron entry to append the output of lsof to a > file every N minutes (perhaps with logrotate enabled) to see if you can > capture what changed in the running system when this "lockup" occurred? > Also worth collecting ps output every minute> _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users-- Nick Anderson <nick@anders0n.net> http://www.cmdln.org _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Some thoughts: 0. Do you have the default behavior where the guests independent wallclocks are disabled? 1. I have observed visible performance differences from a VM when %steal goes above 1%. It sounds like you have 8 cores. How many VMs do you have? What are their weights and caps? 2. The system default of collecting sar every ten minutes is pretty unhelpful for problem diagnosis. I routinely adjust this to interval to five seconds, which for the expense of a lot of disk space, gives a historical dataset that is useful for forensics. On Apr 21, 2009, at 10:10 AM, Nick Anderson wrote:> On Tue, Apr 21, 2009 at 08:30:32AM -0400, Peter Booth wrote: >> It would be interesting to know whether sar data was captured during >> this time. From this you could track whether there was any process >> creation or destruction occurring. > I just had another lockup this weekend. > > Sar (from the host) > 12:35:01 PM all 0.00 0.00 0.00 0.00 > 0.01 99.99 > 12:45:01 PM all 0.00 0.00 0.00 0.00 > 0.01 99.99 > 12:55:01 PM all 0.00 0.00 0.00 0.00 > 0.01 99.99 > 01:05:01 PM all 0.00 0.00 0.00 0.00 > 0.01 99.99 > 01:15:01 PM all 0.00 0.00 0.00 0.00 > 0.01 99.99 > Average: all 0.00 0.00 0.00 0.00 > 0.01 99.98 > > 01:25:53 PM LINUX RESTART > > 01:35:02 PM CPU %user %nice %system %iowait > %steal %idle > 01:45:01 PM all 0.00 0.00 0.00 0.00 > 0.01 99.99 > 01:55:01 PM all 0.00 0.00 0.00 0.00 > 0.01 99.99 > 02:05:01 PM all 0.00 0.00 0.00 0.00 > 0.01 99.99 > > > sar -b > 11:55:01 AM 12.22 0.90 11.32 12.90 257.89 > 12:05:01 PM 13.97 0.49 13.48 7.68 331.48 > 12:15:01 PM 18.88 7.30 11.59 161.74 260.17 > 12:25:01 PM 14.34 1.10 13.23 16.53 438.73 > 12:35:01 PM 9.01 0.43 8.58 6.96 208.50 > 12:45:01 PM 8.47 0.35 8.12 5.23 186.03 > 12:55:01 PM 10.00 1.09 8.91 19.22 245.17 > 01:05:01 PM 11.89 1.82 10.06 27.76 279.90 > 01:15:01 PM 10.06 0.34 9.72 5.23 214.62 > Average: 17.55 6.12 11.43 385.87 369.74 > > 01:25:53 PM LINUX RESTART > > 01:35:02 PM tps rtps wtps bread/s bwrtn/s > 01:45:01 PM 19.01 7.19 11.83 113.49 273.91 > 01:55:01 PM 12.23 2.44 9.79 37.42 239.82 > 02:05:01 PM 16.89 2.79 14.10 47.93 422.02 > 02:15:01 PM 17.09 1.92 15.17 26.93 495.01 > 02:25:01 PM 13.91 3.42 10.49 164.83 282.82 > 02:35:01 PM 12.47 2.05 10.42 28.45 256.32 > 02:45:01 PM 13.67 1.81 11.87 31.78 340.39 > > > sar -c > 12:45:01 PM 0.02 > 12:55:01 PM 0.02 > 01:05:01 PM 0.02 > 01:15:01 PM 0.02 > Average: 0.03 > > 01:25:53 PM LINUX RESTART > > 01:35:02 PM proc/s > 01:45:01 PM 0.02 > 01:55:01 PM 0.02 > > sar -q > 12:55:01 PM 0 147 0.00 0.00 0.00 > 01:05:01 PM 0 147 0.07 0.03 0.01 > 01:15:01 PM 0 147 0.00 0.00 0.00 > Average: 0 147 0.00 0.00 0.00 > > 01:25:53 PM LINUX RESTART > > 01:35:02 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 > 01:45:01 PM 0 147 0.00 0.00 0.00 > 01:55:01 PM 0 147 0.00 0.00 0.00 > > sar -r > 01:05:01 PM 7312568 1878856 20.44 175416 66532 > 1044184 0 0.00 0 > 01:15:01 PM 7311948 1879476 20.45 175416 66544 > 1044184 0 0.00 0 > Average: 7328126 1863298 20.27 175403 67011 > 1044184 0 0.00 0 > > 01:25:53 PM LINUX RESTART > > 01:35:02 PM kbmemfree kbmemused %memused kbbuffers kbcached > kbswpfree kbswpused %swpused kbswpcad > 01:45:01 PM 8620940 570484 6.21 64136 36012 > 1044184 0 0.00 0 > 01:55:01 PM 8619824 571600 6.22 64972 36028 > 1044184 0 0.00 0 > 02:05:01 PM 8618204 573220 6.24 65800 36040 > 1044184 0 0.00 0 > ==============================================================> > > > Now perhaps I have missed something but to me that all looks just > fine. I should setup something to log ps. But in my guests I see steal > pushed through the roof. And its like that for days ahead time. Ive > noticed the steal during the lockups before but either I neglected to > look back several days or forgot what I saw. I didnt recall steal > being at 100% as far back as my logs go. > > 12:55:01 PM CPU %user %nice %system %iowait > %steal %idle > 01:05:01 PM all 0.00 0.00 0.00 0.00 > 100.00 0.00 > 01:15:01 PM all 0.00 0.00 0.00 0.00 > 100.00 0.00 > Average: all 0.00 0.00 0.00 0.00 > 100.00 0.00 > > 01:27:49 PM LINUX RESTART > > 01:35:01 PM CPU %user %nice %system %iowait > %steal %idle > 01:45:01 PM all 4.04 0.00 1.80 0.64 > 0.02 93.50 > 01:55:01 PM all 4.10 0.00 1.76 0.31 > 0.02 93.80 > 02:05:01 PM all 5.45 0.00 2.47 0.23 > 0.02 91.83 > 02:15:01 PM all 7.03 0.00 3.22 0.22 > 0.02 89.51 > 02:25:01 PM all 4.82 0.00 2.31 0.18 > 0.01 92.6 > > > > >> Might also be worth adding a cron entry to append the output of >> lsof to a >> file every N minutes (perhaps with logrotate enabled) to see if you >> can >> capture what changed in the running system when this "lockup" >> occurred? >> Also worth collecting ps output every minute > >> _______________________________________________ >> Xen-users mailing list >> Xen-users@lists.xensource.com >> http://lists.xensource.com/xen-users > > -- > Nick Anderson <nick@anders0n.net> > http://www.cmdln.org > > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Tue, Apr 21, 2009 at 12:31:04PM -0400, Peter Booth wrote:> Some thoughts: > 0. Do you have the default behavior where the guests independent > wallclocks are disabled?Yes I did not change anything relating to that.> 1. I have observed visible performance differences from a VM when %steal > goes above 1%. > It sounds like you have 8 cores. > How many VMs do you have? > What are their weights and caps?I do have 8 cores .... 2 quad core AMD I have 3 domUs each with a single vcpu no pinning, weight or caps just the single vcpu per domU.> 2. The system default of collecting sar every ten minutes is pretty > unhelpful for problem diagnosis. I routinely adjust this to interval to > five seconds, which for the expense of a lot of disk space, gives a > historical dataset that is useful for forensics.I might try taking it down to the 1 minute mark first. Should I do this in a guest as well as the host? -- Nick Anderson <nick@anders0n.net> http://www.cmdln.org _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Nick, Two points: 1. Your domU was already hosed at 12:55pm with the 100% steal. This means that the VM was ready for CPU and the hypervisor scheduler didn''t allocate a pcpu. It''s worth configuring an alert to fire when %steal gets high 2. How can we explain that, when the host isn''t busy, the scheduler isn''t giving the vcpu some free CPU time? I thought that, unless a domain has been configured to have additional vcpus, it can''t be scheduled on more than one pcpu. So if, at most, three pcpus (cores) are being used by DomUs, and the Dom0 is idle, what is happening on the other CPUs? mpstat -p ALL would help a little Peter On Apr 21, 2009, at 2:42 PM, Nick Anderson wrote:> On Tue, Apr 21, 2009 at 12:31:04PM -0400, Peter Booth wrote: >> Some thoughts: >> 0. Do you have the default behavior where the guests independent >> wallclocks are disabled? > Yes I did not change anything relating to that. > >> 1. I have observed visible performance differences from a VM when >> %steal >> goes above 1%. >> It sounds like you have 8 cores. >> How many VMs do you have? >> What are their weights and caps? > > I do have 8 cores .... 2 quad core AMD > I have 3 domUs each with a single vcpu no pinning, weight or caps just > the single vcpu per domU. > >> 2. The system default of collecting sar every ten minutes is pretty >> unhelpful for problem diagnosis. I routinely adjust this to >> interval to >> five seconds, which for the expense of a lot of disk space, gives a >> historical dataset that is useful for forensics. > > I might try taking it down to the 1 minute mark first. > Should I do this in a guest as well as the host? > > -- > Nick Anderson <nick@anders0n.net> > http://www.cmdln.org > > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Nick, How many vcpus does Dom0 have? Peter On Apr 21, 2009, at 2:42 PM, Nick Anderson wrote:> On Tue, Apr 21, 2009 at 12:31:04PM -0400, Peter Booth wrote: >> Some thoughts: >> 0. Do you have the default behavior where the guests independent >> wallclocks are disabled? > Yes I did not change anything relating to that. > >> 1. I have observed visible performance differences from a VM when >> %steal >> goes above 1%. >> It sounds like you have 8 cores. >> How many VMs do you have? >> What are their weights and caps? > > I do have 8 cores .... 2 quad core AMD > I have 3 domUs each with a single vcpu no pinning, weight or caps just > the single vcpu per domU. > >> 2. The system default of collecting sar every ten minutes is pretty >> unhelpful for problem diagnosis. I routinely adjust this to >> interval to >> five seconds, which for the expense of a lot of disk space, gives a >> historical dataset that is useful for forensics. > > I might try taking it down to the 1 minute mark first. > Should I do this in a guest as well as the host? > > -- > Nick Anderson <nick@anders0n.net> > http://www.cmdln.org > > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Thu, Apr 23, 2009 at 04:27:49PM -0400, Peter Booth wrote:> How many vcpus does Dom0 have?I did not limit dom0 so it has access to all 8 cores. -- Nick Anderson <nick@anders0n.net> http://www.cmdln.org _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Thu, Apr 23, 2009 at 04:27:49PM -0400, Peter Booth wrote:> Nick, > How many vcpus does Dom0 have?~# xm vcpu-list Name ID VCPU CPU State Time(s) CPU Affinity Domain-0 0 0 2 -b- 3603.1 any cpu Domain-0 0 1 3 -b- 46.7 any cpu Domain-0 0 2 6 -b- 380.0 any cpu Domain-0 0 3 5 -b- 140.6 any cpu Domain-0 0 4 0 -b- 46.4 any cpu Domain-0 0 5 1 -b- 41.2 any cpu Domain-0 0 6 5 -b- 34.7 any cpu Domain-0 0 7 3 r-- 69.9 any cpu domU1 1 0 1 -b- 14154.7 any cpu domU2 2 0 4 -b- 98047.2 any cpu domU3 3 0 0 -b- 8588.2 any cpu -- Nick Anderson <nick@anders0n.net> http://www.cmdln.org _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Nick, When you got the "soft lockup" errors were they always with CPU0? Do you have any nesting in your LVM definitions? Is dom0 making use of LVM? Does the host have a wireless card? What graphics card does it have? There is a Xen 2.6.18 kernel bug that might be related to what you see, as well as a patch: http://www.mail-archive.com/debian-kernel@lists.debian.org/msg40893.html I appreciate that this freezing occurs when the system is quiet, but it seems as if the hypervisor CPU scheduler might play some part in this, such that the guest vcpus don''t get scheduled on a pcpu. If it were me I would try limiting dom0 to four vcpus - not as a long-term configuration, but to see whether the problem recurred when we know that each domU should have an available CPU. On Apr 23, 2009, at 4:47 PM, Nick Anderson wrote:> On Thu, Apr 23, 2009 at 04:27:49PM -0400, Peter Booth wrote: >> Nick, >> How many vcpus does Dom0 have? > ~# xm vcpu-list > Name ID VCPU CPU State Time(s) CPU > Affinity > Domain-0 0 0 2 -b- 3603.1 any > cpu > Domain-0 0 1 3 -b- 46.7 any > cpu > Domain-0 0 2 6 -b- 380.0 any > cpu > Domain-0 0 3 5 -b- 140.6 any > cpu > Domain-0 0 4 0 -b- 46.4 any > cpu > Domain-0 0 5 1 -b- 41.2 any > cpu > Domain-0 0 6 5 -b- 34.7 any > cpu > Domain-0 0 7 3 r-- 69.9 any > cpu > domU1 1 0 1 -b- 14154.7 any > cpu > domU2 2 0 4 -b- 98047.2 any > cpu > domU3 3 0 0 -b- 8588.2 any > cpu > > -- > Nick Anderson <nick@anders0n.net> > http://www.cmdln.org >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Thu, Apr 23, 2009 at 10:01:02PM -0400, Peter Booth wrote:> Nick, > When you got the "soft lockup" errors were they always with CPU0?Yes I believe so. But since I updated to the latest debian patched kernel I havent seen that soft lockup error.> Do you have any nesting in your LVM definitions?No> Is dom0 making use of LVM?No it is not.> Does the host have a wireless card?No> What graphics card does it have?Just the onboard graphics 01:05.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02)> There is a Xen 2.6.18 kernel bug that might be related to what you see, > as well as a patch: > http://www.mail-archive.com/debian-kernel@lists.debian.org/msg40893.htmlYes I think that patch is in the latest kernel update which made the soft lockup messages go away. Odly everyone else that noted the soft lockup errors only noted them as an annoyance. I don''t recall ever seeing anyone link the messages with any actual undesirable behavior outside of the msgs themselves.> I appreciate that this freezing occurs when the system is quiet, but it > seems as if the hypervisor CPU scheduler might play some part in this, > such that the guest vcpus don''t get scheduled on a pcpu. If it were me I > would try limiting dom0 to four vcpus - not as a long-term > configuration, but to see whether the problem recurred when we know that > each domU should have an available CPU.I was considering that. I also thought it interesting that each of the domUs are running on vcpu0. I figured they would automatically distribute to a free vcpu. -- Nick Anderson <nick@anders0n.net> http://www.cmdln.org _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Hello, did you find out anything helpful about this problem? I just had this freeze this morning again. Server freezed and no guests working any more. After 5 weeks of normal running. It seems that this is my intervall of this problem. I really appreciated anything helpful about this! I''m totally stuck with this problem... Martin Am Freitag, 24. April 2009 schrieb Nick Anderson:> On Thu, Apr 23, 2009 at 10:01:02PM -0400, Peter Booth wrote: > > Nick, > > When you got the "soft lockup" errors were they always with CPU0? > > Yes I believe so. But since I updated to the latest debian patched > kernel I havent seen that soft lockup error. > > > Do you have any nesting in your LVM definitions? > > No > > > Is dom0 making use of LVM? > > No it is not. > > > Does the host have a wireless card? > > No > > > What graphics card does it have? > > Just the onboard graphics 01:05.0 VGA compatible controller: ATI > Technologies Inc ES1000 (rev 02) > > > There is a Xen 2.6.18 kernel bug that might be related to what you see, > > as well as a patch: > > http://www.mail-archive.com/debian-kernel@lists.debian.org/msg40893.html > > Yes I think that patch is in the latest kernel update which made the > soft lockup messages go away. Odly everyone else that noted the soft > lockup errors only noted them as an annoyance. I don''t recall ever > seeing anyone link the messages with any actual undesirable behavior > outside of the msgs themselves. > > > I appreciate that this freezing occurs when the system is quiet, but it > > seems as if the hypervisor CPU scheduler might play some part in this, > > such that the guest vcpus don''t get scheduled on a pcpu. If it were me I > > would try limiting dom0 to four vcpus - not as a long-term > > configuration, but to see whether the problem recurred when we know that > > each domU should have an available CPU. > > I was considering that. I also thought it interesting that each of the > domUs are running on vcpu0. I figured they would automatically > distribute to a free vcpu.-- Mit freundlichem Gruß, Martin Fernau CPS Entwicklungsgesellschaft für EDV-Lösungen mbH Gartenstraße 42 - 37269 Eschwege Telefon (0 56 51) 95 99-0 Telefax (0 56 51) 95 99-90 eMail m.fernau@cps-net.de Internet http://www.cps-net.de Handelsregister Eschwege, HRB 1585 Geschäftsführer, Wilfried Fernau Steuernummer 026 230 40308 USt-ID-Nr. DE 178 554 522 _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Martin, You and Nick both see this problem on a system with eight cores. You have 4 domUs, Nick has 3 domUs. I think a reasonable hypothesis is that this is some kind of resource starvation/livelock/deadlock scenario. Some questions that might help see how similar your scenario and Nick''s are: 1. How many vcpus does each of your domUs have? 2. Do you define pinning, cap or weight for your domUs? 3. Does sar on the guests show a high %steal before the problem occurred ? 4. Do you limit the number of vcpus that Dom0 has? If not, I would suggest that you try this an dsee if the problem occurs when you know that there should be free vcpus for each of your domUs. Peter Do you limit your Dom0 On May 5, 2009, at 2:42 AM, Martin Fernau wrote:> Hello, > > did you find out anything helpful about this problem? > I just had this freeze this morning again. Server freezed and no > guests > working any more. After 5 weeks of normal running. It seems that > this is my > intervall of this problem. > > I really appreciated anything helpful about this! I''m totally stuck > with this > problem... > > Martin > > Am Freitag, 24. April 2009 schrieb Nick Anderson: >> On Thu, Apr 23, 2009 at 10:01:02PM -0400, Peter Booth wrote: >>> Nick, >>> When you got the "soft lockup" errors were they always with CPU0? >> >> Yes I believe so. But since I updated to the latest debian patched >> kernel I havent seen that soft lockup error. >> >>> Do you have any nesting in your LVM definitions? >> >> No >> >>> Is dom0 making use of LVM? >> >> No it is not. >> >>> Does the host have a wireless card? >> >> No >> >>> What graphics card does it have? >> >> Just the onboard graphics 01:05.0 VGA compatible controller: ATI >> Technologies Inc ES1000 (rev 02) >> >>> There is a Xen 2.6.18 kernel bug that might be related to what you >>> see, >>> as well as a patch: >>> http://www.mail-archive.com/debian-kernel@lists.debian.org/msg40893.html >> >> Yes I think that patch is in the latest kernel update which made the >> soft lockup messages go away. Odly everyone else that noted the soft >> lockup errors only noted them as an annoyance. I don''t recall ever >> seeing anyone link the messages with any actual undesirable behavior >> outside of the msgs themselves. >> >>> I appreciate that this freezing occurs when the system is quiet, >>> but it >>> seems as if the hypervisor CPU scheduler might play some part in >>> this, >>> such that the guest vcpus don''t get scheduled on a pcpu. If it >>> were me I >>> would try limiting dom0 to four vcpus - not as a long-term >>> configuration, but to see whether the problem recurred when we >>> know that >>> each domU should have an available CPU. >> >> I was considering that. I also thought it interesting that each of >> the >> domUs are running on vcpu0. I figured they would automatically >> distribute to a free vcpu. > > > > -- > Mit freundlichem Gruß, > Martin Fernau > > > CPS Entwicklungsgesellschaft für EDV-Lösungen mbH > Gartenstraße 42 - 37269 Eschwege > > Telefon (0 56 51) 95 99-0 > Telefax (0 56 51) 95 99-90 > > eMail m.fernau@cps-net.de > Internet http://www.cps-net.de > > Handelsregister Eschwege, HRB 1585 > Geschäftsführer, Wilfried Fernau > Steuernummer 026 230 40308 > USt-ID-Nr. DE 178 554 522 > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Tue, May 05, 2009 at 09:00:06AM -0400, Peter Booth wrote:> Martin, > You and Nick both see this problem on a system with eight cores. > You have 4 domUs, Nick has 3 domUs. > I think a reasonable hypothesis is that this is some kind of resource > starvation/livelock/deadlock scenario. > Some questions that might help see how similar your scenario and Nick''s > are: > 1. How many vcpus does each of your domUs have? > 2. Do you define pinning, cap or weight for your domUs? > 3. Does sar on the guests show a high %steal before the problem occurred > ? > 4. Do you limit the number of vcpus that Dom0 has? If not, I would > suggest that you try this an dsee if the problem occurs when you know > that there should be free vcpus for each of your domUs.I started monitoring %steal with my zenoss install. This weekend I had another odd occurrence. My steal did not jump but my host did not spin out of control either. One vm with apache on it just went nutty. Any time apache was running (and there was traffic) apache would just start hogging the cpu. But the cpu time was being spent in system not in user. I restarted the virtual machine several times to no avail. Ultimately I decided to try rebooting the host. After the host was rebooted my problems went away again. I can only suspect something to have gone wonky with xens network driver. I wrote a little wrapper for sar if anyone is interested. All it does is return the single statistic for a 1 second period that you might be looking for. I just stick it in my snmp as exec lines so I can easily expose the sar metrics for zenoss. -- Nick Anderson <nick@anders0n.net> http://www.cmdln.org _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Peter, at least I would thank you for supporting in this situation! To answer your questions:> 1. How many vcpus does each of your domUs have?This vary. Mostly I''ve set ''vcpu'' to 2.> 2. Do you define pinning, cap or weight for your domUs?no. They are not restricted in this way. I just limited the number of cpu cores to the domains (cpus=...)> 3. Does sar on the guests show a high %steal before the problem > occurred ?I didn''t hat this tool running at the time of the crash. I also have 3 Windows guests which can''t run this program.> 4. Do you limit the number of vcpus that Dom0 has? If not, I would > suggest that you try this an dsee if the problem occurs when you know > that there should be free vcpus for each of your domUs.I didn''t limit Dom0 so far. Because I have 8 cores for 4 guests I could do the following: Assign CPU 1 to dom0 -> "(dom0-cpus 1)" in xend-config.sxp 2,3 to Guest1 (windows 2003) 4,5,6 to Guest2 (windows 2003) 7 to Guest3 (gentoo linux) 8 to Guest4 (windows xp) With this settings all domains using its own processor. Do I need to change vcpus too? I''m afraid if I change this back to 1 that windows wants to activate itself again... Am Dienstag, 5. Mai 2009 schrieb Peter Booth:> Martin, > > You and Nick both see this problem on a system with eight cores. > You have 4 domUs, Nick has 3 domUs. > > I think a reasonable hypothesis is that this is some kind of resource > starvation/livelock/deadlock scenario. > Some questions that might help see how similar your scenario and > Nick''s are: > > 1. How many vcpus does each of your domUs have? > 2. Do you define pinning, cap or weight for your domUs? > 3. Does sar on the guests show a high %steal before the problem > occurred ? > 4. Do you limit the number of vcpus that Dom0 has? If not, I would > suggest that you try this an dsee if the problem occurs when you know > that there should be free vcpus for each of your domUs. > > Peter > > > Do you limit your Dom0 > > On May 5, 2009, at 2:42 AM, Martin Fernau wrote: > > Hello, > > > > did you find out anything helpful about this problem? > > I just had this freeze this morning again. Server freezed and no > > guests > > working any more. After 5 weeks of normal running. It seems that > > this is my > > intervall of this problem. > > > > I really appreciated anything helpful about this! I''m totally stuck > > with this > > problem... > > > > Martin > > > > Am Freitag, 24. April 2009 schrieb Nick Anderson: > >> On Thu, Apr 23, 2009 at 10:01:02PM -0400, Peter Booth wrote: > >>> Nick, > >>> When you got the "soft lockup" errors were they always with CPU0? > >> > >> Yes I believe so. But since I updated to the latest debian patched > >> kernel I havent seen that soft lockup error. > >> > >>> Do you have any nesting in your LVM definitions? > >> > >> No > >> > >>> Is dom0 making use of LVM? > >> > >> No it is not. > >> > >>> Does the host have a wireless card? > >> > >> No > >> > >>> What graphics card does it have? > >> > >> Just the onboard graphics 01:05.0 VGA compatible controller: ATI > >> Technologies Inc ES1000 (rev 02) > >> > >>> There is a Xen 2.6.18 kernel bug that might be related to what you > >>> see, > >>> as well as a patch: > >>> http://www.mail-archive.com/debian-kernel@lists.debian.org/msg40893.htm > >>>l > >> > >> Yes I think that patch is in the latest kernel update which made the > >> soft lockup messages go away. Odly everyone else that noted the soft > >> lockup errors only noted them as an annoyance. I don''t recall ever > >> seeing anyone link the messages with any actual undesirable behavior > >> outside of the msgs themselves. > >> > >>> I appreciate that this freezing occurs when the system is quiet, > >>> but it > >>> seems as if the hypervisor CPU scheduler might play some part in > >>> this, > >>> such that the guest vcpus don''t get scheduled on a pcpu. If it > >>> were me I > >>> would try limiting dom0 to four vcpus - not as a long-term > >>> configuration, but to see whether the problem recurred when we > >>> know that > >>> each domU should have an available CPU. > >> > >> I was considering that. I also thought it interesting that each of > >> the > >> domUs are running on vcpu0. I figured they would automatically > >> distribute to a free vcpu. > > > > -- > > Mit freundlichem Gruß, > > Martin Fernau > > > > > > CPS Entwicklungsgesellschaft für EDV-Lösungen mbH > > Gartenstraße 42 - 37269 Eschwege > > > > Telefon (0 56 51) 95 99-0 > > Telefax (0 56 51) 95 99-90 > > > > eMail m.fernau@cps-net.de > > Internet http://www.cps-net.de > > > > Handelsregister Eschwege, HRB 1585 > > Geschäftsführer, Wilfried Fernau > > Steuernummer 026 230 40308 > > USt-ID-Nr. DE 178 554 522 > > > > _______________________________________________ > > Xen-users mailing list > > Xen-users@lists.xensource.com > > http://lists.xensource.com/xen-users_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Nick, I would appreciate if you could send me your little script. I could use it in my linux box too. Thanks, Martin Am Dienstag, 5. Mai 2009 schrieb Nick Anderson:> [...] I wrote a little wrapper for > sar if anyone is interested. All it does is return the single > statistic for a 1 second period that you might be looking for. I just > stick it in my snmp as exec lines so I can easily expose the sar > metrics for zenoss._______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Tue, May 05, 2009 at 05:41:03PM +0200, Martin Fernau wrote:> Nick, > I would appreciate if you could send me your little script. I could use it in > my linux box too.Like I said it just wraps sar so you need to have sysstat installed. right now it only supports the default colums from plain old sar output. Its easy to extend though. so you would run it with something like sarget iowait or sarget steal or sarget system -- Nick Anderson <nick@anders0n.net> http://www.cmdln.org _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Hi! I''ve the same issue!! every 1-2 weeks the servers goes int black, no pings, no shh...nothing! - Xen 3.3.1 (Debian etch 40r6) 64 bit - dom0: Linux kernel 2.6.18 - domU: - Windows 2008, 64 bit installation ,4 cores, + 20GB memory - Windows XP, 32 bit, 2 cores, 2GB memory. - Debian lenny: 64 bit, 3 cores, 8GB - Debian lenny: 64 bit, 3 cores, 12GB Hardware details: Dell PowerEdge 2900 III 2 x Quad Core Xeon X5470 (3.33GHz, 2x6MB, 1333MHz FSB, 120W TDP) 48GB 667MHz FBD (12x4GB dual rank DIMMs) PERC 5/i integrated RAID Controller *using megasas* driver Now I''ve limited dom0 to 1 cpu...Would be useful upgrade the kernel (linux2.6.18 to linux 2.6.30) and xen 3.3.2 to xen 3.4.0)?? finally, have you solved the hangs?? Any response will be apreciated Thanks! On Tue, May 5, 2009 at 8:18 PM, Nick Anderson <nick@anders0n.net> wrote:> On Tue, May 05, 2009 at 05:41:03PM +0200, Martin Fernau wrote: > > Nick, > > I would appreciate if you could send me your little script. I could use > it in > > my linux box too. > > Like I said it just wraps sar so you need to have sysstat installed. > > right now it only supports the default colums from plain old sar > output. Its easy to extend though. > > so you would run it with something like sarget iowait or sarget steal > or sarget system > > > > -- > Nick Anderson <nick@anders0n.net> > http://www.cmdln.org > > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Tue, Jun 30, 2009 at 12:57:43PM +0200, Alberto Asuero Arroyo wrote:> Hi! > > I''ve the same issue!! every 1-2 weeks the servers goes int black, no pings, > no shh...nothing! > > - Xen 3.3.1 (Debian etch 40r6) 64 bit > - dom0: Linux kernel 2.6.18 > - domU: > > - Windows 2008, 64 bit installation ,4 cores, + 20GB memory > - Windows XP, 32 bit, 2 cores, 2GB memory. > - Debian lenny: 64 bit, 3 cores, 8GB > - Debian lenny: 64 bit, 3 cores, 12GB > > Hardware details: > Dell PowerEdge 2900 III > 2 x Quad Core Xeon X5470 (3.33GHz, 2x6MB, 1333MHz FSB, 120W TDP) > 48GB 667MHz FBD (12x4GB dual rank DIMMs) > PERC 5/i integrated RAID Controller *using megasas* driver > > > Now I''ve limited dom0 to 1 cpu...Would be useful upgrade the kernel > (linux2.6.18 to linux 2.6.30) and xen 3.3.2 to xen 3.4.0)?? >Are you running the Debian 2.6.18 kernel, or xenlinux 2.6.18.8 ?> finally, have you solved the hangs?? > > Any response will be apreciated >You should set up a serial console to get the error messages and/or backtraces of the crash. It''s pretty hard to debug it otherwise. -- Pasi> Thanks! > > > On Tue, May 5, 2009 at 8:18 PM, Nick Anderson <nick@anders0n.net> wrote: > > > On Tue, May 05, 2009 at 05:41:03PM +0200, Martin Fernau wrote: > > > Nick, > > > I would appreciate if you could send me your little script. I could use > > it in > > > my linux box too. > > > > Like I said it just wraps sar so you need to have sysstat installed. > > > > right now it only supports the default colums from plain old sar > > output. Its easy to extend though. > > > > so you would run it with something like sarget iowait or sarget steal > > or sarget system > > > > > > > > -- > > Nick Anderson <nick@anders0n.net> > > http://www.cmdln.org > > > > > > _______________________________________________ > > Xen-users mailing list > > Xen-users@lists.xensource.com > > http://lists.xensource.com/xen-users > >> _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Are you running the Debian 2.6.18 kernel, or xenlinux 2.6.18.8 ? xenlinux 2.6.18.8 I''ll connect the serial in the next hang and show the errors Thanks! On Tue, Jun 30, 2009 at 1:43 PM, Pasi Kärkkäinen <pasik@iki.fi> wrote:> On Tue, Jun 30, 2009 at 12:57:43PM +0200, Alberto Asuero Arroyo wrote: > > Hi! > > > > I''ve the same issue!! every 1-2 weeks the servers goes int black, no > pings, > > no shh...nothing! > > > > - Xen 3.3.1 (Debian etch 40r6) 64 bit > > - dom0: Linux kernel 2.6.18 > > - domU: > > > > - Windows 2008, 64 bit installation ,4 cores, + 20GB memory > > - Windows XP, 32 bit, 2 cores, 2GB memory. > > - Debian lenny: 64 bit, 3 cores, 8GB > > - Debian lenny: 64 bit, 3 cores, 12GB > > > > Hardware details: > > Dell PowerEdge 2900 III > > 2 x Quad Core Xeon X5470 (3.33GHz, 2x6MB, 1333MHz FSB, 120W TDP) > > 48GB 667MHz FBD (12x4GB dual rank DIMMs) > > PERC 5/i integrated RAID Controller *using megasas* driver > > > > > > Now I''ve limited dom0 to 1 cpu...Would be useful upgrade the kernel > > (linux2.6.18 to linux 2.6.30) and xen 3.3.2 to xen 3.4.0)?? > > > > Are you running the Debian 2.6.18 kernel, or xenlinux 2.6.18.8 ? > > > finally, have you solved the hangs?? > > > > Any response will be apreciated > > > > You should set up a serial console to get the error messages and/or > backtraces of the crash. > > It''s pretty hard to debug it otherwise. > > -- Pasi > > > Thanks! > > > > > > On Tue, May 5, 2009 at 8:18 PM, Nick Anderson <nick@anders0n.net> wrote: > > > > > On Tue, May 05, 2009 at 05:41:03PM +0200, Martin Fernau wrote: > > > > Nick, > > > > I would appreciate if you could send me your little script. I could > use > > > it in > > > > my linux box too. > > > > > > Like I said it just wraps sar so you need to have sysstat installed. > > > > > > right now it only supports the default colums from plain old sar > > > output. Its easy to extend though. > > > > > > so you would run it with something like sarget iowait or sarget steal > > > or sarget system > > > > > > > > > > > > -- > > > Nick Anderson <nick@anders0n.net> > > > http://www.cmdln.org > > > > > > > > > _______________________________________________ > > > Xen-users mailing list > > > Xen-users@lists.xensource.com > > > http://lists.xensource.com/xen-users > > > > > > _______________________________________________ > > Xen-users mailing list > > Xen-users@lists.xensource.com > > http://lists.xensource.com/xen-users >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Another crash after two weeks. I couldn''t connect via serial this time (The hypervisor is in another office and I needed to start it quickly ) This time I''ve found errors inside a domU: BUG: soft lockup detected on CPU#1! Call Trace: <IRQ> [<ffffffff8024f5b1>] softlockup_tick+0xd8/0xea [<ffffffff8020e595>] timer_interrupt+0x390/0x3ea [<ffffffff8024f89c>] handle_IRQ_event+0x4e/0x96 [<ffffffff8024f98a>] __do_IRQ+0xa6/0x107 [<ffffffff80232c52>] run_timer_softirq+0x3c/0x202 [<ffffffff8020c5ac>] do_IRQ+0x44/0x4d [<ffffffff8036aab9>] evtchn_do_upcall+0x149/0x203 [<ffffffff8020a6ee>] do_hypervisor_callback+0x1e/0x2c <EOI> [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 [<ffffffff802063aa>] hypercall_page+0x3aa/0x1000 [<ffffffff8020dc1e>] raw_safe_halt+0xb8/0xdc [<ffffffff8020925c>] xen_idle+0x6d/0x81 [<ffffffff80208a20>] cpu_idle+0x51/0x70 On Tue, Jun 30, 2009 at 9:04 PM, Alberto Asuero Arroyo < albertoasuero@gmail.com> wrote:> Are you running the Debian 2.6.18 kernel, or xenlinux 2.6.18.8 ? > > xenlinux 2.6.18.8 > > I''ll connect the serial in the next hang and show the errors > > Thanks! > > > On Tue, Jun 30, 2009 at 1:43 PM, Pasi Kärkkäinen <pasik@iki.fi> wrote: > >> On Tue, Jun 30, 2009 at 12:57:43PM +0200, Alberto Asuero Arroyo wrote: >> > Hi! >> > >> > I''ve the same issue!! every 1-2 weeks the servers goes int black, no >> pings, >> > no shh...nothing! >> > >> > - Xen 3.3.1 (Debian etch 40r6) 64 bit >> > - dom0: Linux kernel 2.6.18 >> > - domU: >> > >> > - Windows 2008, 64 bit installation ,4 cores, + 20GB memory >> > - Windows XP, 32 bit, 2 cores, 2GB memory. >> > - Debian lenny: 64 bit, 3 cores, 8GB >> > - Debian lenny: 64 bit, 3 cores, 12GB >> > >> > Hardware details: >> > Dell PowerEdge 2900 III >> > 2 x Quad Core Xeon X5470 (3.33GHz, 2x6MB, 1333MHz FSB, 120W TDP) >> > 48GB 667MHz FBD (12x4GB dual rank DIMMs) >> > PERC 5/i integrated RAID Controller *using megasas* driver >> > >> > >> > Now I''ve limited dom0 to 1 cpu...Would be useful upgrade the kernel >> > (linux2.6.18 to linux 2.6.30) and xen 3.3.2 to xen 3.4.0)?? >> > >> >> Are you running the Debian 2.6.18 kernel, or xenlinux 2.6.18.8 ? >> >> > finally, have you solved the hangs?? >> > >> > Any response will be apreciated >> > >> >> You should set up a serial console to get the error messages and/or >> backtraces of the crash. >> >> It''s pretty hard to debug it otherwise. >> >> -- Pasi >> >> > Thanks! >> > >> > >> > On Tue, May 5, 2009 at 8:18 PM, Nick Anderson <nick@anders0n.net> >> wrote: >> > >> > > On Tue, May 05, 2009 at 05:41:03PM +0200, Martin Fernau wrote: >> > > > Nick, >> > > > I would appreciate if you could send me your little script. I could >> use >> > > it in >> > > > my linux box too. >> > > >> > > Like I said it just wraps sar so you need to have sysstat installed. >> > > >> > > right now it only supports the default colums from plain old sar >> > > output. Its easy to extend though. >> > > >> > > so you would run it with something like sarget iowait or sarget steal >> > > or sarget system >> > > >> > > >> > > >> > > -- >> > > Nick Anderson <nick@anders0n.net> >> > > http://www.cmdln.org >> > > >> > > >> > > _______________________________________________ >> > > Xen-users mailing list >> > > Xen-users@lists.xensource.com >> > > http://lists.xensource.com/xen-users >> > > >> >> > _______________________________________________ >> > Xen-users mailing list >> > Xen-users@lists.xensource.com >> > http://lists.xensource.com/xen-users >> > >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users