Hi all, We are using XEN as hypervisor to setup our private cloud. The framework is Eucalyptus and using CentOS 5.4 as dom0 OS. Sometimes we find some machines'' dom0 become unresponsive, the symptoms are: (1) We can''t log into dom0 via ssh. After typing password, it just stops there. (2) We can ping dom0 successfully. (3) We can log into domU without problem. The unresponsive dom0 eventually "alive" after a period of time. Maybe half hour or even several hours. Then we can log into dom0 without problem. And everything works fine except some weird things like: (1) Some daemons stop logging during unresponsive period. The log file has a gap. (2) daemon is dead during the unresponsive period. We can''t find any suspicious log on system log (system log doesn''t log during the period, either). Also I redirect the console to com1, turn xen loglvl to all. There are no logs during the period either. I can switch to xen console by pressing Ctrl+a three times during the unresponsive period. The console for xen is working. We don''t do heavy I/O in dom0, just deploy some daemons like snmpd... We are not sure what cause this, but we find a way to reproduce the same symptom: heavy I/O in VMs. The following is the test configuration: Hypervisor =========XEN 3.4.2 (Also tried 3.4.3) DOM0 ===CPU: 2 Xeon E5620(2.4GHz, 6 cores, 12 threads) dedicate 1 core. (The symptom is much easier to be reproduce by dedicate only 1 core to dom0) Memory: dedicate 2048M to dom0. (Node has 24G memory) OS: CentOS 5.4. Kernel 2.6.18-164. (I also try 2.6.18-194, 2.6.18-238, and xenlinux 2.6.18.8) Disk: two SATA disks (Seagate ST3500630NS, 500G) sda and sdb. sda is used as dom0 OS''s root/swap. sdb to is formatted as ext3 fs and used to store VM''s image. VMs (I use 3 VMs) ================CPU: 4 VCPUs Memory: 1024M Disks: 3 files in dom0''s sdb. They are root device, swap, and the disk to IO. (sda1, sda2, sda3 in VM) OS: CentOS 5.4 base image. And kernel is updated to 2.6.18.194.el5xen, also I''ve tried xenlinux 2.6.18.8 and 2.6.18.238) Tests: Create ext3 fs on sda3. mount sda3 to a folder. Performing vdbench filesystem I/O on the mount folder. The I/O behavior is: (1) Create 300 files, each 99m large (2) Random select files and sequential write random patterns the file in 64KB blocks. Read the blocks to verify when done. (3) There is no rate limit, so the program tries its best to do I/O. I can provide configuration file for the workload if needed. When running I/O on only 1 VM. The dom0 almost doesn''t response. It''s very hard to login. (Rarely success) When running I/O on 3 VMs. The dom0 get worse. Log in is not possbile (block after typing password). The symptom happens as mentioned before. I also try to log in to dom0 on VGA console, it blocks after typing password. A pre-logged in session may be still working, I can issue top command. But once I try to open file, it will block there. The files are attach to VMs by "file://" method: dom0 uses loop device to associate the file and attach the loop device to VMs. From XEN''s manual I found this method is not recommended now. So I''ve tried tap:aio method to attach this files to VMs. The dom0 seems good when using this method, but we find when one VM is doing heavy I/O on its disk, other VMs can''t perform I/O well. They can''t even finish the booting. If there are unclear statements please tell me, I will explain in more detail. Any suggestions and thoughts will be valuable to me, thanks for reading. Best Regards, Kiefer Chang _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Florian Heigl
2011-Apr-21 10:59 UTC
Re: [Xen-users] dom0 hangs when doing heavy I/O on domU
Hi Chang, 2011/4/20 Kiefer Chang <zapchang@gmail.com>:> Hi all, > We are using XEN as hypervisor to setup our private cloud. > The framework is Eucalyptus and using CentOS 5.4 as dom0 OS. > Sometimes we find some machines'' dom0 become unresponsive, the symptoms are: > (1) We can''t log into dom0 via ssh. After typing password, it just stops > there. > (2) We can ping dom0 successfully. > (3) We can log into domU without problem. > The unresponsive dom0 eventually "alive" after a period of time. Maybe half > hour or even several hours.So one of your domUs is trashing the disks and dom0 can''t get enough performance, right? - are they sharing a disk? - can you check what I/O scheduler you are using? (with cfq you can then use ionice to lower prio on all blkback threads a little. that way dom0 will "win the race") In general, your dom0 is privileged in terms of IO access rights, but not in IO peformance. So if one domU goes crazy, it will affect anything. ... until you take measures :) I''d suggest you switch to deadline scheduler and re-test. dom0 on a different disk media is also very advisable imho. Flo -- the purpose of libvirt is to provide an abstraction layer hiding all xen features added since 2006 until they were finally understood and copied by the kvm devs. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Joost Roeleveld
2011-Apr-21 13:39 UTC
Re: [Xen-users] dom0 hangs when doing heavy I/O on domU
On Thursday 21 April 2011 12:59:17 Florian Heigl wrote:> Hi Chang, > > 2011/4/20 Kiefer Chang <zapchang@gmail.com>: > > Hi all, > > We are using XEN as hypervisor to setup our private cloud. > > The framework is Eucalyptus and using CentOS 5.4 as dom0 OS. > > Sometimes we find some machines'' dom0 become unresponsive, the symptoms > > are: (1) We can''t log into dom0 via ssh. After typing password, it just > > stops there. > > (2) We can ping dom0 successfully. > > (3) We can log into domU without problem. > > The unresponsive dom0 eventually "alive" after a period of time. Maybe > > half hour or even several hours. > > So one of your domUs is trashing the disks and dom0 can''t get enough > performance, right? > - are they sharing a disk? > - can you check what I/O scheduler you are using? > (with cfq you can then use ionice to lower prio on all blkback > threads a little. that way dom0 will "win the race") > > In general, your dom0 is privileged in terms of IO access rights, but > not in IO peformance. So if one domU goes crazy, it will affect > anything. > ... until you take measures :) > I''d suggest you switch to deadline scheduler and re-test. > dom0 on a different disk media is also very advisable imho.Another possible cause for this is if dom0 is not pinned to a "private" cpu- core. I noticed similar issues before I pinned dom0 to cpu0 and configured all the other domUs to use the other 3 cores. -- Joost _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Kiefer Chang
2011-Apr-21 14:56 UTC
Re: [Xen-users] dom0 hangs when doing heavy I/O on domU
Hi Florian, - are they sharing a disk? Yes, 3 VMs'' images are stored in the same disk sdb. sda is used for dom0 root filesystem. - can you check what I/O scheduler you are using? Default to CFQ in for sdb. I tried to ionice all blkback processes to class 2 and still have no luck. I also tried deadline scheduler before. Right now I found a cure is to make sdb as physical volume and setup volume groups/logical volumes on it. Attach logical volumes to VMs by "phy" method. The symptom is gone when 3 VMs perform the same I/Os. I know XEN manual suggest using blktap and phy method for VM storages. But we think it''s much easier to manage VM''s image files then LVM volumes since we provision VMs by downloading their images from servers. Thanks! -- Kiefer Chang 2011/4/21 Florian Heigl <florian.heigl@gmail.com>> Hi Chang, > > 2011/4/20 Kiefer Chang <zapchang@gmail.com>: > > Hi all, > > We are using XEN as hypervisor to setup our private cloud. > > The framework is Eucalyptus and using CentOS 5.4 as dom0 OS. > > Sometimes we find some machines'' dom0 become unresponsive, the symptoms > are: > > (1) We can''t log into dom0 via ssh. After typing password, it just stops > > there. > > (2) We can ping dom0 successfully. > > (3) We can log into domU without problem. > > The unresponsive dom0 eventually "alive" after a period of time. Maybe > half > > hour or even several hours. > > So one of your domUs is trashing the disks and dom0 can''t get enough > performance, right? > - are they sharing a disk? > - can you check what I/O scheduler you are using? > (with cfq you can then use ionice to lower prio on all blkback > threads a little. that way dom0 will "win the race") > > In general, your dom0 is privileged in terms of IO access rights, but > not in IO peformance. So if one domU goes crazy, it will affect > anything. > ... until you take measures :) > I''d suggest you switch to deadline scheduler and re-test. > dom0 on a different disk media is also very advisable imho. > > > Flo > > -- > the purpose of libvirt is to provide an abstraction layer hiding all > xen features added since 2006 until they were finally understood and > copied by the kvm devs. >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Kiefer Chang
2011-Apr-21 15:02 UTC
Re: [Xen-users] dom0 hangs when doing heavy I/O on domU
Hi Joost, I am sure dom0 is pinned to first core (configuring xen parameters in grub menu) and domUs won''t use first core by specifying VCPU and CPU relation in VM''s xml file. Initially we didn''t dedicate specific core to dom0, so XEN might allocate cores to dom0 and domU dynamically. When we dedicate a single core to dom0 we find the symptom is much easier to reproduce. Thanks. -- Kiefer Chang 2011/4/21 Joost Roeleveld <joost@antarean.org>> On Thursday 21 April 2011 12:59:17 Florian Heigl wrote: > > Hi Chang, > > > > 2011/4/20 Kiefer Chang <zapchang@gmail.com>: > > > Hi all, > > > We are using XEN as hypervisor to setup our private cloud. > > > The framework is Eucalyptus and using CentOS 5.4 as dom0 OS. > > > Sometimes we find some machines'' dom0 become unresponsive, the symptoms > > > are: (1) We can''t log into dom0 via ssh. After typing password, it just > > > stops there. > > > (2) We can ping dom0 successfully. > > > (3) We can log into domU without problem. > > > The unresponsive dom0 eventually "alive" after a period of time. Maybe > > > half hour or even several hours. > > > > So one of your domUs is trashing the disks and dom0 can''t get enough > > performance, right? > > - are they sharing a disk? > > - can you check what I/O scheduler you are using? > > (with cfq you can then use ionice to lower prio on all blkback > > threads a little. that way dom0 will "win the race") > > > > In general, your dom0 is privileged in terms of IO access rights, but > > not in IO peformance. So if one domU goes crazy, it will affect > > anything. > > ... until you take measures :) > > I''d suggest you switch to deadline scheduler and re-test. > > dom0 on a different disk media is also very advisable imho. > > Another possible cause for this is if dom0 is not pinned to a "private" > cpu- > core. > I noticed similar issues before I pinned dom0 to cpu0 and configured all > the > other domUs to use the other 3 cores. > > -- > Joost > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users