Hi folks, we (the company i am working for) are running several dozens of virtualisation servers using CentOS 6 + Xen4CentOS as the virtualisation infrastructure. With the latest versions of all packages installed [1], we see failures in live-migration of stock-CentOS6 HVM guests, leaving a "Domain-unnamed" on the source host, while the migrated guest runs fine on the target host. Domain-0 0 2048 64 r----- 6791.5 Domain-Unnamed 1 4099 4 --ps-- 94.8 The failure is not consistently reproducable, some guests (of the same type) live-migrate just fine, until eventually some seemingly random guest fails, leaving a "Domain-unnamed" Zombie. That Domain-unnamed causes several problems: - The memory allocated to Domain-unnamed remains blocked, thus creating a veritable 'memory-leak' to the host - The DomU causing Domain-unnamed cannot be restarted on the host, as xm thinks it's already running I have tried various things to get rid of Domain-unnamed, all without success - multiple xm destroy - restart xend - delete everything regarding Dommain-unnamed in xenstore with xenstore-rm. The removal is successful, but the domain remains. Restarting xend after the deletion restores Domain-unnamed in xenstore So far, the only way to get rid of Domain-unnamed is a virt-host reboot. As these hosts are all quad-socket opteron 6272 machines with 256gig ram running dozens of guests, this is highly impractical. I have seen this behaviour using xen 4.2.5. The previous 4.2.4 versions did not show this problem, however we did not use live-migration extensively prior to that. Before switching to Xen4CentOS, we used to build our own Xen 4.2.2 based on a git repo, published by Karanbir Singh. We had several issues with that version, but never observed a "Domain-unnamed". Any idea how to resolve this issue would be highly appreciated, as working live-migration is crucial to us. Regards, Thomas Weyergraf Some notes on our config: 1. We still use xm/xend for various reasons ---- 2. Our grub-config for the virtx-hosts is as follows: ---- default=0 timeout=5 #splashimage=(hd0,0)/grub/splash.xpm.gz #hiddenmenu title CentOS (xen-4.2.5-37.el6.gz vmlinuz-3.10.56-11.el6.centos.alt.x86_64) root (hd0,0) kernel /xen-4.2.5-37.el6.gz iommu=1 console=vga,com1 com1=115200,8n1 vga=text-80x25 dom0_mem=2048M,max:2048M module /vmlinuz-3.10.56-11.el6.centos.alt.x86_64 ro xencons=hvc0 console=hvc0 root=/dev/fravirtx68/root rd_NO_LUKS LANG=en_US.UTF-8 KEYBOARDTYPE=pc KEYTABLE=de-latin1-nodeadkeys rd_NO_MD SYSFONT=latarcyrheb-sun16 rd_LVM_LV=fravirtx68/root rd_NO_DM module /initramfs-3.10.56-11.el6.centos.alt.x86_64.img title CentOS (vmlinuz-3.10.56-11.el6.centos.alt.x86_64) root (hd0,0) kernel /vmlinuz-3.10.56-11.el6.centos.alt.x86_64 ro root=/dev/fravirtx68/root rd_NO_LUKS LANG=en_US.UTF-8 KEYBOARDTYPE=pc KEYTABLE=de-latin1-nodeadkeys rd_NO_MD SYSFONT=latarcyrheb-sun16 rd_LVM_LV=fravirtx68/root rd_NO_DM module /initramfs-3.10.56-11.el6.centos.alt.x86_64.img 3. A typical guest-config looks like: ---- name = "fraappmgmt05t.test.fra.net-m.internal" uuid = "3778a443-9194-4c46-adff-211d7fcc24da" memory = "4096" vcpus = 4 kernel = "hvmloader" builder = 'hvm' disk = [ 'phy:/dev/disk/by-path/ip-192.168.240.7:3260-iscsi-iqn.1992-08.com.netapp:navfiler21-lun-56,xvda,w', 'phy:/dev/disk/by-path/ip-192.168.240.7:3260-iscsi-iqn.1992-08.com.netapp:navfiler21-lun-57,xvdb,w', ] vif = [ 'mac=00:16:3e:fa:15:4a,bridge=xenbr11' ] device_model = 'qemu-dm' serial='pty' xen_platform_pci=1 on_poweroff = "destroy" on_crash = "restart" 4. The xend.log excerpt of the migration process from the source host: ---- [2014-11-06 22:53:01 13499] DEBUG (XendDomainInfo:1795) Storing domain details: {'console/port': '7', 'cpu/3/availability': 'online', 'description': '', 'console/limit': '1048576', 'cpu/2/availability': 'online', 'vm': '/vm/f5139575-984b-4c28-b470-efc042ba2703', 'domid': '1', 'store/port': '6', 'console/type': 'ioemu', 'cpu/0/availability': 'online', 'memory/target': '4194304', 'control/platform-feature-multiprocessor-suspend': '1', 'store/ring-ref': '1044476', 'cpu/1/availability': 'online', 'control/platform-feature-xs_reset_watches': '1', 'image/suspend-cancel': '1', 'name': 'migrating-fraapppeccon06.fra.net-m.internal'} [2014-11-06 22:53:01 13499] INFO (XendCheckpoint:423) xc_save: failed to get the suspend evtchn port [2014-11-06 22:53:01 13499] INFO (XendCheckpoint:423) [2014-11-06 22:53:34 13499] DEBUG (XendCheckpoint:394) suspend [2014-11-06 22:53:34 13499] DEBUG (XendCheckpoint:127) In saveInputHandler suspend [2014-11-06 22:53:34 13499] DEBUG (XendCheckpoint:129) Suspending 1 ... [2014-11-06 22:53:34 13499] DEBUG (XendDomainInfo:524) XendDomainInfo.shutdown(suspend) [2014-11-06 22:53:34 13499] DEBUG (XendDomainInfo:1882) XendDomainInfo.handleShutdownWatch [2014-11-06 22:53:34 13499] DEBUG (XendDomainInfo:1882) XendDomainInfo.handleShutdownWatch [2014-11-06 22:53:34 13499] INFO (XendDomainInfo:2079) Domain has shutdown: name=migrating-fraapppeccon06.fra.net-m.internal id=1 reason=suspend. [2014-11-06 22:53:34 13499] INFO (XendCheckpoint:135) Domain 1 suspended. [2014-11-06 22:53:35 13499] INFO (image:542) signalDeviceModel:restore dm state to running [2014-11-06 22:53:35 13499] DEBUG (XendCheckpoint:144) Written done [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:3077) XendDomainInfo.destroy: domid=1 [2014-11-06 22:53:35 13499] ERROR (XendDomainInfo:3091) XendDomainInfo.destroy: domain destruction failed. Traceback (most recent call last): File "/usr/lib64/python2.6/site-packages/xen/xend/XendDomainInfo.py", line 3086, in destroy xc.domain_destroy(self.domid) Error: (16, 'Device or resource busy') [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2402) Destroying device model [2014-11-06 22:53:35 13499] INFO (image:619) migrating-fraapppeccon06.fra.net-m.internal device model terminated [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2409) Releasing devices [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2415) Removing vif/0 [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:1276) XendDomainInfo.destroyDevice: deviceClass = vif, device = vif/0 [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2415) Removing console/0 [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:1276) XendDomainInfo.destroyDevice: deviceClass = console, device = console/0 [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2415) Removing vbd/51712 [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:1276) XendDomainInfo.destroyDevice: deviceClass = vbd, device = vbd/51712 [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:2415) Removing vbd/51728 [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:1276) XendDomainInfo.destroyDevice: deviceClass = vbd, device = vbd/51728 [2014-11-06 22:53:36 13499] DEBUG (XendCheckpoint:124) [xc_save]: /usr/lib/xen/bin/xc_save 26 2 0 0 5
On Sun, Nov 16, 2014 at 12:39 AM, Thomas Weyergraf <T.Weyergraf at virtfinity.de> wrote:> Hi folks, > > we (the company i am working for) are running several dozens of > virtualisation servers using CentOS 6 + Xen4CentOS as the virtualisation > infrastructure. > > With the latest versions of all packages installed [1], we see failures in > live-migration of stock-CentOS6 HVM guests, leaving a "Domain-unnamed" on > the source host, while the migrated guest runs fine on the target host. > > Domain-0 0 2048 64 r----- 6791.5 > Domain-Unnamed 1 4099 4 --ps-- 94.8 > > The failure is not consistently reproducable, some guests (of the same type) > live-migrate just fine, until eventually some seemingly random guest fails, > leaving a "Domain-unnamed" Zombie.Thanks for this report. It looks like for some reason xend has asked Xen to shut down the domain, but Xen is saying, "Sorry, can't do that yet." That's why restarting xend and removing things from xenstore don't work: xend is just saying what it sees, and what it sees is a zombie domain that refuses to die. :-) Do you have a serial port connected to any of your servers? * If so, could you: - Send the output just after you notice a domain in this state - Type "Ctrl-A" three times on the console to switch to Xen, and then type 'q' (And send the resulting output) * If not, could you: - send the output of "xl dmesg" - Run "xl debug-keys q" and again take the output of "xl dmesg"? Can you also do "ps ax | grep qemu" to check to see if the qemu instance associated with this domain has actually been destroyed, or if it's still around? Also, have you tried running "xl destroy" on the domain and seeing what happens? xl is stateless, so it can often do things along side of xend. This is not a good idea in general as they can freqently end up stepping on each others' toes; but in this case I think it shouldn't be a problem. Thanks, -George
On Sun, Nov 16, 2014 at 12:39 AM, Thomas Weyergraf <T.Weyergraf at virtfinity.de> wrote:> Hi folks, > > we (the company i am working for) are running several dozens of > virtualisation servers using CentOS 6 + Xen4CentOS as the virtualisation > infrastructure. > > With the latest versions of all packages installed [1], we see failures in > live-migration of stock-CentOS6 HVM guests, leaving a "Domain-unnamed" on > the source host, while the migrated guest runs fine on the target host. > > Domain-0 0 2048 64 r----- 6791.5 > Domain-Unnamed 1 4099 4 --ps-- 94.8 > > The failure is not consistently reproducable, some guests (of the same type) > live-migrate just fine, until eventually some seemingly random guest fails, > leaving a "Domain-unnamed" Zombie. > > That Domain-unnamed causes several problems: > - The memory allocated to Domain-unnamed remains blocked, thus creating a > veritable 'memory-leak' to the host > - The DomU causing Domain-unnamed cannot be restarted on the host, as xm > thinks it's already running > > I have tried various things to get rid of Domain-unnamed, all without > success > - multiple xm destroy > - restart xend > - delete everything regarding Dommain-unnamed in xenstore with xenstore-rm. > The removal is successful, but the domain remains. Restarting xend after the > deletion restores Domain-unnamed in xenstore > > So far, the only way to get rid of Domain-unnamed is a virt-host reboot. As > these hosts are all quad-socket opteron 6272 machines with 256gig ram > running dozens of guests, this is highly impractical. > > I have seen this behaviour using xen 4.2.5. The previous 4.2.4 versions did > not show this problem, however we did not use live-migration extensively > prior to that. Before switching to Xen4CentOS, we used to build our own Xen > 4.2.2 based on a git repo, published by Karanbir Singh. We had several > issues with that version, but never observed a "Domain-unnamed". > > Any idea how to resolve this issue would be highly appreciated, as working > live-migration is crucial to us. > > Regards, > Thomas Weyergraf > > Some notes on our config: > > 1. We still use xm/xend for various reasons > ---- > 2. Our grub-config for the virtx-hosts is as follows: > ---- > default=0 > timeout=5 > #splashimage=(hd0,0)/grub/splash.xpm.gz > #hiddenmenu > title CentOS (xen-4.2.5-37.el6.gz vmlinuz-3.10.56-11.el6.centos.alt.x86_64) > root (hd0,0) > kernel /xen-4.2.5-37.el6.gz iommu=1 console=vga,com1 com1=115200,8n1 > vga=text-80x25 dom0_mem=2048M,max:2048M > module /vmlinuz-3.10.56-11.el6.centos.alt.x86_64 ro xencons=hvc0 > console=hvc0 root=/dev/fravirtx68/root rd_NO_LUKS LANG=en_US.UTF-8 > KEYBOARDTYPE=pc KEYTABLE=de-latin1-nodeadkeys rd_NO_MD > SYSFONT=latarcyrheb-sun16 rd_LVM_LV=fravirtx68/root rd_NO_DM > module /initramfs-3.10.56-11.el6.centos.alt.x86_64.img > title CentOS (vmlinuz-3.10.56-11.el6.centos.alt.x86_64) > root (hd0,0) > kernel /vmlinuz-3.10.56-11.el6.centos.alt.x86_64 ro > root=/dev/fravirtx68/root rd_NO_LUKS LANG=en_US.UTF-8 KEYBOARDTYPE=pc > KEYTABLE=de-latin1-nodeadkeys rd_NO_MD SYSFONT=latarcyrheb-sun16 > rd_LVM_LV=fravirtx68/root rd_NO_DM > module /initramfs-3.10.56-11.el6.centos.alt.x86_64.img > > 3. A typical guest-config looks like: > ---- > name = "fraappmgmt05t.test.fra.net-m.internal" > uuid = "3778a443-9194-4c46-adff-211d7fcc24da" > memory = "4096" > vcpus = 4 > kernel = "hvmloader" > builder = 'hvm' > disk = [ > 'phy:/dev/disk/by-path/ip-192.168.240.7:3260-iscsi-iqn.1992-08.com.netapp:navfiler21-lun-56,xvda,w', > 'phy:/dev/disk/by-path/ip-192.168.240.7:3260-iscsi-iqn.1992-08.com.netapp:navfiler21-lun-57,xvdb,w', > ] > vif = [ 'mac=00:16:3e:fa:15:4a,bridge=xenbr11' ] > device_model = 'qemu-dm' > serial='pty' > xen_platform_pci=1 > on_poweroff = "destroy" > on_crash = "restart" > > 4. The xend.log excerpt of the migration process from the source host: > ---- > [2014-11-06 22:53:01 13499] DEBUG (XendDomainInfo:1795) Storing domain > details: {'console/port': '7', 'cpu/3/availability': 'online', > 'description': '', 'console/limit': '1048576', 'cpu/2/availability': > 'online', 'vm': '/vm/f5139575-984b-4c28-b470-efc042ba2703', 'domid': '1', > 'store/port': '6', 'console/type': 'ioemu', 'cpu/0/availability': 'online', > 'memory/target': '4194304', > 'control/platform-feature-multiprocessor-suspend': '1', 'store/ring-ref': > '1044476', 'cpu/1/availability': 'online', > 'control/platform-feature-xs_reset_watches': '1', 'image/suspend-cancel': > '1', 'name': 'migrating-fraapppeccon06.fra.net-m.internal'} > [2014-11-06 22:53:01 13499] INFO (XendCheckpoint:423) xc_save: failed to get > the suspend evtchn port > [2014-11-06 22:53:01 13499] INFO (XendCheckpoint:423) > [2014-11-06 22:53:34 13499] DEBUG (XendCheckpoint:394) suspend > [2014-11-06 22:53:34 13499] DEBUG (XendCheckpoint:127) In saveInputHandler > suspend > [2014-11-06 22:53:34 13499] DEBUG (XendCheckpoint:129) Suspending 1 ... > [2014-11-06 22:53:34 13499] DEBUG (XendDomainInfo:524) > XendDomainInfo.shutdown(suspend) > [2014-11-06 22:53:34 13499] DEBUG (XendDomainInfo:1882) > XendDomainInfo.handleShutdownWatch > [2014-11-06 22:53:34 13499] DEBUG (XendDomainInfo:1882) > XendDomainInfo.handleShutdownWatch > [2014-11-06 22:53:34 13499] INFO (XendDomainInfo:2079) Domain has shutdown: > name=migrating-fraapppeccon06.fra.net-m.internal id=1 reason=suspend. > [2014-11-06 22:53:34 13499] INFO (XendCheckpoint:135) Domain 1 suspended. > [2014-11-06 22:53:35 13499] INFO (image:542) signalDeviceModel:restore dm > state to running > [2014-11-06 22:53:35 13499] DEBUG (XendCheckpoint:144) Written done > [2014-11-06 22:53:35 13499] DEBUG (XendDomainInfo:3077) > XendDomainInfo.destroy: domid=1 > [2014-11-06 22:53:35 13499] ERROR (XendDomainInfo:3091) > XendDomainInfo.destroy: domain destruction failed. > Traceback (most recent call last): > File "/usr/lib64/python2.6/site-packages/xen/xend/XendDomainInfo.py", line > 3086, in destroy > xc.domain_destroy(self.domid) > Error: (16, 'Device or resource busy')Actually, looking at this again -- something is definitely weird here. That hypercall shouldn't be able to return anything except error 11, "EAGAIN", or error 3, "ESRCH". Error 16 "EBUSY" isn't anywhere in the codepath for domain_destroy, and several places will call BUG_ON() if the error returned is *not* EAGAIN. Are you sure you're running a matched set of hypervisor and tools? -George
Possibly Parallel Threads
- Internal error during live migration saving
- Snapshot fail, when snapshot a vm the second time. (already update to xen-4.0.1 and kernel-2.6.32.25)
- Snapshot fail, when snapshot a vm the second time. (already update to xen-4.0.1 and kernel-2.6.32.25)
- Live migration problem with xen 3.2
- Xen LVM DRBD live migration