Tim O''Donovan
2010-Apr-07 17:04 UTC
[Xen-users] HVM Live Migrations Failing 90% Of The Time
I''m deploying a 2-node Pacemaker/DRBD backed Xen cluster to run a mixture of Linux PVM and Windows HVM VMs. I have this up and running on a pair of development machines, with both automatic and manual failover working perfectly. The live migrations work every time for the PVM and HVM based VMs. I''ve replicated the setup onto a pair of high-end live machines, but the live migrations only succeed around 10% of the time for the HVM VMs. PVM live migrations complete every time. The configurations on the development and live machines are identical in every way, except for the physical hardware. The migrating host errors with the following when the migration fails: [2010-04-07 14:42:45 6211] DEBUG (XendCheckpoint:103) [xc_save]: /usr/lib64/xen/bin/xc_save 30 18 0 0 5 [2010-04-07 14:42:45 6211] INFO (XendCheckpoint:403) xc_save: could not read suspend event channel [2010-04-07 14:42:45 6211] WARNING (XendDomainInfo:1617) Domain has crashed: name=migrating-web id=18. [2010-04-07 14:42:45 6211] DEBUG (XendDomainInfo:2389) XendDomainInfo.destroy: domid=18 [2010-04-07 14:42:45 6211] DEBUG (XendDomainInfo:2406) XendDomainInfo.destroyDomain(18) [2010-04-07 14:42:48 6211] DEBUG (XendDomainInfo:1939) Destroying device model [2010-04-07 14:42:48 6211] INFO (XendCheckpoint:403) Saving memory pages: iter 1 10%ERROR Internal error: Error peeking shadow bitmap [2010-04-07 14:42:48 6211] INFO (XendCheckpoint:403) Warning - couldn''t disable shadow modeSave exit rc=1 [2010-04-07 14:42:48 6211] ERROR (XendCheckpoint:157) Save failed on domain web (18) - resuming. Traceback (most recent call last): File "/usr/lib/python2.5/site-packages/xen/xend/XendCheckpoint.py", line 125, in save forkHelper(cmd, fd, saveInputHandler, False) File "/usr/lib/python2.5/site-packages/xen/xend/XendCheckpoint.py", line 391, in forkHelper raise XendError("%s failed" % string.join(cmd)) XendError: /usr/lib64/xen/bin/xc_save 30 18 0 0 5 failed With the below also being logged in /var/log/xen/qemu-dm-web.log: xenstore_process_logdirty_event: key=000000006b8b4567 size=335816 Log-dirty: mapped segment at 0x7fb56c136000 Triggered log-dirty buffer switch The host that is being migrated to errors with the following: [2010-04-07 14:42:45 6227] INFO (XendCheckpoint:403) Reloading memory pages: 0% [2010-04-07 14:42:48 6227] INFO (XendCheckpoint:403) ERROR Internal error: Error when reading batch size [2010-04-07 14:42:48 6227] INFO (XendCheckpoint:403) Restore exit with rc=1 [2010-04-07 14:42:48 6227] DEBUG (XendDomainInfo:2389) XendDomainInfo.destroy: domid=26 [2010-04-07 14:42:48 6227] DEBUG (XendDomainInfo:2406) XendDomainInfo.destroyDomain(26) [2010-04-07 14:42:48 6227] ERROR (XendDomainInfo:2418) XendDomainInfo.destroy: xc.domain_destroy failed. Traceback (most recent call last): File "/usr/lib/python2.5/site-packages/xen/xend/XendDomainInfo.py", line 2413, in destroyDomain xc.domain_destroy(self.domid) Error: (3, ''No such process'') Some basic config details: Xen version: 3.3.0 Kernel: 2.6.24-27-xen dom0 OS: Ubuntu 8.04 64-bit domU OS: Windows 2008 64-bit VM config for the above example: name = "web" kernel = "/usr/lib/xen/boot/hvmloader" builder=''hvm'' memory = 10240 shadow_memory = 8 vif = [ ''bridge=eth1'' ] acpi = 1 apic = 1 disk = [ ''phy:/dev/drbd0,hda,w'', ''phy:/dev/drbd1,hdb,w'' ] device_model = ''/usr/lib64/xen/bin/qemu-dm'' boot="dc" sdl=0 vnc=1 vncconsole=1 vncpasswd=''XXXXXXXXXXXX'' serial=''pty'' usbdevice=''tablet'' vcpus=8 on_poweroff = ''destroy'' on_reboot = ''restart'' on_crash = ''destroy'' The DRBD resources are handled by Jefferson Ogata''s qemu-dm.drbd wrapper (http://www.antibozo.net/xen/qemu-dm.drbd) and a slightly modified version of DRBD''s block-drbd script. The dom0 machines are allocated 1GB of memory each and are identical, in both software and hardware configurations. Each machine has a total of 24GB of memory. Thanks _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Pasi Kärkkäinen
2010-Apr-07 18:07 UTC
Re: [Xen-users] HVM Live Migrations Failing 90% Of The Time
On Wed, Apr 07, 2010 at 06:04:14PM +0100, Tim O''Donovan wrote:> I''m deploying a 2-node Pacemaker/DRBD backed Xen cluster to run a > mixture of Linux PVM and Windows HVM VMs. I have this up and running on > a pair of development machines, with both automatic and manual failover > working perfectly. The live migrations work every time for the PVM and > HVM based VMs. > > I''ve replicated the setup onto a pair of high-end live machines, but the > live migrations only succeed around 10% of the time for the HVM VMs. PVM > live migrations complete every time. The configurations on the > development and live machines are identical in every way, except for the > physical hardware. ><snip>> > > Some basic config details: > > Xen version: 3.3.0 > Kernel: 2.6.24-27-xen > dom0 OS: Ubuntu 8.04 64-bit > domU OS: Windows 2008 64-bit >You might want to update to a newer Xen version, and also possibly to a newer dom0 kernel version. Those versions are old, and they have known bugs. -- Pasi _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users