Elena V. Titova
2013-Feb-04 07:24 UTC
[PATCH] xend: resume a guest domain after an unsuccessful live migration
Hello. We use debian sarge, linux-image-3.2.0-3-amd64 and xen-4.1.3 on our servers. When a live migration is run the guest domain may not resume on a destination host and is destroyed on a source host. This patch fixes it by resuming the guest domain on a source host when a start of the guest domain was failed. git diff tools/python/xen/xend/XendCheckpoint.py diff --git a/tools/python/xen/xend/XendCheckpoint.py b/tools/python/xen/xend/XendCheckpoint.py index fa09757..6b8765f 100644 --- a/tools/python/xen/xend/XendCheckpoint.py +++ b/tools/python/xen/xend/XendCheckpoint.py @@ -163,12 +163,16 @@ def save(fd, dominfo, network, live, dst, checkpoint=False, node=-1,sock=None): dominfo.resumeDomain() else: if live and sock != None: + status = os.read(fd, 64) try: sock.shutdown(2) except: pass sock.close() + if status == "FAIL": + raise XendError("Restore failed") + dominfo.destroy() dominfo.testDeviceComplete() try: @@ -351,8 +355,14 @@ def restore(xd, fd, dominfo = None, paused = False, relocating = False): if not paused: dominfo.unpause() + if relocating: + os.write(fd, "SUCCESS") + return dominfo except Exception, exn: + if relocating: + os.write(fd, "FAIL") + dominfo.destroy() log.exception(exn) raise exn -- Elena Titova
Ian Campbell
2013-Feb-04 15:39 UTC
Re: [PATCH] xend: resume a guest domain after an unsuccessful live migration
On Mon, 2013-02-04 at 07:24 +0000, Elena V. Titova wrote:> Hello. > > We use debian sarge, linux-image-3.2.0-3-amd64 and xen-4.1.3 on our > servers.Do you really mean Sarge? Or did you mean Squeeze or Wheezy? Those kernel and Xen versions look like Wheezy versions but perhaps you are using backports.> When a live migration is run the guest domain may not resume on a > destination > host and is destroyed on a source host. > This patch fixes it by resuming the guest domain on a source host when a > start of > the guest domain was failed.xend is supposed to be in maintenance mode so I''m not too sure about this sort of change. In particular I''m worried that it might break migration from Xen version N to version N+1 which is something we try and support. BTW the xl toolstack already has this functionality so another option for you may be to switch to that.> git diff tools/python/xen/xend/XendCheckpoint.py > diff --git a/tools/python/xen/xend/XendCheckpoint.py > b/tools/python/xen/xend/XendCheckpoint.py > index fa09757..6b8765f 100644 > --- a/tools/python/xen/xend/XendCheckpoint.py > +++ b/tools/python/xen/xend/XendCheckpoint.py > @@ -163,12 +163,16 @@ def save(fd, dominfo, network, live, dst, > checkpoint=False, node=-1,sock=None): > dominfo.resumeDomain() > else: > if live and sock != None:This same class of errors isn''t possible for non-live?> + status = os.read(fd, 64)The written strings are 7 or 4 bytes, it would be better to choose a fixed length for all writes and the read I think. That might mean padding the fail message. Also these protocol strings should be defined as constants rather than open coded. Even with that addressed I don''t really feel confident enough about xend internals to Ack a patch like this.> try: > sock.shutdown(2) > except: > pass > sock.close() > > + if status == "FAIL": > + raise XendError("Restore failed") > + > dominfo.destroy() > dominfo.testDeviceComplete() > try: > @@ -351,8 +355,14 @@ def restore(xd, fd, dominfo = None, paused = False, > relocating = False): > if not paused: > dominfo.unpause() > > + if relocating: > + os.write(fd, "SUCCESS") > + > return dominfo > except Exception, exn: > + if relocating: > + os.write(fd, "FAIL") > + > dominfo.destroy() > log.exception(exn) > raise exn > > -- > Elena Titova > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Elena V. Titova
2013-Feb-05 12:14 UTC
Re: [PATCH] xend: resume a guest domain after an unsuccessful live migration
В Пнд, 04/02/2013 в 15:39 +0000, Ian Campbell пишет:> > > > We use debian sarge, linux-image-3.2.0-3-amd64 and xen-4.1.3 on our > > servers. > > Do you really mean Sarge? Or did you mean Squeeze or Wheezy? Those > kernel and Xen versions look like Wheezy versions but perhaps you are > using backports.It is my mistake. I want to say debian squeeze with testing kernel and xen.> > > When a live migration is run the guest domain may not resume on a > > destination > > host and is destroyed on a source host. > > This patch fixes it by resuming the guest domain on a source host when a > > start of > > the guest domain was failed. > > xend is supposed to be in maintenance mode so I'm not too sure about > this sort of change. > > In particular I'm worried that it might break migration from Xen version > N to version N+1 which is something we try and support. > > BTW the xl toolstack already has this functionality so another option > for you may be to switch to that. > > > git diff tools/python/xen/xend/XendCheckpoint.py > > diff --git a/tools/python/xen/xend/XendCheckpoint.py > > b/tools/python/xen/xend/XendCheckpoint.py > > index fa09757..6b8765f 100644 > > --- a/tools/python/xen/xend/XendCheckpoint.py > > +++ b/tools/python/xen/xend/XendCheckpoint.py > > @@ -163,12 +163,16 @@ def save(fd, dominfo, network, live, dst, > > checkpoint=False, node=-1,sock=None): > > dominfo.resumeDomain() > > else: > > if live and sock != None: > > This same class of errors isn't possible for non-live?As I think in non-live migration I have a saved image of VM and can try to resume it on different servers including the source server. In live migration if resuming of VM fail I'll stay without running VM and services althougt VM could continue to run on the source server.> > > + status = os.read(fd, 64) > > The written strings are 7 or 4 bytes, it would be better to choose a > fixed length for all writes and the read I think. That might mean > padding the fail message. Also these protocol strings should be defined > as constants rather than open coded. > > Even with that addressed I don't really feel confident enough about xend > internals to Ack a patch like this. >Thank you for your comments and advice to use xl toolstack. We use xen and xend toolstack and have some scripts with xm and XenAPI. But as I read xend is deprecated in Xen 4.1 and will be removed in a future release and a switch to xl may be a good idea.> > try: > > sock.shutdown(2) > > except: > > pass > > sock.close() > > > > + if status == "FAIL": > > + raise XendError("Restore failed") > > + > > dominfo.destroy() > > dominfo.testDeviceComplete() > > try: > > @@ -351,8 +355,14 @@ def restore(xd, fd, dominfo = None, paused = False, > > relocating = False): > > if not paused: > > dominfo.unpause() > > > > + if relocating: > > + os.write(fd, "SUCCESS") > > + > > return dominfo > > except Exception, exn: > > + if relocating: > > + os.write(fd, "FAIL") > > + > > dominfo.destroy() > > log.exception(exn) > > raise exn > >-- Elena Titova _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Campbell
2013-Feb-05 12:31 UTC
Re: [PATCH] xend: resume a guest domain after an unsuccessful live migration
> > > > > + status = os.read(fd, 64) > > > > The written strings are 7 or 4 bytes, it would be better to choose a > > fixed length for all writes and the read I think. That might mean > > padding the fail message. Also these protocol strings should be defined > > as constants rather than open coded. > > > > Even with that addressed I don''t really feel confident enough about xend > > internals to Ack a patch like this. > > > > Thank you for your comments and advice to use xl toolstack. We use xen > and xend toolstack and have some scripts with xm and XenAPI. But as I > read xend is deprecated in Xen 4.1 and will be removed in a future > release and a switch to xl may be a good idea.If you are using XenAPI then you might also consider switching to the xapi toolstack (the XCP toolstack), which is now in Debian as a standalone thing too. Ian.