Ian Campbell
2011-Feb-16 11:47 UTC
[Xen-devel] [PATCH] libxl: do slow resume after failed migration attempt
# HG changeset patch # User Ian Campbell <ian.campbell@citrix.com> # Date 1297856874 0 # Node ID 1728ed4bbec9e82ca13c2639c8e4ef8b4dc231b6 # Parent aa466613328f5de78fdfc968473cb06e948c1f5d libxl: do slow resume after failed migration attempt both of the current callers for libxl_domain_resume are calling after a migration has failed, one is failure to suspend on the sender and the other is failure to start on the destination, both leading to a resume attempt on the sender. However in the first case, failure to suspend, there is no guarantee that the guest has made it as far as the suspend hypercall and therefore the fast resume method, which frobs the hypercall return to indicate a cancelled suspend, cannot safely be used since it will corrupt %eax/%rax. For the second case, failure to start on destination, I don''t think it really matters if the resume is fast or slow. Therefore always use the slow/uncooperative version of xc_domain_resume from libxl_domain_resume. This makes a PV domain which failed to suspend (e.g. because the core Linux PM infrastructure within the guest didn''t allow it) recover gracefully. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> diff -r aa466613328f -r 1728ed4bbec9 tools/libxl/libxl.c --- a/tools/libxl/libxl.c Tue Feb 15 13:40:50 2011 +0000 +++ b/tools/libxl/libxl.c Wed Feb 16 11:47:54 2011 +0000 @@ -226,7 +226,7 @@ int libxl_domain_resume(libxl_ctx *ctx, rc = ERROR_NI; goto out; } - if (xc_domain_resume(ctx->xch, domid, 1)) { + if (xc_domain_resume(ctx->xch, domid, 0)) { LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "xc_domain_resume failed for domain %u", domid); _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-Feb-16 11:49 UTC
[Xen-devel] Re: [PATCH] libxl: do slow resume after failed migration attempt
On Wed, 2011-02-16 at 11:47 +0000, Ian Campbell wrote:> # HG changeset patch > # User Ian Campbell <ian.campbell@citrix.com> > # Date 1297856874 0 > # Node ID 1728ed4bbec9e82ca13c2639c8e4ef8b4dc231b6 > # Parent aa466613328f5de78fdfc968473cb06e948c1f5d > libxl: do slow resume after failed migration attempt > > both of the current callers for libxl_domain_resume are calling after > a migration has failed, one is failure to suspend on the sender and > the other is failure to start on the destination, both leading to a > resume attempt on the sender. > > However in the first case, failure to suspend, there is no guarantee > that the guest has made it as far as the suspend hypercall and > therefore the fast resume method, which frobs the hypercall return to > indicate a cancelled suspend, cannot safely be used since it will > corrupt %eax/%rax. > > For the second case, failure to start on destination, I don''t think it > really matters if the resume is fast or slow. > > Therefore always use the slow/uncooperative version of xc_domain_resume from > libxl_domain_resume. > > This makes a PV domain which failed to suspend (e.g. because the core > Linux PM infrastructure within the guest didn''t allow it) recover > gracefully.a PVHVM domain never suffered from this because libxl_domain_resume bails due to a libxl__domain_is_hvm check. I''m not 100% clear whether this is correct but I didn''t change it. My test with a PVHVM guest which acknowledges the suspend but doesn''t go on to do anything seems to work. Ian.> > Signed-off-by: Ian Campbell <ian.campbell@citrix.com> > > diff -r aa466613328f -r 1728ed4bbec9 tools/libxl/libxl.c > --- a/tools/libxl/libxl.c Tue Feb 15 13:40:50 2011 +0000 > +++ b/tools/libxl/libxl.c Wed Feb 16 11:47:54 2011 +0000 > @@ -226,7 +226,7 @@ int libxl_domain_resume(libxl_ctx *ctx, > rc = ERROR_NI; > goto out; > } > - if (xc_domain_resume(ctx->xch, domid, 1)) { > + if (xc_domain_resume(ctx->xch, domid, 0)) { > LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, > "xc_domain_resume failed for domain %u", > domid);_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2011-Feb-16 11:53 UTC
Re: [Xen-devel] Re: [PATCH] libxl: do slow resume after failed migration attempt
On Wed, 2011-02-16 at 11:49 +0000, Ian Campbell wrote:> On Wed, 2011-02-16 at 11:47 +0000, Ian Campbell wrote: > > # HG changeset patch > > # User Ian Campbell <ian.campbell@citrix.com> > > # Date 1297856874 0 > > # Node ID 1728ed4bbec9e82ca13c2639c8e4ef8b4dc231b6 > > # Parent aa466613328f5de78fdfc968473cb06e948c1f5d > > libxl: do slow resume after failed migration attempt > > > > both of the current callers for libxl_domain_resume are calling after > > a migration has failed, one is failure to suspend on the sender and > > the other is failure to start on the destination, both leading to a > > resume attempt on the sender. > > > > However in the first case, failure to suspend, there is no guarantee > > that the guest has made it as far as the suspend hypercall and > > therefore the fast resume method, which frobs the hypercall return to > > indicate a cancelled suspend, cannot safely be used since it will > > corrupt %eax/%rax. > > > > For the second case, failure to start on destination, I don''t think it > > really matters if the resume is fast or slow. > > > > Therefore always use the slow/uncooperative version of xc_domain_resume from > > libxl_domain_resume. > > > > This makes a PV domain which failed to suspend (e.g. because the core > > Linux PM infrastructure within the guest didn''t allow it) recover > > gracefully. > > a PVHVM domain never suffered from this because libxl_domain_resume > bails due to a libxl__domain_is_hvm check. I''m not 100% clear whether > this is correct but I didn''t change it. My test with a PVHVM guest which > acknowledges the suspend but doesn''t go on to do anything seems to work.Looking closer, even a PV guest which is hacked to not actually try to suspend fails this new xc_domain_resume call and it''s actually the original domain which continues. I''m inclined to suggest that this is OK and that trying to do a slow xc_domain_resume will save guests which have suffered certain types of failure and be harmless for other types of failures, but I wouldn''t argue strongly against a suggestion that the right thing to do in the "failed to suspend" case is to simply unpause the original domain and let it try and continue... Ian.> > Ian. > > > > > Signed-off-by: Ian Campbell <ian.campbell@citrix.com> > > > > diff -r aa466613328f -r 1728ed4bbec9 tools/libxl/libxl.c > > --- a/tools/libxl/libxl.c Tue Feb 15 13:40:50 2011 +0000 > > +++ b/tools/libxl/libxl.c Wed Feb 16 11:47:54 2011 +0000 > > @@ -226,7 +226,7 @@ int libxl_domain_resume(libxl_ctx *ctx, > > rc = ERROR_NI; > > goto out; > > } > > - if (xc_domain_resume(ctx->xch, domid, 1)) { > > + if (xc_domain_resume(ctx->xch, domid, 0)) { > > LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, > > "xc_domain_resume failed for domain %u", > > domid); > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Jackson
2011-Feb-17 19:41 UTC
Re: [Xen-devel] Re: [PATCH] libxl: do slow resume after failed migration attempt
Ian Campbell writes ("Re: [Xen-devel] Re: [PATCH] libxl: do slow resume after failed migration attempt"):> I''m inclined to suggest that this is OK and that trying to do a slow > xc_domain_resume will save guests which have suffered certain types of > failure and be harmless for other types of failures, but I wouldn''t > argue strongly against a suggestion that the right thing to do in the > "failed to suspend" case is to simply unpause the original domain and > let it try and continue...I find this all quite convincing. I have applied the patch. Thanks, Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel