We''re in the process of porting xenopsd[1] to libxl, rather than driving libxc manually. Using our libxc-based backend, we are able to cancel operations. For operations that are using XenStore watches we wrap these in a cancellable_watch[2], and for operations that make use of a subprocess we send SIGKILL when we wish to cancel the associated task. We would then instrument any necessary cleanup by hand on a best-effort basis. With the move to libxl, a lot of this control will be abstracted away and it is unclear how best to allow long-running tasks to be cancelled. It seems most of these operations could be executed asynchronously but we wonder if it is possible to cancel them, or how we could add cancellation functionality to these operations? Cheers, Si [1]: https://github.com/xapi-project/xenopsd [2]: https://github.com/xapi-project/xenopsd/blob/master/xc/cancel_utils.ml#L101
Konrad Rzeszutek Wilk
2013-Oct-23 17:23 UTC
Re: FW: Cancelling asynchronous operations in libxl
On Tue, Oct 15, 2013 at 04:08:31PM +0000, Simon Beaumont wrote:> We''re in the process of porting xenopsd[1] to libxl, rather than driving libxc manually. > > Using our libxc-based backend, we are able to cancel operations. For operations that are using XenStore watches we wrap these in a cancellable_watch[2], and for operations that make use of a subprocess we send SIGKILL when we wish to cancel the associated task. We would then instrument any necessary cleanup by hand on a best-effort basis. > > With the move to libxl, a lot of this control will be abstracted away and it is unclear how best to allow long-running tasks to be cancelled. It seems most of these operations could be executed asynchronously but we wonder if it is possible to cancel them, or how we could add cancellation functionality to these operations?Are there specific operations that you are focusing on to cancel? Or is this mostly of ''the guest is doing something and does not seem to have a vnc/console, lets kill it'' ?> > Cheers, > > Si > > [1]: https://github.com/xapi-project/xenopsd > [2]: https://github.com/xapi-project/xenopsd/blob/master/xc/cancel_utils.ml#L101 > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
On Wed, 2013-10-23 at 13:23 -0400, Konrad Rzeszutek Wilk wrote:> On Tue, Oct 15, 2013 at 04:08:31PM +0000, Simon Beaumont wrote: > > We''re in the process of porting xenopsd[1] to libxl, rather than > driving libxc manually. > > > > Using our libxc-based backend, we are able to cancel operations. For > operations that are using XenStore watches we wrap these in a > cancellable_watch[2], and for operations that make use of a subprocess > we send SIGKILL when we wish to cancel the associated task. We would > then instrument any necessary cleanup by hand on a best-effort basis. > > > > With the move to libxl, a lot of this control will be abstracted > away and it is unclear how best to allow long-running tasks to be > cancelled. It seems most of these operations could be executed > asynchronously but we wonder if it is possible to cancel them, or how > we could add cancellation functionality to these operations? > > Are there specific operations that you are focusing on to cancel?Yes, this would be good to know. Ian J would know better than I but AIUI there is currently no generic support for cancelling an asynchronous op in libxl. I''m sure we could find a way to add this, for example as an abort after the current "step" succeeds/times out/fails.> Or is this mostly of ''the guest is doing something and does not seem > to have a vnc/console, lets kill it'' ?For killing libxl_domain_destroy should be fine to call even if there is an async op pending on that domain, the async op will naturally fail, it''ll probably log a lot but should do so gracefully. If there are cases which do not we should fix them, since they would represent a bug irrespective of any killing of the domain from under the op''s feet. Ian.
On 26 Oct 2013, at 09:33, Ian Campbell <ian.campbell@citrix.com> wrote:> On Wed, 2013-10-23 at 13:23 -0400, Konrad Rzeszutek Wilk wrote: >> On Tue, Oct 15, 2013 at 04:08:31PM +0000, Simon Beaumont wrote: >> Are there specific operations that you are focusing on to cancel? > > Yes, this would be good to know. Ian J would know better than I but AIUI > there is currently no generic support for cancelling an asynchronous op > in libxl. I''m sure we could find a way to add this, for example as an > abort after the current "step" succeeds/times out/fails.I think this generic support is what we’re looking for. Ideally, we’d like to be able to cancel any potentially long-running function (i.e. those that can be called asynchronously). More specifically, we’d like to be able to cancel at least the following: * device_*_[add|remove|destroy] * device_pci_assignable_[add|remove] * libxl_domain_create_new * libxl_domain_create_resume * libxl_domain_shutdown * libxl_domain_wait_shutdown * libxl_domain_reboot * libxl_domain_destroy * libxl_domain_suspend * libxl_domain_pause * libxl_domain_unpause>> Or is this mostly of ''the guest is doing something and does not seem >> to have a vnc/console, lets kill it'' ? > > For killing libxl_domain_destroy should be fine to call even if there is > an async op pending on that domain, the async op will naturally fail, > it''ll probably log a lot but should do so gracefully. If there are cases > which do not we should fix them, since they would represent a bug > irrespective of any killing of the domain from under the op''s feet.Not so much interested in this use case. Si
Ian Campbell writes ("Re: [Xen-devel] FW: Cancelling asynchronous operations in libxl"):> On Wed, 2013-10-23 at 13:23 -0400, Konrad Rzeszutek Wilk wrote: > > Are there specific operations that you are focusing on to cancel? > > Yes, this would be good to know. Ian J would know better than I but AIUI > there is currently no generic support for cancelling an asynchronous op > in libxl. I''m sure we could find a way to add this, for example as an > abort after the current "step" succeeds/times out/fails.That''s right. I think the right approach would be to provide a new kind of ao_how value which returns a reference to the caller which can be used for cancellation. Cancellation would itself be asynchronous, in the sense that you sould request cancellation but then still have to wait for the operation to end. So initially you could implement cancellation for only a subset of operations and have the infrastructure disregard cancellation requests for operations which don''t support it. I think the most obvious operation that should be cancellable is domain migration. Ideally there would be some kind of automatic plumbing which would arrange (for example) that callers to the libxl xs watch functions, or timeout functions, would get a "cancellation" notice instead.> For killing libxl_domain_destroy should be fine to call even if there is > an async op pending on that domain, the async op will naturally fail, > it''ll probably log a lot but should do so gracefully. If there are cases > which do not we should fix them, since they would represent a bug > irrespective of any killing of the domain from under the op''s feet.Yes. Ian.
On Mon, 2013-10-28 at 15:52 +0000, Ian Jackson wrote:> Ian Campbell writes ("Re: [Xen-devel] FW: Cancelling asynchronous operations in libxl"): > > On Wed, 2013-10-23 at 13:23 -0400, Konrad Rzeszutek Wilk wrote: > > > Are there specific operations that you are focusing on to cancel? > > > > Yes, this would be good to know. Ian J would know better than I but AIUI > > there is currently no generic support for cancelling an asynchronous op > > in libxl. I''m sure we could find a way to add this, for example as an > > abort after the current "step" succeeds/times out/fails. > > That''s right. > > I think the right approach would be to provide a new kind of ao_how > value which returns a reference to the caller which can be used for > cancellation. > > Cancellation would itself be asynchronous, in the sense that you sould > request cancellation but then still have to wait for the operation to > end. > > So initially you could implement cancellation for only a subset of > operations and have the infrastructure disregard cancellation requests > for operations which don''t support it. I think the most obvious > operation that should be cancellable is domain migration. > > Ideally there would be some kind of automatic plumbing which would > arrange (for example) that callers to the libxl xs watch functions, or > timeout functions, would get a "cancellation" notice instead.The way in which sequential async subops are currently handled (by chaining callbacks) is a bit ugly -- it leads to weird error messages like "libxl__vtpm_foo: Failed" which actually means "couldn''t add a NIC" (because that was the previous step, and libxl__vtpm_foo is what picks up the result). Do you think it would be feasible to recast those as sequences of function pointers in an array with a central (common) dispatcher which steps through, handling and reporting errors? If things could be restructured along those lines then adding cancellation support to the dispatcher might get us reasonably good API coverage pretty easily Ian.
Ian Campbell writes ("Re: [Xen-devel] FW: Cancelling asynchronous operations in libxl"):> The way in which sequential async subops are currently handled (by > chaining callbacks) is a bit ugly -- it leads to weird error messages > like "libxl__vtpm_foo: Failed" which actually means "couldn''t add a > NIC" (because that was the previous step, and libxl__vtpm_foo is what > picks up the result). Do you think it would be feasible to recast those > as sequences of function pointers in an array with a central (common) > dispatcher which steps through, handling and reporting errors?Hmmm. It might be. There''s a problem in that they all tend to have different arguments. And the control flow would be split into two places. Maybe it would be better to improve the error reporting to not rely on the function name (which is TBH pretty pants anyway).> If things could be restructured along those lines then adding > cancellation support to the dispatcher might get us reasonably good API > coverage pretty easilyHmm. Writing the control flow of all of these in a declarative style would be difficult. There are a lot of ifs and some parallel loops, as well as simple sequential execution. I think we would be in danger of writing a full-on coroutine system. Maybe that wouldn''t be such a bad idea. Hmm. Ian.
On Thu, 2013-10-31 at 14:32 +0000, Ian Jackson wrote:> Ian Campbell writes ("Re: [Xen-devel] FW: Cancelling asynchronous operations in libxl"): > > The way in which sequential async subops are currently handled (by > > chaining callbacks) is a bit ugly -- it leads to weird error messages > > like "libxl__vtpm_foo: Failed" which actually means "couldn''t add a > > NIC" (because that was the previous step, and libxl__vtpm_foo is what > > picks up the result). Do you think it would be feasible to recast those > > as sequences of function pointers in an array with a central (common) > > dispatcher which steps through, handling and reporting errors? > > Hmmm. It might be. There''s a problem in that they all tend to have > different arguments. And the control flow would be split into two > places. > > Maybe it would be better to improve the error reporting to not rely on > the function name (which is TBH pretty pants anyway).Yes. Even just adding a foo_done step would be better. e.g.: do_nic_stuff -> nic_stuff_done (check nic errors) -> do_vtpm_stuff(vtpm stuff) rather than do_nic_stuff -> do_vtpm_stuf(check nic errors, vtpm stuff)> > > If things could be restructured along those lines then adding > > cancellation support to the dispatcher might get us reasonably good API > > coverage pretty easily > > Hmm. > > Writing the control flow of all of these in a declarative style would > be difficult. There are a lot of ifs and some parallel loops, as well > as simple sequential execution.ifs seems like the easier case, you call the function and it returns a code saying "done" without doing anything (if its that sort of if) instead of "async work started" or you handle the two cases in the function etc. Don''t the parallel loops coalesce into a single "everything done" callback at some point? That seems to obvious place to resync with the dispatcher.> I think we would be in danger of writing a full-on coroutine system. > Maybe that wouldn''t be such a bad idea.Indeed ;-)> > Hmm. > > Ian.
I''ve been thinking about this some more and looking at the code. I have the following sketch of an approach: * Somehow the ao_how API is changed to make it possible to return the ao to the caller (in the case of an asynchronous ao). (NB that there could be exciting races in the application between completing the ao and cancelling it; this means that the application can only use cancellation if it uses the callback mechanism and must make sure that the callback takes a lock and then makes changes to its data structurs which prevent the ao being cancelled. As an alternative we could invent a separate "cancellation handle" which can be detached from the ao but which must always be explicitly destroyed by the application.) Suggestions for the API welcome. The most obvious approach simply adds a new field to libxl_asyncop_how but that risks lots of existing code failing to initialise it. * Keep a list of cancellation hooks in the ao. Anything which is using this ao can add itself to that set of hooks. * Cancellation involves repeatedly taking the front of that list off, and calling the hook. (After the ao has been cancelled, its completion still needs to be awaited by the application, but it will hopefully complete earlier and return ERROR_CANCEL.) * The timeout registration facility is changed to take an ao and register a cancellation hook. It is changed to provide an rc value to its callback, which will be FAIL or CANCEL. * The fork machinery is changed to take an ao, and register a cancellation hook. A suitable-for-default-uses cancellation hook function is provided which sends SIGKILL to the child and makes a note that this has happened. The child death callback provides an rc value (0 for status==0, or FAIL or CANCEL) for the convenience of the next layer up. * A new version of the xswatch event registration machinery is provided which takes an ao, registers itself as cancellation hooks, and provides an rc value to its callbacks. This new facility could usefully do an xs_read on a predefined path. The rc value will be OK or CANCEL. (We need new versions of this because some xswatch callers are part of the infrastructure or libxl application event generation, not aos.) * Anything which uses the fd machinery directly needs to do cancellation itself (or ensure that it has a timeout, an xs watch, or a child). A tricky question arises regarding cleanup: for example, if libxl_domain_create_* were cancelled. It would end up in domcreate_complete with rc==CANCEL. Should it now run the domain destruction ? How would the caller say they wanted that cancelled, if that too was taking too long ? Perhaps there should be a progress callback to say "we have finished cancelling the first thing and are now cleaning up". Or perhaps cancelling the operation should simply skip the destruction and return the domid to the caller. (But also, fiddly edge case: consider what happens if a failed creation, which is already being destroyed, is cancelled.) Ian.
On Fri, 2013-11-08 at 18:38 +0000, Ian Jackson wrote:> I''ve been thinking about this some more and looking at the code. > > I have the following sketch of an approach: > > * Somehow the ao_how API is changed to make it possible to return the > ao to the caller (in the case of an asynchronous ao). > > (NB that there could be exciting races in the application between > completing the ao and cancelling it; this means that the > application can only use cancellation if it uses the callback > mechanism and must make sure that the callback takes a lock and > then makes changes to its data structurs which prevent the ao being > cancelled. As an alternative we could invent a separate > "cancellation handle" which can be detached from the ao but which > must always be explicitly destroyed by the application.)That sounds like the sort of cleanup operation which callers would be pretty good at forgetting to do ;-) Or maybe it would be handled by generic code in the callers and it would be fine.> Suggestions for the API welcome. The most obvious approach simply > adds a new field to libxl_asyncop_how but that risks lots of > existing code failing to initialise it.Should we have supplied libxl_asyncop_how_init functions? It''s perhaps not too late to do so. The following 6 things are all internal to libxl, right?> * Keep a list of cancellation hooks in the ao. Anything which is > using this ao can add itself to that set of hooks. > > * Cancellation involves repeatedly taking the front of that list off, > and calling the hook. (After the ao has been cancelled, its > completion still needs to be awaited by the application, but it > will hopefully complete earlier and return ERROR_CANCEL.) > > * The timeout registration facility is changed to take an ao and > register a cancellation hook. It is changed to provide an rc value > to its callback, which will be FAIL or CANCEL. > > * The fork machinery is changed to take an ao, and register a > cancellation hook. A suitable-for-default-uses cancellation hook > function is provided which sends SIGKILL to the child and makes a > note that this has happened. The child death callback provides an > rc value (0 for status==0, or FAIL or CANCEL) for the convenience > of the next layer up. > > * A new version of the xswatch event registration machinery is > provided which takes an ao, registers itself as cancellation hooks, > and provides an rc value to its callbacks. This new facility could > usefully do an xs_read on a predefined path. The rc value will be > OK or CANCEL. (We need new versions of this because some xswatch > callers are part of the infrastructure or libxl application event > generation, not aos.) > > * Anything which uses the fd machinery directly needs to do > cancellation itself (or ensure that it has a timeout, an xs watch, > or a child).All sounds plausible to me.> A tricky question arises regarding cleanup: for example, if > libxl_domain_create_* were cancelled. It would end up in > domcreate_complete with rc==CANCEL. > > Should it now run the domain destruction ? How would the caller say > they wanted that cancelled, if that too was taking too long ? Perhaps > there should be a progress callback to say "we have finished > cancelling the first thing and are now cleaning up". > > Or perhaps cancelling the operation should simply skip the destruction > and return the domid to the caller. (But also, fiddly edge case: > consider what happens if a failed creation, which is already being > destroyed, is cancelled.)I think either approach could be made to work, the question is how much complexity we are willing to put into libxl vs the applications for this. On first glance it seems that by putting a small amount of complexity into the apps (you must destroy on cancel) we avoid a potentially large amount in the library. It seems to also make sense to allow the callers to manage the destruction as a separate cancellable event which they can manage independently if they wish. Probably we should ask the existing potential users (xapi & libvirt?) what they would prefer.