Brendan Cully
2008-Sep-05 19:16 UTC
[Xen-devel] [PATCH] xc_save: ignore the first suspend event channel notification
# HG changeset patch # User Brendan Cully <brendan@cs.ubc.ca> # Date 1220642148 25200 # Node ID 8b8456290da7eeb8b40d3d42b6e9e4adcd2f7077 # Parent 6e53036deb06fdbaa55489610ea8fc9c9b67ca64 xc_save: ignore the first suspend event channel notification I''ve noticed that the suspend event channel becomes pending as soon as it is bound. I''m not sure why or whether this is intentional, but it means that the suspend function will return before the domain has completed suspending unless the first notification is cleared. Without this patch, xc_domain_save may find that the guest has not suspended and sleep in 10ms chunks until it does. Typically this is several milliseconds of wasted time. diff --git a/tools/xcutils/xc_save.c b/tools/xcutils/xc_save.c --- a/tools/xcutils/xc_save.c +++ b/tools/xcutils/xc_save.c @@ -57,6 +57,25 @@ return 0; } +static int await_suspend(struct suspendinfo *si) +{ + int rc; + + do { + rc = xc_evtchn_pending(si->xce); + if (rc < 0) { + warnx("error polling suspend notification channel: %d", rc); + return -1; + } + } while (rc != si->suspend_evtchn); + + /* harmless for one-off suspend */ + if (xc_evtchn_unmask(si->xce, si->suspend_evtchn) < 0) + warnx("failed to unmask suspend notification channel: %d", rc); + + return 0; +} + static int suspend_evtchn_init(int xc, int domid) { struct xs_handle *xs; @@ -104,7 +123,8 @@ goto cleanup; } - return 0; + /* event channel is pending immediately after binding? */ + await_suspend(&si); cleanup: suspend_evtchn_release(xc, domid); @@ -125,17 +145,10 @@ return 0; } - do { - rc = xc_evtchn_pending(si.xce); - if (rc < 0) { - warnx("error polling suspend notification channel: %d", rc); + if (await_suspend(&si) < 0) { + warnx("suspend failed"); return 0; - } - } while (rc != si.suspend_evtchn); - - /* harmless for one-off suspend */ - if (xc_evtchn_unmask(si.xce, si.suspend_evtchn) < 0) - warnx("failed to unmask suspend notification channel: %d", rc); + } /* notify xend that it can do device migration */ printf("suspended\n"); _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Sep-06 08:19 UTC
Re: [Xen-devel] [PATCH] xc_save: ignore the first suspend event channel notification
On 5/9/08 20:16, "Brendan Cully" <brendan@cs.ubc.ca> wrote:> I''ve noticed that the suspend event channel becomes pending as soon as > it is bound. I''m not sure why or whether this is intentional, but it > means that the suspend function will return before the domain has > completed suspending unless the first notification is cleared. Without > this patch, xc_domain_save may find that the guest has not suspended > and sleep in 10ms chunks until it does. Typically this is several > milliseconds of wasted time.This patch looks pretty dodgy to me. First of all, you removed the non-error-case return statement from suspend_evtchn_init(). Hence it appears you release the local port before returning, and you return -1 (which you''d never notice since the caller erroneously does not check the return code - please fix this). Hence I think actually you are ending up using the old slow shutdown method since suspend_evtchn will be -1 and hence you''ll end up in compat_suspend(). That should result in a significant slowdown rather than speedup, so were your measurements actually taken with this exact patch? Even with this fixed, I don''t think that ignoring an event wakeup is a great idea. You can easily make the not-suspended-yet retry loop always wait on the evtchn rather than sleeping 10ms. By which I mean -- there is really no reason for that to be a pure poll loop when you have an evtchn to sleep on. To do this, call (*suspend) from within the retry loop: the evtchn case can do what it always does (basically sleep on the evtchn device until its evtchn of interest appears); the compat case should change behaviour after its first invocation so that it sleeps 10ms (stash a static variable in the function or in suspendinfo for this purpose, to remember whether it was already invoked). -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Brendan Cully
2008-Sep-07 02:28 UTC
Re: [Xen-devel] [PATCH] xc_save: ignore the first suspend event channel notification
On Saturday, 06 September 2008 at 09:19, Keir Fraser wrote:> On 5/9/08 20:16, "Brendan Cully" <brendan@cs.ubc.ca> wrote: > > > I''ve noticed that the suspend event channel becomes pending as soon as > > it is bound. I''m not sure why or whether this is intentional, but it > > means that the suspend function will return before the domain has > > completed suspending unless the first notification is cleared. Without > > this patch, xc_domain_save may find that the guest has not suspended > > and sleep in 10ms chunks until it does. Typically this is several > > milliseconds of wasted time. > > This patch looks pretty dodgy to me. > > First of all, you removed the non-error-case return statement from > suspend_evtchn_init(). Hence it appears you release the local port before > returning, and you return -1 (which you''d never notice since the caller > erroneously does not check the return code - please fix this).Very sorry. I''ve attached the simple fix below.> Hence I think actually you are ending up using the old slow shutdown method > since suspend_evtchn will be -1 and hence you''ll end up in compat_suspend(). > That should result in a significant slowdown rather than speedup, so were > your measurements actually taken with this exact patch?No, I measured in remus and misbackported, I''m afraid. But I''ve timed this version -- it tends to be 2-5ms to suspend and notify xend.> Even with this fixed, I don''t think that ignoring an event wakeup is a great > idea. You can easily make the not-suspended-yet retry loop always wait on > the evtchn rather than sleeping 10ms. By which I mean -- there is really no > reason for that to be a pure poll loop when you have an evtchn to sleep on.This code is only ignoring the event at setup time, before the suspend function has actually notified the domain to suspend itself. So I think it should be reasonably safe. Presumably the only reason the event channel has a pending notification at this point is because it is forcibly set in evtchn_bind_interdomain? The only other case I can think of is that somebody else has requested that the domain suspend itself, maybe by writing to the watch. That seems like a higher-level locking problem to me.> To do this, call (*suspend) from within the retry loop: the evtchn case can > do what it always does (basically sleep on the evtchn device until its > evtchn of interest appears); the compat case should change behaviour after > its first invocation so that it sleeps 10ms (stash a static variable in the > function or in suspendinfo for this purpose, to remember whether it was > already invoked).I could certainly code this up as well (it''d need a static flag in evtchn_suspend as well to avoid resignalling the domain, I think). But generally without clearing the event channel before signalling the guest, the first suspend attempt will always return early. I''m not really clear on the scenario that results in the domain not being suspended after *suspend has succesfully returned. Could you clarify? Thanks, Brendan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Sep-07 07:06 UTC
Re: [Xen-devel] [PATCH] xc_save: ignore the first suspend event channel notification
On 7/9/08 03:28, "Brendan Cully" <brendan@cs.ubc.ca> wrote:>> To do this, call (*suspend) from within the retry loop: the evtchn case can >> do what it always does (basically sleep on the evtchn device until its >> evtchn of interest appears); the compat case should change behaviour after >> its first invocation so that it sleeps 10ms (stash a static variable in the >> function or in suspendinfo for this purpose, to remember whether it was >> already invoked). > > I could certainly code this up as well (it''d need a static flag in > evtchn_suspend as well to avoid resignalling the domain, I think). But > generally without clearing the event channel before signalling the > guest, the first suspend attempt will always return early. I''m not > really clear on the scenario that results in the domain not being > suspended after *suspend has succesfully returned. Could you clarify?I don''t like the forced initial consumption of a single evtchn signal. I''d like rid of that and change the retry loop to call (*suspend) instead. Add an extra boolean to the static suspend_info structure to detect first invocation of (*suspend) versus repeated invocations. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Sep-07 07:11 UTC
Re: [Xen-devel] [PATCH] xc_save: ignore the first suspend event channel notification
On 7/9/08 08:06, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:>> I could certainly code this up as well (it''d need a static flag in >> evtchn_suspend as well to avoid resignalling the domain, I think). But >> generally without clearing the event channel before signalling the >> guest, the first suspend attempt will always return early. I''m not >> really clear on the scenario that results in the domain not being >> suspended after *suspend has succesfully returned. Could you clarify? > > I don''t like the forced initial consumption of a single evtchn signal. I''d > like rid of that and change the retry loop to call (*suspend) instead. Add > an extra boolean to the static suspend_info structure to detect first > invocation of (*suspend) versus repeated invocations.Actually I can knock these changes together myself, so I''ll do that, taking whatever bits of your patch are appropriate. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Brendan Cully
2008-Sep-07 07:15 UTC
Re: [Xen-devel] [PATCH] xc_save: ignore the first suspend event channel notification
On Sunday, 07 September 2008 at 08:11, Keir Fraser wrote:> On 7/9/08 08:06, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote: > > >> I could certainly code this up as well (it''d need a static flag in > >> evtchn_suspend as well to avoid resignalling the domain, I think). But > >> generally without clearing the event channel before signalling the > >> guest, the first suspend attempt will always return early. I''m not > >> really clear on the scenario that results in the domain not being > >> suspended after *suspend has succesfully returned. Could you clarify? > > > > I don''t like the forced initial consumption of a single evtchn signal. I''d > > like rid of that and change the retry loop to call (*suspend) instead. Add > > an extra boolean to the static suspend_info structure to detect first > > invocation of (*suspend) versus repeated invocations. > > Actually I can knock these changes together myself, so I''ll do that, taking > whatever bits of your patch are appropriate.Ok, sure. I still think it''s a bit unfortunate that the suspend function will return without waiting for the domain to suspend, and not just because of the extra round trip. At the end of the suspend function, xend is under the impression that the guest has suspended and it can perform device migration stage 2. I wonder if it''s dangerous to do this before confirming that the guest has actually suspended. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Sep-07 07:43 UTC
Re: [Xen-devel] [PATCH] xc_save: ignore the first suspend event channel notification
On 7/9/08 08:15, "Brendan Cully" <brendan@cs.ubc.ca> wrote:>>> I don''t like the forced initial consumption of a single evtchn signal. I''d >>> like rid of that and change the retry loop to call (*suspend) instead. Add >>> an extra boolean to the static suspend_info structure to detect first >>> invocation of (*suspend) versus repeated invocations. >> >> Actually I can knock these changes together myself, so I''ll do that, taking >> whatever bits of your patch are appropriate. > > Ok, sure. I still think it''s a bit unfortunate that the suspend > function will return without waiting for the domain to suspend, and > not just because of the extra round trip. At the end of the suspend > function, xend is under the impression that the guest has suspended > and it can perform device migration stage 2. I wonder if it''s > dangerous to do this before confirming that the guest has actually > suspended.Ah, I hadn''t spotted that bit. I''ll think about it - maybe your approach is simplest after all. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Sep-08 10:18 UTC
Re: [Xen-devel] [PATCH] xc_save: ignore the first suspend event channel notification
On 7/9/08 03:28, "Brendan Cully" <brendan@cs.ubc.ca> wrote:> I could certainly code this up as well (it''d need a static flag in > evtchn_suspend as well to avoid resignalling the domain, I think). But > generally without clearing the event channel before signalling the > guest, the first suspend attempt will always return early. I''m not > really clear on the scenario that results in the domain not being > suspended after *suspend has succesfully returned. Could you clarify?I checked in your patch as is. One question: do we need the wait-one-second-for-shutdown loop in suspend_and_state() at all? My reading of (*suspend)() is that it should be sure the domain is suspended when it returns, and hence should suspend_and_state() not simply raise an error if it finds that domaininfo does not indicate the guest is shut down? The retry loop may simply be allowing bugs of the sort you''ve just fixed to linger. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Brendan Cully
2008-Sep-08 22:36 UTC
Re: [Xen-devel] [PATCH] xc_save: ignore the first suspend event channel notification
On Monday, 08 September 2008 at 11:18, Keir Fraser wrote:> On 7/9/08 03:28, "Brendan Cully" <brendan@cs.ubc.ca> wrote: > > > I could certainly code this up as well (it''d need a static flag in > > evtchn_suspend as well to avoid resignalling the domain, I think). But > > generally without clearing the event channel before signalling the > > guest, the first suspend attempt will always return early. I''m not > > really clear on the scenario that results in the domain not being > > suspended after *suspend has succesfully returned. Could you clarify? > > I checked in your patch as is. One question: do we need the > wait-one-second-for-shutdown loop in suspend_and_state() at all? My reading > of (*suspend)() is that it should be sure the domain is suspended when it > returns, and hence should suspend_and_state() not simply raise an error if > it finds that domaininfo does not indicate the guest is shut down? The retry > loop may simply be allowing bugs of the sort you''ve just fixed to linger.I agree that that retry loop is a bit dubious. It appears to come from changeset 2147:949f21fc9e77 (Fix migrate to cope with domains that are paused.) This was long before device migration appeared in xend (changeset 9657:1fe63743a147), when the only thing that mattered was that the domain be suspended before the final round. Removing the poll in suspend_and_state undoes 2147. If we want to keep that logic, we could probably just hoist it up into *suspend. There''s still no protection that I know of between concurrent invocation of xc_save on the same domain, but that''s a whole other kettle of fish :) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2008-Sep-09 07:18 UTC
Re: [Xen-devel] [PATCH] xc_save: ignore the first suspend event channel notification
On 8/9/08 23:36, "Brendan Cully" <brendan@cs.ubc.ca> wrote:>> I checked in your patch as is. One question: do we need the >> wait-one-second-for-shutdown loop in suspend_and_state() at all? My reading >> of (*suspend)() is that it should be sure the domain is suspended when it >> returns, and hence should suspend_and_state() not simply raise an error if >> it finds that domaininfo does not indicate the guest is shut down? The retry >> loop may simply be allowing bugs of the sort you''ve just fixed to linger. > > I agree that that retry loop is a bit dubious. It appears to come from > changeset 2147:949f21fc9e77 (Fix migrate to cope with domains that are > paused.) This was long before device migration appeared in xend > (changeset 9657:1fe63743a147), when the only thing that mattered was > that the domain be suspended before the final round. > > Removing the poll in suspend_and_state undoes 2147. If we want to keep > that logic, we could probably just hoist it up into *suspend.If the domain is paused then we''re boned, since the evtchn will never fire / we won''t get shutdown acknowledgement from xend. The subsequent checks in suspend_and_state() are too late. So, if you revert 2147 does everything still work okay? If so we should do that. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel