Konrad Rzeszutek Wilk
2013-Nov-08 17:38 UTC
[PATCH 4/4] xen/xenbus: Avoid synchronous wait on XenBus stalling shutdown/restart.
The ''read_reply'' works with ''process_msg'' to read of a reply in XenBus. ''process_msg'' is running from within the ''xenbus'' thread. Whenever a message shows up in XenBus it is put on a xs_state.reply_list list and ''read_reply'' picks it up. The problem is if the backend domain or the xenstored process is killed. In which case ''xenbus'' is still awaiting - and ''read_reply'' if called - stuck forever waiting for the reply_list to have some contents. This is normally not a problem - as the backend domain can come back or the xenstored process can be restarted. However if the domain is in process of being powered off/restarted/halted - there is no point of waiting on it coming back - as we are effectively being terminated and should not impede the progress. This patch solves this problem by checking the ''system_state'' value to see if we are in heading towards death. We also make the wait mechanism a bit more asynchronous. Fixes-Bug: http://bugs.xenproject.org/xen/bug/8 Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> --- drivers/xen/xenbus/xenbus_xs.c | 24 +++++++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-) diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c index b6d5fff..4f22706 100644 --- a/drivers/xen/xenbus/xenbus_xs.c +++ b/drivers/xen/xenbus/xenbus_xs.c @@ -148,9 +148,24 @@ static void *read_reply(enum xsd_sockmsg_type *type, unsigned int *len) while (list_empty(&xs_state.reply_list)) { spin_unlock(&xs_state.reply_lock); - /* XXX FIXME: Avoid synchronous wait for response here. */ - wait_event(xs_state.reply_waitq, - !list_empty(&xs_state.reply_list)); + wait_event_timeout(xs_state.reply_waitq, + !list_empty(&xs_state.reply_list), + msecs_to_jiffies(500)); + + /* + * If we are in the process of being shut-down there is + * no point of trying to contact XenBus - it is either + * killed (xenstored application) or the other domain + * has been killed or is unreachable. + */ + switch (system_state) { + case SYSTEM_POWER_OFF: + case SYSTEM_RESTART: + case SYSTEM_HALT: + return ERR_PTR(-EIO); + default: + break; + } spin_lock(&xs_state.reply_lock); } @@ -215,6 +230,9 @@ void *xenbus_dev_request_and_reply(struct xsd_sockmsg *msg) mutex_unlock(&xs_state.request_mutex); + if (IS_ERR(ret)) + return ret; + if ((msg->type == XS_TRANSACTION_END) || ((req_msg.type == XS_TRANSACTION_START) && (msg->type == XS_ERROR))) -- 1.8.3.1
David Vrabel
2013-Nov-21 17:52 UTC
Re: [PATCH 4/4] xen/xenbus: Avoid synchronous wait on XenBus stalling shutdown/restart.
On 08/11/13 17:38, Konrad Rzeszutek Wilk wrote:> The ''read_reply'' works with ''process_msg'' to read of a reply in XenBus. > ''process_msg'' is running from within the ''xenbus'' thread. Whenever > a message shows up in XenBus it is put on a xs_state.reply_list list > and ''read_reply'' picks it up. > > The problem is if the backend domain or the xenstored process is killed. > In which case ''xenbus'' is still awaiting - and ''read_reply'' if called - > stuck forever waiting for the reply_list to have some contents. > > This is normally not a problem - as the backend domain can come back > or the xenstored process can be restarted. However if the domain > is in process of being powered off/restarted/halted - there is no > point of waiting on it coming back - as we are effectively being > terminated and should not impede the progress. > > This patch solves this problem by checking the ''system_state'' value > to see if we are in heading towards death. We also make the wait > mechanism a bit more asynchronous.This seems to be checking the wrong thing conceptually. We should abort the wait if xenstored is dead not if our domain is dying. I think you can consider xenstored as dead if: a) it''s local and we''re dying. b) it''s remote and the remote domain is dead.> Fixes-Bug: http://bugs.xenproject.org/xen/bug/8This bug link has no useful information in it.> --- a/drivers/xen/xenbus/xenbus_xs.c > +++ b/drivers/xen/xenbus/xenbus_xs.c > @@ -148,9 +148,24 @@ static void *read_reply(enum xsd_sockmsg_type *type, unsigned int *len) > > while (list_empty(&xs_state.reply_list)) { > spin_unlock(&xs_state.reply_lock); > - /* XXX FIXME: Avoid synchronous wait for response here. */ > - wait_event(xs_state.reply_waitq, > - !list_empty(&xs_state.reply_list)); > + wait_event_timeout(xs_state.reply_waitq, > + !list_empty(&xs_state.reply_list), > + msecs_to_jiffies(500));This is still a synchronous wait. Is the removal of the FIXME comment correct?> + > + /* > + * If we are in the process of being shut-down there is > + * no point of trying to contact XenBus - it is either > + * killed (xenstored application) or the other domain > + * has been killed or is unreachable.Not necessarily, xenstore could just be slow. David
Ian Campbell
2013-Nov-22 09:30 UTC
Re: [PATCH 4/4] xen/xenbus: Avoid synchronous wait on XenBus stalling shutdown/restart.
graft 8 <20130528152156.GB3027@phenom.dumpdata.com> prune 8 <20130528181149.GA27718@phenom.dumpdata.com> thanks On Thu, 2013-11-21 at 17:52 +0000, David Vrabel wrote:> > Fixes-Bug: http://bugs.xenproject.org/xen/bug/8 > > This bug link has no useful information in it.Looks like the intention was for it to reference this mail: http://thread.gmane.org/gmane.comp.emulators.xen.devel/160720/focus=160828 this has the exact same contents as the control mail that created this bug which I dug out of the bug trackers spool. The probles was that in the original the commands were appended instead of at the front of the message, so they got ignored. Then when the commands were correctly sent the mail in question used "!" (meaning this mail) but didn''t go to xen-devel, so it didn''t actually refer to a known thread. The correct thing to do in that resend would have been to reference the relevant message id directly. I think I''ve fixed it up with the above commands. Ian.
Konrad Rzeszutek Wilk
2013-Nov-26 16:50 UTC
Re: [PATCH 4/4] xen/xenbus: Avoid synchronous wait on XenBus stalling shutdown/restart.
On Thu, Nov 21, 2013 at 05:52:28PM +0000, David Vrabel wrote:> On 08/11/13 17:38, Konrad Rzeszutek Wilk wrote: > > The ''read_reply'' works with ''process_msg'' to read of a reply in XenBus. > > ''process_msg'' is running from within the ''xenbus'' thread. Whenever > > a message shows up in XenBus it is put on a xs_state.reply_list list > > and ''read_reply'' picks it up. > > > > The problem is if the backend domain or the xenstored process is killed. > > In which case ''xenbus'' is still awaiting - and ''read_reply'' if called - > > stuck forever waiting for the reply_list to have some contents. > > > > This is normally not a problem - as the backend domain can come back > > or the xenstored process can be restarted. However if the domain > > is in process of being powered off/restarted/halted - there is no > > point of waiting on it coming back - as we are effectively being > > terminated and should not impede the progress. > > > > This patch solves this problem by checking the ''system_state'' value > > to see if we are in heading towards death. We also make the wait > > mechanism a bit more asynchronous. > > This seems to be checking the wrong thing conceptually. We should abort > the wait if xenstored is dead not if our domain is dying. > > I think you can consider xenstored as dead if: > > a) it''s local and we''re dying.OK. Not sure exactly how to do that but that should be possible.> b) it''s remote and the remote domain is dead.OK, any idea how to do that? As in check if a remote domain is dead?> > > Fixes-Bug: http://bugs.xenproject.org/xen/bug/8 > > This bug link has no useful information in it. > > > --- a/drivers/xen/xenbus/xenbus_xs.c > > +++ b/drivers/xen/xenbus/xenbus_xs.c > > @@ -148,9 +148,24 @@ static void *read_reply(enum xsd_sockmsg_type *type, unsigned int *len) > > > > while (list_empty(&xs_state.reply_list)) { > > spin_unlock(&xs_state.reply_lock); > > - /* XXX FIXME: Avoid synchronous wait for response here. */ > > - wait_event(xs_state.reply_waitq, > > - !list_empty(&xs_state.reply_list)); > > + wait_event_timeout(xs_state.reply_waitq, > > + !list_empty(&xs_state.reply_list), > > + msecs_to_jiffies(500)); > > This is still a synchronous wait. Is the removal of the FIXME comment > correct?I thought that the comment was meant in terms of it blocking forever. But perhaps that was not the intent of the comment?> > > + > > + /* > > + * If we are in the process of being shut-down there is > > + * no point of trying to contact XenBus - it is either > > + * killed (xenstored application) or the other domain > > + * has been killed or is unreachable. > > Not necessarily, xenstore could just be slow.That is true. Your suggestion would help in evaluating when XenBus end point is kaput.> > David
David Vrabel
2013-Dec-02 11:41 UTC
Re: [PATCH 4/4] xen/xenbus: Avoid synchronous wait on XenBus stalling shutdown/restart.
On 26/11/13 16:50, Konrad Rzeszutek Wilk wrote:> On Thu, Nov 21, 2013 at 05:52:28PM +0000, David Vrabel wrote: >> On 08/11/13 17:38, Konrad Rzeszutek Wilk wrote: >>> The ''read_reply'' works with ''process_msg'' to read of a reply in XenBus. >>> ''process_msg'' is running from within the ''xenbus'' thread. Whenever >>> a message shows up in XenBus it is put on a xs_state.reply_list list >>> and ''read_reply'' picks it up. >>> >>> The problem is if the backend domain or the xenstored process is killed. >>> In which case ''xenbus'' is still awaiting - and ''read_reply'' if called - >>> stuck forever waiting for the reply_list to have some contents. >>> >>> This is normally not a problem - as the backend domain can come back >>> or the xenstored process can be restarted. However if the domain >>> is in process of being powered off/restarted/halted - there is no >>> point of waiting on it coming back - as we are effectively being >>> terminated and should not impede the progress. >>> >>> This patch solves this problem by checking the ''system_state'' value >>> to see if we are in heading towards death. We also make the wait >>> mechanism a bit more asynchronous. >> >> This seems to be checking the wrong thing conceptually. We should abort >> the wait if xenstored is dead not if our domain is dying. >> >> I think you can consider xenstored as dead if: >> >> a) it''s local and we''re dying. > > OK. Not sure exactly how to do that but that should be possible.xen_store_domain_type == XS_LOCAL and looking at system_state?>> b) it''s remote and the remote domain is dead. > > OK, any idea how to do that? As in check if a remote domain is dead?Let someone who cares about xenstore domains fix this -- this is not the most common use case. I''d be happy to have some thing like: bool xenbus_ok(void) { switch (xen_store_domain_type) { case XS_LOCAL: return system_state != dying; case XS_PV: case XS_HVM; /* FIXME: could check remote domain is alive, but it''s normally dom0. */ return true; // ... default: return true; } }>>> Fixes-Bug: http://bugs.xenproject.org/xen/bug/8 >> >> This bug link has no useful information in it.And it now does, thanks Ian.>>> --- a/drivers/xen/xenbus/xenbus_xs.c >>> +++ b/drivers/xen/xenbus/xenbus_xs.c >>> @@ -148,9 +148,24 @@ static void *read_reply(enum xsd_sockmsg_type *type, unsigned int *len) >>> >>> while (list_empty(&xs_state.reply_list)) { >>> spin_unlock(&xs_state.reply_lock); >>> - /* XXX FIXME: Avoid synchronous wait for response here. */ >>> - wait_event(xs_state.reply_waitq, >>> - !list_empty(&xs_state.reply_list)); >>> + wait_event_timeout(xs_state.reply_waitq, >>> + !list_empty(&xs_state.reply_list), >>> + msecs_to_jiffies(500)); >> >> This is still a synchronous wait. Is the removal of the FIXME comment >> correct? > > I thought that the comment was meant in terms of it blocking forever. > But perhaps that was not the intent of the comment?Ok. I don''t anticipate a fully async interface here being sensible anyway. David