Dear xen-devel, There''s a race condition in xenfs (xenstore driver) that causes userspace utility xenstore-watch to crash. Normally, the userspace process gets an "OK" from xenfs and then the watch fires immediately after. Occasionally, this happens the other way around: the watch fires before the driver sends "OK", which confuses the xenstore-watch client. It seems to me that the client is within its rights to expect the "OK" first. Here''s what is happening: The userspace process xenstore-watch writes to /proc/xen/xenbus with msg_type==XS_WATCH. This is handled by xenbus_write_watch which calls register_xenbus_watch with watch_fired as a callback *before* acquiring the reply_mutex and sending the synthesised "OK" reply. This gives a fast xenstore the opportunity to cause the watch_fired to run (and briefly grab the reply_mutex for itself) before the fake "OK" message is sent. Below, I''ve included a putative patch for pre-3.3 xenfs that fixes this problem. (It looks like the patch would also apply cleanly to 3.3-onwards xenbus_dev_frontend.c, but I haven''t tried.) Any comments about whether this is a reasonable approach? A cursory glance at drivers/xen/xenbus/xenbus_dev.c suggests that it suffers from the same problem. Although I haven''t haven''t tested this, I''d expect that it requires a very similar solution. Jonathan Take the (non-reentrant) reply_mutex before calling register_xenbus_watch to prevent the watch_fired callback from writing anything until the "OK" has been sent. Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> diff -r 5ab1b4af1faf drivers/xen/xenfs/xenbus.c --- a/drivers/xen/xenfs/xenbus.c Thu Dec 03 06:00:06 2009 +0000 +++ b/drivers/xen/xenfs/xenbus.c Wed May 15 17:24:47 2013 +0100 @@ -359,6 +359,8 @@ static int xenbus_write_watch(unsigned m } token++; + mutex_lock(&u->reply_mutex); + if (msg_type == XS_WATCH) { watch = alloc_watch_adapter(path, token); if (watch == NULL) { @@ -401,12 +403,11 @@ static int xenbus_write_watch(unsigned m "OK" }; - mutex_lock(&u->reply_mutex); rc = queue_reply(&u->read_buffers, &reply, sizeof(reply)); - mutex_unlock(&u->reply_mutex); } out: + mutex_unlock(&u->reply_mutex); return rc; }
>>> On 15.05.13 at 18:51, Jonathan Davies <jonathan.davies@citrix.com> wrote: > The userspace process xenstore-watch writes to /proc/xen/xenbus with > msg_type==XS_WATCH. This is handled by xenbus_write_watch which calls > register_xenbus_watch with watch_fired as a callback *before* acquiring > the reply_mutex and sending the synthesised "OK" reply. > > This gives a fast xenstore the opportunity to cause the watch_fired to > run (and briefly grab the reply_mutex for itself) before the fake "OK" > message is sent. > > Below, I''ve included a putative patch for pre-3.3 xenfs that fixes this > problem. (It looks like the patch would also apply cleanly to > 3.3-onwards xenbus_dev_frontend.c, but I haven''t tried.) Any comments > about whether this is a reasonable approach?Yes, this looks reasonable at a first glance, pending confirmation by someone involved in the original xenbus/xenstore design that the expectation to see the watch firing only after the OK response is a valid one. There''s an implementation problem though:> --- a/drivers/xen/xenfs/xenbus.c Thu Dec 03 06:00:06 2009 +0000 > +++ b/drivers/xen/xenfs/xenbus.c Wed May 15 17:24:47 2013 +0100 > @@ -359,6 +359,8 @@ static int xenbus_write_watch(unsigned m > } > token++; > > + mutex_lock(&u->reply_mutex); > +Right above the initial patch context there is another "goto out", so ...> if (msg_type == XS_WATCH) { > watch = alloc_watch_adapter(path, token); > if (watch == NULL) { > @@ -401,12 +403,11 @@ static int xenbus_write_watch(unsigned m > "OK" > }; > > - mutex_lock(&u->reply_mutex); > rc = queue_reply(&u->read_buffers, &reply, sizeof(reply)); > - mutex_unlock(&u->reply_mutex); > } > > out: > + mutex_unlock(&u->reply_mutex);... this is not valid.> return rc; > } >Thus the unlock needs to become conditional here (either via tracking the state in a local variable, or via adding a second label, or via replacing the first goto with a straight return). I''d also recommend to shrink the protected region to the minimum possible (i.e. acquire the mutex right before the call to register_xenbus_watch(), and at the end of the "else" body). Since you then would end up with only a single error path needing to drop the lock, dropping it in that error path rather than moving it past the "out" label might be preferable. Jan
On Wed, May 15, 2013 at 05:51:06PM +0100, Jonathan Davies wrote:> Dear xen-devel, > > There''s a race condition in xenfs (xenstore driver) that causes > userspace utility xenstore-watch to crash. > > Normally, the userspace process gets an "OK" from xenfs and then the > watch fires immediately after. Occasionally, this happens the other way > around: the watch fires before the driver sends "OK", which confuses > the xenstore-watch client. It seems to me that the client is within its > rights to expect the "OK" first. > > Here''s what is happening: > > The userspace process xenstore-watch writes to /proc/xen/xenbus with > msg_type==XS_WATCH. This is handled by xenbus_write_watch which calls > register_xenbus_watch with watch_fired as a callback *before* acquiring > the reply_mutex and sending the synthesised "OK" reply. > > This gives a fast xenstore the opportunity to cause the watch_fired to > run (and briefly grab the reply_mutex for itself) before the fake "OK" > message is sent. > > Below, I''ve included a putative patch for pre-3.3 xenfs that fixes this > problem. (It looks like the patch would also apply cleanly to > 3.3-onwards xenbus_dev_frontend.c, but I haven''t tried.) Any comments > about whether this is a reasonable approach?It can''t apply cleanly as the file moved :-(> > A cursory glance at drivers/xen/xenbus/xenbus_dev.c suggests that it > suffers from the same problem. Although I haven''t haven''t tested this, > I''d expect that it requires a very similar solution. > > Jonathan > > > Take the (non-reentrant) reply_mutex before calling > register_xenbus_watch to prevent the watch_fired callback from writing > anything until the "OK" has been sent.Should that perhaps be done inside the msg_type == XS_WATCH code (with a bool that would determine whether an reply_mutex has been taken?) As in, is there no need to take this mutex if msg_type != XS_WATCH?> > Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> > > diff -r 5ab1b4af1faf drivers/xen/xenfs/xenbus.c > --- a/drivers/xen/xenfs/xenbus.c Thu Dec 03 06:00:06 2009 +0000 > +++ b/drivers/xen/xenfs/xenbus.c Wed May 15 17:24:47 2013 +0100 > @@ -359,6 +359,8 @@ static int xenbus_write_watch(unsigned m > } > token++; > > + mutex_lock(&u->reply_mutex); > +Have you tested this with the kernel compiled with DEBUG_MUTEX and DEBUG_PROVE_LOCKING to make sure there are no mutex/spinlock issues?> if (msg_type == XS_WATCH) { > watch = alloc_watch_adapter(path, token); > if (watch == NULL) { > @@ -401,12 +403,11 @@ static int xenbus_write_watch(unsigned m > "OK" > }; > > - mutex_lock(&u->reply_mutex); > rc = queue_reply(&u->read_buffers, &reply, sizeof(reply)); > - mutex_unlock(&u->reply_mutex); > } > > out: > + mutex_unlock(&u->reply_mutex); > return rc; > } > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >
>>> On 28.05.13 at 14:55, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > On Wed, May 15, 2013 at 05:51:06PM +0100, Jonathan Davies wrote: >> Normally, the userspace process gets an "OK" from xenfs and then the >> watch fires immediately after. Occasionally, this happens the other way >> around: the watch fires before the driver sends "OK", which confuses >> the xenstore-watch client. It seems to me that the client is within its >> rights to expect the "OK" first.Now that I look at this another time, I wonder how this behavior can be guaranteed even with your patch: xenbus_write_watch() sends the reply by calling queue_reply() followed by wake_up(), so the reply getting delivered and the watch firing appear to be inherently asynchronous anyway, and your patch just reduces the race window. Am I overlooking something here?>> Take the (non-reentrant) reply_mutex before calling >> register_xenbus_watch to prevent the watch_fired callback from writing >> anything until the "OK" has been sent. > > Should that perhaps be done inside the msg_type == XS_WATCH code (with a > bool that would determine whether an reply_mutex has been taken?) > > As in, is there no need to take this mutex if msg_type != XS_WATCH?As said in an earlier reply, I also think that it would be preferable to shrink the locked region as much as possible. But there clearly is a need to also acquire the mutex for msg_type != XS_WATCH, just not before the loop in the else branch. Jan
On 28/05/13 13:55, Konrad Rzeszutek Wilk wrote:> On Wed, May 15, 2013 at 05:51:06PM +0100, Jonathan Davies wrote: >> Dear xen-devel, >> >> There''s a race condition in xenfs (xenstore driver) that causes >> userspace utility xenstore-watch to crash. >> >> Normally, the userspace process gets an "OK" from xenfs and then the >> watch fires immediately after. Occasionally, this happens the other way >> around: the watch fires before the driver sends "OK", which confuses >> the xenstore-watch client. It seems to me that the client is within its >> rights to expect the "OK" first. >> >> Here''s what is happening: >> >> The userspace process xenstore-watch writes to /proc/xen/xenbus with >> msg_type==XS_WATCH. This is handled by xenbus_write_watch which calls >> register_xenbus_watch with watch_fired as a callback *before* acquiring >> the reply_mutex and sending the synthesised "OK" reply. >> >> This gives a fast xenstore the opportunity to cause the watch_fired to >> run (and briefly grab the reply_mutex for itself) before the fake "OK" >> message is sent. >> >> Below, I''ve included a putative patch for pre-3.3 xenfs that fixes this >> problem. (It looks like the patch would also apply cleanly to >> 3.3-onwards xenbus_dev_frontend.c, but I haven''t tried.) Any comments >> about whether this is a reasonable approach? > > It can''t apply cleanly as the file moved :-(Well, it applies cleanly if you rename the affected file to drivers/xen/xenbus/xenbus_dev_frontend.c and can tolerate a 15-line offset... :-)>> A cursory glance at drivers/xen/xenbus/xenbus_dev.c suggests that it >> suffers from the same problem. Although I haven''t haven''t tested this, >> I''d expect that it requires a very similar solution. >> >> Jonathan >> >> >> Take the (non-reentrant) reply_mutex before calling >> register_xenbus_watch to prevent the watch_fired callback from writing >> anything until the "OK" has been sent. > > Should that perhaps be done inside the msg_type == XS_WATCH code (with a > bool that would determine whether an reply_mutex has been taken?) > > As in, is there no need to take this mutex if msg_type != XS_WATCH?When msg_type != XS_WATCH, we only need to take the mutex briefly when sending the "OK". There''s no issue with XS_UNWATCH because there''s never anything further to be written. v2 of the patch won''t hold the lock while doing the work for XS_UNWATCH.>> Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> >> >> diff -r 5ab1b4af1faf drivers/xen/xenfs/xenbus.c >> --- a/drivers/xen/xenfs/xenbus.c Thu Dec 03 06:00:06 2009 +0000 >> +++ b/drivers/xen/xenfs/xenbus.c Wed May 15 17:24:47 2013 +0100 >> @@ -359,6 +359,8 @@ static int xenbus_write_watch(unsigned m >> } >> token++; >> >> + mutex_lock(&u->reply_mutex); >> + > > Have you tested this with the kernel compiled with DEBUG_MUTEX and DEBUG_PROVE_LOCKING > to make sure there are no mutex/spinlock issues?No. I will give that a try.>> if (msg_type == XS_WATCH) { >> watch = alloc_watch_adapter(path, token); >> if (watch == NULL) { >> @@ -401,12 +403,11 @@ static int xenbus_write_watch(unsigned m >> "OK" >> }; >> >> - mutex_lock(&u->reply_mutex); >> rc = queue_reply(&u->read_buffers, &reply, sizeof(reply)); >> - mutex_unlock(&u->reply_mutex); >> } >> >> out: >> + mutex_unlock(&u->reply_mutex); >> return rc; >> }Thanks for the feedback, Jonathan
On 31/05/13 14:45, Jan Beulich wrote:>>>> On 28.05.13 at 14:55, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: >> On Wed, May 15, 2013 at 05:51:06PM +0100, Jonathan Davies wrote: >>> Normally, the userspace process gets an "OK" from xenfs and then the >>> watch fires immediately after. Occasionally, this happens the other way >>> around: the watch fires before the driver sends "OK", which confuses >>> the xenstore-watch client. It seems to me that the client is within its >>> rights to expect the "OK" first. > > Now that I look at this another time, I wonder how this behavior > can be guaranteed even with your patch: xenbus_write_watch() > sends the reply by calling queue_reply() followed by wake_up(), > so the reply getting delivered and the watch firing appear to be > inherently asynchronous anyway, and your patch just reduces > the race window. Am I overlooking something here?There''s no call to wake_up() in xenbus_write_watch(), as far as I can see... The key thing is to guarantee that xenbus_write_watch()''s call to queue_reply() (for the "OK" message) executes before the watch_fired() callback -- registered by the call to register_xenbus_watch() -- could be called-back. Given that watch_fired() also takes the reply_mutex before its own calls to queue_reply() and wake_up(), this behaviour can be guaranteed by taking the reply_mutex before registering the callback.>>> Take the (non-reentrant) reply_mutex before calling >>> register_xenbus_watch to prevent the watch_fired callback from writing >>> anything until the "OK" has been sent. >> >> Should that perhaps be done inside the msg_type == XS_WATCH code (with a >> bool that would determine whether an reply_mutex has been taken?) >> >> As in, is there no need to take this mutex if msg_type != XS_WATCH? > > As said in an earlier reply, I also think that it would be preferable > to shrink the locked region as much as possible.v2 of the patch will have a much tighter locked region. I''ll post that once I''ve tested it. Thanks, Jonathan
>>> On 31.05.13 at 19:07, Jonathan Davies <jonathan.davies@citrix.com> wrote: > On 31/05/13 14:45, Jan Beulich wrote: >>>>> On 28.05.13 at 14:55, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: >>> On Wed, May 15, 2013 at 05:51:06PM +0100, Jonathan Davies wrote: >>>> Normally, the userspace process gets an "OK" from xenfs and then the >>>> watch fires immediately after. Occasionally, this happens the other way >>>> around: the watch fires before the driver sends "OK", which confuses >>>> the xenstore-watch client. It seems to me that the client is within its >>>> rights to expect the "OK" first. >> >> Now that I look at this another time, I wonder how this behavior >> can be guaranteed even with your patch: xenbus_write_watch() >> sends the reply by calling queue_reply() followed by wake_up(), >> so the reply getting delivered and the watch firing appear to be >> inherently asynchronous anyway, and your patch just reduces >> the race window. Am I overlooking something here? > > There''s no call to wake_up() in xenbus_write_watch(), as far as I can see...I''m not sure where you''re looking, but both xenfs/xenbus.c and the more recent xenbus/xenbus_dev_frontend.c have this mutex_lock(&u->reply_mutex); rc = queue_reply(&u->read_buffers, &reply, sizeof(reply)); wake_up(&u->read_waitq); mutex_unlock(&u->reply_mutex); near the end of xenbus_write_watch().> The key thing is to guarantee that xenbus_write_watch()''s call to > queue_reply() (for the "OK" message) executes before the watch_fired() > callback -- registered by the call to register_xenbus_watch() -- could be > called-back. Given that watch_fired() also takes the reply_mutex before its > own calls to queue_reply() and wake_up(), this behaviour can be guaranteed > by taking the reply_mutex before registering the callback.Okay, I agree after taking another closer look - the very different variable naming obfuscate this a little. And inn fact the mutex appears to get acquired too early in watch_fired - it could be confined to just the list_splice_tail() (and maybe the wake_up(), if the really requires external synchronization) invocation. Jan