Two further fixes for bugs in the FIFO-based event channel implementation. 1. Correctly set READY bits to avoid races with guests emptying queues. 2. Handle problems with relinking events that were the tails on now empty queues. This fixes more cases than a previously posted patch. I have now backported the FIFO-based event channel patches (including the above fixes) to Xen 4.3 and Linux 3.10 and run then through some of XenServer''s system tests. Approximately 40-50 machine hours of testing of dom0 have been done with a variety of guests and workloads. No event channel failures were found. Changes in v5: - Only set READY bits for new heads. - Rework old tail bug fix to cover all cases. Changes in v4: - const struct domain * - Clear BUSY with existing cmpxchg() where possible. - Fix BUSY bit debug output. Changes in v3: - Use a new BUSY bit to block guests from clearing UNMASKED, this is lower overhead than the previous solution (which required a hypercall). - Fix another problem with moving events between queues. - Add evtchn->last_vpcu_id and evtchn->last_priority instead of evtchn->q. This keeps the structure at 32 bytes long. Changes in v2: - Add MAINTAINERS patch - Remove some unnecessary temporary pending state clears - Add fix for DoS David
From: David Vrabel <david.vrabel@citrix.com> Setting a queue''s READY bit for every event added to the queue introduces a race. If an event is added to the tail of a queue, the guest may consume the newly added event and leave an empty queue before the READY is set. The guest may then see a stale HEAD value and if the event at the stale head became linked onto a different queue, the guest would consume events from the wrong queue (corrupting it). As noted in section 4.1.2 of the design document, only set READY if a new HEAD is set. This ensures that if the guest sees a READY bit set the corresponding HEAD is valid. Signed-off-by: David Vrabel <david.vrabel@citrix.com> --- xen/common/event_fifo.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/xen/common/event_fifo.c b/xen/common/event_fifo.c index 9106c55..6048784 100644 --- a/xen/common/event_fifo.c +++ b/xen/common/event_fifo.c @@ -161,8 +161,9 @@ static void evtchn_fifo_set_pending(struct vcpu *v, struct evtchn *evtchn) spin_unlock_irqrestore(&q->lock, flags); - if ( !test_and_set_bit(q->priority, - &v->evtchn_fifo->control_block->ready) ) + if ( !linked + && !test_and_set_bit(q->priority, + &v->evtchn_fifo->control_block->ready) ) vcpu_mark_events_pending(v); } -- 1.7.2.5
David Vrabel
2013-Nov-19 18:17 UTC
[PATCH 2/2] evtchn/fifo: don''t corrupt queues if an old tail is linked
From: David Vrabel <david.vrabel@citrix.com> An event may still be the tail of a queue even if the queue is now empty (an ''old tail'' event). There is logic to handle the case when this old tail event needs to be added to the now empty queue (by checking for q->tail == port). However, this does not cover all cases. 1. An old tail may be re-added simultaneously with another event. LINKED is set on the old tail, and the other CPU may misinterpret this as the old tail still being valid and set LINK instead of HEAD. All events on this queue will then be lost. 2. If the old tail event on queue A is moved to a different queue B (by changing its VCPU or priority), the event may then be linked onto queue B. When another event is linked onto queue A it will check the old tail, see that it is linked (but on queue B) and overwrite the LINK field, corrupting both queues. When an event is linked, save the vcpu id and priority of the queue it is being linked onto. Use this when linking an event to check if it is an unlinked old tail event. If it is an old tail event, the old queue is empty and old_q->tail is invalidated to ensure adding another event to old_q will update HEAD. The tail is invalidated by setting it to 0 since the event 0 is never linked. The old_q->lock is held while setting LINKED to avoid the race with the test of LINKED in evtchn_fifo_set_link(). Signed-off-by: David Vrabel <david.vrabel@citrix.com> --- xen/common/event_fifo.c | 46 +++++++++++++++++++++++++++++++++++----------- xen/include/xen/sched.h | 2 ++ 2 files changed, 37 insertions(+), 11 deletions(-) diff --git a/xen/common/event_fifo.c b/xen/common/event_fifo.c index 6048784..7a6e360 100644 --- a/xen/common/event_fifo.c +++ b/xen/common/event_fifo.c @@ -103,7 +103,6 @@ static void evtchn_fifo_set_pending(struct vcpu *v, struct evtchn *evtchn) struct domain *d = v->domain; unsigned int port; event_word_t *word; - struct evtchn_fifo_queue *q; unsigned long flags; bool_t was_pending; @@ -120,25 +119,50 @@ static void evtchn_fifo_set_pending(struct vcpu *v, struct evtchn *evtchn) return; } - /* - * No locking around getting the queue. This may race with - * changing the priority but we are allowed to signal the event - * once on the old priority. - */ - q = &v->evtchn_fifo->queue[evtchn->priority]; - was_pending = test_and_set_bit(EVTCHN_FIFO_PENDING, word); /* * Link the event if it unmasked and not already linked. */ if ( !test_bit(EVTCHN_FIFO_MASKED, word) - && !test_and_set_bit(EVTCHN_FIFO_LINKED, word) ) + && !test_bit(EVTCHN_FIFO_LINKED, word) ) { + struct vcpu *old_v; + struct evtchn_fifo_queue *q, *old_q; event_word_t *tail_word; bool_t linked = 0; - spin_lock_irqsave(&q->lock, flags); + /* + * No locking around getting the queue. This may race with + * changing the priority but we are allowed to signal the + * event once on the old priority. + */ + q = &v->evtchn_fifo->queue[evtchn->priority]; + + old_v = d->vcpu[evtchn->last_vcpu_id]; + old_q = &old_v->evtchn_fifo->queue[evtchn->last_priority]; + + spin_lock_irqsave(&old_q->lock, flags); + + /* + * If this event was a tail, the old queue is now empty and + * its tail must be invalidated to prevent adding an event to + * the old queue from corrupting the new queue. + */ + if ( old_q->tail == port ) + old_q->tail = 0; + + set_bit(EVTCHN_FIFO_LINKED, word); + + /* Moved to a different queue? */ + if ( old_q != q) + { + evtchn->last_vcpu_id = evtchn->notify_vcpu_id; + evtchn->last_priority = evtchn->priority; + + spin_unlock_irqrestore(&old_q->lock, flags); + spin_lock_irqsave(&q->lock, flags); + } /* * Atomically link the tail to port iff the tail is linked. @@ -150,7 +174,7 @@ static void evtchn_fifo_set_pending(struct vcpu *v, struct evtchn *evtchn) * If the queue is empty (i.e., we haven''t linked to the new * event), head must be updated. */ - if ( port != q->tail ) + if ( q->tail ) { tail_word = evtchn_fifo_word_from_port(d, q->tail); linked = evtchn_fifo_set_link(d, tail_word, port); diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index cbdf377..5ab92dd 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -98,6 +98,8 @@ struct evtchn } u; u8 priority; u8 pending:1; + u16 last_vcpu_id; + u8 last_priority; #ifdef FLASK_ENABLE void *ssid; #endif -- 1.7.2.5
David Vrabel
2013-Nov-20 17:21 UTC
Re: [PATCH 2/2] evtchn/fifo: don''t corrupt queues if an old tail is linked
On 19/11/13 18:17, David Vrabel wrote:> From: David Vrabel <david.vrabel@citrix.com> > > An event may still be the tail of a queue even if the queue is now > empty (an ''old tail'' event). There is logic to handle the case when > this old tail event needs to be added to the now empty queue (by > checking for q->tail == port). > > However, this does not cover all cases. > > 1. An old tail may be re-added simultaneously with another event. > LINKED is set on the old tail, and the other CPU may misinterpret > this as the old tail still being valid and set LINK instead of > HEAD. All events on this queue will then be lost. > > 2. If the old tail event on queue A is moved to a different queue B > (by changing its VCPU or priority), the event may then be linked > onto queue B. When another event is linked onto queue A it will > check the old tail, see that it is linked (but on queue B) and > overwrite the LINK field, corrupting both queues. > > When an event is linked, save the vcpu id and priority of the queue it > is being linked onto. Use this when linking an event to check if it > is an unlinked old tail event. If it is an old tail event, the old > queue is empty and old_q->tail is invalidated to ensure adding another > event to old_q will update HEAD. The tail is invalidated by setting > it to 0 since the event 0 is never linked. > > The old_q->lock is held while setting LINKED to avoid the race with > the test of LINKED in evtchn_fifo_set_link().Whilst this is considerably more reliable than before it isn''t completely safe. evtchn_fifo_set_pending() is relying on being serialized for each event channel. The serialization is required to protect evtchn->last_* and the split test_bit(LINKED) and set_bit(LINKED). In most cases the serialization is done by a suitable lock in event_channel.c. e.g., interdomain event are serialized with the remote domain''s event lock, virqs by the local virq lock. However, If evtchn_fifo_set_pending() is called from evtchn_fifo_unmask() or evtchn_fifo_expand_array() then only the local event lock is held which is different than the lock used for interdomain and virq events. One race is: CPU A CPU B EVTCHNOP_send() evtchn_fifo_set_pending() evtchn->last_vcpu = ... old_v = d->vcpu[evtchn->last_vcpu_id] old_q = old_v->queue[evtchn->last_pri] evtchn->last_pri = ... lock(old_q) // wrong q set_bit(LINKED) q->tail = port ... Guest: clear_bit(LINKED) set_bit(LINKED) // q->tail not invalidated unlock(old_q) lock(q) link port to itself. unlock(q) Linking an event to itself will lose any other event that were in the queue (they''re LINKED but not reachable by the guest) Another race is: CPU A CPU B EVTCHNOP_send() evtchn_fifo_set_pending() clear_bit(MASKED) set_bit(PENDING) test_bit(PENDING) test_bit(LINKED) EVTCHNOP_unmask() evtchn_fifo_set_pending() test_bit(LINKED) lock(q->lock) set_bit(LINKED) link to q->tail q->tail = port unlock(q->lock) lock(q->lock) q->tail = 0 because q->tail == port q->head = port unlock(q->lock) The double link of the event loses any other event preceding it in the queue and those events are lost (they have LINKED set but are no longer reachable by the guest). There are two ways to fix this: 1. Introduce a per-event lock and use this to serialize evtchn_fifo_set_pending(). After 4.4, this per-event lock can be used instead of the domain''s event lock and virq lock when raising events, hopefully reducing lock contention and improving event channel scalability. This would half the number of struct evtchn per page as the lock takes it over 32 bytes in size. 2. Check for the old_q changing after locking old_q->lock and use test_and_set_bit(LINKED) to bail early if another CPU linked it (see below patch). Any opinions on either of these solutions? --- a/xen/common/event_fifo.c Tue Nov 19 11:06:54 2013 +0000 +++ b/xen/common/event_fifo.c Wed Nov 20 16:41:32 2013 +0000 @@ -34,6 +34,30 @@ static inline event_word_t *evtchn_fifo_ return d->evtchn_fifo->event_array[p] + w; } +static struct evtchn_fifo_queue *lock_old_queue(const struct domain *d, + struct evtchn *evtchn, + unsigned long *flags) +{ + struct vcpu *v; + struct evtchn_fifo_queue *q, *old_q; + + for (;;) + { + v = d->vcpu[evtchn->last_vcpu_id]; + old_q = &v->evtchn_fifo->queue[evtchn->last_priority]; + + spin_lock_irqsave(&old_q->lock, *flags); + + v = d->vcpu[evtchn->last_vcpu_id]; + q = &v->evtchn_fifo->queue[evtchn->last_priority]; + + if ( old_q == q ) + return old_q; + + spin_unlock_irqrestore(&old_q->lock, *flags); + } +} + static int try_set_link(event_word_t *word, event_word_t *w, uint32_t link) { event_word_t new, old; @@ -127,7 +151,6 @@ static void evtchn_fifo_set_pending(stru if ( !test_bit(EVTCHN_FIFO_MASKED, word) && !test_bit(EVTCHN_FIFO_LINKED, word) ) { - struct vcpu *old_v; struct evtchn_fifo_queue *q, *old_q; event_word_t *tail_word; bool_t linked = 0; @@ -139,10 +162,13 @@ static void evtchn_fifo_set_pending(stru */ q = &v->evtchn_fifo->queue[evtchn->priority]; - old_v = d->vcpu[evtchn->last_vcpu_id]; - old_q = &old_v->evtchn_fifo->queue[evtchn->last_priority]; + old_q = lock_old_queue(d, evtchn, &flags); - spin_lock_irqsave(&old_q->lock, flags); + if ( test_and_set_bit(EVTCHN_FIFO_LINKED, word) ) + { + spin_unlock_irqrestore(&old_q->lock, flags); + goto done; + } /* * If this event was a tail, the old queue is now empty and @@ -152,12 +178,9 @@ static void evtchn_fifo_set_pending(stru if ( old_q->tail == port ) old_q->tail = 0; - set_bit(EVTCHN_FIFO_LINKED, word); - /* Moved to a different queue? */ - if ( old_q != q) + if ( old_q != q ) { - evtchn->last_vcpu_id = evtchn->notify_vcpu_id; evtchn->last_priority = evtchn->priority; @@ -191,7 +214,7 @@ static void evtchn_fifo_set_pending(stru &v->evtchn_fifo->control_block->ready) ) vcpu_mark_events_pending(v); } - +done: if ( !was_pending ) evtchn_check_pollers(d, port); }
Jan Beulich
2013-Nov-22 12:02 UTC
Re: [PATCH 2/2] evtchn/fifo: don''t corrupt queues if an old tail is linked
>>> On 20.11.13 at 18:21, David Vrabel <david.vrabel@citrix.com> wrote: > 2. Check for the old_q changing after locking old_q->lock and use > test_and_set_bit(LINKED) to bail early if another CPU linked it (see > below patch). > > Any opinions on either of these solutions?I''d favor 2, but ...> --- a/xen/common/event_fifo.c Tue Nov 19 11:06:54 2013 +0000 > +++ b/xen/common/event_fifo.c Wed Nov 20 16:41:32 2013 +0000 > @@ -34,6 +34,30 @@ static inline event_word_t *evtchn_fifo_ > return d->evtchn_fifo->event_array[p] + w; > } > > +static struct evtchn_fifo_queue *lock_old_queue(const struct domain *d, > + struct evtchn *evtchn, > + unsigned long *flags) > +{ > + struct vcpu *v; > + struct evtchn_fifo_queue *q, *old_q; > + > + for (;;) > + { > + v = d->vcpu[evtchn->last_vcpu_id]; > + old_q = &v->evtchn_fifo->queue[evtchn->last_priority]; > + > + spin_lock_irqsave(&old_q->lock, *flags); > + > + v = d->vcpu[evtchn->last_vcpu_id]; > + q = &v->evtchn_fifo->queue[evtchn->last_priority]; > + > + if ( old_q == q ) > + return old_q; > + > + spin_unlock_irqrestore(&old_q->lock, *flags); > + }... is there a guaranteed upper bound to this loop?> +} > + > static int try_set_link(event_word_t *word, event_word_t *w, uint32_t link) > { > event_word_t new, old; > @@ -127,7 +151,6 @@ static void evtchn_fifo_set_pending(stru > if ( !test_bit(EVTCHN_FIFO_MASKED, word) > && !test_bit(EVTCHN_FIFO_LINKED, word) ) > { > - struct vcpu *old_v; > struct evtchn_fifo_queue *q, *old_q; > event_word_t *tail_word; > bool_t linked = 0; > @@ -139,10 +162,13 @@ static void evtchn_fifo_set_pending(stru > */ > q = &v->evtchn_fifo->queue[evtchn->priority]; > > - old_v = d->vcpu[evtchn->last_vcpu_id]; > - old_q = &old_v->evtchn_fifo->queue[evtchn->last_priority]; > + old_q = lock_old_queue(d, evtchn, &flags); > > - spin_lock_irqsave(&old_q->lock, flags); > + if ( test_and_set_bit(EVTCHN_FIFO_LINKED, word) ) > + { > + spin_unlock_irqrestore(&old_q->lock, flags); > + goto done; > + } > > /* > * If this event was a tail, the old queue is now empty and > @@ -152,12 +178,9 @@ static void evtchn_fifo_set_pending(stru > if ( old_q->tail == port ) > old_q->tail = 0; > > - set_bit(EVTCHN_FIFO_LINKED, word); > - > /* Moved to a different queue? */ > - if ( old_q != q) > + if ( old_q != q ) > { > - > evtchn->last_vcpu_id = evtchn->notify_vcpu_id; > evtchn->last_priority = evtchn->priority; > > @@ -191,7 +214,7 @@ static void evtchn_fifo_set_pending(stru > &v->evtchn_fifo->control_block->ready) ) > vcpu_mark_events_pending(v); > } > - > +done:Labels indented by at least one space please.> if ( !was_pending ) > evtchn_check_pollers(d, port); > }Apart from that - what does this mean for the 2/2 patch you reply to here? Apply it or wait (I assume the latter)? If wait, is 1/2 still fine to apply? Jan
David Vrabel
2013-Nov-22 18:23 UTC
Re: [PATCH 2/2] evtchn/fifo: don''t corrupt queues if an old tail is linked
On 22/11/13 12:02, Jan Beulich wrote:>>>> On 20.11.13 at 18:21, David Vrabel <david.vrabel@citrix.com> wrote: >> 2. Check for the old_q changing after locking old_q->lock and use >> test_and_set_bit(LINKED) to bail early if another CPU linked it (see >> below patch). >> >> Any opinions on either of these solutions? > > I''d favor 2, but ... > >> --- a/xen/common/event_fifo.c Tue Nov 19 11:06:54 2013 +0000 >> +++ b/xen/common/event_fifo.c Wed Nov 20 16:41:32 2013 +0000 >> @@ -34,6 +34,30 @@ static inline event_word_t *evtchn_fifo_ >> return d->evtchn_fifo->event_array[p] + w; >> } >> >> +static struct evtchn_fifo_queue *lock_old_queue(const struct domain *d, >> + struct evtchn *evtchn, >> + unsigned long *flags) >> +{ >> + struct vcpu *v; >> + struct evtchn_fifo_queue *q, *old_q; >> + >> + for (;;) >> + { >> + v = d->vcpu[evtchn->last_vcpu_id]; >> + old_q = &v->evtchn_fifo->queue[evtchn->last_priority]; >> + >> + spin_lock_irqsave(&old_q->lock, *flags); >> + >> + v = d->vcpu[evtchn->last_vcpu_id]; >> + q = &v->evtchn_fifo->queue[evtchn->last_priority]; >> + >> + if ( old_q == q ) >> + return old_q; >> + >> + spin_unlock_irqrestore(&old_q->lock, *flags); >> + } > > ... is there a guaranteed upper bound to this loop?No :( But the only attack I could think of seems highly implausible. A malicious guest A with two co-operating guests (B and C) can ping-pong one of its queue locks (Q) between two VCPUs with repeated EVTCHNOP_send calls on two interdomain event channels bound to A. They need to be in different domains otherwise there is a window were Q will not be locked. The time spent while holding Q is less than the time spent in the hypercall while not holding the lock, then the guest will need more co-operating guests to keep Q constantly locked. If Guest A then has another two co-operating guests (D and E), it can arrange for them to ping-pong another queue lock (R) between two VCPUs. Guest A can also repeatedly change the priority of these four events. With careful timing it will be able to change the priority such that every send call moves the event between the two queues. Guest A must also immediately clear any LINKED bit to prevent the unmask calls from taking the ''already LINKED'' fast path in evtchn_fifo_set_pending(). This is trivial to do by just repeatedly writing 0 to the relevant event words. Guest V (the victim) then attempts to acquire the old queue lock Q. If it manages to lock it, it will now be the wrong lock and it must try and acquire R. If it manages to acquire R it will again be the wrong lock. And so on. There might be an easier attack but I couldn''t see it. Do you think this is a real problem that should be resolved?> Apart from that - what does this mean for the 2/2 patch you reply > to here? Apply it or wait (I assume the latter)? If wait, is 1/2 still > fine to apply?Please apply 1/2 and wait for a revised 2/2. David
Jan Beulich
2013-Nov-25 09:10 UTC
Re: [PATCH 2/2] evtchn/fifo: don''t corrupt queues if an old tail is linked
>>> On 22.11.13 at 19:23, David Vrabel <david.vrabel@citrix.com> wrote: > On 22/11/13 12:02, Jan Beulich wrote: >>>>> On 20.11.13 at 18:21, David Vrabel <david.vrabel@citrix.com> wrote: >>> +static struct evtchn_fifo_queue *lock_old_queue(const struct domain *d, >>> + struct evtchn *evtchn, >>> + unsigned long *flags) >>> +{ >>> + struct vcpu *v; >>> + struct evtchn_fifo_queue *q, *old_q; >>> + >>> + for (;;) >>> + { >>> + v = d->vcpu[evtchn->last_vcpu_id]; >>> + old_q = &v->evtchn_fifo->queue[evtchn->last_priority]; >>> + >>> + spin_lock_irqsave(&old_q->lock, *flags); >>> + >>> + v = d->vcpu[evtchn->last_vcpu_id]; >>> + q = &v->evtchn_fifo->queue[evtchn->last_priority]; >>> + >>> + if ( old_q == q ) >>> + return old_q; >>> + >>> + spin_unlock_irqrestore(&old_q->lock, *flags); >>> + } >> >> ... is there a guaranteed upper bound to this loop? > > No :( But the only attack I could think of seems highly implausible. > > A malicious guest A with two co-operating guests (B and C) can ping-pong > one of its queue locks (Q) between two VCPUs with repeated EVTCHNOP_send > calls on two interdomain event channels bound to A. They need to be in > different domains otherwise there is a window were Q will not be locked. > The time spent while holding Q is less than the time spent in the > hypercall while not holding the lock, then the guest will need more > co-operating guests to keep Q constantly locked. > > If Guest A then has another two co-operating guests (D and E), it can > arrange for them to ping-pong another queue lock (R) between two VCPUs. > > Guest A can also repeatedly change the priority of these four events. > With careful timing it will be able to change the priority such that > every send call moves the event between the two queues. > > Guest A must also immediately clear any LINKED bit to prevent the unmask > calls from taking the ''already LINKED'' fast path in > evtchn_fifo_set_pending(). This is trivial to do by just repeatedly > writing 0 to the relevant event words. > > Guest V (the victim) then attempts to acquire the old queue lock Q. If > it manages to lock it, it will now be the wrong lock and it must try and > acquire R. If it manages to acquire R it will again be the wrong lock. > And so on. > > There might be an easier attack but I couldn''t see it. > > Do you think this is a real problem that should be resolved?I think at least some arbitrary upper bound on the number of iterations should be put in, either failing the operation or at least logging a message when exceeding the bound.>> Apart from that - what does this mean for the 2/2 patch you reply >> to here? Apply it or wait (I assume the latter)? If wait, is 1/2 still >> fine to apply? > > Please apply 1/2 and wait for a revised 2/2.Will do. Jan