thr3ads.net - Xen devel - Wait Queues [Nov 2012]

If this information is useful, please help other people find it:
Share via:

Andres Lagar-Cavilla

2012-Nov-07 20:54 UTC

Wait Queues

Hi all,
we currently have a problem in the (x86) mm layer. Callers may request the p2m
to perform a translation of a gfn to an mfn. Such translation may need to wait
for a third party to service it. This happens when:

- a page needs to be paged in
- a CoW breaking of a shared page fails due to lack of memory

Note that paging in may also fail due to lack of memory. In both ENOMEM cases,
all the plumbing for a toolstack to be notified and take some corrective action
to release some memory and retry is in place. We also have plumbing for pagers
in place.

Ideally we want the internals to be self-contained, so that callers need not be
concerned with any of this. A request for a p2m translation may or may not
sleep, but on exit from the p2m the caller either has an mfn with a page ref, or
an error code due to some other condition.

Wait queue support in (x86) Xen prevents sleeping on a wait queue if any locks
are held, including RCU read-side locks (i.e. BUG_ON(!in_atomic()).

For this reason, we have not yet implemented sleeping on the p2m. Callers may
get errors telling them to retry. A lot of (imho) ugly code is peppered around
the hypervisor to deal with the consequences of this. More fundamentally, in
some cases there is no possible elegant handling, and guests are crashed (for
example, if a page table page is paged out and the hypervisor needs to translate
a guest virtual address to a gfn). This limits the applicability of memory
paging and sharing.

One way to solve this would be to ensure no code path liable to sleep in a wait
queue is holding any locks at wait queue sleep time. I believe this is doomed.
Not just because this is a herculean task. It also makes writing hypervisor code
*very* difficult. Anyone trying to throw a p2m translation into a code path
needs to think of all possible upstream call sequences. Not even RCU read locks
are allowed.

I''d like to propose an approach that ensures that as long some
properties are met, arbitrary wait queue sleep is allowed. Here are the
properties:
1. Third parties servicing a wait queue sleep are indeed third parties. In other
words, dom0 does not do paging.
2. Vcpus of a wait queue servicing domain may never go to sleep on a wait queue
during a foreign map.
3. A guest vcpu may go to sleep on a wait queue holding any kinds of locks as
long as it does not hold the p2m lock.
4. "Kick" routines in the hypervisor by which service domains
effectively wake up a vcpu may only take the p2m lock to do a fix up of the p2m.
5. Wait queues can be awakened on a special domain destroy condition.

Property 1. is hopefully self-evident, and although not enforced in the code it
is reasonably simple to do so.

Property 2. is also self-evident and enforced in the code today.

Property 3. simplifies reasoning about p2m translations and wait queue sleeping.
Provides a clean model for reasoning about what might or might not happen. It
will require us to restructure some code paths (i.e. XENMEM_add_to_physmap) that
require multiple p2m translations in a single critical region to perform all
translations up front.

Property 4. is already enforced in the code as is right now.

Property 5. needs adding some logic to the top of domain destruction: set a
flag, wake up all vcpus in wait queues. Extra logic on the wait queue side will
exit the wait if the destroy flag is set, with an error. Most if not all callers
can deal right now with a p2m translation error (other than paging), and unwind
and release all their locks.

I confess my understanding of RCU is not 100% there and I am not sure what will
happen to force_quiescent_state. I also understand there is a impedance mismatch
wrt to "saving" and "restoring" the physical CPU preempt
count.

Thanks,
Andres

Andres Lagar-Cavilla

2012-Nov-08 03:22 UTC

head link

Re: Wait Queues

On Nov 7, 2012, at 3:54 PM, Andres Lagar-Cavilla <andres@gridcentric.ca>
wrote:
> Hi all,
> we currently have a problem in the (x86) mm layer. Callers may request the
p2m to perform a translation of a gfn to an mfn. Such translation may need to
wait for a third party to service it. This happens when:
> 
> - a page needs to be paged in
> - a CoW breaking of a shared page fails due to lack of memory
> 
> Note that paging in may also fail due to lack of memory. In both ENOMEM
cases, all the plumbing for a toolstack to be notified and take some corrective
action to release some memory and retry is in place. We also have plumbing for
pagers in place.
> 
> Ideally we want the internals to be self-contained, so that callers need
not be concerned with any of this. A request for a p2m translation may or may
not sleep, but on exit from the p2m the caller either has an mfn with a page
ref, or an error code due to some other condition.
> 
> Wait queue support in (x86) Xen prevents sleeping on a wait queue if any
locks are held, including RCU read-side locks (i.e. BUG_ON(!in_atomic()).
> 
> For this reason, we have not yet implemented sleeping on the p2m. Callers
may get errors telling them to retry. A lot of (imho) ugly code is peppered
around the hypervisor to deal with the consequences of this. More fundamentally,
in some cases there is no possible elegant handling, and guests are crashed (for
example, if a page table page is paged out and the hypervisor needs to translate
a guest virtual address to a gfn). This limits the applicability of memory
paging and sharing.
> 
> One way to solve this would be to ensure no code path liable to sleep in a
wait queue is holding any locks at wait queue sleep time. I believe this is
doomed. Not just because this is a herculean task. It also makes writing
hypervisor code *very* difficult. Anyone trying to throw a p2m translation into
a code path needs to think of all possible upstream call sequences. Not even RCU
read locks are allowed.
> 
> I''d like to propose an approach that ensures that as long some
properties are met, arbitrary wait queue sleep is allowed. Here are the
properties:
> 1. Third parties servicing a wait queue sleep are indeed third parties. In
other words, dom0 does not do paging.
> 2. Vcpus of a wait queue servicing domain may never go to sleep on a wait
queue during a foreign map.
> 3. A guest vcpu may go to sleep on a wait queue holding any kinds of locks
as long as it does not hold the p2m lock.
N.B: I understand (now) this may cause any other vcpu contending on a lock held
by the wait queue sleeper to not yield to the scheduler and pin its physical
cpu.

What I am struggling with is coming up with a solution that doesn''t
turn hypervisor mm hacking into a locking minefield.

Linux fixes this with many kinds of sleeping synchronization primitives. A task
can, for example, hold the mmap semaphore and sleep on a wait queue. Is this the
only way out of this mess? Not if wait queues force the vcpu to wake up on the
same phys cpu it was using at the time of sleeping….

Andres
> 4. "Kick" routines in the hypervisor by which service domains
effectively wake up a vcpu may only take the p2m lock to do a fix up of the p2m.
> 5. Wait queues can be awakened on a special domain destroy condition.
> 
> Property 1. is hopefully self-evident, and although not enforced in the
code it is reasonably simple to do so.
> 
> Property 2. is also self-evident and enforced in the code today.
> 
> Property 3. simplifies reasoning about p2m translations and wait queue
sleeping. Provides a clean model for reasoning about what might or might not
happen. It will require us to restructure some code paths (i.e.
XENMEM_add_to_physmap) that require multiple p2m translations in a single
critical region to perform all translations up front.
> 
> Property 4. is already enforced in the code as is right now.
> 
> Property 5. needs adding some logic to the top of domain destruction: set a
flag, wake up all vcpus in wait queues. Extra logic on the wait queue side will
exit the wait if the destroy flag is set, with an error. Most if not all callers
can deal right now with a p2m translation error (other than paging), and unwind
and release all their locks.
> 
> I confess my understanding of RCU is not 100% there and I am not sure what
will happen to force_quiescent_state. I also understand there is a impedance
mismatch wrt to "saving" and "restoring" the physical CPU
preempt count.
> 
> Thanks,
> Andres

Keir Fraser

2012-Nov-08 07:42 UTC

head link

Re: Wait Queues

On 08/11/2012 03:22, "Andres Lagar-Cavilla"
<andreslc@gridcentric.ca> wrote:
>> I''d like to propose an approach that ensures that as long some
properties are
>> met, arbitrary wait queue sleep is allowed. Here are the properties:
>> 1. Third parties servicing a wait queue sleep are indeed third parties.
In
>> other words, dom0 does not do paging.
>> 2. Vcpus of a wait queue servicing domain may never go to sleep on a
wait
>> queue during a foreign map.
>> 3. A guest vcpu may go to sleep on a wait queue holding any kinds of
locks as
>> long as it does not hold the p2m lock.
> 
> N.B: I understand (now) this may cause any other vcpu contending on a lock
> held by the wait queue sleeper to not yield to the scheduler and pin its
> physical cpu.
> 
> What I am struggling with is coming up with a solution that
doesn''t turn
> hypervisor mm hacking into a locking minefield.
> 
> Linux fixes this with many kinds of sleeping synchronization primitives. A
> task can, for example, hold the mmap semaphore and sleep on a wait queue.
Is
> this the only way out of this mess? Not if wait queues force the vcpu to
wake
> up on the same phys cpu it was using at the time of sleeping.
Well, the forcing to wake up on same phys cpu it slept on is going to be
fixed. But it''s not clear to me how that current restriction makes the
problem harder? What if you were running on a single-phys-cpu system?

As you have realised, the fact that all locks in Xen are spinlocks makes the
potential for deadlock very obvious. Someone else gets scheduled and takes
out the phys cpu by spinning on a lock that someone else is holding while
they are descheduled.

Linux-style sleeping mutexes might help. We could add those. They don''t
help
as readily as in the Linux case however! In some ways they push the deadlock
up one level of abstraction, to the virt cpu (vcpu). Consider single-vcpu
dom0 running a pager -- even if you are careful that the pager itself does
not acquire any locks that one of its clients may hold-while-sleeping, if
*anything* running in dom0 can acquire such a lock, you have an obvious
deadlock, as that will take out the dom0 vcpu and leave it blocked forever
waiting for a lock that is held while its holder waits for service from the
dom0 vcpu....

I don''t think there is an easy solution here!

 -- Keir

Andres Lagar-Cavilla

2012-Nov-08 15:39 UTC

head link

Re: Wait Queues

On Nov 8, 2012, at 2:42 AM, Keir Fraser <keir.xen@gmail.com> wrote:
> On 08/11/2012 03:22, "Andres Lagar-Cavilla"
<andreslc@gridcentric.ca> wrote:
> 
>>> I''d like to propose an approach that ensures that as long
some properties are
>>> met, arbitrary wait queue sleep is allowed. Here are the
properties:
>>> 1. Third parties servicing a wait queue sleep are indeed third
parties. In
>>> other words, dom0 does not do paging.
>>> 2. Vcpus of a wait queue servicing domain may never go to sleep on
a wait
>>> queue during a foreign map.
>>> 3. A guest vcpu may go to sleep on a wait queue holding any kinds
of locks as
>>> long as it does not hold the p2m lock.
>> 
>> N.B: I understand (now) this may cause any other vcpu contending on a
lock
>> held by the wait queue sleeper to not yield to the scheduler and pin
its
>> physical cpu.
>> 
>> What I am struggling with is coming up with a solution that
doesn''t turn
>> hypervisor mm hacking into a locking minefield.
>> 
>> Linux fixes this with many kinds of sleeping synchronization
primitives. A
>> task can, for example, hold the mmap semaphore and sleep on a wait
queue. Is
>> this the only way out of this mess? Not if wait queues force the vcpu
to wake
>> up on the same phys cpu it was using at the time of sleepingŠ.
> 
> Well, the forcing to wake up on same phys cpu it slept on is going to be
> fixed. But it''s not clear to me how that current restriction makes
the
> problem harder? What if you were running on a single-phys-cpu system?It''s not a hard blocker. It''s giving up efficiency otherwise.
It''s a "nice to have" precondition.
> 
> As you have realised, the fact that all locks in Xen are spinlocks makes
the
> potential for deadlock very obvious. Someone else gets scheduled and takes
> out the phys cpu by spinning on a lock that someone else is holding while
> they are descheduled.
> 
> Linux-style sleeping mutexes might help. We could add those. They
don''t help
> as readily as in the Linux case however! In some ways they push the
deadlock
> up one level of abstraction, to the virt cpu (vcpu). Consider single-vcpu
> dom0 running a pager -- even if you are careful that the pager itself does
> not acquire any locks that one of its clients may hold-while-sleeping, if
> *anything* running in dom0 can acquire such a lock, you have an obvious
> deadlock, as that will take out the dom0 vcpu and leave it blocked forever
> waiting for a lock that is held while its holder waits for service from the
> dom0 vcpu….Uhmm. But it seems there is _some_ method to the madness. Luckily mm locks are
all taken after the p2m lock (and enforced that way). dom0 can grab ... the big
domain lock? the grant table lock?

Perhaps we can categorize locks between reflexive or foreign (not that we have
abundant space in the spin lock struct to stash more flags) and perform some
sort of enforcement like what goes on in the mm layer. Xen insults via
BUG_ON''s are a strong conditioning tool for developers. It is certainly
simpler to tease out the locks that might deadlock dom0 than all possible locks,
including RCU read-locks.

What I mean:

BUG_ON(current->domain != d && lock_is_reflexive)
An example of a reflexive lock is the per page sharing lock.

BUG_ON(prepare_to_wait && current->domain->holds_foreign_lock)
An example of a transitive lock is the gran table lock.

A third category would entail global locks like the domain list, which are
identical to a foreign lock wrt to this analysis.

Another benefit of this is that only reflexive locks need to be made
sleep-capable, not everything under the sun. I.e. the possibility of livelock is
corralled to apply only to vcpus of the same domain, and then it''s
avoided by making those lock holders re-schedulable.

Andres
> 
> I don''t think there is an easy solution here!
> 
> -- Keir
> 
>

Keir Fraser

2012-Nov-08 16:48 UTC

head link

Re: Wait Queues

On 08/11/2012 15:39, "Andres Lagar-Cavilla"
<andreslc@gridcentric.ca> wrote:
>> dom0 vcpu.
> Uhmm. But it seems there is _some_ method to the madness. Luckily mm locks
are
> all taken after the p2m lock (and enforced that way). dom0 can grab ... the
> big domain lock? the grant table lock?
> 
> Perhaps we can categorize locks between reflexive or foreign (not that we
have
> abundant space in the spin lock struct to stash more flags) and perform
some
> sort of enforcement like what goes on in the mm layer. Xen insults via
> BUG_ON''s are a strong conditioning tool for developers. It is
certainly
> simpler to tease out the locks that might deadlock dom0 than all possible
> locks, including RCU read-locks.
> 
> What I mean:
> 
> BUG_ON(current->domain != d && lock_is_reflexive)
> An example of a reflexive lock is the per page sharing lock.
> 
> BUG_ON(prepare_to_wait &&
current->domain->holds_foreign_lock)
> An example of a transitive lock is the gran table lock.
> 
> A third category would entail global locks like the domain list, which are
> identical to a foreign lock wrt to this analysis.
> 
> Another benefit of this is that only reflexive locks need to be made
> sleep-capable, not everything under the sun. I.e. the possibility of
livelock
> is corralled to apply only to vcpus of the same domain, and then
it''s avoided
> by making those lock holders re-schedulable.
This sounds possible. RCU read locks will often count as global locks by the
way, as they are most often used as an alternative to taking a global
spinlock or multi-reader lock. So sleeping in RCU critical regions is
generally not going to be a good idea. Perhaps it will turn out that such
regions don''t get in your way too often.
> Andres

Xen devel - Nov 2012 - Wait Queues

Wait Queues

Re: Wait Queues

Re: Wait Queues

Re: Wait Queues

Re: Wait Queues