thr3ads.net - Linux Virtualization - [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker [Aug 2019]

If this information is useful, please help other people find it:
Share via:

Jason Gunthorpe

2019-Jul-31 19:30 UTC

[PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

On Wed, Jul 31, 2019 at 09:28:20PM +0800, Jason Wang
wrote:> 
> On 2019/7/31 ??8:39, Jason Gunthorpe wrote:
> > On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote:
> > > We used to use RCU to synchronize MMU notifier with worker. This
leads
> > > calling synchronize_rcu() in invalidate_range_start(). But on a
busy
> > > system, there would be many factors that may slow down the
> > > synchronize_rcu() which makes it unsuitable to be called in MMU
> > > notifier.
> > > 
> > > A solution is SRCU but its overhead is obvious with the expensive
full
> > > memory barrier. Another choice is to use seqlock, but it
doesn't
> > > provide a synchronization method between readers and writers. The
last
> > > choice is to use vq mutex, but it need to deal with the worst
case
> > > that MMU notifier must be blocked and wait for the finish of swap
in.
> > > 
> > > So this patch switches use a counter to track whether or not the
map
> > > was used. The counter was increased when vq try to start or
finish
> > > uses the map. This means, when it was even, we're sure
there's no
> > > readers and MMU notifier is synchronized. When it was odd, it
means
> > > there's a reader we need to wait it to be even again then we
are
> > > synchronized.
> > You just described a seqlock.
> 
> 
> Kind of, see my explanation below.
> 
> 
> > 
> > We've been talking about providing this as some core service from
mmu
> > notifiers because nearly every use of this API needs it.
> 
> 
> That would be very helpful.
> 
> 
> > 
> > IMHO this gets the whole thing backwards, the common pattern is to
> > protect the 'shadow pte' data with a seqlock (usually open
coded),
> > such that the mmu notififer side has the write side of that lock and
> > the read side is consumed by the thread accessing or updating the
SPTE.
> 
> 
> Yes, I've considered something like that. But the problem is, mmu
notifier
> (writer) need to wait for the vhost worker to finish the read before it can
> do things like setting dirty pages and unmapping page.? It looks to me
> seqlock doesn't provide things like this.? 
The seqlock is usually used to prevent a 2nd thread from accessing the
VA while it is being changed by the mm. ie you use something seqlocky
instead of the ugly mmu_notifier_unregister/register cycle.

You are supposed to use something simple like a spinlock or mutex
inside the invalidate_range_start to serialized tear down of the SPTEs
with their accessors.
> write_seqcount_begin()
> 
> map = vq->map[X]
> 
> write or read through map->addr directly
> 
> write_seqcount_end()
> 
> 
> There's no rmb() in write_seqcount_begin(), so map could be read before
> write_seqcount_begin(), but it looks to me now that this doesn't harm
at
> all, maybe we can try this way.
That is because it is a write side lock, not a read lock. IIRC
seqlocks have weaker barriers because the write side needs to be
serialized in some other way.

The requirement I see is you need invalidate_range_start to block
until another thread exits its critical section (ie stops accessing
the SPTEs). 

That is a spinlock/mutex.

You just can't invent a faster spinlock by open coding something with
barriers, it doesn't work.

Jason

Jason Wang

2019-Aug-01 05:02 UTC

head link

[PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

On 2019/8/1 ??3:30, Jason Gunthorpe wrote:> On Wed, Jul 31, 2019 at 09:28:20PM +0800, Jason Wang wrote:
>> On 2019/7/31 ??8:39, Jason Gunthorpe wrote:
>>> On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote:
>>>> We used to use RCU to synchronize MMU notifier with worker.
This leads
>>>> calling synchronize_rcu() in invalidate_range_start(). But on a
busy
>>>> system, there would be many factors that may slow down the
>>>> synchronize_rcu() which makes it unsuitable to be called in MMU
>>>> notifier.
>>>>
>>>> A solution is SRCU but its overhead is obvious with the
expensive full
>>>> memory barrier. Another choice is to use seqlock, but it
doesn't
>>>> provide a synchronization method between readers and writers.
The last
>>>> choice is to use vq mutex, but it need to deal with the worst
case
>>>> that MMU notifier must be blocked and wait for the finish of
swap in.
>>>>
>>>> So this patch switches use a counter to track whether or not
the map
>>>> was used. The counter was increased when vq try to start or
finish
>>>> uses the map. This means, when it was even, we're sure
there's no
>>>> readers and MMU notifier is synchronized. When it was odd, it
means
>>>> there's a reader we need to wait it to be even again then
we are
>>>> synchronized.
>>> You just described a seqlock.
>>
>> Kind of, see my explanation below.
>>
>>
>>> We've been talking about providing this as some core service
from mmu
>>> notifiers because nearly every use of this API needs it.
>>
>> That would be very helpful.
>>
>>
>>> IMHO this gets the whole thing backwards, the common pattern is to
>>> protect the 'shadow pte' data with a seqlock (usually open
coded),
>>> such that the mmu notififer side has the write side of that lock
and
>>> the read side is consumed by the thread accessing or updating the
SPTE.
>>
>> Yes, I've considered something like that. But the problem is, mmu
notifier
>> (writer) need to wait for the vhost worker to finish the read before it
can
>> do things like setting dirty pages and unmapping page.? It looks to me
>> seqlock doesn't provide things like this.
> The seqlock is usually used to prevent a 2nd thread from accessing the
> VA while it is being changed by the mm. ie you use something seqlocky
> instead of the ugly mmu_notifier_unregister/register cycle.

Yes, so we have two mappings:

[1] vring address to VA
[2] VA to PA

And have several readers and writers

1) set_vring_num_addr(): writer of both [1] and [2]
2) MMU notifier: reader of [1] writer of [2]
3) GUP: reader of [1] writer of [2]
4) memory accessors: reader of [1] and [2]

Fortunately, 1) 3) and 4) have already synchronized through vq->mutex. 
We only need to deal with synchronization between 2) and each of the reset:
Sync between 1) and 2): For mapping [1], I do 
mmu_notifier_unregister/register. This help to avoid holding any lock to 
do overlap check. Anyway we only care about one or three pages , but the 
whole guest memory could be several TBs. For mapping [2], both 1) and 2) 
are writers, so use spinlock (mmu_lock) to synchronize.
Sync between 2) and 3): For mapping [1], both are readers, no need any 
synchronization. For mapping [2], both 2) and 3) are writers, so 
synchronize through spinlock (mmu_lock);
Sync between 2) and 4): For mapping [1], both are readers, no need any 
synchronization. For mapping [2], synchronize through RCU (or something 
simliar to seqlock).

You suggestion is about the synchronization of [1] which may make sense, 
but it could be done on top as an optimization. What this path tries to 
do is to not use RCU for [2]. Of course, the simplest way is to use vq 
mutex in 2) but it means:
- we must hold vq lock to check range overlap
- since the critical section was increased, the worst case is to wait 
guest memory to be swapped in, this could be even slower than 
synchronize_rcu().

>
> You are supposed to use something simple like a spinlock or mutex
> inside the invalidate_range_start to serialized tear down of the SPTEs
> with their accessors.

Technically yes, but we probably can't afford that for vhost fast path, 
the atomics eliminate almost all the performance improvement brought by 
this patch on a machine without SMAP.

>
>> write_seqcount_begin()
>>
>> map = vq->map[X]
>>
>> write or read through map->addr directly
>>
>> write_seqcount_end()
>>
>>
>> There's no rmb() in write_seqcount_begin(), so map could be read
before
>> write_seqcount_begin(), but it looks to me now that this doesn't
harm at
>> all, maybe we can try this way.
> That is because it is a write side lock, not a read lock. IIRC
> seqlocks have weaker barriers because the write side needs to be
> serialized in some other way.

Yes. Having a hard thought of the code, it looks to me 
write_seqcount_begin()/end() is sufficient here:

- Notifier will only assign NULL to map, so it doesn't harm to read map 
before seq, then we will fallback to normal copy_from/to_user() slow 
path earlier
- if we write through map->addr, it should be done before increasing the 
seqcount because of the smp_wmb() in write_seqcount_end()
- if we read through map->addr which also contain a store to a pointer, 
we have a good data dependency so smp_wmb() also work here.

>
> The requirement I see is you need invalidate_range_start to block
> until another thread exits its critical section (ie stops accessing
> the SPTEs).

Yes.

>
> That is a spinlock/mutex.

Or a semantics similar to RCU.

>
> You just can't invent a faster spinlock by open coding something with
> barriers, it doesn't work.
>
> Jason

If write_seqlock() works here, we can simply wait for seqcount to move 
advance in MMU notifier. The original idea is to use RCU which solves 
this perfectly. But as pointed out it could be slow.

Thanks

Jason Gunthorpe

2019-Aug-01 14:15 UTC

head link

[PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

On Thu, Aug 01, 2019 at 01:02:18PM +0800, Jason Wang
wrote:> 
> On 2019/8/1 ??3:30, Jason Gunthorpe wrote:
> > On Wed, Jul 31, 2019 at 09:28:20PM +0800, Jason Wang wrote:
> > > On 2019/7/31 ??8:39, Jason Gunthorpe wrote:
> > > > On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote:
> > > > > We used to use RCU to synchronize MMU notifier with
worker. This leads
> > > > > calling synchronize_rcu() in invalidate_range_start().
But on a busy
> > > > > system, there would be many factors that may slow down
the
> > > > > synchronize_rcu() which makes it unsuitable to be
called in MMU
> > > > > notifier.
> > > > > 
> > > > > A solution is SRCU but its overhead is obvious with the
expensive full
> > > > > memory barrier. Another choice is to use seqlock, but
it doesn't
> > > > > provide a synchronization method between readers and
writers. The last
> > > > > choice is to use vq mutex, but it need to deal with the
worst case
> > > > > that MMU notifier must be blocked and wait for the
finish of swap in.
> > > > > 
> > > > > So this patch switches use a counter to track whether
or not the map
> > > > > was used. The counter was increased when vq try to
start or finish
> > > > > uses the map. This means, when it was even, we're
sure there's no
> > > > > readers and MMU notifier is synchronized. When it was
odd, it means
> > > > > there's a reader we need to wait it to be even
again then we are
> > > > > synchronized.
> > > > You just described a seqlock.
> > > 
> > > Kind of, see my explanation below.
> > > 
> > > 
> > > > We've been talking about providing this as some core
service from mmu
> > > > notifiers because nearly every use of this API needs it.
> > > 
> > > That would be very helpful.
> > > 
> > > 
> > > > IMHO this gets the whole thing backwards, the common pattern
is to
> > > > protect the 'shadow pte' data with a seqlock
(usually open coded),
> > > > such that the mmu notififer side has the write side of that
lock and
> > > > the read side is consumed by the thread accessing or
updating the SPTE.
> > > 
> > > Yes, I've considered something like that. But the problem is,
mmu notifier
> > > (writer) need to wait for the vhost worker to finish the read
before it can
> > > do things like setting dirty pages and unmapping page.? It looks
to me
> > > seqlock doesn't provide things like this.
> > The seqlock is usually used to prevent a 2nd thread from accessing the
> > VA while it is being changed by the mm. ie you use something seqlocky
> > instead of the ugly mmu_notifier_unregister/register cycle.
> 
> 
> Yes, so we have two mappings:
> 
> [1] vring address to VA
> [2] VA to PA
> 
> And have several readers and writers
> 
> 1) set_vring_num_addr(): writer of both [1] and [2]
> 2) MMU notifier: reader of [1] writer of [2]
> 3) GUP: reader of [1] writer of [2]
> 4) memory accessors: reader of [1] and [2]
> 
> Fortunately, 1) 3) and 4) have already synchronized through vq->mutex.
We
> only need to deal with synchronization between 2) and each of the reset:
> Sync between 1) and 2): For mapping [1], I do
> mmu_notifier_unregister/register. This help to avoid holding any lock to do
> overlap check.
I suspect you could have done this with a RCU technique instead of
register/unregister.
> Sync between 2) and 4): For mapping [1], both are readers, no need any
> synchronization. For mapping [2], synchronize through RCU (or something
> simliar to seqlock).
You can't really use a seqlock, seqlocks are collision-retry locks,
and the semantic here is that invalidate_range_start *MUST* not
continue until thread doing #4 above is guarenteed no longer touching
the memory.

This must be a proper barrier, like a spinlock, mutex, or
synchronize_rcu.

And, again, you can't re-invent a spinlock with open coding and get
something better.

Jason

Possibly Parallel Threads

Search for more reasonably related threads

Linux Virtualization - Aug 2019 - [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

[PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

[PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

[PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

Possibly Parallel Threads