thr3ads.net - Linux Virtualization - [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker [Aug 2019]

If this information is useful, please help other people find it:
Share via:

Jason Gunthorpe

2019-Aug-01 14:15 UTC

[PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

On Thu, Aug 01, 2019 at 01:02:18PM +0800, Jason Wang
wrote:> 
> On 2019/8/1 ??3:30, Jason Gunthorpe wrote:
> > On Wed, Jul 31, 2019 at 09:28:20PM +0800, Jason Wang wrote:
> > > On 2019/7/31 ??8:39, Jason Gunthorpe wrote:
> > > > On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote:
> > > > > We used to use RCU to synchronize MMU notifier with
worker. This leads
> > > > > calling synchronize_rcu() in invalidate_range_start().
But on a busy
> > > > > system, there would be many factors that may slow down
the
> > > > > synchronize_rcu() which makes it unsuitable to be
called in MMU
> > > > > notifier.
> > > > > 
> > > > > A solution is SRCU but its overhead is obvious with the
expensive full
> > > > > memory barrier. Another choice is to use seqlock, but
it doesn't
> > > > > provide a synchronization method between readers and
writers. The last
> > > > > choice is to use vq mutex, but it need to deal with the
worst case
> > > > > that MMU notifier must be blocked and wait for the
finish of swap in.
> > > > > 
> > > > > So this patch switches use a counter to track whether
or not the map
> > > > > was used. The counter was increased when vq try to
start or finish
> > > > > uses the map. This means, when it was even, we're
sure there's no
> > > > > readers and MMU notifier is synchronized. When it was
odd, it means
> > > > > there's a reader we need to wait it to be even
again then we are
> > > > > synchronized.
> > > > You just described a seqlock.
> > > 
> > > Kind of, see my explanation below.
> > > 
> > > 
> > > > We've been talking about providing this as some core
service from mmu
> > > > notifiers because nearly every use of this API needs it.
> > > 
> > > That would be very helpful.
> > > 
> > > 
> > > > IMHO this gets the whole thing backwards, the common pattern
is to
> > > > protect the 'shadow pte' data with a seqlock
(usually open coded),
> > > > such that the mmu notififer side has the write side of that
lock and
> > > > the read side is consumed by the thread accessing or
updating the SPTE.
> > > 
> > > Yes, I've considered something like that. But the problem is,
mmu notifier
> > > (writer) need to wait for the vhost worker to finish the read
before it can
> > > do things like setting dirty pages and unmapping page.? It looks
to me
> > > seqlock doesn't provide things like this.
> > The seqlock is usually used to prevent a 2nd thread from accessing the
> > VA while it is being changed by the mm. ie you use something seqlocky
> > instead of the ugly mmu_notifier_unregister/register cycle.
> 
> 
> Yes, so we have two mappings:
> 
> [1] vring address to VA
> [2] VA to PA
> 
> And have several readers and writers
> 
> 1) set_vring_num_addr(): writer of both [1] and [2]
> 2) MMU notifier: reader of [1] writer of [2]
> 3) GUP: reader of [1] writer of [2]
> 4) memory accessors: reader of [1] and [2]
> 
> Fortunately, 1) 3) and 4) have already synchronized through vq->mutex.
We
> only need to deal with synchronization between 2) and each of the reset:
> Sync between 1) and 2): For mapping [1], I do
> mmu_notifier_unregister/register. This help to avoid holding any lock to do
> overlap check.
I suspect you could have done this with a RCU technique instead of
register/unregister.
> Sync between 2) and 4): For mapping [1], both are readers, no need any
> synchronization. For mapping [2], synchronize through RCU (or something
> simliar to seqlock).
You can't really use a seqlock, seqlocks are collision-retry locks,
and the semantic here is that invalidate_range_start *MUST* not
continue until thread doing #4 above is guarenteed no longer touching
the memory.

This must be a proper barrier, like a spinlock, mutex, or
synchronize_rcu.

And, again, you can't re-invent a spinlock with open coding and get
something better.

Jason

Jason Wang

2019-Aug-02 09:40 UTC

head link

[PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

On 2019/8/1 ??10:15, Jason Gunthorpe wrote:> On Thu, Aug 01, 2019 at 01:02:18PM +0800, Jason Wang wrote:
>> On 2019/8/1 ??3:30, Jason Gunthorpe wrote:
>>> On Wed, Jul 31, 2019 at 09:28:20PM +0800, Jason Wang wrote:
>>>> On 2019/7/31 ??8:39, Jason Gunthorpe wrote:
>>>>> On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote:
>>>>>> We used to use RCU to synchronize MMU notifier with
worker. This leads
>>>>>> calling synchronize_rcu() in invalidate_range_start().
But on a busy
>>>>>> system, there would be many factors that may slow down
the
>>>>>> synchronize_rcu() which makes it unsuitable to be
called in MMU
>>>>>> notifier.
>>>>>>
>>>>>> A solution is SRCU but its overhead is obvious with the
expensive full
>>>>>> memory barrier. Another choice is to use seqlock, but
it doesn't
>>>>>> provide a synchronization method between readers and
writers. The last
>>>>>> choice is to use vq mutex, but it need to deal with the
worst case
>>>>>> that MMU notifier must be blocked and wait for the
finish of swap in.
>>>>>>
>>>>>> So this patch switches use a counter to track whether
or not the map
>>>>>> was used. The counter was increased when vq try to
start or finish
>>>>>> uses the map. This means, when it was even, we're
sure there's no
>>>>>> readers and MMU notifier is synchronized. When it was
odd, it means
>>>>>> there's a reader we need to wait it to be even
again then we are
>>>>>> synchronized.
>>>>> You just described a seqlock.
>>>> Kind of, see my explanation below.
>>>>
>>>>
>>>>> We've been talking about providing this as some core
service from mmu
>>>>> notifiers because nearly every use of this API needs it.
>>>> That would be very helpful.
>>>>
>>>>
>>>>> IMHO this gets the whole thing backwards, the common
pattern is to
>>>>> protect the 'shadow pte' data with a seqlock
(usually open coded),
>>>>> such that the mmu notififer side has the write side of that
lock and
>>>>> the read side is consumed by the thread accessing or
updating the SPTE.
>>>> Yes, I've considered something like that. But the problem
is, mmu notifier
>>>> (writer) need to wait for the vhost worker to finish the read
before it can
>>>> do things like setting dirty pages and unmapping page.? It
looks to me
>>>> seqlock doesn't provide things like this.
>>> The seqlock is usually used to prevent a 2nd thread from accessing
the
>>> VA while it is being changed by the mm. ie you use something
seqlocky
>>> instead of the ugly mmu_notifier_unregister/register cycle.
>>
>> Yes, so we have two mappings:
>>
>> [1] vring address to VA
>> [2] VA to PA
>>
>> And have several readers and writers
>>
>> 1) set_vring_num_addr(): writer of both [1] and [2]
>> 2) MMU notifier: reader of [1] writer of [2]
>> 3) GUP: reader of [1] writer of [2]
>> 4) memory accessors: reader of [1] and [2]
>>
>> Fortunately, 1) 3) and 4) have already synchronized through
vq->mutex. We
>> only need to deal with synchronization between 2) and each of the
reset:
>> Sync between 1) and 2): For mapping [1], I do
>> mmu_notifier_unregister/register. This help to avoid holding any lock
to do
>> overlap check.
> I suspect you could have done this with a RCU technique instead of
> register/unregister.

Probably. But the issue to be addressed by this patch is the 
synchronization between MMU notifier and vhost worker.

>
>> Sync between 2) and 4): For mapping [1], both are readers, no need any
>> synchronization. For mapping [2], synchronize through RCU (or something
>> simliar to seqlock).
> You can't really use a seqlock, seqlocks are collision-retry locks,
> and the semantic here is that invalidate_range_start *MUST* not
> continue until thread doing #4 above is guarenteed no longer touching
> the memory.

Yes, that's the tricky part. For hardware like CPU, kicking through IPI 
is sufficient for synchronization. But for vhost kthread, it requires a 
low overhead synchronization.

>
> This must be a proper barrier, like a spinlock, mutex, or
> synchronize_rcu.

I start with synchronize_rcu() but both you and Michael raise some 
concern. Then I try spinlock and mutex:

1) spinlock: add lots of overhead on datapath, this leads 0 performance 
improvement.

2) SRCU: full memory barrier requires on srcu_read_lock(), which still 
leads little performance improvement

3) mutex: a possible issue is need to wait for the page to be swapped in 
(is this unacceptable ?), another issue is that we need hold vq lock 
during range overlap check.

4) using vhost_flush_work() instead of synchronize_rcu(): still need to 
wait for swap. But can do overlap checking without the lock

>
> And, again, you can't re-invent a spinlock with open coding and get
> something better.

So the question is if waiting for swap is considered to be unsuitable 
for MMU notifiers. If not, it would simplify codes. If not, we still 
need to figure out a possible solution.

Btw, I come up another idea, that is to disable preemption when vhost 
thread need to access the memory. Then register preempt notifier and if 
vhost thread is preempted, we're sure no one will access the memory and 
can do the cleanup.

Thanks

>
> Jason

Jason Gunthorpe

2019-Aug-02 12:46 UTC

head link

[PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang
wrote:> > This must be a proper barrier, like a spinlock, mutex, or
> > synchronize_rcu.
> 
> 
> I start with synchronize_rcu() but both you and Michael raise some
> concern.
I've also idly wondered if calling synchronize_rcu() under the various
mm locks is a deadlock situation.
> Then I try spinlock and mutex:
> 
> 1) spinlock: add lots of overhead on datapath, this leads 0 performance
> improvement.
I think the topic here is correctness not performance improvement
> 2) SRCU: full memory barrier requires on srcu_read_lock(), which still
leads
> little performance improvement
 > 3) mutex: a possible issue is need to wait for the page to be swapped in
(is
> this unacceptable ?), another issue is that we need hold vq lock during
> range overlap check.
I have a feeling that mmu notififers cannot safely become dependent on
progress of swap without causing deadlock. You probably should avoid
this.
> > And, again, you can't re-invent a spinlock with open coding and
get
> > something better.
> 
> So the question is if waiting for swap is considered to be unsuitable for
> MMU notifiers. If not, it would simplify codes. If not, we still need to
> figure out a possible solution.
> 
> Btw, I come up another idea, that is to disable preemption when vhost
thread
> need to access the memory. Then register preempt notifier and if vhost
> thread is preempted, we're sure no one will access the memory and can
do the
> cleanup.
I think you should use the spinlock so at least the code is obviously
functionally correct and worry about designing some properly justified
performance change after.

Jason

Michael S. Tsirkin

2019-Aug-02 14:03 UTC

head link

[PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

On Fri, Aug 02, 2019 at 05:40:07PM +0800, Jason Wang
wrote:> Btw, I come up another idea, that is to disable preemption when vhost
thread
> need to access the memory. Then register preempt notifier and if vhost
> thread is preempted, we're sure no one will access the memory and can
do the
> cleanup.
Great, more notifiers :(

Maybe can live with
1- disable preemption while using the cached pointer
2- teach vhost to recover from memory access failures,
   by switching to regular from/to user path

So if you want to try that, fine since it's a step in
the right direction.

But I think fundamentally it's not what we want to do long term.

It's always been a fundamental problem with this patch series that only
metadata is accessed through a direct pointer.

The difference in ways you handle metadata and data is what is
now coming and messing everything up.

So if continuing the direct map approach,
what is needed is a cache of mapped VM memory, then on a cache miss
we'd queue work along the lines of 1-2 above.

That's one direction to take. Another one is to give up on that and
write our own version of uaccess macros.  Add a "high security" flag
to
the vhost module and if not active use these for userspace memory
access.

-- 
MST

Maybe Matching Threads

Search for more possibly parallel threads

Linux Virtualization - Aug 2019 - [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

[PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

[PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

[PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

[PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker

Maybe Matching Threads