thr3ads.net - Linux Virtualization - [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address [Mar 2019]

If this information is useful, please help other people find it:
Share via:

Jerome Glisse

2019-Mar-07 19:17 UTC

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

On Thu, Mar 07, 2019 at 12:56:45PM -0500, Michael S. Tsirkin
wrote:> On Thu, Mar 07, 2019 at 10:47:22AM -0500, Michael S. Tsirkin wrote:
> > On Wed, Mar 06, 2019 at 02:18:12AM -0500, Jason Wang wrote:
> > > +static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
> > > +	.invalidate_range = vhost_invalidate_range,
> > > +};
> > > +
> > >  void vhost_dev_init(struct vhost_dev *dev,
> > >  		    struct vhost_virtqueue **vqs, int nvqs, int iov_limit)
> > >  {
> > 
> > I also wonder here: when page is write protected then
> > it does not look like .invalidate_range is invoked.
> > 
> > E.g. mm/ksm.c calls
> > 
> > mmu_notifier_invalidate_range_start and
> > mmu_notifier_invalidate_range_end but not
mmu_notifier_invalidate_range.
> > 
> > Similarly, rmap in page_mkclean_one will not call
> > mmu_notifier_invalidate_range.
> > 
> > If I'm right vhost won't get notified when page is
write-protected since you
> > didn't install start/end notifiers. Note that end notifier can be
called
> > with page locked, so it's not as straight-forward as just adding a
call.
> > Writing into a write-protected page isn't a good idea.
> > 
> > Note that documentation says:
> > 	it is fine to delay the mmu_notifier_invalidate_range
> > 	call to mmu_notifier_invalidate_range_end() outside the page table
lock.
> > implying it's called just later.
> 
> OK I missed the fact that _end actually calls
> mmu_notifier_invalidate_range internally. So that part is fine but the
> fact that you are trying to take page lock under VQ mutex and take same
> mutex within notifier probably means it's broken for ksm and rmap at
> least since these call invalidate with lock taken.
> 
> And generally, Andrea told me offline one can not take mutex under
> the notifier callback. I CC'd Andrea for why.
Correct, you _can not_ take mutex or any sleeping lock from within the
invalidate_range callback as those callback happens under the page table
spinlock. You can however do so under the invalidate_range_start call-
back only if it is a blocking allow callback (there is a flag passdown
with the invalidate_range_start callback if you are not allow to block
then return EBUSY and the invalidation will be aborted).

> 
> That's a separate issue from set_page_dirty when memory is file backed.
If you can access file back page then i suggest using set_page_dirty
from within a special version of vunmap() so that when you vunmap you
set the page dirty without taking page lock. It is safe to do so
always from within an mmu notifier callback if you had the page map
with write permission which means that the page had write permission
in the userspace pte too and thus it having dirty pte is expected
and calling set_page_dirty on the page is allowed without any lock.
Locking will happen once the userspace pte are tear down through the
page table lock.
> It's because of all these issues that I preferred just accessing
> userspace memory and handling faults. Unfortunately there does not
> appear to exist an API that whitelists a specific driver along the lines
> of "I checked this code for speculative info leaks, don't add
barriers
> on data path please".
Maybe it would be better to explore adding such helper then remapping
page into kernel address space ?

Cheers,
J?r?me

Michael S. Tsirkin

2019-Mar-08 02:21 UTC

head link

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

On Thu, Mar 07, 2019 at 02:17:20PM -0500, Jerome Glisse
wrote:> > It's because of all these issues that I preferred just accessing
> > userspace memory and handling faults. Unfortunately there does not
> > appear to exist an API that whitelists a specific driver along the
lines
> > of "I checked this code for speculative info leaks, don't add
barriers
> > on data path please".
> 
> Maybe it would be better to explore adding such helper then remapping
> page into kernel address space ?
I explored it a bit (see e.g. thread around: "__get_user slower than
get_user") and I can tell you it's not trivial given the issue is
around
security.  So in practice it does not seem fair to keep a significant
optimization out of kernel because *maybe* we can do it differently even
better :)

-- 
MST

Jerome Glisse

2019-Mar-08 02:55 UTC

head link

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

On Thu, Mar 07, 2019 at 09:21:03PM -0500, Michael S. Tsirkin
wrote:> On Thu, Mar 07, 2019 at 02:17:20PM -0500, Jerome Glisse wrote:
> > > It's because of all these issues that I preferred just
accessing
> > > userspace memory and handling faults. Unfortunately there does
not
> > > appear to exist an API that whitelists a specific driver along
the lines
> > > of "I checked this code for speculative info leaks,
don't add barriers
> > > on data path please".
> > 
> > Maybe it would be better to explore adding such helper then remapping
> > page into kernel address space ?
> 
> I explored it a bit (see e.g. thread around: "__get_user slower than
> get_user") and I can tell you it's not trivial given the issue is
around
> security.  So in practice it does not seem fair to keep a significant
> optimization out of kernel because *maybe* we can do it differently even
> better :)
Maybe a slightly different approach between this patchset and other
copy user API would work here. What you want really is something like
a temporary mlock on a range of memory so that it is safe for the
kernel to access range of userspace virtual address ie page are
present and with proper permission hence there can be no page fault
while you are accessing thing from kernel context.

So you can have like a range structure and mmu notifier. When you
lock the range you block mmu notifier to allow your code to work on
the userspace VA safely. Once you are done you unlock and let the
mmu notifier go on. It is pretty much exactly this patchset except
that you remove all the kernel vmap code. A nice thing about that
is that you do not need to worry about calling set page dirty it
will already be handle by the userspace VA pte. It also use less
memory than when you have kernel vmap.

This idea might be defeated by security feature where the kernel is
running in its own address space without the userspace address
space present.

Anyway just wanted to put the idea forward.

Cheers,
J?r?me

Jason Wang

2019-Mar-08 08:58 UTC

head link

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

On 2019/3/8 ??3:17, Jerome Glisse wrote:> On Thu, Mar 07, 2019 at 12:56:45PM -0500, Michael S. Tsirkin wrote:
>> On Thu, Mar 07, 2019 at 10:47:22AM -0500, Michael S. Tsirkin wrote:
>>> On Wed, Mar 06, 2019 at 02:18:12AM -0500, Jason Wang wrote:
>>>> +static const struct mmu_notifier_ops vhost_mmu_notifier_ops =
{
>>>> +	.invalidate_range = vhost_invalidate_range,
>>>> +};
>>>> +
>>>>   void vhost_dev_init(struct vhost_dev *dev,
>>>>   		    struct vhost_virtqueue **vqs, int nvqs, int iov_limit)
>>>>   {
>>> I also wonder here: when page is write protected then
>>> it does not look like .invalidate_range is invoked.
>>>
>>> E.g. mm/ksm.c calls
>>>
>>> mmu_notifier_invalidate_range_start and
>>> mmu_notifier_invalidate_range_end but not
mmu_notifier_invalidate_range.
>>>
>>> Similarly, rmap in page_mkclean_one will not call
>>> mmu_notifier_invalidate_range.
>>>
>>> If I'm right vhost won't get notified when page is
write-protected since you
>>> didn't install start/end notifiers. Note that end notifier can
be called
>>> with page locked, so it's not as straight-forward as just
adding a call.
>>> Writing into a write-protected page isn't a good idea.
>>>
>>> Note that documentation says:
>>> 	it is fine to delay the mmu_notifier_invalidate_range
>>> 	call to mmu_notifier_invalidate_range_end() outside the page table
lock.
>>> implying it's called just later.
>> OK I missed the fact that _end actually calls
>> mmu_notifier_invalidate_range internally. So that part is fine but the
>> fact that you are trying to take page lock under VQ mutex and take same
>> mutex within notifier probably means it's broken for ksm and rmap
at
>> least since these call invalidate with lock taken.
>>
>> And generally, Andrea told me offline one can not take mutex under
>> the notifier callback. I CC'd Andrea for why.
> Correct, you _can not_ take mutex or any sleeping lock from within the
> invalidate_range callback as those callback happens under the page table
> spinlock. You can however do so under the invalidate_range_start call-
> back only if it is a blocking allow callback (there is a flag passdown
> with the invalidate_range_start callback if you are not allow to block
> then return EBUSY and the invalidation will be aborted).
>
>
>> That's a separate issue from set_page_dirty when memory is file
backed.
> If you can access file back page then i suggest using set_page_dirty
> from within a special version of vunmap() so that when you vunmap you
> set the page dirty without taking page lock. It is safe to do so
> always from within an mmu notifier callback if you had the page map
> with write permission which means that the page had write permission
> in the userspace pte too and thus it having dirty pte is expected
> and calling set_page_dirty on the page is allowed without any lock.
> Locking will happen once the userspace pte are tear down through the
> page table lock.

Can I simply can set_page_dirty() before vunmap() in the mmu notifier 
callback, or is there any reason that it must be called within vumap()?

Thanks

>
>> It's because of all these issues that I preferred just accessing
>> userspace memory and handling faults. Unfortunately there does not
>> appear to exist an API that whitelists a specific driver along the
lines
>> of "I checked this code for speculative info leaks, don't add
barriers
>> on data path please".
> Maybe it would be better to explore adding such helper then remapping
> page into kernel address space ?
>
> Cheers,
> J?r?me

Michael S. Tsirkin

2019-Mar-08 12:56 UTC

head link

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

On Fri, Mar 08, 2019 at 04:58:44PM +0800, Jason Wang
wrote:> 
> On 2019/3/8 ??3:17, Jerome Glisse wrote:
> > On Thu, Mar 07, 2019 at 12:56:45PM -0500, Michael S. Tsirkin wrote:
> > > On Thu, Mar 07, 2019 at 10:47:22AM -0500, Michael S. Tsirkin
wrote:
> > > > On Wed, Mar 06, 2019 at 02:18:12AM -0500, Jason Wang wrote:
> > > > > +static const struct mmu_notifier_ops
vhost_mmu_notifier_ops = {
> > > > > +	.invalidate_range = vhost_invalidate_range,
> > > > > +};
> > > > > +
> > > > >   void vhost_dev_init(struct vhost_dev *dev,
> > > > >   		    struct vhost_virtqueue **vqs, int nvqs, int
iov_limit)
> > > > >   {
> > > > I also wonder here: when page is write protected then
> > > > it does not look like .invalidate_range is invoked.
> > > > 
> > > > E.g. mm/ksm.c calls
> > > > 
> > > > mmu_notifier_invalidate_range_start and
> > > > mmu_notifier_invalidate_range_end but not
mmu_notifier_invalidate_range.
> > > > 
> > > > Similarly, rmap in page_mkclean_one will not call
> > > > mmu_notifier_invalidate_range.
> > > > 
> > > > If I'm right vhost won't get notified when page is
write-protected since you
> > > > didn't install start/end notifiers. Note that end
notifier can be called
> > > > with page locked, so it's not as straight-forward as
just adding a call.
> > > > Writing into a write-protected page isn't a good idea.
> > > > 
> > > > Note that documentation says:
> > > > 	it is fine to delay the mmu_notifier_invalidate_range
> > > > 	call to mmu_notifier_invalidate_range_end() outside the
page table lock.
> > > > implying it's called just later.
> > > OK I missed the fact that _end actually calls
> > > mmu_notifier_invalidate_range internally. So that part is fine
but the
> > > fact that you are trying to take page lock under VQ mutex and
take same
> > > mutex within notifier probably means it's broken for ksm and
rmap at
> > > least since these call invalidate with lock taken.
> > > 
> > > And generally, Andrea told me offline one can not take mutex
under
> > > the notifier callback. I CC'd Andrea for why.
> > Correct, you _can not_ take mutex or any sleeping lock from within the
> > invalidate_range callback as those callback happens under the page
table
> > spinlock. You can however do so under the invalidate_range_start call-
> > back only if it is a blocking allow callback (there is a flag passdown
> > with the invalidate_range_start callback if you are not allow to block
> > then return EBUSY and the invalidation will be aborted).
> > 
> > 
> > > That's a separate issue from set_page_dirty when memory is
file backed.
> > If you can access file back page then i suggest using set_page_dirty
> > from within a special version of vunmap() so that when you vunmap you
> > set the page dirty without taking page lock. It is safe to do so
> > always from within an mmu notifier callback if you had the page map
> > with write permission which means that the page had write permission
> > in the userspace pte too and thus it having dirty pte is expected
> > and calling set_page_dirty on the page is allowed without any lock.
> > Locking will happen once the userspace pte are tear down through the
> > page table lock.
> 
> 
> Can I simply can set_page_dirty() before vunmap() in the mmu notifier
> callback, or is there any reason that it must be called within vumap()?
> 
> Thanks

I think this is what Jerome is saying, yes.
Maybe add a patch to mmu notifier doc file, documenting this?

> 
> > 
> > > It's because of all these issues that I preferred just
accessing
> > > userspace memory and handling faults. Unfortunately there does
not
> > > appear to exist an API that whitelists a specific driver along
the lines
> > > of "I checked this code for speculative info leaks,
don't add barriers
> > > on data path please".
> > Maybe it would be better to explore adding such helper then remapping
> > page into kernel address space ?
> > 
> > Cheers,
> > J?r?me

Andrea Arcangeli

2019-Mar-08 19:13 UTC

head link

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

On Fri, Mar 08, 2019 at 04:58:44PM +0800, Jason Wang
wrote:> Can I simply can set_page_dirty() before vunmap() in the mmu notifier 
> callback, or is there any reason that it must be called within vumap()?
I also don't see any problem in doing it before vunmap. As far as the
mmu notifier and set_page_dirty is concerned vunmap is just
put_page. It's just slower and potentially unnecessary.

Apparently Analagous Threads

Search for more possibly parallel threads

Linux Virtualization - Mar 2019 - [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

Apparently Analagous Threads