thr3ads.net - Linux Virtualization - [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address [Mar 2019]

If this information is useful, please help other people find it:
Share via:

Jerome Glisse

2019-Mar-07 20:17 UTC

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

On Thu, Mar 07, 2019 at 02:38:38PM -0500, Andrea Arcangeli
wrote:> On Thu, Mar 07, 2019 at 02:09:10PM -0500, Jerome Glisse wrote:
> > I thought this patch was only for anonymous memory ie not file back ?
> 
> Yes, the other common usages are on hugetlbfs/tmpfs that also don't
> need to implement writeback and are obviously safe too.
> 
> > If so then set dirty is mostly useless it would only be use for swap
> > but for this you can use an unlock version to set the page dirty.
> 
> It's not a practical issue but a security issue perhaps: you can
> change the KVM userland to run on VM_SHARED ext4 as guest physical
> memory, you could do that with the qemu command line that is used to
> place it on tmpfs or hugetlbfs for example and some proprietary KVM
> userland may do for other reasons. In general it shouldn't be possible
> to crash the kernel with this, and it wouldn't be nice to fail if
> somebody decides to put VM_SHARED ext4 (we could easily allow vhost
> ring only backed by anon or tmpfs or hugetlbfs to solve this of
> course).
> 
> It sounds like we should at least optimize away the _lock from
> set_page_dirty if it's anon/hugetlbfs/tmpfs, would be nice if there
> was a clean way to do that.
> 
> Now assuming we don't nak the use on ext4 VM_SHARED and we stick to
> set_page_dirty_lock for such case: could you recap how that
> __writepage ext4 crash was solved if try_to_free_buffers() run on a
> pinned GUP page (in our vhost case try_to_unmap would have gotten rid
> of the pins through the mmu notifier and the page would have been
> freed just fine).
So for the above the easiest thing is to call set_page_dirty() from
the mmu notifier callback. It is always safe to use the non locking
variant from such callback. Well it is safe only if the page was
map with write permission prior to the callback so here i assume
nothing stupid is going on and that you only vmap page with write
if they have a CPU pte with write and if not then you force a write
page fault.

Basicly from mmu notifier callback you have the same right as zap
pte has.> 
> The first two things that come to mind is that we can easily forbid
> the try_to_free_buffers() if the page might be pinned by GUP, it has
> false positives with the speculative pagecache lookups but it cannot
> give false negatives. We use those checks to know when a page is
> pinned by GUP, for example, where we cannot merge KSM pages with gup
> pins etc... However what if the elevated refcount wasn't there when
> try_to_free_buffers run and is there when __remove_mapping runs?
> 
> What I mean is that it sounds easy to forbid try_to_free_buffers for
> the long term pins, but that still won't prevent the same exact issue
> for a transient pin (except the window to trigger it will be much smaller).
I think here you do not want to go down the same path as what is being
plane for GUP. GUP is being fix for "broken" hardware. Myself i am
converting proper hardware to no longer use GUP but rely on mmu notifier.

So i would not do any dance with blocking try_to_free_buffer, just
do everything from mmu notifier callback and you are fine.
> 
> I basically don't see how long term GUP pins breaks stuff in ext4
> while transient short term GUP pins like O_DIRECT don't. The VM code
> isn't able to disambiguate if the pin is short or long term and it
> won't even be able to tell the difference between a GUP pin (long or
> short term) and a speculative get_page_unless_zero run by the
> pagecache speculative pagecache lookup. Even a random speculative
> pagecache lookup that runs just before __remove_mapping, can cause
> __remove_mapping to fail despite try_to_free_buffers() succeeded
> before it (like if there was a transient or long term GUP
> pin). speculative lookup that can happen across all page struct at all
> times and they will cause page_ref_freeze in __remove_mapping to
> fail.
> 
> I'm sure I'm missing details on the ext4 __writepage problem and
how
> set_page_dirty_lock broke stuff with long term GUP pins, so I'm
> asking...
O_DIRECT can suffer from the same issue but the race window for that
is small enough that it is unlikely it ever happened. But for device
driver that GUP page for hours/days/weeks/months ... obviously the
race window is big enough here. It affects many fs (ext4, xfs, ...)
in different ways. I think ext4 is the most obvious because of the
kernel log trace it leaves behind.

Bottom line is for set_page_dirty to be safe you need the following:
    lock_page()
    page_mkwrite()
    set_pte_with_write()
    unlock_page()

Now when loosing the write permission on the pte you will first get
a mmu notifier callback so anyone that abide by mmu notifier is fine
as long as they only write to the page if they found a pte with
write as it means the above sequence did happen and page is write-
able until the mmu notifier callback happens.

When you lookup a page into the page cache you still need to call
page_mkwrite() before installing a write-able pte.

Here for this vmap thing all you need is that the original user
pte had the write flag. If you only allow write in the vmap when
the original pte had write and you abide by mmu notifier then it
is ok to call set_page_dirty from the mmu notifier (but not after).

Hence why my suggestion is a special vunmap that call set_page_dirty
on the page from the mmu notifier.

Cheers,
J?r?me

Andrea Arcangeli

2019-Mar-07 21:27 UTC

head link

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

Hello Jerome,

On Thu, Mar 07, 2019 at 03:17:22PM -0500, Jerome Glisse
wrote:> So for the above the easiest thing is to call set_page_dirty() from
> the mmu notifier callback. It is always safe to use the non locking
> variant from such callback. Well it is safe only if the page was
> map with write permission prior to the callback so here i assume
> nothing stupid is going on and that you only vmap page with write
> if they have a CPU pte with write and if not then you force a write
> page fault.
So if the GUP doesn't set FOLL_WRITE, set_page_dirty simply shouldn't
be called in such case. It only ever makes sense if the pte is
writable.

On a side note, the reason the write bit on the pte enabled avoids the
need of the _lock suffix is because of the stable page writeback
guarantees?
> Basicly from mmu notifier callback you have the same right as zap
> pte has.
Good point.

Related to this I already was wondering why the set_page_dirty is not
done in the invalidate. Reading the patch it looks like the dirty is
marked dirty when the ring wraps around, not in the invalidate, Jeson
can tell if I misread something there.

For transient data passing through the ring, nobody should care if
it's lost. It's not user-journaled anyway so it could hit the disk in
any order. The only reason to flush it to do disk is if there's memory
pressure (to pageout like a swapout) and in such case it's enough to
mark it dirty only in the mmu notifier invalidate like you pointed out
(and only if GUP was called with FOLL_WRITE).
> O_DIRECT can suffer from the same issue but the race window for that
> is small enough that it is unlikely it ever happened. But for device
Ok that clarifies things.
> driver that GUP page for hours/days/weeks/months ... obviously the
> race window is big enough here. It affects many fs (ext4, xfs, ...)
> in different ways. I think ext4 is the most obvious because of the
> kernel log trace it leaves behind.
> 
> Bottom line is for set_page_dirty to be safe you need the following:
>     lock_page()
>     page_mkwrite()
>     set_pte_with_write()
>     unlock_page()
I also wondered why ext4 writepage doesn't recreate the bh if they got
dropped by the VM and page->private is 0. I mean, page->index and
page->mapping are still there, that's enough info for writepage itself
to take a slow path and calls page_mkwrite to find where to write the
page on disk.
> Now when loosing the write permission on the pte you will first get
> a mmu notifier callback so anyone that abide by mmu notifier is fine
> as long as they only write to the page if they found a pte with
> write as it means the above sequence did happen and page is write-
> able until the mmu notifier callback happens.
>
> When you lookup a page into the page cache you still need to call
> page_mkwrite() before installing a write-able pte.
> 
> Here for this vmap thing all you need is that the original user
> pte had the write flag. If you only allow write in the vmap when
> the original pte had write and you abide by mmu notifier then it
> is ok to call set_page_dirty from the mmu notifier (but not after).
> 
> Hence why my suggestion is a special vunmap that call set_page_dirty
> on the page from the mmu notifier.
Agreed, that will solve all issues in vhost context with regard to
set_page_dirty, including the case the memory is backed by VM_SHARED ext4.

Thanks!
Andrea

Jason Wang

2019-Mar-08 09:13 UTC

head link

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

On 2019/3/8 ??5:27, Andrea Arcangeli wrote:> Hello Jerome,
>
> On Thu, Mar 07, 2019 at 03:17:22PM -0500, Jerome Glisse wrote:
>> So for the above the easiest thing is to call set_page_dirty() from
>> the mmu notifier callback. It is always safe to use the non locking
>> variant from such callback. Well it is safe only if the page was
>> map with write permission prior to the callback so here i assume
>> nothing stupid is going on and that you only vmap page with write
>> if they have a CPU pte with write and if not then you force a write
>> page fault.
> So if the GUP doesn't set FOLL_WRITE, set_page_dirty simply
shouldn't
> be called in such case. It only ever makes sense if the pte is
> writable.
>
> On a side note, the reason the write bit on the pte enabled avoids the
> need of the _lock suffix is because of the stable page writeback
> guarantees?
>
>> Basicly from mmu notifier callback you have the same right as zap
>> pte has.
> Good point.
>
> Related to this I already was wondering why the set_page_dirty is not
> done in the invalidate. Reading the patch it looks like the dirty is
> marked dirty when the ring wraps around, not in the invalidate, Jeson
> can tell if I misread something there.

Actually not wrapping around,? the pages for used ring was marked as 
dirty after a round of virtqueue processing when we're sure vhost wrote 
something there.

Thanks

>
> For transient data passing through the ring, nobody should care if
> it's lost. It's not user-journaled anyway so it could hit the disk
in
> any order. The only reason to flush it to do disk is if there's memory
> pressure (to pageout like a swapout) and in such case it's enough to
> mark it dirty only in the mmu notifier invalidate like you pointed out
> (and only if GUP was called with FOLL_WRITE).
>
>> O_DIRECT can suffer from the same issue but the race window for that
>> is small enough that it is unlikely it ever happened. But for device
> Ok that clarifies things.
>
>> driver that GUP page for hours/days/weeks/months ... obviously the
>> race window is big enough here. It affects many fs (ext4, xfs, ...)
>> in different ways. I think ext4 is the most obvious because of the
>> kernel log trace it leaves behind.
>>
>> Bottom line is for set_page_dirty to be safe you need the following:
>>      lock_page()
>>      page_mkwrite()
>>      set_pte_with_write()
>>      unlock_page()
> I also wondered why ext4 writepage doesn't recreate the bh if they got
> dropped by the VM and page->private is 0. I mean, page->index and
> page->mapping are still there, that's enough info for writepage
itself
> to take a slow path and calls page_mkwrite to find where to write the
> page on disk.
>
>> Now when loosing the write permission on the pte you will first get
>> a mmu notifier callback so anyone that abide by mmu notifier is fine
>> as long as they only write to the page if they found a pte with
>> write as it means the above sequence did happen and page is write-
>> able until the mmu notifier callback happens.
>>
>> When you lookup a page into the page cache you still need to call
>> page_mkwrite() before installing a write-able pte.
>>
>> Here for this vmap thing all you need is that the original user
>> pte had the write flag. If you only allow write in the vmap when
>> the original pte had write and you abide by mmu notifier then it
>> is ok to call set_page_dirty from the mmu notifier (but not after).
>>
>> Hence why my suggestion is a special vunmap that call set_page_dirty
>> on the page from the mmu notifier.
> Agreed, that will solve all issues in vhost context with regard to
> set_page_dirty, including the case the memory is backed by VM_SHARED ext4.
>
> Thanks!
> Andrea

Jan Kara

2019-Mar-11 14:45 UTC

head link

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

On Thu 07-03-19 16:27:17, Andrea Arcangeli wrote:> > driver that GUP page for hours/days/weeks/months ... obviously the
> > race window is big enough here. It affects many fs (ext4, xfs, ...)
> > in different ways. I think ext4 is the most obvious because of the
> > kernel log trace it leaves behind.
> > 
> > Bottom line is for set_page_dirty to be safe you need the following:
> >     lock_page()
> >     page_mkwrite()
> >     set_pte_with_write()
> >     unlock_page()
> 
> I also wondered why ext4 writepage doesn't recreate the bh if they got
> dropped by the VM and page->private is 0. I mean, page->index and
> page->mapping are still there, that's enough info for writepage
itself
> to take a slow path and calls page_mkwrite to find where to write the
> page on disk.
There are two problems:

1) What to do with errors that page_mkwrite() can generate (ENOMEM, ENOSPC,
EIO). On page fault you just propagate them to userspace, on set_page_dirty()
you have no chance so you just silently loose data.

2) We need various locks to protect page_mkwrite(), possibly do some IO.
set_page_dirty() is rather uncertain context to acquire locks or do IO...

									Honza
-- 
Jan Kara <jack at suse.com>
SUSE Labs, CR

Reasonably Related Threads

Search for more possibly parallel threads

Linux Virtualization - Mar 2019 - [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

Reasonably Related Threads