thr3ads.net - Linux Virtualization - [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address [Mar 2019]

If this information is useful, please help other people find it:
Share via:

Jason Wang

2019-Mar-08 09:13 UTC

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

On 2019/3/8 ??5:27, Andrea Arcangeli wrote:> Hello Jerome,
>
> On Thu, Mar 07, 2019 at 03:17:22PM -0500, Jerome Glisse wrote:
>> So for the above the easiest thing is to call set_page_dirty() from
>> the mmu notifier callback. It is always safe to use the non locking
>> variant from such callback. Well it is safe only if the page was
>> map with write permission prior to the callback so here i assume
>> nothing stupid is going on and that you only vmap page with write
>> if they have a CPU pte with write and if not then you force a write
>> page fault.
> So if the GUP doesn't set FOLL_WRITE, set_page_dirty simply
shouldn't
> be called in such case. It only ever makes sense if the pte is
> writable.
>
> On a side note, the reason the write bit on the pte enabled avoids the
> need of the _lock suffix is because of the stable page writeback
> guarantees?
>
>> Basicly from mmu notifier callback you have the same right as zap
>> pte has.
> Good point.
>
> Related to this I already was wondering why the set_page_dirty is not
> done in the invalidate. Reading the patch it looks like the dirty is
> marked dirty when the ring wraps around, not in the invalidate, Jeson
> can tell if I misread something there.

Actually not wrapping around,? the pages for used ring was marked as 
dirty after a round of virtqueue processing when we're sure vhost wrote 
something there.

Thanks

>
> For transient data passing through the ring, nobody should care if
> it's lost. It's not user-journaled anyway so it could hit the disk
in
> any order. The only reason to flush it to do disk is if there's memory
> pressure (to pageout like a swapout) and in such case it's enough to
> mark it dirty only in the mmu notifier invalidate like you pointed out
> (and only if GUP was called with FOLL_WRITE).
>
>> O_DIRECT can suffer from the same issue but the race window for that
>> is small enough that it is unlikely it ever happened. But for device
> Ok that clarifies things.
>
>> driver that GUP page for hours/days/weeks/months ... obviously the
>> race window is big enough here. It affects many fs (ext4, xfs, ...)
>> in different ways. I think ext4 is the most obvious because of the
>> kernel log trace it leaves behind.
>>
>> Bottom line is for set_page_dirty to be safe you need the following:
>>      lock_page()
>>      page_mkwrite()
>>      set_pte_with_write()
>>      unlock_page()
> I also wondered why ext4 writepage doesn't recreate the bh if they got
> dropped by the VM and page->private is 0. I mean, page->index and
> page->mapping are still there, that's enough info for writepage
itself
> to take a slow path and calls page_mkwrite to find where to write the
> page on disk.
>
>> Now when loosing the write permission on the pte you will first get
>> a mmu notifier callback so anyone that abide by mmu notifier is fine
>> as long as they only write to the page if they found a pte with
>> write as it means the above sequence did happen and page is write-
>> able until the mmu notifier callback happens.
>>
>> When you lookup a page into the page cache you still need to call
>> page_mkwrite() before installing a write-able pte.
>>
>> Here for this vmap thing all you need is that the original user
>> pte had the write flag. If you only allow write in the vmap when
>> the original pte had write and you abide by mmu notifier then it
>> is ok to call set_page_dirty from the mmu notifier (but not after).
>>
>> Hence why my suggestion is a special vunmap that call set_page_dirty
>> on the page from the mmu notifier.
> Agreed, that will solve all issues in vhost context with regard to
> set_page_dirty, including the case the memory is backed by VM_SHARED ext4.
>
> Thanks!
> Andrea

Andrea Arcangeli

2019-Mar-08 19:11 UTC

head link

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

On Fri, Mar 08, 2019 at 05:13:26PM +0800, Jason Wang
wrote:> Actually not wrapping around,? the pages for used ring was marked as 
> dirty after a round of virtqueue processing when we're sure vhost wrote
> something there.
Thanks for the clarification. So we need to convert it to
set_page_dirty and move it to the mmu notifier invalidate but in those
cases where gup_fast was called with write=1 (1 out of 3).

If using ->invalidate_range the page pin also must be removed
immediately after get_user_pages returns (not ok to hold the pin in
vmap until ->invalidate_range is called) to avoid false positive gup
pin checks in things like KSM, or the pin must be released in
invalidate_range_start (which is called before the pin checks).

Here's why:

		/*
		 * Check that no O_DIRECT or similar I/O is in progress on the
		 * page
		 */
		if (page_mapcount(page) + 1 + swapped != page_count(page)) {
			set_pte_at(mm, pvmw.address, pvmw.pte, entry);
			goto out_unlock;
		}
		[..]
		set_pte_at_notify(mm, pvmw.address, pvmw.pte, entry);
			  ^^^^^^^ too late release the pin here, the
				  above already failed

->invalidate_range cannot be used with mutex anyway so you need to go
back with invalidate_range_start/end anyway, just the pin must be
released in _start at the latest in such case.

My prefer is generally to call gup_fast() followed by immediate
put_page() because I always want to drop FOLL_GET from gup_fast as a
whole to avoid 2 useless atomic ops per gup_fast.

I'll write more about vmap in answer to the other email.

Thanks,
Andrea

Jason Wang

2019-Mar-11 07:21 UTC

head link

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

On 2019/3/9 ??3:11, Andrea Arcangeli wrote:> On Fri, Mar 08, 2019 at 05:13:26PM +0800, Jason Wang wrote:
>> Actually not wrapping around,? the pages for used ring was marked as
>> dirty after a round of virtqueue processing when we're sure vhost
wrote
>> something there.
> Thanks for the clarification. So we need to convert it to
> set_page_dirty and move it to the mmu notifier invalidate but in those
> cases where gup_fast was called with write=1 (1 out of 3).
>
> If using ->invalidate_range the page pin also must be removed
> immediately after get_user_pages returns (not ok to hold the pin in
> vmap until ->invalidate_range is called) to avoid false positive gup
> pin checks in things like KSM, or the pin must be released in
> invalidate_range_start (which is called before the pin checks).
>
> Here's why:
>
> 		/*
> 		 * Check that no O_DIRECT or similar I/O is in progress on the
> 		 * page
> 		 */
> 		if (page_mapcount(page) + 1 + swapped != page_count(page)) {
> 			set_pte_at(mm, pvmw.address, pvmw.pte, entry);
> 			goto out_unlock;
> 		}
> 		[..]
> 		set_pte_at_notify(mm, pvmw.address, pvmw.pte, entry);
> 			  ^^^^^^^ too late release the pin here, the
> 				  above already failed
>
> ->invalidate_range cannot be used with mutex anyway so you need to go
> back with invalidate_range_start/end anyway, just the pin must be
> released in _start at the latest in such case.

Yes.

>
> My prefer is generally to call gup_fast() followed by immediate
> put_page() because I always want to drop FOLL_GET from gup_fast as a
> whole to avoid 2 useless atomic ops per gup_fast.

Ok, will do this (if I still plan to use vmap() in next version).

>
> I'll write more about vmap in answer to the other email.
>
> Thanks,
> Andrea

Thanks

Seemingly Similar Threads

Search for more possibly parallel threads

Linux Virtualization - Mar 2019 - [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address

Seemingly Similar Threads