Jason Wang
2019-Mar-08 09:13 UTC
[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address
On 2019/3/8 ??5:27, Andrea Arcangeli wrote:> Hello Jerome, > > On Thu, Mar 07, 2019 at 03:17:22PM -0500, Jerome Glisse wrote: >> So for the above the easiest thing is to call set_page_dirty() from >> the mmu notifier callback. It is always safe to use the non locking >> variant from such callback. Well it is safe only if the page was >> map with write permission prior to the callback so here i assume >> nothing stupid is going on and that you only vmap page with write >> if they have a CPU pte with write and if not then you force a write >> page fault. > So if the GUP doesn't set FOLL_WRITE, set_page_dirty simply shouldn't > be called in such case. It only ever makes sense if the pte is > writable. > > On a side note, the reason the write bit on the pte enabled avoids the > need of the _lock suffix is because of the stable page writeback > guarantees? > >> Basicly from mmu notifier callback you have the same right as zap >> pte has. > Good point. > > Related to this I already was wondering why the set_page_dirty is not > done in the invalidate. Reading the patch it looks like the dirty is > marked dirty when the ring wraps around, not in the invalidate, Jeson > can tell if I misread something there.Actually not wrapping around,? the pages for used ring was marked as dirty after a round of virtqueue processing when we're sure vhost wrote something there. Thanks> > For transient data passing through the ring, nobody should care if > it's lost. It's not user-journaled anyway so it could hit the disk in > any order. The only reason to flush it to do disk is if there's memory > pressure (to pageout like a swapout) and in such case it's enough to > mark it dirty only in the mmu notifier invalidate like you pointed out > (and only if GUP was called with FOLL_WRITE). > >> O_DIRECT can suffer from the same issue but the race window for that >> is small enough that it is unlikely it ever happened. But for device > Ok that clarifies things. > >> driver that GUP page for hours/days/weeks/months ... obviously the >> race window is big enough here. It affects many fs (ext4, xfs, ...) >> in different ways. I think ext4 is the most obvious because of the >> kernel log trace it leaves behind. >> >> Bottom line is for set_page_dirty to be safe you need the following: >> lock_page() >> page_mkwrite() >> set_pte_with_write() >> unlock_page() > I also wondered why ext4 writepage doesn't recreate the bh if they got > dropped by the VM and page->private is 0. I mean, page->index and > page->mapping are still there, that's enough info for writepage itself > to take a slow path and calls page_mkwrite to find where to write the > page on disk. > >> Now when loosing the write permission on the pte you will first get >> a mmu notifier callback so anyone that abide by mmu notifier is fine >> as long as they only write to the page if they found a pte with >> write as it means the above sequence did happen and page is write- >> able until the mmu notifier callback happens. >> >> When you lookup a page into the page cache you still need to call >> page_mkwrite() before installing a write-able pte. >> >> Here for this vmap thing all you need is that the original user >> pte had the write flag. If you only allow write in the vmap when >> the original pte had write and you abide by mmu notifier then it >> is ok to call set_page_dirty from the mmu notifier (but not after). >> >> Hence why my suggestion is a special vunmap that call set_page_dirty >> on the page from the mmu notifier. > Agreed, that will solve all issues in vhost context with regard to > set_page_dirty, including the case the memory is backed by VM_SHARED ext4. > > Thanks! > Andrea
Andrea Arcangeli
2019-Mar-08 19:11 UTC
[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address
On Fri, Mar 08, 2019 at 05:13:26PM +0800, Jason Wang wrote:> Actually not wrapping around,? the pages for used ring was marked as > dirty after a round of virtqueue processing when we're sure vhost wrote > something there.Thanks for the clarification. So we need to convert it to set_page_dirty and move it to the mmu notifier invalidate but in those cases where gup_fast was called with write=1 (1 out of 3). If using ->invalidate_range the page pin also must be removed immediately after get_user_pages returns (not ok to hold the pin in vmap until ->invalidate_range is called) to avoid false positive gup pin checks in things like KSM, or the pin must be released in invalidate_range_start (which is called before the pin checks). Here's why: /* * Check that no O_DIRECT or similar I/O is in progress on the * page */ if (page_mapcount(page) + 1 + swapped != page_count(page)) { set_pte_at(mm, pvmw.address, pvmw.pte, entry); goto out_unlock; } [..] set_pte_at_notify(mm, pvmw.address, pvmw.pte, entry); ^^^^^^^ too late release the pin here, the above already failed ->invalidate_range cannot be used with mutex anyway so you need to go back with invalidate_range_start/end anyway, just the pin must be released in _start at the latest in such case. My prefer is generally to call gup_fast() followed by immediate put_page() because I always want to drop FOLL_GET from gup_fast as a whole to avoid 2 useless atomic ops per gup_fast. I'll write more about vmap in answer to the other email. Thanks, Andrea
Jason Wang
2019-Mar-11 07:21 UTC
[RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address
On 2019/3/9 ??3:11, Andrea Arcangeli wrote:> On Fri, Mar 08, 2019 at 05:13:26PM +0800, Jason Wang wrote: >> Actually not wrapping around,? the pages for used ring was marked as >> dirty after a round of virtqueue processing when we're sure vhost wrote >> something there. > Thanks for the clarification. So we need to convert it to > set_page_dirty and move it to the mmu notifier invalidate but in those > cases where gup_fast was called with write=1 (1 out of 3). > > If using ->invalidate_range the page pin also must be removed > immediately after get_user_pages returns (not ok to hold the pin in > vmap until ->invalidate_range is called) to avoid false positive gup > pin checks in things like KSM, or the pin must be released in > invalidate_range_start (which is called before the pin checks). > > Here's why: > > /* > * Check that no O_DIRECT or similar I/O is in progress on the > * page > */ > if (page_mapcount(page) + 1 + swapped != page_count(page)) { > set_pte_at(mm, pvmw.address, pvmw.pte, entry); > goto out_unlock; > } > [..] > set_pte_at_notify(mm, pvmw.address, pvmw.pte, entry); > ^^^^^^^ too late release the pin here, the > above already failed > > ->invalidate_range cannot be used with mutex anyway so you need to go > back with invalidate_range_start/end anyway, just the pin must be > released in _start at the latest in such case.Yes.> > My prefer is generally to call gup_fast() followed by immediate > put_page() because I always want to drop FOLL_GET from gup_fast as a > whole to avoid 2 useless atomic ops per gup_fast.Ok, will do this (if I still plan to use vmap() in next version).> > I'll write more about vmap in answer to the other email. > > Thanks, > AndreaThanks
Seemingly Similar Threads
- [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address
- [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address
- [RFC PATCH V2 0/5] vhost: accelerate metadata access through vmap()
- [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address
- [RFC PATCH V2 5/5] vhost: access vq metadata through kernel virtual address