Peter Xu
2021-May-18 17:27 UTC
[Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access
Alistair, While I got one reply below to your previous email, I also looked at the rest code (majorly restore and fork sides) today and added a few more comments. On Tue, May 18, 2021 at 11:19:05PM +1000, Alistair Popple wrote: [...]> > > diff --git a/mm/memory.c b/mm/memory.c > > > index 3a5705cfc891..556ff396f2e9 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -700,6 +700,84 @@ struct page *vm_normal_page_pmd(struct vm_area_struct > > > *vma, unsigned long addr,> > > > } > > > #endif > > > > > > +static void restore_exclusive_pte(struct vm_area_struct *vma, > > > + struct page *page, unsigned long address, > > > + pte_t *ptep) > > > +{ > > > + pte_t pte; > > > + swp_entry_t entry; > > > + > > > + pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot))); > > > + if (pte_swp_soft_dirty(*ptep)) > > > + pte = pte_mksoft_dirty(pte); > > > + > > > + entry = pte_to_swp_entry(*ptep); > > > + if (pte_swp_uffd_wp(*ptep)) > > > + pte = pte_mkuffd_wp(pte); > > > + else if (is_writable_device_exclusive_entry(entry)) > > > + pte = maybe_mkwrite(pte_mkdirty(pte), vma); > > > + > > > + set_pte_at(vma->vm_mm, address, ptep, pte); > > > + > > > + /* > > > + * No need to take a page reference as one was already > > > + * created when the swap entry was made. > > > + */ > > > + if (PageAnon(page)) > > > + page_add_anon_rmap(page, vma, address, false); > > > + else > > > + page_add_file_rmap(page, false);This seems to be another leftover; maybe WARN_ON_ONCE(!PageAnon(page))?> > > + > > > + if (vma->vm_flags & VM_LOCKED) > > > + mlock_vma_page(page); > > > + > > > + /* > > > + * No need to invalidate - it was non-present before. However > > > + * secondary CPUs may have mappings that need invalidating. > > > + */ > > > + update_mmu_cache(vma, address, ptep); > > > +}[...]> > > /* > > > > > > * copy one vm_area from one task to the other. Assumes the page tables > > > * already present in the new task to be cleared in the whole range > > > > > > @@ -781,6 +859,12 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct > > > mm_struct *src_mm,> > > > pte = pte_swp_mkuffd_wp(pte); > > > > > > set_pte_at(src_mm, addr, src_pte, pte); > > > > > > } > > > > > > + } else if (is_device_exclusive_entry(entry)) { > > > + /* COW mappings should be dealt with by removing the entry > > > */Here the comment says "removing the entry" however I think it didn't remove the pte, instead it keeps it (as there's no "return", so set_pte_at() will be called below), so I got a bit confused.> > > + VM_BUG_ON(is_cow_mapping(vm_flags));Also here, if PageAnon() is the only case to support so far, doesn't that easily satisfy is_cow_mapping()? Maybe I missed something.. I also have a pure and high level question regarding a process fork() when there're device exclusive ptes: would the two processes then own the device together? Is this a real usecase? Indeed it'll be odd for a COW page since for COW page then it means after parent/child writting to the page it'll clone into two, then it's a mistery on which one will be the one that "exclusived owned" by the device..> > > + page = pfn_swap_entry_to_page(entry); > > > + get_page(page); > > > + rss[mm_counter(page)]++; > > > > > > } > > > set_pte_at(dst_mm, addr, dst_pte, pte); > > > return 0; > > > > > > @@ -947,6 +1031,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct > > > vm_area_struct *src_vma,> > > > int rss[NR_MM_COUNTERS]; > > > swp_entry_t entry = (swp_entry_t){0}; > > > struct page *prealloc = NULL; > > > > > > + struct page *locked_page = NULL; > > > > > > again: > > > progress = 0; > > > > > > @@ -980,13 +1065,36 @@ copy_pte_range(struct vm_area_struct *dst_vma, > > > struct vm_area_struct *src_vma,> > > > continue; > > > > > > } > > > if (unlikely(!pte_present(*src_pte))) { > > > > > > - entry.val = copy_nonpresent_pte(dst_mm, src_mm, > > > - dst_pte, src_pte, > > > - src_vma, addr, rss); > > > - if (entry.val) > > > - break; > > > - progress += 8; > > > - continue; > > > + swp_entry_t swp_entry = pte_to_swp_entry(*src_pte);(Just a side note to all of us: this will be one more place that I'll need to look after in my uffd-wp series if this series lands first, as after that series we can only call pte_to_swp_entry after a pte_has_swap_entry check, as sometimes non-present pte won't contain a swap entry at all)> > > + > > > + if (unlikely(is_cow_mapping(src_vma->vm_flags) && > > > + is_device_exclusive_entry(swp_entry))) { > > > + /* > > > + * Normally this would require sending mmu > > > + * notifiers, but copy_page_range() has > > > already + * done that for COW mappings. > > > + */ > > > + ret = try_restore_exclusive_pte(src_mm, > > > src_pte, + > > > src_vma, addr, + > > > &locked_page); + if (ret == -EBUSY) > > > + break;Would it be possible that we put all the handling of device exclusive ptes into copy_nonpresent_pte()? As IMHO all device exclusive ptes should still be one kind of non-present pte. Splitting the cases really make it (at least to me...) even harder to read. Maybe you wanted to avoid the rework of copy_nonpresent_pte() as it currently returns a entry.val which is indeed not straightforward already.. I wanted to clean that up but not yet. An easier option is perhaps failing the fork() directly when trylock_page() failed when restoring the pte? So the userspace could try again the whole fork(). However that'll also depend on my previous question on whether this is a valid scenario after all. If "maintaining fork correctness" is the only thing we persue for, maybe still worth to consider?> > > + } else { > > > + entry.val = copy_nonpresent_pte(dst_mm, > > > src_mm, + > > > dst_pte, src_pte, + > > > src_vma, addr, + > > > rss); + if (entry.val) > > > + break; > > > + progress += 8; > > > + continue; > > > + } > > > + } > > > + /* a non-present pte became present after dropping the ptl > > > */ > > > + if (unlikely(locked_page)) { > > > + unlock_page(locked_page); > > > + put_page(locked_page); > > > + locked_page = NULL; > > > > > > } > > > /* copy_present_pte() will clear `*prealloc' if consumed */ > > > ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte, > > >[...]> > > +static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf) > > > +{ > > > + struct page *page = vmf->page; > > > + struct vm_area_struct *vma = vmf->vma; > > > + struct page_vma_mapped_walk pvmw = { > > > + .page = page, > > > + .vma = vma, > > > + .address = vmf->address, > > > + .flags = PVMW_SYNC, > > > + }; > > > + vm_fault_t ret = 0; > > > + struct mmu_notifier_range range; > > > + > > > + if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) > > > + return VM_FAULT_RETRY; > > > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, > > > vma->vm_mm, > > > + vmf->address & PAGE_MASK, > > > + (vmf->address & PAGE_MASK) + PAGE_SIZE); > > > + mmu_notifier_invalidate_range_start(&range); > > > > I looked at MMU_NOTIFIER_CLEAR document and it tells me: > > > > * @MMU_NOTIFY_CLEAR: clear page table entry (many reasons for this like > > * madvise() or replacing a page by another one, ...). > > > > Does MMU_NOTIFIER_CLEAR suite for this case? Normally I think for such a > > case (existing pte is invalid) we don't need to notify at all. However > > from what I read from the whole series, this seems to be a critical point > > where we'd like to kick the owner/driver to synchronously stop doing atomic > > operations from the device. Not sure whether we'd like a new notifier > > type, or maybe at least comment on why to use CLEAR? > > Right, notifying the owner/driver when it no longer has exclusive access to > the page and allowing it to stop atomic operations is the critical point and > why it notifies when we ordinarily wouldn't (ie. invalid -> valid). > > I did consider adding a new type, but in the driver implementation it ends up > being treated the same as a CLEAR notification anyway so didn't think it was > necessary. But I suppose adding a different type would allow other listening > notifiers to filter these which might be worthwhile.Sounds good to me. [...]> > > + /* > > > + * Check that our target page is still mapped at the > > > expected > > > + * address. > > > + */ > > > + if (ttp->mm == mm && ttp->address == address && > > > + pte_write(pteval)) > > > + ttp->valid = true; > > > > I think I get the point of doing this (as after GUP the pte could have been > > changed to point to another page), however it smells a bit odd to me (or > > it's also possible that I'm not familiar enough with the code base..). > > IIUC this is the _only_ reason that we passed in "address" into > > try_to_protect() too, and further into the ttp_args. > > Yes, this is why "address" is passed up to ttp_args. > > > The odd part is the remote GUP should have walked the page table already, so > > since the target here is the vaddr to replace, the 1st page table walk > > should be able to both trylock/lock the page, then modify the pte with > > pgtable lock held, return the locked page, then walk the rmap again to > > remove all the rest of the ptes that are mapping to this page. In that > > case before we call the rmap_walk() we know this must be the page we want > > to take care of, and no one will be able to restore the original mm pte > > either (as we're with the page lock). Then we don't need this check, > > neither do we need ttp->address. > > If I am understanding you correctly I think this would be similar to the > approach that was taken in v2. However it pretty much ended up being just an > open-coded version of gup which is useful anyway to fault the page in.I see. For easier reference this is v2 patch 1: https://lore.kernel.org/lkml/20210219020750.16444-2-apopple at nvidia.com/ Indeed that looks like it, it's just that instead of grabbing the page only in hmm_exclusive_pmd() we can do the pte modification along the way to seal the whole thing (address/pte & page). I saw Christoph and Jason commented in that patch, but not regarding to this approach. So is there a reason that you switched? Do you think it'll work? I have no strong opinion either, it's just not crystal clear why we'd need that ttp->address at all for a rmap walk along with that "valid" field. Meanwhile it should be slightly less efficient too to go with current approach, especially when the page array gets huge, I think: since there'll be longer time we do GUP before doing the rmap walk, so higher possibility that the GUPed pages got replaced for whatever reason. Then the call to make_device_exclusive_range() will fail as a whole just for a single page replacement within the range. Thanks, -- Peter Xu
Alistair Popple
2021-May-19 11:35 UTC
[Nouveau] [PATCH v8 5/8] mm: Device exclusive memory access
On Wednesday, 19 May 2021 3:27:42 AM AEST Peter Xu wrote:> > > The odd part is the remote GUP should have walked the page table > > > already, so since the target here is the vaddr to replace, the 1st page > > > table walk should be able to both trylock/lock the page, then modify > > > the pte with pgtable lock held, return the locked page, then walk the > > > rmap again to remove all the rest of the ptes that are mapping to this > > > page. In that case before we call the rmap_walk() we know this must be > > > the page we want to take care of, and no one will be able to restore > > > the original mm pte either (as we're with the page lock). Then we > > > don't need this check, neither do we need ttp->address. > > > > If I am understanding you correctly I think this would be similar to the > > approach that was taken in v2. However it pretty much ended up being just > > an open-coded version of gup which is useful anyway to fault the page in. > I see. For easier reference this is v2 patch 1: > > https://lore.kernel.org/lkml/20210219020750.16444-2-apopple at nvidia.com/Sorry, I should have been clearer and just included that reference for you.> Indeed that looks like it, it's just that instead of grabbing the page only > in hmm_exclusive_pmd() we can do the pte modification along the way to seal > the whole thing (address/pte & page). I saw Christoph and Jason commented > in that patch, but not regarding to this approach. So is there a reason > that you switched? Do you think it'll work?I think the approach you are describing is similar to what migrate_vma_collect()/migrate_vma_unamp() does now and I think it could be made to work. I ended up going with the GUP+unmap approach in part because Christoph suggested it but primarily because it required less code especially given we needed to call something to fault the page in/break COW anyway (or extend what was there to call handle_mm_fault(), etc.).> I have no strong opinion either, it's just not crystal clear why we'd need > that ttp->address at all for a rmap walk along with that "valid" field.It's purely to ensure the PTE pointing to the GUP page was replaced with an exclusive swap entry and that the mapping didn't change between calls.> Meanwhile it should be slightly less efficient too to go with current > approach, especially when the page array gets huge, I think: since there'll > be longer time we do GUP before doing the rmap walk, so higher possibility > that the GUPed pages got replaced for whatever reason. Then the call to > make_device_exclusive_range() will fail as a whole just for a single page > replacement within the range.