David Hildenbrand
2025-Jan-30 09:01 UTC
[PATCH v1 04/12] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk
On 30.01.25 07:11, Alistair Popple wrote:> On Wed, Jan 29, 2025 at 12:54:02PM +0100, David Hildenbrand wrote: >> We require a writable PTE and only support anonymous folio: we can only >> have exactly one PTE pointing at that page, which we can just lookup >> using a folio walk, avoiding the rmap walk and the anon VMA lock. >> >> So let's stop doing an rmap walk and perform a folio walk instead, so we >> can easily just modify a single PTE and avoid relying on rmap/mapcounts. >> >> We now effectively work on a single PTE instead of multiple PTEs of >> a large folio, allowing for conversion of individual PTEs from >> non-exclusive to device-exclusive -- note that the other way always >> worked on single PTEs. >> >> We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why >> that is not required: GUP will already take care of the >> MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive >> entry) when not finding a present PTE and having to trigger a fault and >> ending up in remove_device_exclusive_entry(). > > I will have to look at this a bit more closely tomorrow but this doesn't seem > right to me. We may be transitioning from a present PTE (ie. a writable > anonymous mapping) to a non-present PTE (ie. a device-exclusive entry) and > therefore any secondary processors (eg. other GPUs, iommus, etc.) will need to > update their copies of the PTE. So I think the notifier call is needed.Then it is all very confusing: "MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no longer have exclusive access to the page." That's simply not true in the scenario you describe, because nobody had exclusive access. But what you are saying is, that we need to inform others (e.g., KVM) that we are converting it to a device-exclusive entry, such that they stop accessing it. That makes sense to me (and the cleanup patch in the cleanup series would have to go as well to prevent the livelock). So we would have to update the documentation of MMU_NOTIFY_EXCLUSIVE that it is also trigger on conversion from non-exclusive to exclusive. Does that make sense? Thanks! -- Cheers, David / dhildenb
David Hildenbrand
2025-Jan-30 09:12 UTC
[PATCH v1 04/12] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk
On 30.01.25 10:01, David Hildenbrand wrote:> On 30.01.25 07:11, Alistair Popple wrote: >> On Wed, Jan 29, 2025 at 12:54:02PM +0100, David Hildenbrand wrote: >>> We require a writable PTE and only support anonymous folio: we can only >>> have exactly one PTE pointing at that page, which we can just lookup >>> using a folio walk, avoiding the rmap walk and the anon VMA lock. >>> >>> So let's stop doing an rmap walk and perform a folio walk instead, so we >>> can easily just modify a single PTE and avoid relying on rmap/mapcounts. >>> >>> We now effectively work on a single PTE instead of multiple PTEs of >>> a large folio, allowing for conversion of individual PTEs from >>> non-exclusive to device-exclusive -- note that the other way always >>> worked on single PTEs. >>> >>> We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why >>> that is not required: GUP will already take care of the >>> MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive >>> entry) when not finding a present PTE and having to trigger a fault and >>> ending up in remove_device_exclusive_entry(). >> >> I will have to look at this a bit more closely tomorrow but this doesn't seem >> right to me. We may be transitioning from a present PTE (ie. a writable >> anonymous mapping) to a non-present PTE (ie. a device-exclusive entry) and >> therefore any secondary processors (eg. other GPUs, iommus, etc.) will need to >> update their copies of the PTE. So I think the notifier call is needed. > > Then it is all very confusing: > > "MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no > longer have exclusive access to the page." > > That's simply not true in the scenario you describe, because nobody had > exclusive access. > > But what you are saying is, that we need to inform others (e.g., KVM) > that we are converting it to a device-exclusive entry, such that they > stop accessing it. > > That makes sense to me (and the cleanup patch in the cleanup series > would have to go as well to prevent the livelock). > > So we would have to update the documentation of MMU_NOTIFY_EXCLUSIVE > that it is also trigger on conversion from non-exclusive to exclusive. > > Does that make sense?Something like that on top: diff --git a/mm/rmap.c b/mm/rmap.c index 49ffac6d27f8..fd6dfe67ce7b 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2405,6 +2405,7 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags) struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr, void *owner, struct folio **foliop) { + struct mmu_notifier_range range; struct folio *folio, *fw_folio; struct vm_area_struct *vma; struct folio_walk fw; @@ -2413,6 +2414,7 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr, pte_t swp_pte; mmap_assert_locked(mm); + addr = PAGE_ALIGN_DOWN(addr); /* * Fault in the page writable and try to lock it; note that if the @@ -2440,6 +2442,14 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr, return ERR_PTR(-EBUSY); } + /* + * Inform secondary MMUs that we are going to convert this PTE to + * device-exclusive, such that they unmap it now. + */ + mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, + mm, addr, addr + PAGE_SIZE, owner); + mmu_notifier_invalidate_range_start(&range); + /* * Let's do a second walk and make sure we still find the same page * mapped writable. If we don't find what we expect, we will trigger @@ -2452,6 +2462,7 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr, fw.level != FW_LEVEL_PTE || !pte_write(fw.pte)) { if (fw_folio) folio_walk_end(&fw, vma); + mmu_notifier_invalidate_range_end(&range); folio_unlock(folio); folio_put(folio); return ERR_PTR(-EBUSY); @@ -2485,6 +2496,7 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr, folio_remove_rmap_pte(folio, page, vma); folio_walk_end(&fw, vma); + mmu_notifier_invalidate_range_end(&range); *foliop = folio; return page; } -- Cheers, David / dhildenb
David Hildenbrand
2025-Jan-30 09:24 UTC
[PATCH v1 04/12] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk
On 30.01.25 10:01, David Hildenbrand wrote:> On 30.01.25 07:11, Alistair Popple wrote: >> On Wed, Jan 29, 2025 at 12:54:02PM +0100, David Hildenbrand wrote: >>> We require a writable PTE and only support anonymous folio: we can only >>> have exactly one PTE pointing at that page, which we can just lookup >>> using a folio walk, avoiding the rmap walk and the anon VMA lock. >>> >>> So let's stop doing an rmap walk and perform a folio walk instead, so we >>> can easily just modify a single PTE and avoid relying on rmap/mapcounts. >>> >>> We now effectively work on a single PTE instead of multiple PTEs of >>> a large folio, allowing for conversion of individual PTEs from >>> non-exclusive to device-exclusive -- note that the other way always >>> worked on single PTEs. >>> >>> We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why >>> that is not required: GUP will already take care of the >>> MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive >>> entry) when not finding a present PTE and having to trigger a fault and >>> ending up in remove_device_exclusive_entry(). >> >> I will have to look at this a bit more closely tomorrow but this doesn't seem >> right to me. We may be transitioning from a present PTE (ie. a writable >> anonymous mapping) to a non-present PTE (ie. a device-exclusive entry) and >> therefore any secondary processors (eg. other GPUs, iommus, etc.) will need to >> update their copies of the PTE. So I think the notifier call is needed. > > Then it is all very confusing: > > "MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no > longer have exclusive access to the page."So the second sentence actually describes the other condition. Likely we should make that clearer: --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -43,10 +43,11 @@ struct mmu_interval_notifier; * a device driver to possibly ignore the invalidation if the * owner field matches the driver's device private pgmap owner. * - * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no - * longer have exclusive access to the page. When sent during creation of an - * exclusive range the owner will be initialised to the value provided by the - * caller of make_device_exclusive(), otherwise the owner will be NULL. + * @MMU_NOTIFY_EXCLUSIVE: (1) to signal a device driver that the device will no + * longer have exclusive access to the page; and (2) to signal that a page will + * be made exclusive to a device. During (1), the owner will be NULL, during + * (2), the owner will be initialised to the value provided by the caller of + * make_device_exclusive(). */ enum mmu_notifier_event { MMU_NOTIFY_UNMAP = 0, -- Cheers, David / dhildenb