David Hildenbrand
2025-Jan-29  11:57 UTC
[PATCH v1 0/4] mm: cleanups for device-exclusive entries (hmm)
This is a follow-up to [1], performing some related cleanups. There are more cleanups to be had, but I'll have to focus on some other stuff next. I might come back to that once I'm stuck on (annoyed by :) ) other things. Cc: Andrew Morton <akpm at linux-foundation.org> Cc: "J?r?me Glisse" <jglisse at redhat.com> Cc: Jonathan Corbet <corbet at lwn.net> Cc: Alex Shi <alexs at kernel.org> Cc: Yanteng Si <si.yanteng at linux.dev> Cc: Karol Herbst <kherbst at redhat.com> Cc: Lyude Paul <lyude at redhat.com> Cc: Danilo Krummrich <dakr at kernel.org> Cc: David Airlie <airlied at gmail.com> Cc: Simona Vetter <simona at ffwll.ch> Cc: "Liam R. Howlett" <Liam.Howlett at oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes at oracle.com> Cc: Vlastimil Babka <vbabka at suse.cz> Cc: Jann Horn <jannh at google.com> Cc: Pasha Tatashin <pasha.tatashin at soleen.com> Cc: Peter Xu <peterx at redhat.com> Cc: Alistair Popple <apopple at nvidia.com> Cc: Jason Gunthorpe <jgg at nvidia.com> [1] https://lkml.kernel.org/r/20250129115411.2077152-1-david at redhat.com David Hildenbrand (4): lib/test_hmm: make dmirror_atomic_map() consume a single page mm/mmu_notifier: drop owner from MMU_NOTIFY_EXCLUSIVE mm/memory: pass folio and pte to restore_exclusive_pte() mm/memory: document restore_exclusive_pte() drivers/gpu/drm/nouveau/nouveau_svm.c | 6 +-- include/linux/mmu_notifier.h | 4 +- include/linux/rmap.h | 2 +- lib/test_hmm.c | 35 ++++++----------- mm/memory.c | 54 +++++++++++++++++++-------- mm/rmap.c | 3 +- 6 files changed, 54 insertions(+), 50 deletions(-) -- 2.48.1
David Hildenbrand
2025-Jan-29  11:57 UTC
[PATCH v1 1/4] lib/test_hmm: make dmirror_atomic_map() consume a single page
The caller now always passes a single page; let's simplify, and return
"0" on success.
Signed-off-by: David Hildenbrand <david at redhat.com>
---
 lib/test_hmm.c | 33 ++++++++++-----------------------
 1 file changed, 10 insertions(+), 23 deletions(-)
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 9e1b07a227a3..1c0a58279db9 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -706,34 +706,23 @@ static int dmirror_check_atomic(struct dmirror *dmirror,
unsigned long start,
 	return 0;
 }
 
-static int dmirror_atomic_map(unsigned long start, unsigned long end,
-			      struct page **pages, struct dmirror *dmirror)
+static int dmirror_atomic_map(unsigned long addr, struct page *page,
+		struct dmirror *dmirror)
 {
-	unsigned long pfn, mapped = 0;
-	int i;
+	void *entry;
 
 	/* Map the migrated pages into the device's page tables. */
 	mutex_lock(&dmirror->mutex);
 
-	for (i = 0, pfn = start >> PAGE_SHIFT; pfn < (end >>
PAGE_SHIFT); pfn++, i++) {
-		void *entry;
-
-		if (!pages[i])
-			continue;
-
-		entry = pages[i];
-		entry = xa_tag_pointer(entry, DPT_XA_TAG_ATOMIC);
-		entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
-		if (xa_is_err(entry)) {
-			mutex_unlock(&dmirror->mutex);
-			return xa_err(entry);
-		}
-
-		mapped++;
+	entry = xa_tag_pointer(page, DPT_XA_TAG_ATOMIC);
+	entry = xa_store(&dmirror->pt, addr >> PAGE_SHIFT, entry,
GFP_ATOMIC);
+	if (xa_is_err(entry)) {
+		mutex_unlock(&dmirror->mutex);
+		return xa_err(entry);
 	}
 
 	mutex_unlock(&dmirror->mutex);
-	return mapped;
+	return 0;
 }
 
 static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
@@ -803,9 +792,7 @@ static int dmirror_exclusive(struct dmirror *dmirror,
 			break;
 		}
 
-		ret = dmirror_atomic_map(addr, addr + PAGE_SIZE, &page, dmirror);
-		if (!ret)
-			ret = -EBUSY;
+		ret = dmirror_atomic_map(addr, page, dmirror);
 		folio_unlock(folio);
 		folio_put(folio);
 
-- 
2.48.1
David Hildenbrand
2025-Jan-29  11:58 UTC
[PATCH v1 2/4] mm/mmu_notifier: drop owner from MMU_NOTIFY_EXCLUSIVE
We no longer get a MMU_NOTIFY_EXCLUSIVE on conversion with the owner set
that one has to filter out: if there already *is* a device-exclusive
entry (e.g., other device, we don't have that information), GUP will
convert it back to an ordinary PTE and notify via
remove_device_exclusive_entry().
Signed-off-by: David Hildenbrand <david at redhat.com>
---
 drivers/gpu/drm/nouveau/nouveau_svm.c | 6 +-----
 include/linux/mmu_notifier.h          | 4 +---
 include/linux/rmap.h                  | 2 +-
 lib/test_hmm.c                        | 2 +-
 mm/rmap.c                             | 3 +--
 5 files changed, 5 insertions(+), 12 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 39e3740980bb..4758fee182b4 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -510,10 +510,6 @@ static bool nouveau_svm_range_invalidate(struct
mmu_interval_notifier *mni,
 	struct svm_notifier *sn  		container_of(mni, struct svm_notifier, notifier);
 
-	if (range->event == MMU_NOTIFY_EXCLUSIVE &&
-	    range->owner == sn->svmm->vmm->cli->drm->dev)
-		return true;
-
 	/*
 	 * serializes the update to mni->invalidate_seq done by caller and
 	 * prevents invalidation of the PTE from progressing while HW is being
@@ -609,7 +605,7 @@ static int nouveau_atomic_range_fault(struct nouveau_svmm
*svmm,
 
 		notifier_seq = mmu_interval_read_begin(¬ifier->notifier);
 		mmap_read_lock(mm);
-		page = make_device_exclusive(mm, start, drm->dev, &folio);
+		page = make_device_exclusive(mm, start, &folio);
 		mmap_read_unlock(mm);
 		if (IS_ERR(page)) {
 			ret = -EINVAL;
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index d4e714661826..bac2385099dd 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -44,9 +44,7 @@ struct mmu_interval_notifier;
  * owner field matches the driver's device private pgmap owner.
  *
  * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
- * longer have exclusive access to the page. When sent during creation of an
- * exclusive range the owner will be initialised to the value provided by the
- * caller of make_device_exclusive(), otherwise the owner will be NULL.
+ * longer have exclusive access to the page.
  */
 enum mmu_notifier_event {
 	MMU_NOTIFY_UNMAP = 0,
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 86425d42c1a9..3b216b91d2e5 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -664,7 +664,7 @@ void try_to_migrate(struct folio *folio, enum ttu_flags
flags);
 void try_to_unmap(struct folio *, enum ttu_flags flags);
 
 struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
-		void *owner, struct folio **foliop);
+		struct folio **foliop);
 
 /* Avoid racy checks */
 #define PVMW_SYNC		(1 << 0)
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 1c0a58279db9..8520c1d1b21b 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -786,7 +786,7 @@ static int dmirror_exclusive(struct dmirror *dmirror,
 		struct folio *folio;
 		struct page *page;
 
-		page = make_device_exclusive(mm, addr, &folio, NULL);
+		page = make_device_exclusive(mm, addr, &folio);
 		if (IS_ERR(page)) {
 			ret = PTR_ERR(page);
 			break;
diff --git a/mm/rmap.c b/mm/rmap.c
index 4acc9f6d743a..d99dbf59adc6 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2397,7 +2397,6 @@ void try_to_migrate(struct folio *folio, enum ttu_flags
flags)
  * make_device_exclusive() - Mark an address for exclusive use by a device
  * @mm: mm_struct of associated target process
  * @addr: the virtual address to mark for exclusive device access
- * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier to allow filtering
  * @foliop: folio pointer will be stored here on success.
  *
  * This function looks up the page mapped at the given address, grabs a
@@ -2421,7 +2420,7 @@ void try_to_migrate(struct folio *folio, enum ttu_flags
flags)
  * Returns: pointer to mapped page on success, otherwise a negative error.
  */
 struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
-		void *owner, struct folio **foliop)
+		struct folio **foliop)
 {
 	struct folio *folio, *fw_folio;
 	struct vm_area_struct *vma;
-- 
2.48.1
David Hildenbrand
2025-Jan-29  11:58 UTC
[PATCH v1 3/4] mm/memory: pass folio and pte to restore_exclusive_pte()
Let's pass the folio and the pte to restore_exclusive_pte(), so we
can avoid repeated page_folio() and ptep_get(). To do that,
pass the pte to try_restore_exclusive_pte() and use a folio in there
already.
While at it, just avoid the "swp_entry_t entry" variable in
try_restore_exclusive_pte() and add a folio-locked check to
restore_exclusive_pte().
Signed-off-by: David Hildenbrand <david at redhat.com>
---
 mm/memory.c | 29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index cd689cd8a7c8..46956994aaff 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -719,14 +719,13 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct
*vma,
 #endif
 
 static void restore_exclusive_pte(struct vm_area_struct *vma,
-				  struct page *page, unsigned long address,
-				  pte_t *ptep)
+		struct folio *folio, struct page *page, unsigned long address,
+		pte_t *ptep, pte_t orig_pte)
 {
-	struct folio *folio = page_folio(page);
-	pte_t orig_pte;
 	pte_t pte;
 
-	orig_pte = ptep_get(ptep);
+	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+
 	pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
 	if (pte_swp_soft_dirty(orig_pte))
 		pte = pte_mksoft_dirty(pte);
@@ -756,16 +755,15 @@ static void restore_exclusive_pte(struct vm_area_struct
*vma,
  * Tries to restore an exclusive pte if the page lock can be acquired without
  * sleeping.
  */
-static int
-try_restore_exclusive_pte(pte_t *src_pte, struct vm_area_struct *vma,
-			unsigned long addr)
+static int try_restore_exclusive_pte(struct vm_area_struct *vma,
+		unsigned long addr, pte_t *ptep, pte_t orig_pte)
 {
-	swp_entry_t entry = pte_to_swp_entry(ptep_get(src_pte));
-	struct page *page = pfn_swap_entry_to_page(entry);
+	struct page *page = pfn_swap_entry_to_page(pte_to_swp_entry(orig_pte));
+	struct folio *folio = page_folio(page);
 
-	if (trylock_page(page)) {
-		restore_exclusive_pte(vma, page, addr, src_pte);
-		unlock_page(page);
+	if (folio_trylock(folio)) {
+		restore_exclusive_pte(vma, folio, page, addr, ptep, orig_pte);
+		folio_unlock(folio);
 		return 0;
 	}
 
@@ -871,7 +869,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct
mm_struct *src_mm,
 		 * (ie. COW) mappings.
 		 */
 		VM_BUG_ON(!is_cow_mapping(src_vma->vm_flags));
-		if (try_restore_exclusive_pte(src_pte, src_vma, addr))
+		if (try_restore_exclusive_pte(src_vma, addr, src_pte, orig_pte))
 			return -EBUSY;
 		return -ENOENT;
 	} else if (is_pte_marker_entry(entry)) {
@@ -3979,7 +3977,8 @@ static vm_fault_t remove_device_exclusive_entry(struct
vm_fault *vmf)
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
 				&vmf->ptl);
 	if (likely(vmf->pte && pte_same(ptep_get(vmf->pte),
vmf->orig_pte)))
-		restore_exclusive_pte(vma, vmf->page, vmf->address, vmf->pte);
+		restore_exclusive_pte(vma, folio, vmf->page, vmf->address,
+				      vmf->pte, vmf->orig_pte);
 
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
-- 
2.48.1
David Hildenbrand
2025-Jan-29  11:58 UTC
[PATCH v1 4/4] mm/memory: document restore_exclusive_pte()
Let's document how this function is to be used, and why the requirement for the folio lock might maybe be dropped in the future. Signed-off-by: David Hildenbrand <david at redhat.com> --- mm/memory.c | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index 46956994aaff..caaae8df11a9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -718,6 +718,31 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma, } #endif +/** + * restore_exclusive_pte - Restore a device-exclusive entry + * @vma: VMA covering @address + * @folio: the mapped folio + * @page: the mapped folio page + * @address: the virtual address + * @ptep: PTE pointer into the locked page table mapping the folio page + * @orig_pte: PTE value at @ptep + * + * Restore a device-exclusive non-swap entry to an ordinary present PTE. + * + * The folio and the page table must be locked, and MMU notifiers must have + * been called to invalidate any (exclusive) device mappings. In case of + * fork(), MMU_NOTIFY_PROTECTION_PAGE is triggered, and in case of a page + * fault MMU_NOTIFY_EXCLUSIVE is triggered. + * + * Locking the folio makes sure that anybody who just converted the PTE to + * a device-private entry can map it into the device, before unlocking it; so + * the folio lock prevents concurrent conversion to device-exclusive. + * + * TODO: the folio lock does not protect against all cases of concurrent + * page table modifications (e.g., MADV_DONTNEED, mprotect), so device drivers + * must already use MMU notifiers to sync against any concurrent changes + * Maybe the requirement for the folio lock can be dropped in the future. + */ static void restore_exclusive_pte(struct vm_area_struct *vma, struct folio *folio, struct page *page, unsigned long address, pte_t *ptep, pte_t orig_pte) -- 2.48.1