Ralph Campbell
2020-Oct-01 18:17 UTC
[Nouveau] [RFC PATCH v3 0/2] mm: remove extra ZONE_DEVICE struct page refcount
This is still an RFC because after looking at the pmem/dax code some more, I realized that the ZONE_DEVICE struct pages are being inserted into the process' page tables with vmf_insert_mixed() and a zero refcount on the ZONE_DEVICE struct page. This is sort of OK because insert_pfn() increments the reference count on the pgmap which is what prevents memunmap_pages() from freeing the struct pages and it doesn't check for a non-zero struct page reference count. But, any calls to get_page() will hit the VM_BUG_ON_PAGE() that checks for a reference count == 0. // mmap() an ext4 file that is mounted -o dax. ext4_dax_fault() ext4_dax_huge_fault() dax_iomap_fault(&ext4_iomap_ops) dax_iomap_pte_fault() ops->iomap_begin() // ext4_iomap_begin() ext4_map_blocks() ext4_set_iomap() dax_iomap_pfn() dax_insert_entry() vmf_insert_mixed(pfn) __vm_insert_mixed() if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && !pfn_t_devmap(pfn) && pfn_t_valid(pfn)) insert_page() get_page(page) // XXX would trigger VM_BUG_ON_PAGE() page_add_file_rmap() set_pte_at() else insert_pfn() pte_mkdevmap() set_pte_at() Should pmem set the page reference count to one before inserting the pfn into the page tables (and decrement when removing devmap PTEs)? What about MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_PCI_P2PDMA use cases? Where should they icrement/decrement the page reference count? I don't know enough about how these are used to really know what to do at this point. If people want me to continue to work on this series, I will need some guidance. --- Matthew Wilcox, Ira Weiny, and others have complained that ZONE_DEVICE struct page reference counting is ugly because they are "free" when the reference count is one instead of zero. This leads to explicit checks for ZONE_DEVICE pages in places like put_page(), GUP, THP splitting, and page migration which have to adjust the expected reference count when determining if the page is isolated or idle. This is my attempt to make ZONE_DEVICE pages be free when the reference count is zero and removing the special cases. I'm only sending this out as a RFC since I'm not that familiar with the DAX, PMEM, XEN, and other uses of ZONE_DEVICE struct pages allocated with devm_memremap_pages() or memremap_pages() but my best reading of the code looks like it might be OK. I could use help testing these configurations. I have been able to successfully run xfstests on ext4 with the memmap kernel boot option to simulate pmem. Changes in v3: Rebase to linux-mm 5.9.0-rc7-mm1. Added a check for page_free() as suggested by Christoph Hellwig. Added a helper for dax_wait_page() as suggested by Christoph Hellwig. Changes in v2: One of the big changes in v2 is that devm_memremap_pages() and memremap_pages() now return the struct pages' reference count set to zero instead of one. Normally, get_page() will VM_BUG_ON_PAGE() if page->_refcount is zero. I didn't see any such warnings running the xfstests with dax/pmem but I'm not clear how the zero to one reference count is handled. Other changes in v2: Rebased to Linux-5.9.0-rc6 to include pmem fixes. I added patch 1 to introduce a page refcount helper for ext4 and xfs as suggested by Christoph Hellwig. I also applied Christoph Hellwig's other suggested changes for removing the devmap_managed_key, etc. Ralph Campbell (2): ext4/xfs: add page refcount helper mm: remove extra ZONE_DEVICE struct page refcount arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- fs/dax.c | 8 +-- fs/ext4/inode.c | 5 +- fs/xfs/xfs_file.c | 4 +- include/linux/dax.h | 10 +++ include/linux/memremap.h | 7 ++- include/linux/mm.h | 44 -------------- lib/test_hmm.c | 2 +- mm/gup.c | 44 -------------- mm/internal.h | 8 +++ mm/memremap.c | 84 +++++++------------------- mm/migrate.c | 5 -- mm/page_alloc.c | 3 + mm/swap.c | 44 ++------------ 15 files changed, 63 insertions(+), 209 deletions(-) -- 2.20.1
Ralph Campbell
2020-Oct-01 18:17 UTC
[Nouveau] [RFC PATCH v3 1/2] ext4/xfs: add page refcount helper
There are several places where ZONE_DEVICE struct pages assume a reference count == 1 means the page is idle and free. Instead of open coding this, add a helper function to hide this detail. Signed-off-by: Ralph Campbell <rcampbell at nvidia.com> --- fs/dax.c | 4 ++-- fs/ext4/inode.c | 5 +---- fs/xfs/xfs_file.c | 4 +--- include/linux/dax.h | 10 ++++++++++ 4 files changed, 14 insertions(+), 9 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 5b47834f2e1b..85c63f735909 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn); - WARN_ON_ONCE(trunc && page_ref_count(page) > 1); + WARN_ON_ONCE(trunc && !dax_layout_is_idle_page(page)); WARN_ON_ONCE(page->mapping && page->mapping != mapping); page->mapping = NULL; page->index = 0; @@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry) for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn); - if (page_ref_count(page) > 1) + if (!dax_layout_is_idle_page(page)) return page; } return NULL; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index bf596467c234..4c3b80e68121 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3926,10 +3926,7 @@ int ext4_break_layouts(struct inode *inode) if (!page) return 0; - error = ___wait_var_event(&page->_refcount, - atomic_read(&page->_refcount) == 1, - TASK_INTERRUPTIBLE, 0, 0, - ext4_wait_dax_page(ei)); + error = dax_wait_page(ei, page, ext4_wait_dax_page); } while (error == 0); return error; diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 3d1b95124744..a5304aaeaa3a 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -749,9 +749,7 @@ xfs_break_dax_layouts( return 0; *retry = true; - return ___wait_var_event(&page->_refcount, - atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE, - 0, 0, xfs_wait_dax_page(inode)); + return dax_wait_page(inode, page, xfs_wait_dax_page); } int diff --git a/include/linux/dax.h b/include/linux/dax.h index b52f084aa643..f906cf4db1cc 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space *mapping) return mapping->host && IS_DAX(mapping->host); } +static inline bool dax_layout_is_idle_page(struct page *page) +{ + return page_ref_count(page) == 1; +} + +#define dax_wait_page(_inode, _page, _wait_cb) \ + ___wait_var_event(&(_page)->_refcount, \ + dax_layout_is_idle_page(_page), \ + TASK_INTERRUPTIBLE, 0, 0, _wait_cb(_inode)) + #ifdef CONFIG_DEV_DAX_HMEM_DEVICES void hmem_register_device(int target_nid, struct resource *r); #else -- 2.20.1
Ralph Campbell
2020-Oct-01 18:17 UTC
[Nouveau] [RFC PATCH v3 2/2] mm: remove extra ZONE_DEVICE struct page refcount
ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE. Signed-off-by: Ralph Campbell <rcampbell at nvidia.com> --- arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- fs/dax.c | 4 +- include/linux/dax.h | 2 +- include/linux/memremap.h | 7 ++- include/linux/mm.h | 44 -------------- lib/test_hmm.c | 2 +- mm/gup.c | 44 -------------- mm/internal.h | 8 +++ mm/memremap.c | 84 +++++++------------------- mm/migrate.c | 5 -- mm/page_alloc.c | 3 + mm/swap.c | 44 ++------------ 13 files changed, 50 insertions(+), 201 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index 84e5a2dc8be5..acee67710620 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -711,7 +711,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm) dpage = pfn_to_page(uvmem_pfn); dpage->zone_device_data = pvt; - get_page(dpage); + init_page_count(dpage); lock_page(dpage); return dpage; out_clear: diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index 92987daa5e17..8bc7120e1216 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -324,7 +324,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm) return NULL; } - get_page(page); + init_page_count(page); lock_page(page); return page; } diff --git a/fs/dax.c b/fs/dax.c index 85c63f735909..4804348f62e6 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -560,14 +560,14 @@ static void *grab_mapping_entry(struct xa_state *xas, /** * dax_layout_busy_page_range - find first pinned page in @mapping - * @mapping: address space to scan for a page with ref count > 1 + * @mapping: address space to scan for a page with ref count > 0 * @start: Starting offset. Page containing 'start' is included. * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX, * pages from 'start' till the end of file are included. * * DAX requires ZONE_DEVICE mapped pages. These pages are never * 'onlined' to the page allocator so they are considered idle when - * page->count == 1. A filesystem uses this interface to determine if + * page->count == 0. A filesystem uses this interface to determine if * any page in the mapping is busy, i.e. for DMA, or other * get_user_pages() usages. * diff --git a/include/linux/dax.h b/include/linux/dax.h index f906cf4db1cc..e4920ea6abd3 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -245,7 +245,7 @@ static inline bool dax_mapping(struct address_space *mapping) static inline bool dax_layout_is_idle_page(struct page *page) { - return page_ref_count(page) == 1; + return page_ref_count(page) == 0; } #define dax_wait_page(_inode, _page, _wait_cb) \ diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 86c6c368ce9b..2f63747caf56 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -66,9 +66,10 @@ enum memory_type { struct dev_pagemap_ops { /* - * Called once the page refcount reaches 1. (ZONE_DEVICE pages never - * reach 0 refcount unless there is a refcount bug. This allows the - * device driver to implement its own memory management.) + * Called once the page refcount reaches 0. The reference count + * should be reset to one with init_page_count(page) before reusing + * the page. This allows the device driver to implement its own + * memory management. */ void (*page_free)(struct page *page); diff --git a/include/linux/mm.h b/include/linux/mm.h index 27c64d0d7520..3d71a820ae38 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1100,39 +1100,6 @@ static inline bool is_zone_device_page(const struct page *page) } #endif -#ifdef CONFIG_DEV_PAGEMAP_OPS -void free_devmap_managed_page(struct page *page); -DECLARE_STATIC_KEY_FALSE(devmap_managed_key); - -static inline bool page_is_devmap_managed(struct page *page) -{ - if (!static_branch_unlikely(&devmap_managed_key)) - return false; - if (!is_zone_device_page(page)) - return false; - switch (page->pgmap->type) { - case MEMORY_DEVICE_PRIVATE: - case MEMORY_DEVICE_FS_DAX: - return true; - default: - break; - } - return false; -} - -void put_devmap_managed_page(struct page *page); - -#else /* CONFIG_DEV_PAGEMAP_OPS */ -static inline bool page_is_devmap_managed(struct page *page) -{ - return false; -} - -static inline void put_devmap_managed_page(struct page *page) -{ -} -#endif /* CONFIG_DEV_PAGEMAP_OPS */ - static inline bool is_device_private_page(const struct page *page) { return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && @@ -1179,17 +1146,6 @@ static inline void put_page(struct page *page) { page = compound_head(page); - /* - * For devmap managed pages we need to catch refcount transition from - * 2 to 1, when refcount reach one it means the page is free and we - * need to inform the device driver through callback. See - * include/linux/memremap.h and HMM for details. - */ - if (page_is_devmap_managed(page)) { - put_devmap_managed_page(page); - return; - } - if (put_page_testzero(page)) __put_page(page); } diff --git a/lib/test_hmm.c b/lib/test_hmm.c index e151a7f10519..37a2dea5e805 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -561,7 +561,7 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice) } dpage->zone_device_data = rpage; - get_page(dpage); + init_page_count(dpage); lock_page(dpage); return dpage; diff --git a/mm/gup.c b/mm/gup.c index 102877ed77a4..ed117e225c86 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -177,41 +177,6 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags) return true; } -#ifdef CONFIG_DEV_PAGEMAP_OPS -static bool __unpin_devmap_managed_user_page(struct page *page) -{ - int count, refs = 1; - - if (!page_is_devmap_managed(page)) - return false; - - if (hpage_pincount_available(page)) - hpage_pincount_sub(page, 1); - else - refs = GUP_PIN_COUNTING_BIAS; - - count = page_ref_sub_return(page, refs); - - mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1); - /* - * devmap page refcounts are 1-based, rather than 0-based: if - * refcount is 1, then the page is free and the refcount is - * stable because nobody holds a reference on the page. - */ - if (count == 1) - free_devmap_managed_page(page); - else if (!count) - __put_page(page); - - return true; -} -#else -static bool __unpin_devmap_managed_user_page(struct page *page) -{ - return false; -} -#endif /* CONFIG_DEV_PAGEMAP_OPS */ - /** * unpin_user_page() - release a dma-pinned page * @page: pointer to page to be released @@ -227,15 +192,6 @@ void unpin_user_page(struct page *page) page = compound_head(page); - /* - * For devmap managed pages we need to catch refcount transition from - * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the - * page is free and we need to inform the device driver through - * callback. See include/linux/memremap.h and HMM for details. - */ - if (__unpin_devmap_managed_user_page(page)) - return; - if (hpage_pincount_available(page)) hpage_pincount_sub(page, 1); else diff --git a/mm/internal.h b/mm/internal.h index 6345b08ce86c..629959a6f26d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -618,4 +618,12 @@ struct migration_target_control { gfp_t gfp_mask; }; +#ifdef CONFIG_DEV_PAGEMAP_OPS +void free_zone_device_page(struct page *page); +#else +static inline void free_zone_device_page(struct page *page) +{ +} +#endif + #endif /* __MM_INTERNAL_H */ diff --git a/mm/memremap.c b/mm/memremap.c index 2bb276680837..336f088c92a1 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -12,6 +12,7 @@ #include <linux/types.h> #include <linux/wait_bit.h> #include <linux/xarray.h> +#include "internal.h" static DEFINE_XARRAY(pgmap_array); @@ -37,36 +38,6 @@ unsigned long memremap_compat_align(void) EXPORT_SYMBOL_GPL(memremap_compat_align); #endif -#ifdef CONFIG_DEV_PAGEMAP_OPS -DEFINE_STATIC_KEY_FALSE(devmap_managed_key); -EXPORT_SYMBOL(devmap_managed_key); - -static void devmap_managed_enable_put(void) -{ - static_branch_dec(&devmap_managed_key); -} - -static int devmap_managed_enable_get(struct dev_pagemap *pgmap) -{ - if (pgmap->type == MEMORY_DEVICE_PRIVATE && - (!pgmap->ops || !pgmap->ops->page_free)) { - WARN(1, "Missing page_free method\n"); - return -EINVAL; - } - - static_branch_inc(&devmap_managed_key); - return 0; -} -#else -static int devmap_managed_enable_get(struct dev_pagemap *pgmap) -{ - return -EINVAL; -} -static void devmap_managed_enable_put(void) -{ -} -#endif /* CONFIG_DEV_PAGEMAP_OPS */ - static void pgmap_array_delete(struct range *range) { xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end), @@ -91,13 +62,6 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id) return (range->start + range_len(range)) >> PAGE_SHIFT; } -static unsigned long pfn_next(unsigned long pfn) -{ - if (pfn % 1024 == 0) - cond_resched(); - return pfn + 1; -} - /* * This returns true if the page is reserved by ZONE_DEVICE driver. */ @@ -118,9 +82,6 @@ bool pfn_zone_device_reserved(unsigned long pfn) return ret; } -#define for_each_device_pfn(pfn, map, i) \ - for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn)) - static void dev_pagemap_kill(struct dev_pagemap *pgmap) { if (pgmap->ops && pgmap->ops->kill) @@ -176,20 +137,18 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id) void memunmap_pages(struct dev_pagemap *pgmap) { - unsigned long pfn; int i; dev_pagemap_kill(pgmap); for (i = 0; i < pgmap->nr_range; i++) - for_each_device_pfn(pfn, pgmap, i) - put_page(pfn_to_page(pfn)); + percpu_ref_put_many(pgmap->ref, pfn_end(pgmap, i) - + pfn_first(pgmap, i)); dev_pagemap_cleanup(pgmap); for (i = 0; i < pgmap->nr_range; i++) pageunmap_range(pgmap, i); WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); - devmap_managed_enable_put(); } EXPORT_SYMBOL_GPL(memunmap_pages); @@ -327,7 +286,6 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) .pgprot = PAGE_KERNEL, }; const int nr_range = pgmap->nr_range; - bool need_devmap_managed = true; int error, i; if (WARN_ONCE(!nr_range, "nr_range must be specified\n")) @@ -343,6 +301,10 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) WARN(1, "Missing migrate_to_ram method\n"); return ERR_PTR(-EINVAL); } + if (!pgmap->ops->page_free) { + WARN(1, "Missing page_free method\n"); + return ERR_PTR(-EINVAL); + } if (!pgmap->owner) { WARN(1, "Missing owner\n"); return ERR_PTR(-EINVAL); @@ -356,11 +318,9 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) } break; case MEMORY_DEVICE_GENERIC: - need_devmap_managed = false; break; case MEMORY_DEVICE_PCI_P2PDMA: params.pgprot = pgprot_noncached(params.pgprot); - need_devmap_managed = false; break; default: WARN(1, "Invalid pgmap type %d\n", pgmap->type); @@ -384,12 +344,6 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) } } - if (need_devmap_managed) { - error = devmap_managed_enable_get(pgmap); - if (error) - return ERR_PTR(error); - } - /* * Clear the pgmap nr_range as it will be incremented for each * successfully processed range. This communicates how many @@ -504,16 +458,9 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn, EXPORT_SYMBOL_GPL(get_dev_pagemap); #ifdef CONFIG_DEV_PAGEMAP_OPS -void free_devmap_managed_page(struct page *page) +static void free_device_private_page(struct page *page) { - /* notify page idle for dax */ - if (!is_device_private_page(page)) { - wake_up_var(&page->_refcount); - return; - } - __ClearPageWaiters(page); - mem_cgroup_uncharge(page); /* @@ -540,4 +487,19 @@ void free_devmap_managed_page(struct page *page) page->mapping = NULL; page->pgmap->ops->page_free(page); } + +void free_zone_device_page(struct page *page) +{ + switch (page->pgmap->type) { + case MEMORY_DEVICE_FS_DAX: + /* notify page idle */ + wake_up_var(&page->_refcount); + return; + case MEMORY_DEVICE_PRIVATE: + free_device_private_page(page); + return; + default: + return; + } +} #endif /* CONFIG_DEV_PAGEMAP_OPS */ diff --git a/mm/migrate.c b/mm/migrate.c index 5ca5842df5db..ee09334b46d8 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -380,11 +380,6 @@ static int expected_page_refs(struct address_space *mapping, struct page *page) { int expected_count = 1; - /* - * Device private pages have an extra refcount as they are - * ZONE_DEVICE pages. - */ - expected_count += is_device_private_page(page); if (mapping) expected_count += thp_nr_pages(page) + page_has_private(page); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index eb0962976a0e..039b88bb3978 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6154,6 +6154,9 @@ void __ref memmap_init_zone_device(struct zone *zone, __init_single_page(page, pfn, zone_idx, nid); + /* ZONE_DEVICE pages start with a zero reference count. */ + set_page_count(page, 0); + /* * Mark page reserved as it will need to wait for onlining * phase for it to be fully associated with a zone. diff --git a/mm/swap.c b/mm/swap.c index 0eb057141a04..82cbe069dcf3 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -116,12 +116,11 @@ static void __put_compound_page(struct page *page) void __put_page(struct page *page) { if (is_zone_device_page(page)) { - put_dev_pagemap(page->pgmap); - /* * The page belongs to the device that created pgmap. Do * not return it to page allocator. */ + free_zone_device_page(page); return; } @@ -891,26 +890,18 @@ void release_pages(struct page **pages, int nr) if (is_huge_zero_page(page)) continue; + if (!put_page_testzero(page)) + continue; + if (is_zone_device_page(page)) { if (locked_pgdat) { spin_unlock_irqrestore(&locked_pgdat->lru_lock, flags); locked_pgdat = NULL; } - /* - * ZONE_DEVICE pages that return 'false' from - * page_is_devmap_managed() do not require special - * processing, and instead, expect a call to - * put_page_testzero(). - */ - if (page_is_devmap_managed(page)) { - put_devmap_managed_page(page); - continue; - } - } - - if (!put_page_testzero(page)) + free_zone_device_page(page); continue; + } if (PageCompound(page)) { if (locked_pgdat) { @@ -1188,26 +1179,3 @@ void __init swap_setup(void) * _really_ don't want to cluster much more */ } - -#ifdef CONFIG_DEV_PAGEMAP_OPS -void put_devmap_managed_page(struct page *page) -{ - int count; - - if (WARN_ON_ONCE(!page_is_devmap_managed(page))) - return; - - count = page_ref_dec_return(page); - - /* - * devmap page refcounts are 1-based, rather than 0-based: if - * refcount is 1, then the page is free and the refcount is - * stable because nobody holds a reference on the page. - */ - if (count == 1) - free_devmap_managed_page(page); - else if (!count) - __put_page(page); -} -EXPORT_SYMBOL(put_devmap_managed_page); -#endif -- 2.20.1
Christoph Hellwig
2020-Oct-02 05:56 UTC
[Nouveau] [RFC PATCH v3 1/2] ext4/xfs: add page refcount helper
On Thu, Oct 01, 2020 at 11:17:14AM -0700, Ralph Campbell wrote:> There are several places where ZONE_DEVICE struct pages assume a reference > count == 1 means the page is idle and free. Instead of open coding this, > add a helper function to hide this detail. > > Signed-off-by: Ralph Campbell <rcampbell at nvidia.com>Looks good: Reviewed-by: Christoph Hellwig <hch at lst.de>
Christoph Hellwig
2020-Oct-02 05:59 UTC
[Nouveau] [RFC PATCH v3 2/2] mm: remove extra ZONE_DEVICE struct page refcount
On Thu, Oct 01, 2020 at 11:17:15AM -0700, Ralph Campbell wrote:> ZONE_DEVICE struct pages have an extra reference count that complicates the > code for put_page() and several places in the kernel that need to check the > reference count to see that a page is not being used (gup, compaction, > migration, etc.). Clean up the code so the reference count doesn't need to > be treated specially for ZONE_DEVICE. > > Signed-off-by: Ralph Campbell <rcampbell at nvidia.com>Looks good, Reviewed-by: Christoph Hellwig <hch at lst.de>
Matthew Wilcox
2020-Oct-05 16:56 UTC
[Nouveau] [RFC PATCH v3 0/2] mm: remove extra ZONE_DEVICE struct page refcount
On Thu, Oct 01, 2020 at 11:17:13AM -0700, Ralph Campbell wrote:> This is still an RFC because after looking at the pmem/dax code some > more, I realized that the ZONE_DEVICE struct pages are being inserted > into the process' page tables with vmf_insert_mixed() and a zero > refcount on the ZONE_DEVICE struct page. This is sort of OK because > insert_pfn() increments the reference count on the pgmap which is what > prevents memunmap_pages() from freeing the struct pages and it doesn't > check for a non-zero struct page reference count. > But, any calls to get_page() will hit the VM_BUG_ON_PAGE() that > checks for a reference count == 0.And presumably get_user_pages() would call get_page()? Or is that gated off somewhere? Looking at this from a design standpoint instead of getting into the code, I would say that a struct page which cannot be referenced from anywhere should have a reference count of 0. As soon as it can be found by something, it should have a positive refcount. For DAX, I think that means it should have its refcount raised to 1 immediately before a DAX entry for it is put in the page cache. It should have its refcount decreased when its DAX entry is removed from the page cache. That should ensure that we can't put a 0-ref DAX page in the page tables, and thus it can't be found by get_user_pages().> // mmap() an ext4 file that is mounted -o dax. > ext4_dax_fault() > ext4_dax_huge_fault() > dax_iomap_fault(&ext4_iomap_ops) > dax_iomap_pte_fault() > ops->iomap_begin() // ext4_iomap_begin() > ext4_map_blocks() > ext4_set_iomap() > dax_iomap_pfn() > dax_insert_entry() > vmf_insert_mixed(pfn) > __vm_insert_mixed() > if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && > !pfn_t_devmap(pfn) && pfn_t_valid(pfn)) > insert_page() > get_page(page) // XXX would trigger VM_BUG_ON_PAGE() > page_add_file_rmap() > set_pte_at() > else > insert_pfn() > pte_mkdevmap() > set_pte_at() > > Should pmem set the page reference count to one before inserting the > pfn into the page tables (and decrement when removing devmap PTEs)? > What about MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_PCI_P2PDMA use cases? > Where should they icrement/decrement the page reference count? > I don't know enough about how these are used to really know what to > do at this point. If people want me to continue to work on this series, > I will need some guidance. > > --- > > Matthew Wilcox, Ira Weiny, and others have complained that ZONE_DEVICE > struct page reference counting is ugly because they are "free" when the > reference count is one instead of zero. This leads to explicit checks > for ZONE_DEVICE pages in places like put_page(), GUP, THP splitting, and > page migration which have to adjust the expected reference count when > determining if the page is isolated or idle. This is my attempt to make > ZONE_DEVICE pages be free when the reference count is zero and removing > the special cases. > > I'm only sending this out as a RFC since I'm not that familiar with the > DAX, PMEM, XEN, and other uses of ZONE_DEVICE struct pages allocated > with devm_memremap_pages() or memremap_pages() but my best reading of > the code looks like it might be OK. I could use help testing these > configurations. > I have been able to successfully run xfstests on ext4 with the memmap > kernel boot option to simulate pmem. > > Changes in v3: > Rebase to linux-mm 5.9.0-rc7-mm1. > Added a check for page_free() as suggested by Christoph Hellwig. > Added a helper for dax_wait_page() as suggested by Christoph Hellwig. > > Changes in v2: > One of the big changes in v2 is that devm_memremap_pages() and > memremap_pages() now return the struct pages' reference count set to > zero instead of one. Normally, get_page() will VM_BUG_ON_PAGE() if > page->_refcount is zero. I didn't see any such warnings running the > xfstests with dax/pmem but I'm not clear how the zero to one reference > count is handled. > > Other changes in v2: > Rebased to Linux-5.9.0-rc6 to include pmem fixes. > I added patch 1 to introduce a page refcount helper for ext4 and xfs as > suggested by Christoph Hellwig. > I also applied Christoph Hellwig's other suggested changes for removing > the devmap_managed_key, etc. > > Ralph Campbell (2): > ext4/xfs: add page refcount helper > mm: remove extra ZONE_DEVICE struct page refcount > > arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- > drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- > fs/dax.c | 8 +-- > fs/ext4/inode.c | 5 +- > fs/xfs/xfs_file.c | 4 +- > include/linux/dax.h | 10 +++ > include/linux/memremap.h | 7 ++- > include/linux/mm.h | 44 -------------- > lib/test_hmm.c | 2 +- > mm/gup.c | 44 -------------- > mm/internal.h | 8 +++ > mm/memremap.c | 84 +++++++------------------- > mm/migrate.c | 5 -- > mm/page_alloc.c | 3 + > mm/swap.c | 44 ++------------ > 15 files changed, 63 insertions(+), 209 deletions(-) > > -- > 2.20.1 > >
Dan Williams
2020-Oct-05 18:24 UTC
[Nouveau] [RFC PATCH v3 0/2] mm: remove extra ZONE_DEVICE struct page refcount
On Thu, Oct 1, 2020 at 11:17 AM Ralph Campbell <rcampbell at nvidia.com> wrote:> > This is still an RFC because after looking at the pmem/dax code some > more, I realized that the ZONE_DEVICE struct pages are being inserted > into the process' page tables with vmf_insert_mixed() and a zero > refcount on the ZONE_DEVICE struct page. This is sort of OK because > insert_pfn() increments the reference count on the pgmap which is what > prevents memunmap_pages() from freeing the struct pages and it doesn't > check for a non-zero struct page reference count. > But, any calls to get_page() will hit the VM_BUG_ON_PAGE() that > checks for a reference count == 0. > > // mmap() an ext4 file that is mounted -o dax. > ext4_dax_fault() > ext4_dax_huge_fault() > dax_iomap_fault(&ext4_iomap_ops) > dax_iomap_pte_fault() > ops->iomap_begin() // ext4_iomap_begin() > ext4_map_blocks() > ext4_set_iomap() > dax_iomap_pfn() > dax_insert_entry() > vmf_insert_mixed(pfn) > __vm_insert_mixed() > if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && > !pfn_t_devmap(pfn) && pfn_t_valid(pfn)) > insert_page() > get_page(page) // XXX would trigger VM_BUG_ON_PAGE() > page_add_file_rmap() > set_pte_at() > else > insert_pfn() > pte_mkdevmap() > set_pte_at() > > Should pmem set the page reference count to one before inserting the > pfn into the page tables (and decrement when removing devmap PTEs)? > What about MEMORY_DEVICE_GENERIC and MEMORY_DEVICE_PCI_P2PDMA use cases? > Where should they icrement/decrement the page reference count? > I don't know enough about how these are used to really know what to > do at this point. If people want me to continue to work on this series, > I will need some guidance.fs/dax could take the reference when inserting, but that would mean that ext4 and xfs would need to go back to checking for 1 to be page idle. I think that's ok because the filesystem is actually not checking for page-idle it's checking for "get_user_pages()" idle.
Ram Pai
2020-Oct-08 05:17 UTC
[Nouveau] [RFC PATCH v3 2/2] mm: remove extra ZONE_DEVICE struct page refcount
On Thu, Oct 01, 2020 at 11:17:15AM -0700, Ralph Campbell wrote:> ZONE_DEVICE struct pages have an extra reference count that complicates the > code for put_page() and several places in the kernel that need to check the > reference count to see that a page is not being used (gup, compaction, > migration, etc.). Clean up the code so the reference count doesn't need to > be treated specially for ZONE_DEVICE.I was hoping this patch would resolve a page-reference issue that we run into at random times while migrating a page to a device page backed by secure-memory. Unfortunately I continue to see the problem. There is a reference held on that page, which fails the migration. FYI RP
Matthew Wilcox
2021-May-13 13:15 UTC
[Nouveau] [RFC PATCH v3 0/2] mm: remove extra ZONE_DEVICE struct page refcount
On Thu, Oct 01, 2020 at 11:17:13AM -0700, Ralph Campbell wrote:> This is still an RFC because after looking at the pmem/dax code some > more, I realized that the ZONE_DEVICE struct pages are being inserted > into the process' page tables with vmf_insert_mixed() and a zero > refcount on the ZONE_DEVICE struct page. This is sort of OK because > insert_pfn() increments the reference count on the pgmap which is what > prevents memunmap_pages() from freeing the struct pages and it doesn't > check for a non-zero struct page reference count. > But, any calls to get_page() will hit the VM_BUG_ON_PAGE() that > checks for a reference count == 0.This seems to have gone quiet. What needs to happen to resurrect this?
Apparently Analagous Threads
- [PATCH 1/2] ext4/xfs: add page refcount helper
- [PATCH 1/2] ext4/xfs: add page refcount helper
- [PATCH 1/2] ext4/xfs: add page refcount helper
- [RFC PATCH v3 0/2] mm: remove extra ZONE_DEVICE struct page refcount
- [RFC PATCH v2 0/2] mm: remove extra ZONE_DEVICE struct page refcount