Alistair Popple
2025-Feb-13 11:03 UTC
[PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm)
On Mon, Feb 10, 2025 at 08:37:42PM +0100, David Hildenbrand wrote:> Against mm-hotfixes-stable for now. > > Discussing the PageTail() call in make_device_exclusive_range() with > Willy, I recently discovered [1] that device-exclusive handling does > not properly work with THP, making the hmm-tests selftests fail if THPs > are enabled on the system. > > Looking into more details, I found that hugetlb is not properly fenced, > and I realized that something that was bugging me for longer -- how > device-exclusive entries interact with mapcounts -- completely breaks > migration/swapout/split/hwpoison handling of these folios while they have > device-exclusive PTEs. > > The program below can be used to allocate 1 GiB worth of pages and > making them device-exclusive on a kernel with CONFIG_TEST_HMM. > > Once they are device-exclusive, these folios cannot get swapped out > (proc$pid/smaps_rollup will always indicate 1 GiB RSS no matter how > much one forces memory reclaim), and when having a memory block onlined > to ZONE_MOVABLE, trying to offline it will loop forever and complain about > failed migration of a page that should be movable. > > # echo offline > /sys/devices/system/memory/memory136/state > # echo online_movable > /sys/devices/system/memory/memory136/state > # ./hmm-swap & > ... wait until everything is device-exclusive > # echo offline > /sys/devices/system/memory/memory136/state > [ 285.193431][T14882] page: refcount:2 mapcount:0 mapping:0000000000000000 > index:0x7f20671f7 pfn:0x442b6a > [ 285.196618][T14882] memcg:ffff888179298000 > [ 285.198085][T14882] anon flags: 0x5fff0000002091c(referenced|uptodate| > dirty|active|owner_2|swapbacked|node=1|zone=3|lastcpupid=0x7ff) > [ 285.201734][T14882] raw: ... > [ 285.204464][T14882] raw: ... > [ 285.207196][T14882] page dumped because: migration failure > [ 285.209072][T14882] page_owner tracks the page as allocated > [ 285.210915][T14882] page last allocated via order 0, migratetype > Movable, gfp_mask 0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO), > id 14926, tgid 14926 (hmm-swap), ts 254506295376, free_ts 227402023774 > [ 285.216765][T14882] post_alloc_hook+0x197/0x1b0 > [ 285.218874][T14882] get_page_from_freelist+0x76e/0x3280 > [ 285.220864][T14882] __alloc_frozen_pages_noprof+0x38e/0x2740 > [ 285.223302][T14882] alloc_pages_mpol+0x1fc/0x540 > [ 285.225130][T14882] folio_alloc_mpol_noprof+0x36/0x340 > [ 285.227222][T14882] vma_alloc_folio_noprof+0xee/0x1a0 > [ 285.229074][T14882] __handle_mm_fault+0x2b38/0x56a0 > [ 285.230822][T14882] handle_mm_fault+0x368/0x9f0 > ... > > This series fixes all issues I found so far. There is no easy way to fix > without a bigger rework/cleanup. I have a bunch of cleanups on top (some > previous sent, some the result of the discussion in v1) that I will send > out separately once this landed and I get to it. > I wish we could just use some special present PROT_NONE PTEs instead ofFirst off David thanks for finding and fixing these issues. If you have further clean-ups in mind that you need help with please let me know as I'd be happy to help.> these (non-present, non-none) fake-swap entries; but that just results in > the same problem we keep having (lack of spare PTE bits), and staring at > other similar fake-swap entries, that ship has sailed. > > With this series, make_device_exclusive() doesn't actually belong into > mm/rmap.c anymore, but I'll leave moving that for another day. > > I only tested this series with the hmm-tests selftests due to lack of HW, > so I'd appreciate some testing, especially if the interaction between > two GPUs wanting a device-exclusive entry works as expected.I'm still reviewing the series but so far testing on my single GPU system appears to be working as expected. I will try and fire up a dual GPU system tomorrow and test it there as well. - Alistair> <program> > #include <stdio.h> > #include <fcntl.h> > #include <stdint.h> > #include <unistd.h> > #include <stdlib.h> > #include <string.h> > #include <sys/mman.h> > #include <sys/ioctl.h> > #include <linux/types.h> > #include <linux/ioctl.h> > > #define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd) > > struct hmm_dmirror_cmd { > __u64 addr; > __u64 ptr; > __u64 npages; > __u64 cpages; > __u64 faults; > }; > > const size_t size = 1 * 1024 * 1024 * 1024ul; > const size_t chunk_size = 2 * 1024 * 1024ul; > > int main(void) > { > struct hmm_dmirror_cmd cmd; > size_t cur_size; > int fd, ret; > char *addr, *mirror; > > fd = open("/dev/hmm_dmirror1", O_RDWR, 0); > if (fd < 0) { > perror("open failed\n"); > exit(1); > } > > addr = mmap(NULL, size, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > if (addr == MAP_FAILED) { > perror("mmap failed\n"); > exit(1); > } > madvise(addr, size, MADV_NOHUGEPAGE); > memset(addr, 1, size); > > mirror = malloc(chunk_size); > > for (cur_size = 0; cur_size < size; cur_size += chunk_size) { > cmd.addr = (uintptr_t)addr + cur_size; > cmd.ptr = (uintptr_t)mirror; > cmd.npages = chunk_size / getpagesize(); > ret = ioctl(fd, HMM_DMIRROR_EXCLUSIVE, &cmd); > if (ret) { > perror("ioctl failed\n"); > exit(1); > } > } > pause(); > return 0; > } > </program> > > [1] https://lkml.kernel.org/r/25e02685-4f1d-47fa-be5b-01ff85bb0ce2 at redhat.com > > Cc: Andrew Morton <akpm at linux-foundation.org> > Cc: "J?r?me Glisse" <jglisse at redhat.com> > Cc: Jonathan Corbet <corbet at lwn.net> > Cc: Alex Shi <alexs at kernel.org> > Cc: Yanteng Si <si.yanteng at linux.dev> > Cc: Karol Herbst <kherbst at redhat.com> > Cc: Lyude Paul <lyude at redhat.com> > Cc: Danilo Krummrich <dakr at kernel.org> > Cc: David Airlie <airlied at gmail.com> > Cc: Simona Vetter <simona at ffwll.ch> > Cc: Masami Hiramatsu <mhiramat at kernel.org> > Cc: Oleg Nesterov <oleg at redhat.com> > Cc: Peter Zijlstra <peterz at infradead.org> > Cc: SeongJae Park <sj at kernel.org> > Cc: "Liam R. Howlett" <Liam.Howlett at oracle.com> > Cc: Lorenzo Stoakes <lorenzo.stoakes at oracle.com> > Cc: Vlastimil Babka <vbabka at suse.cz> > Cc: Jann Horn <jannh at google.com> > Cc: Pasha Tatashin <pasha.tatashin at soleen.com> > Cc: Peter Xu <peterx at redhat.com> > Cc: Alistair Popple <apopple at nvidia.com> > Cc: Jason Gunthorpe <jgg at nvidia.com> > > v1 -> v2: > * "mm/rmap: convert make_device_exclusive_range() to make_device_exclusive()" > -> Fix and simplify return value handling when calling dmirror_atomic_map() > -> Fix parameter order when calling make_device_exclusive() > [both things were fixed by the separate cleanups I previously sent, realized > it when re-testing the fixes here only] > -> Heavily extend documentation of make_device_exclusive() > * "mm/rmap: implement make_device_exclusive() using folio_walk instead of > rmap walk" > -> Keep MMU_NOTIFY_EXCLUSIVE, and update comments/description > * "mm/rmap: handle device-exclusive entries correctly in try_to_migrate_one()" > -> Handle PageHWPoison with device-private pages differently > * Added a bunch of "handle device-exclusive entries correctly" fixes, > now handling all page_vma_mapped_walk() callers correctly > * Added "mm/rmap: avoid -EBUSY from make_device_exclusive()" to fix some > hmm selftest failures I saw while testing under memory pressure > * Plenty of comment/description updates and improvements > > David Hildenbrand (17): > mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAs > mm/rmap: reject hugetlb folios in folio_make_device_exclusive() > mm/rmap: convert make_device_exclusive_range() to > make_device_exclusive() > mm/rmap: implement make_device_exclusive() using folio_walk instead of > rmap walk > mm/memory: detect writability in restore_exclusive_pte() through > can_change_pte_writable() > mm: use single SWP_DEVICE_EXCLUSIVE entry type > mm/page_vma_mapped: device-exclusive entries are not migration entries > kernel/events/uprobes: handle device-exclusive entries correctly in > __replace_page() > mm/ksm: handle device-exclusive entries correctly in > write_protect_page() > mm/rmap: handle device-exclusive entries correctly in > try_to_unmap_one() > mm/rmap: handle device-exclusive entries correctly in > try_to_migrate_one() > mm/rmap: handle device-exclusive entries correctly in > page_vma_mkclean_one() > mm/page_idle: handle device-exclusive entries correctly in > page_idle_clear_pte_refs_one() > mm/damon: handle device-exclusive entries correctly in > damon_folio_young_one() > mm/damon: handle device-exclusive entries correctly in > damon_folio_mkold_one() > mm/rmap: keep mapcount untouched for device-exclusive entries > mm/rmap: avoid -EBUSY from make_device_exclusive() > > Documentation/mm/hmm.rst | 2 +- > Documentation/translations/zh_CN/mm/hmm.rst | 2 +- > drivers/gpu/drm/nouveau/nouveau_svm.c | 5 +- > include/linux/mmu_notifier.h | 2 +- > include/linux/rmap.h | 5 +- > include/linux/swap.h | 7 +- > include/linux/swapops.h | 27 +- > kernel/events/uprobes.c | 13 +- > lib/test_hmm.c | 41 +- > mm/damon/ops-common.c | 23 +- > mm/damon/paddr.c | 10 +- > mm/gup.c | 3 + > mm/ksm.c | 9 +- > mm/memory.c | 28 +- > mm/mprotect.c | 8 - > mm/page_idle.c | 9 +- > mm/page_table_check.c | 5 +- > mm/page_vma_mapped.c | 3 +- > mm/rmap.c | 469 +++++++++----------- > 19 files changed, 315 insertions(+), 356 deletions(-) > > > base-commit: e5b2a356dc8a88708d97bd47cca3b8f7ed7af6cb > -- > 2.48.1 >
David Hildenbrand
2025-Feb-13 11:15 UTC
[PATCH v2 00/17] mm: fixes for device-exclusive entries (hmm)
On 13.02.25 12:03, Alistair Popple wrote:> On Mon, Feb 10, 2025 at 08:37:42PM +0100, David Hildenbrand wrote: >> Against mm-hotfixes-stable for now. >> >> Discussing the PageTail() call in make_device_exclusive_range() with >> Willy, I recently discovered [1] that device-exclusive handling does >> not properly work with THP, making the hmm-tests selftests fail if THPs >> are enabled on the system. >> >> Looking into more details, I found that hugetlb is not properly fenced, >> and I realized that something that was bugging me for longer -- how >> device-exclusive entries interact with mapcounts -- completely breaks >> migration/swapout/split/hwpoison handling of these folios while they have >> device-exclusive PTEs. >> >> The program below can be used to allocate 1 GiB worth of pages and >> making them device-exclusive on a kernel with CONFIG_TEST_HMM. >> >> Once they are device-exclusive, these folios cannot get swapped out >> (proc$pid/smaps_rollup will always indicate 1 GiB RSS no matter how >> much one forces memory reclaim), and when having a memory block onlined >> to ZONE_MOVABLE, trying to offline it will loop forever and complain about >> failed migration of a page that should be movable. >> >> # echo offline > /sys/devices/system/memory/memory136/state >> # echo online_movable > /sys/devices/system/memory/memory136/state >> # ./hmm-swap & >> ... wait until everything is device-exclusive >> # echo offline > /sys/devices/system/memory/memory136/state >> [ 285.193431][T14882] page: refcount:2 mapcount:0 mapping:0000000000000000 >> index:0x7f20671f7 pfn:0x442b6a >> [ 285.196618][T14882] memcg:ffff888179298000 >> [ 285.198085][T14882] anon flags: 0x5fff0000002091c(referenced|uptodate| >> dirty|active|owner_2|swapbacked|node=1|zone=3|lastcpupid=0x7ff) >> [ 285.201734][T14882] raw: ... >> [ 285.204464][T14882] raw: ... >> [ 285.207196][T14882] page dumped because: migration failure >> [ 285.209072][T14882] page_owner tracks the page as allocated >> [ 285.210915][T14882] page last allocated via order 0, migratetype >> Movable, gfp_mask 0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO), >> id 14926, tgid 14926 (hmm-swap), ts 254506295376, free_ts 227402023774 >> [ 285.216765][T14882] post_alloc_hook+0x197/0x1b0 >> [ 285.218874][T14882] get_page_from_freelist+0x76e/0x3280 >> [ 285.220864][T14882] __alloc_frozen_pages_noprof+0x38e/0x2740 >> [ 285.223302][T14882] alloc_pages_mpol+0x1fc/0x540 >> [ 285.225130][T14882] folio_alloc_mpol_noprof+0x36/0x340 >> [ 285.227222][T14882] vma_alloc_folio_noprof+0xee/0x1a0 >> [ 285.229074][T14882] __handle_mm_fault+0x2b38/0x56a0 >> [ 285.230822][T14882] handle_mm_fault+0x368/0x9f0 >> ... >> >> This series fixes all issues I found so far. There is no easy way to fix >> without a bigger rework/cleanup. I have a bunch of cleanups on top (some >> previous sent, some the result of the discussion in v1) that I will send >> out separately once this landed and I get to it. >> I wish we could just use some special present PROT_NONE PTEs instead of > > First off David thanks for finding and fixing these issues. If you have further > clean-ups in mind that you need help with please let me know as I'd be happy > to help.Sure! I have some cleanups TBD as result of the previous discussion, but nothing bigger so far. (removing the folio lock could be considered bigger, if we want to go down that path)> >> these (non-present, non-none) fake-swap entries; but that just results in >> the same problem we keep having (lack of spare PTE bits), and staring at >> other similar fake-swap entries, that ship has sailed. >> >> With this series, make_device_exclusive() doesn't actually belong into >> mm/rmap.c anymore, but I'll leave moving that for another day. >> >> I only tested this series with the hmm-tests selftests due to lack of HW, >> so I'd appreciate some testing, especially if the interaction between >> two GPUs wanting a device-exclusive entry works as expected. > > I'm still reviewing the series but so far testing on my single GPU system > appears to be working as expected. I will try and fire up a dual GPU system > tomorrow and test it there as well.Great, thanks a bunch for testing! Out of interest: does the nvidia driver make use of this interface as well, and are you testing with that or with the nouveau driver? I saw some reports that nvidia at least checks for it [1] when building the module: CONFTEST: make_device_exclusive_range [1] https://www.googlecloudcommunity.com/gc/AI-ML/Can-t-Install-Nvidia-Drivers-on-6-1-0-18-Kernel/m-p/722596 -- Cheers, David / dhildenb