Michal Hocko
2019-Oct-22  12:23 UTC
[PATCH RFC v3 6/9] mm: Allow to offline PageOffline() pages with a reference count of 0
On Fri 18-10-19 14:35:06, David Hildenbrand wrote:> On 18.10.19 13:20, Michal Hocko wrote: > > On Fri 18-10-19 10:50:24, David Hildenbrand wrote: > > > On 18.10.19 10:15, Michal Hocko wrote:[...]> > > > for that - MEM_GOING_OFFLINE notification. This sounds like a good place > > > > for the driver to decide whether it is safe to let the page go or not. > > > > > > As I explained, this is too late and fragile. I post again what I posted > > > before with some further explanations > > > > > > __offline_pages() works like this: > > > > > > 1) start_isolate_page_range() > > > -> offline pages with a reference count of one will be detected as > > > unmovable -> offlining aborted. (see below on the memory isolation notifier) > > > > I am assuming that has_unmovable_pages would skip over those pages. Your > > patch already does that, no? > > Yes, this works IFF the reference count is 0 (IOW, this patch). Not with a > reference count of 1 (unless the pages are movable, like with balloon > compaction).I am pretty sure that has_unmovable_pages can special these pages regardless of the reference count for the memory hotplug. We already do that for HWPoison pages.> Please note that we have other users that use PG_offline + refcount >= 1 > (HyperV balloon, XEN). We should not affect these users (IOW, > has_unmovable_pages() has to stop right there if we see one of these pages).OK, this is exactly what I was worried about. I can see why you might want to go an easier way and rule those users out but wouldn't be it actually more reasonable to explicitly request PageOffline users to implement MEM_GOING_OFFLINE and prepare their offlined pages for the offlining operation or fail right there if that is not possible. If you fail right there during the isolation phase then there is no way to allow the offlining to proceed from that context.> > > 2) memory_notify(MEM_GOING_OFFLINE, &arg); > > > -> Here, we could release all pages to the buddy, clearing PG_offline > > > -> PF_offline must not be cleared so dumping tools will not touch > > > these pages. There is a time where pages are !PageBuddy() and > > > !PageOffline(). > > > > Well, this is fully under control of the driver, no? Reference count > > shouldn't play any role here AFAIU. > > Yes, this is more a PG_offline issue. The reference count is an issue of > reaching this call :) If we want to go via the buddy: > > 1. Clear PG_offline > 2. Free page (gets set PG_buddy) > > Between 1 and 2, a dumping tool could not exclude these pages and therefore > try to read from these pages. That is an issue IFF we want to return the > pages back to the buddy instead of doing what I propose here.If the driver is going to free page to the allocator then it would have to claim the page back and so it is usable again. If it cannot free it then it would simply set the reference count to 0. It can even keep the PG_offline if necessary although I have to admit I am not really sure it is necessary.> > > 3) scan_movable_pages() ... > > Please note that when we don't put the pages back to the buddy and don't > implement something like I have in this patch, we'll loop/fail here. > Especially if we have pages with PG_offline + refcount >= 1 .You should have your reference count 0 at this stage as it is after MEM_GOING_OFFLINE.> > MEM_CANCEL_OFFLINE could gain the reference back to balance the > > MEM_GOING_OFFLINE step. > > The pages are already unisolated and could be used by the buddy. But again, > I think you have an idea that tries to avoid putting pages to the buddy.Yeah, set_page_count(page, 0) if you do not want to release that page from the notifier context to reflect that the page is ok to be offlined with the rest. [...]> > explicit control via the reference count which is the standard way to > > control the struct page life cycle. > > > > Anyway hooking into __put_page (which tends to be a hot path with > > something that is barely used on most systems) doesn't sound nice to me. > > This is the whole point which made me think about the whole reference > > count approach in the first place. > > Again, the race I think that is possible > > somebody: get_page_unless_zero(page) > virtio_mem: page_ref_dec(pfn_to_page(pfn) > somebody: put_page() -> straight to the buddyWho is that somebody? I thought that it is only the owner/driver to have a control over the page. Also the above is not possible as long as the owner/driver keeps a reference to the PageOffline page throughout the time it is marked that way. [...]> > > > If you can let the page go then just drop the reference count. The page > > > > is isolated already by that time. If you cannot let it go for whatever > > > > reason you can fail the offlining. > > > > > > We do have one hack in current MM code, which is the memory isolation > > > notifier only used by CMM on PPC. It allows to "cheat" has_unmovable_pages() > > > to skip over unmovable pages. But quite frankly, I rather want to get rid of > > > that crap (something I am working on right now) than introduce new users. > > > This stuff is racy as hell and for CMM, if memory offlining fails, the > > > ballooned pages are suddenly part of the buddy. Fragile. > > > > Could you be more specific please? > > Let's take a look at how arch/powerpc/platforms/pseries/cmm.c handles it: > > cmm_memory_isolate_cb() -> cmm_count_pages(arg): > - Memory Isolation notifier callback > - Count how many pages in the range to be isolated are in the ballooon > - This makes has_unmovable_pages() succeed. Pages can be isolated. > > cmm_memory_cb -> cmm_mem_going_offline(arg): > - Memory notifier (online/offline) > - Release all pages in the range to the buddy > > If offlining fails, the pages are now in the buddy, no longer in the > balloon. MEM_CANCEL_ONLINE is too late, because the range is already > unisolated again and the pages might be in use. > > For CMM it might not be that bad, because it can actually "reloan" any > pages. In contrast, virtio-mem cannot simply go ahead and reuse random > memory in unplugged. Any access to these pages would be evil. Giving them > back to the buddy is dangerous.Thanks, I was not aware of that code. But from what I understood this is an outright bug in this code because cmm_mem_going_offline releases pages to the buddy allocator which is something that is not recoverable on a later failure. -- Michal Hocko SUSE Labs
David Hildenbrand
2019-Oct-22  14:02 UTC
[PATCH RFC v3 6/9] mm: Allow to offline PageOffline() pages with a reference count of 0
>> Please note that we have other users that use PG_offline + refcount >= 1 >> (HyperV balloon, XEN). We should not affect these users (IOW, >> has_unmovable_pages() has to stop right there if we see one of these pages). > > OK, this is exactly what I was worried about. I can see why you might > want to go an easier way and rule those users out but wouldn't be it > actually more reasonable to explicitly request PageOffline users to > implement MEM_GOING_OFFLINE and prepare their offlined pages for the > offlining operation or fail right there if that is not possible. > If you fail right there during the isolation phase then there is no way > to allow the offlining to proceed from that context.I am not sure I agree. But let's discuss the details. See below.> >>>> 2) memory_notify(MEM_GOING_OFFLINE, &arg); >>>> -> Here, we could release all pages to the buddy, clearing PG_offline >>>> -> PF_offline must not be cleared so dumping tools will not touch >>>> these pages. There is a time where pages are !PageBuddy() and >>>> !PageOffline(). >>> >>> Well, this is fully under control of the driver, no? Reference count >>> shouldn't play any role here AFAIU. >> >> Yes, this is more a PG_offline issue. The reference count is an issue of >> reaching this call :) If we want to go via the buddy: >> >> 1. Clear PG_offline >> 2. Free page (gets set PG_buddy) >> >> Between 1 and 2, a dumping tool could not exclude these pages and therefore >> try to read from these pages. That is an issue IFF we want to return the >> pages back to the buddy instead of doing what I propose here. > > If the driver is going to free page to the allocator then it would have > to claim the page back and so it is usable again. If it cannot free it > then it would simply set the reference count to 0. It can even keep the > PG_offline if necessary although I have to admit I am not really sure it > is necessary.Yes it is necessary to keep PG_offline set to avoid anybody touching the page until the section is offline. (especially, dumping tools) But that's another discussion. The important part is to not go via the buddy.> >>>> 3) scan_movable_pages() ... >> >> Please note that when we don't put the pages back to the buddy and don't >> implement something like I have in this patch, we'll loop/fail here. >> Especially if we have pages with PG_offline + refcount >= 1 . > > You should have your reference count 0 at this stage as it is after > MEM_GOING_OFFLINE. > >>> MEM_CANCEL_OFFLINE could gain the reference back to balance the >>> MEM_GOING_OFFLINE step. >> >> The pages are already unisolated and could be used by the buddy. But again, >> I think you have an idea that tries to avoid putting pages to the buddy. > > Yeah, set_page_count(page, 0) if you do not want to release that page > from the notifier context to reflect that the page is ok to be offlined > with the rest. >I neither see how you deal with __test_page_isolated_in_pageblock() nor with __offline_isolated_pages(). Sorry, but what I read is incomplete and you probably have a full proposal in your head. Please read below how I think you want to solve it.> >>> explicit control via the reference count which is the standard way to >>> control the struct page life cycle. >>> >>> Anyway hooking into __put_page (which tends to be a hot path with >>> something that is barely used on most systems) doesn't sound nice to me. >>> This is the whole point which made me think about the whole reference >>> count approach in the first place. >> >> Again, the race I think that is possible >> >> somebody: get_page_unless_zero(page) >> virtio_mem: page_ref_dec(pfn_to_page(pfn) >> somebody: put_page() -> straight to the buddy > > Who is that somebody? I thought that it is only the owner/driver to have > a control over the page. Also the above is not possible as long as the > owner/driver keeps a reference to the PageOffline page throughout the > time it is marked that way. >I was reading include/linux/mm_types.h: "If you want to use the refcount field, it must be used in such a way that other CPUs temporarily incrementing and then decrementing the refcount does not cause problems" And that made me think "anybody can go ahead and try get_page_unless_zero()". If I am missing something here and this can indeed not happen (e.g., because PageOffline() pages are never mapped to user space), then I'll happily remove this code.> >>>>> If you can let the page go then just drop the reference count. The page >>>>> is isolated already by that time. If you cannot let it go for whatever >>>>> reason you can fail the offlining. >>>> >>>> We do have one hack in current MM code, which is the memory isolation >>>> notifier only used by CMM on PPC. It allows to "cheat" has_unmovable_pages() >>>> to skip over unmovable pages. But quite frankly, I rather want to get rid of >>>> that crap (something I am working on right now) than introduce new users. >>>> This stuff is racy as hell and for CMM, if memory offlining fails, the >>>> ballooned pages are suddenly part of the buddy. Fragile. >>> >>> Could you be more specific please? >> >> Let's take a look at how arch/powerpc/platforms/pseries/cmm.c handles it: >> >> cmm_memory_isolate_cb() -> cmm_count_pages(arg): >> - Memory Isolation notifier callback >> - Count how many pages in the range to be isolated are in the ballooon >> - This makes has_unmovable_pages() succeed. Pages can be isolated. >> >> cmm_memory_cb -> cmm_mem_going_offline(arg): >> - Memory notifier (online/offline) >> - Release all pages in the range to the buddy >> >> If offlining fails, the pages are now in the buddy, no longer in the >> balloon. MEM_CANCEL_ONLINE is too late, because the range is already >> unisolated again and the pages might be in use. >> >> For CMM it might not be that bad, because it can actually "reloan" any >> pages. In contrast, virtio-mem cannot simply go ahead and reuse random >> memory in unplugged. Any access to these pages would be evil. Giving them >> back to the buddy is dangerous. > > Thanks, I was not aware of that code. But from what I understood this is > an outright bug in this code because cmm_mem_going_offline releases > pages to the buddy allocator which is something that is not recoverable > on a later failure. >Yes, and that should be gone if we switch to balloon compaction. Let's recap what I suggest: "PageOffline() pages that have a reference count of 0 will be treated like free pages when offlining pages, allowing the containing memory block to get offlined. In case a driver wants to revive such a page, it has to synchronize against memory onlining/offlining (e.g., using memory notifiers) while incrementing the reference count. Also, a driver that relies in this feature is aware that re-onlining the memory will require to re-set the pages PageOffline() - e.g., via the online_page_callback_t." a) has_unmovable_pages() already skips over pages with a refcount of zero. The code I add to not skip over these pages when !MEMORY_OFFLINE is a pure optimization to fail early when trying to allocate from that range. b) __test_page_isolated_in_pageblock() is modified to skip over PageOffline() pages with a refcount of zero c) __offline_isolated_pages() is modified to skip over PageOffline() pages with a refcount of zero d) __put_page() is modified to not return pages to the buddy in any case as a safety net. We might be able to get rid of that. What I think you suggest: a) has_unmovable_pages() skips over all PageOffline() pages. This results in a lot of false negatives when trying to offline. Might be ok. b) The driver decrements the reference count of the PageOffline pages in MEM_GOING_OFFLINE. c) The driver increments the reference count of the PageOffline pages in MEM_CANCEL_OFFLINE. One issue might be that the pages are no longer isolated once we get that call. Might be ok. d) How to make __test_page_isolated_in_pageblock() succeed? Like I propose in this patch (PageOffline() + refcount == 0)? e) How to make __offline_isolated_pages() succeed? Like I propose in this patch (PageOffline() + refcount == 0)? In summary, is what you suggest simply delaying setting the reference count to 0 in MEM_GOING_OFFLINE instead of right away when the driver unpluggs the pages? What's the big benefit you see and I fail to see? -- Thanks, David / dhildenb
Michal Hocko
2019-Oct-23  09:43 UTC
[PATCH RFC v3 6/9] mm: Allow to offline PageOffline() pages with a reference count of 0
On Tue 22-10-19 16:02:09, David Hildenbrand wrote: [...]> >>> MEM_CANCEL_OFFLINE could gain the reference back to balance the > >>> MEM_GOING_OFFLINE step. > >> > >> The pages are already unisolated and could be used by the buddy. But again, > >> I think you have an idea that tries to avoid putting pages to the buddy. > > > > Yeah, set_page_count(page, 0) if you do not want to release that page > > from the notifier context to reflect that the page is ok to be offlined > > with the rest. > > > > I neither see how you deal with __test_page_isolated_in_pageblock() nor with > __offline_isolated_pages(). Sorry, but what I read is incomplete and you > probably have a full proposal in your head. Please read below how I think > you want to solve it.Yeah, sorry that I am throwing incomplete ideas at you. I am just trying to really nail down how to deal with reference counting here because it is an important aspect.> >>> explicit control via the reference count which is the standard way to > >>> control the struct page life cycle. > >>> > >>> Anyway hooking into __put_page (which tends to be a hot path with > >>> something that is barely used on most systems) doesn't sound nice to me. > >>> This is the whole point which made me think about the whole reference > >>> count approach in the first place. > >> > >> Again, the race I think that is possible > >> > >> somebody: get_page_unless_zero(page) > >> virtio_mem: page_ref_dec(pfn_to_page(pfn) > >> somebody: put_page() -> straight to the buddy > > > > Who is that somebody? I thought that it is only the owner/driver to have > > a control over the page. Also the above is not possible as long as the > > owner/driver keeps a reference to the PageOffline page throughout the > > time it is marked that way. > > > > I was reading > > include/linux/mm_types.h: > > "If you want to use the refcount field, it must be used in such a way > that other CPUs temporarily incrementing and then decrementing the > refcount does not cause problems" > > And that made me think "anybody can go ahead and try get_page_unless_zero()". > > If I am missing something here and this can indeed not happen (e.g., > because PageOffline() pages are never mapped to user space), then I'll > happily remove this code.The point is that if the owner of the page is holding the only reference to the page then it is clear that nothing like that's happened. [...]> Let's recap what I suggest: > > "PageOffline() pages that have a reference count of 0 will be treated > like free pages when offlining pages, allowing the containing memory > block to get offlined. In case a driver wants to revive such a page, it > has to synchronize against memory onlining/offlining (e.g., using memory > notifiers) while incrementing the reference count. Also, a driver that > relies in this feature is aware that re-onlining the memory will require > to re-set the pages PageOffline() - e.g., via the online_page_callback_t."OK [...]> d) __put_page() is modified to not return pages to the buddy in any > case as a safety net. We might be able to get rid of that.I do not like exactly this part> What I think you suggest: > > a) has_unmovable_pages() skips over all PageOffline() pages. > This results in a lot of false negatives when trying to offline. Might be ok. > > b) The driver decrements the reference count of the PageOffline pages > in MEM_GOING_OFFLINE.Well, driver should make the page unreferenced or fail. What is done really depends on the specific driver> c) The driver increments the reference count of the PageOffline pages > in MEM_CANCEL_OFFLINE. One issue might be that the pages are no longer > isolated once we get that call. Might be ok.Only previous PageBuddy pages are returned to the allocator IIRC. Mostly because of MovablePage()> d) How to make __test_page_isolated_in_pageblock() succeed? > Like I propose in this patch (PageOffline() + refcount == 0)?Yep> e) How to make __offline_isolated_pages() succeed? > Like I propose in this patch (PageOffline() + refcount == 0)?Simply skip over PageOffline pages. Reference count should never be != 0 at this stage.> In summary, is what you suggest simply delaying setting the reference count to 0 > in MEM_GOING_OFFLINE instead of right away when the driver unpluggs the pages?Yes> What's the big benefit you see and I fail to see?Aparat from no hooks into __put_page it is also an explicit control over the page via reference counting. Do you see any downsides? -- Michal Hocko SUSE Labs
Reasonably Related Threads
- [PATCH RFC v3 6/9] mm: Allow to offline PageOffline() pages with a reference count of 0
- [PATCH RFC v3 6/9] mm: Allow to offline PageOffline() pages with a reference count of 0
- [PATCH RFC v3 6/9] mm: Allow to offline PageOffline() pages with a reference count of 0
- [PATCH RFC v3 6/9] mm: Allow to offline PageOffline() pages with a reference count of 0
- [PATCH RFC v3 6/9] mm: Allow to offline PageOffline() pages with a reference count of 0