thr3ads.net - Linux Virtualization - [PATCH v1 3/5] mm/memory_hotplug: make offline_and_remove_memory() timeout instead of failing on fatal signals [Jun 2023]

If this information is useful, please help other people find it:
Share via:

John Hubbard

2023-Jun-27 21:34 UTC

[PATCH v1 3/5] mm/memory_hotplug: make offline_and_remove_memory() timeout instead of failing on fatal signals

On 6/27/23 08:14, Michal Hocko wrote:> On Tue 27-06-23 16:57:53, David Hildenbrand wrote:
...>>>> IIUC (John can correct me if I am wrong):
>>>>
>>>> 1) The process holds the device node open
>>>> 2) The process gets killed or quits
>>>> 3) As the process gets torn down, it closes the device node
>>>> 4) Closing the device node results in the driver removing the
device and
>>>>      calling offline_and_remove_memory()
>>>>
>>>> So it's not a "tear down process" that triggers
that offlining_removal
>>>> somehow explicitly, it's just a side-product of it letting
go of the device
>>>> node as the process gets torn down.
>>>
>>> Isn't that just fragile? The operation might fail for other
reasons. Why
>>> cannot there be a hold on the resource to control the tear down
>>> explicitly?
>>
>> I'll let John comment on that. But from what I understood, in most
setups
>> where ZONE_MOVABLE gets used for hotplugged memory
>> offline_and_remove_memory() succeeds and allows for reusing the device
later
>> without a reboot.
>>
>> For the cases where it doesn't work, a reboot is required.  
That is exactly correct. That's what we ran into.

And there are workarounds (for example: kthreads don't have any signals
pending...), but I did want to follow through here and make -mm aware of the
problem. And see if there is a better way.

...>>> It seems that offline_and_remove_memory is using a wrong operation
then.
>>> If it wants an opportunistic offlining with some sort of policy.
Timeout
>>> might be just one policy to use but failure mode or a retry count
might
>>> be a better fit for some users. So rather than (ab)using
offline_pages,
>>> would be make more sense to extract basic offlining steps and allow
>>> drivers like virtio-mem to reuse them and define their own policy?
...like this, perhaps. Sounds promising!


thanks,
-- 
John Hubbard
NVIDIA

Linux Virtualization - Jun 2023 - [PATCH v1 3/5] mm/memory_hotplug: make offline_and_remove_memory() timeout instead of failing on fatal signals

[PATCH v1 3/5] mm/memory_hotplug: make offline_and_remove_memory() timeout instead of failing on fatal signals