thr3ads.net - Nouveau - [Nouveau] [PATCH 6/7] nouveau/dmem: Evict device private memory during release [Sep 2022]

If this information is useful, please help other people find it:
Share via:

John Hubbard

2022-Sep-26 22:14 UTC

[Nouveau] [PATCH 6/7] nouveau/dmem: Evict device private memory during release

On 9/26/22 14:35, Lyude Paul wrote:>> +	for (i = 0; i < npages; i++) {
>> +		if (src_pfns[i] & MIGRATE_PFN_MIGRATE) {
>> +			struct page *dpage;
>> +
>> +			/*
>> +			 * _GFP_NOFAIL because the GPU is going away and there
>> +			 * is nothing sensible we can do if we can't copy the
>> +			 * data back.
>> +			 */
> 
> You'll have to excuse me for a moment since this area of nouveau
isn't one of
> my strongpoints, but are we sure about this? IIRC __GFP_NOFAIL means
infinite
> retry, in the case of a GPU hotplug event I would assume we would rather
just
> stop trying to migrate things to the GPU and just drop the data instead of
> hanging on infinite retries.
> Hi Lyude!

Actually, I really think it's better in this case to keep trying
(presumably not necessarily infinitely, but only until memory becomes
available), rather than failing out and corrupting data.

That's because I'm not sure it's completely clear that this memory
is
discardable. And at some point, we're going to make this all work with
file-backed memory, which will *definitely* not be discardable--I
realize that we're not there yet, of course.

But here, it's reasonable to commit to just retrying indefinitely,
really. Memory should eventually show up. And if it doesn't, then
restarting the machine is better than corrupting data, generally.

thanks,

-- 
John Hubbard
NVIDIA

Alistair Popple

2022-Sep-26 23:45 UTC

head link

[Nouveau] [PATCH 6/7] nouveau/dmem: Evict device private memory during release

John Hubbard <jhubbard at nvidia.com> writes:
> On 9/26/22 14:35, Lyude Paul wrote:
>>> +	for (i = 0; i < npages; i++) {
>>> +		if (src_pfns[i] & MIGRATE_PFN_MIGRATE) {
>>> +			struct page *dpage;
>>> +
>>> +			/*
>>> +			 * _GFP_NOFAIL because the GPU is going away and there
>>> +			 * is nothing sensible we can do if we can't copy the
>>> +			 * data back.
>>> +			 */
>>
>> You'll have to excuse me for a moment since this area of nouveau
isn't one of
>> my strongpoints, but are we sure about this? IIRC __GFP_NOFAIL means
infinite
>> retry, in the case of a GPU hotplug event I would assume we would
rather just
>> stop trying to migrate things to the GPU and just drop the data instead
of
>> hanging on infinite retries.
>>
No problem, thanks for taking a look!
> Hi Lyude!
>
> Actually, I really think it's better in this case to keep trying
> (presumably not necessarily infinitely, but only until memory becomes
> available), rather than failing out and corrupting data.
>
> That's because I'm not sure it's completely clear that this
memory is
> discardable. And at some point, we're going to make this all work with
> file-backed memory, which will *definitely* not be discardable--I
> realize that we're not there yet, of course.
>
> But here, it's reasonable to commit to just retrying indefinitely,
> really. Memory should eventually show up. And if it doesn't, then
> restarting the machine is better than corrupting data, generally.
The memory is definitely not discardable here if the migration failed
because that implies it is still mapped into some userspace process.

We could avoid restarting the machine by doing something similar to what
happens during memory failure and killing every process that maps the
page(s). But overall I think it's better to retry until memory is
available, because that allows things like reclaim to work and in the
worst case allows the OOM killer to select an appropriate task to kill.
It also won't cause data corruption if/when we have file-backed memory.
> thanks,

Nouveau - Sep 2022 - [PATCH 6/7] nouveau/dmem: Evict device private memory during release

[Nouveau] [PATCH 6/7] nouveau/dmem: Evict device private memory during release

[Nouveau] [PATCH 6/7] nouveau/dmem: Evict device private memory during release