thr3ads.net - Nouveau - [RFC PATCH 1/6] drm/sched: Avoid memory leaks with cancel

If this information is useful, please help other people find it:
Share via:
Tvrtko Ursulin
2025-Jun-16 09:27 UTC
[RFC PATCH 1/6] drm/sched: Avoid memory leaks with cancel_job() callback

On 12/06/2025 15:20, Philipp Stanner wrote:> On Thu, 2025-06-12 at 15:17 +0100, Tvrtko Ursulin wrote:
>>
>> On 03/06/2025 10:31, Philipp Stanner wrote:
>>> Since its inception, the GPU scheduler can leak memory if the
>>> driver
>>> calls drm_sched_fini() while there are still jobs in flight.
>>>
>>> The simplest way to solve this in a backwards compatible manner is
>>> by
>>> adding a new callback, drm_sched_backend_ops.cancel_job(), which
>>> instructs the driver to signal the hardware fence associated with
>>> the
>>> job. Afterwards, the scheduler can savely use the established
>>> free_job()
>>> callback for freeing the job.
>>>
>>> Implement the new backend_ops callback cancel_job().
>>>
>>> Suggested-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>>
>> Please just add the link to the patch here (it is only in the cover
>> letter):
>>
>> Link:
>> https://lore.kernel.org/dri-devel/20250418113211.69956-1-tvrtko.ursulin
at igalia.com/
> 
> That I can do, sure
Cool, with that, for this patch:

Acked-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com>
>> And you probably want to take the unit test modifications from the
>> same
>> patch too. You could put them in the same patch or separate.
> 
> Necessary adjustments for the unit tests are already implemented and
> are waiting for review separately, since this can be done independently
> from this entire series:
> 
> https://lore.kernel.org/dri-devel/20250605134154.191764-2-phasta at
kernel.org/
For me it would make most sense to fold that into 2/6 from this series. 
I don't see it making sense as standalone. So if you could repost the 
series with it integrated I will give it a spin and can review that 
patch at least.

Regards,

Tvrtko
> 
> Thx
> P.
> 
>>
>> Regards,
>>
>> Tvrtko
>>
>>> Signed-off-by: Philipp Stanner <phasta at kernel.org>
>>> ---
>>>  ? drivers/gpu/drm/scheduler/sched_main.c | 34
++++++++++++++++-----
>>> -----
>>>  ? include/drm/gpu_scheduler.h??????????? |? 9 +++++++
>>>  ? 2 files changed, 30 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index d20726d7adf0..3f14f1e151fa 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1352,6 +1352,18 @@ int drm_sched_init(struct drm_gpu_scheduler
>>> *sched, const struct drm_sched_init_
>>>  ? }
>>>  ? EXPORT_SYMBOL(drm_sched_init);
>>>    
>>> +static void drm_sched_kill_remaining_jobs(struct drm_gpu_scheduler
>>> *sched)
>>> +{
>>> +	struct drm_sched_job *job, *tmp;
>>> +
>>> +	/* All other accessors are stopped. No locking necessary.
>>> */
>>> +	list_for_each_entry_safe_reverse(job, tmp, &sched-
>>>> pending_list, list) {
>>> +		sched->ops->cancel_job(job);
>>> +		list_del(&job->list);
>>> +		sched->ops->free_job(job);
>>> +	}
>>> +}
>>> +
>>>  ? /**
>>>  ?? * drm_sched_fini - Destroy a gpu scheduler
>>>  ?? *
>>> @@ -1359,19 +1371,11 @@ EXPORT_SYMBOL(drm_sched_init);
>>>  ?? *
>>>  ?? * Tears down and cleans up the scheduler.
>>>  ?? *
>>> - * This stops submission of new jobs to the hardware through
>>> - * drm_sched_backend_ops.run_job(). Consequently,
>>> drm_sched_backend_ops.free_job()
>>> - * will not be called for all jobs still in
>>> drm_gpu_scheduler.pending_list.
>>> - * There is no solution for this currently. Thus, it is up to the
>>> driver to make
>>> - * sure that:
>>> - *
>>> - *? a) drm_sched_fini() is only called after for all submitted
>>> jobs
>>> - *???? drm_sched_backend_ops.free_job() has been called or that
>>> - *? b) the jobs for which drm_sched_backend_ops.free_job() has not
>>> been called
>>> - *???? after drm_sched_fini() ran are freed manually.
>>> - *
>>> - * FIXME: Take care of the above problem and prevent this function
>>> from leaking
>>> - * the jobs in drm_gpu_scheduler.pending_list under any
>>> circumstances.
>>> + * This stops submission of new jobs to the hardware through
>>> &struct
>>> + * drm_sched_backend_ops.run_job. If &struct
>>> drm_sched_backend_ops.cancel_job
>>> + * is implemented, all jobs will be canceled through it and
>>> afterwards cleaned
>>> + * up through &struct drm_sched_backend_ops.free_job. If
>>> cancel_job is not
>>> + * implemented, memory could leak.
>>>  ?? */
>>>  ? void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>>  ? {
>>> @@ -1401,6 +1405,10 @@ void drm_sched_fini(struct drm_gpu_scheduler
>>> *sched)
>>>  ??	/* Confirm no work left behind accessing device structures
>>> */
>>>  ??	cancel_delayed_work_sync(&sched->work_tdr);
>>>    
>>> +	/* Avoid memory leaks if supported by the driver. */
>>> +	if (sched->ops->cancel_job)
>>> +		drm_sched_kill_remaining_jobs(sched);
>>> +
>>>  ??	if (sched->own_submit_wq)
>>>  ??		destroy_workqueue(sched->submit_wq);
>>>  ??	sched->ready = false;
>>> diff --git a/include/drm/gpu_scheduler.h
>>> b/include/drm/gpu_scheduler.h
>>> index e62a7214e052..81dcbfc8c223 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -512,6 +512,15 @@ struct drm_sched_backend_ops {
>>>  ?????????? * and it's time to clean it up.
>>>  ??	 */
>>>  ??	void (*free_job)(struct drm_sched_job *sched_job);
>>> +
>>> +	/**
>>> +	 * @cancel_job: Used by the scheduler to guarantee
>>> remaining jobs' fences
>>> +	 * get signaled in drm_sched_fini().
>>> +	 *
>>> +	 * Drivers need to signal the passed job's hardware fence
>>> with
>>> +	 * -ECANCELED in this callback. They must not free the
>>> job.
>>> +	 */
>>> +	void (*cancel_job)(struct drm_sched_job *sched_job);
>>>  ? };
>>>    
>>>  ? /**
>>
>
Nouveau - Jun 2025 - [RFC PATCH 1/6] drm/sched: Avoid memory leaks with cancel_job() callback

[RFC PATCH 1/6] drm/sched: Avoid memory leaks with cancel_job() callback