Philipp Stanner
2025-Jun-03 09:31 UTC
[RFC PATCH 0/6] drm/sched: Avoid memory leaks by canceling job-by-job
An alternative version to [1], based on Tvrtko's suggestion from [2]. I tested this for Nouveau. Works. I'm having, however, bigger problems properly porting the unit tests and have seen various explosions. In the process I noticed that some things in the unit tests aren't right and a bit of a larger rework will be necessary (for example, the timedout job callback must signal the timedout fence, remove it from the list and so on). Anyways. Please comment on the general idea. @Tvrtko: As briefly brainstormed about on IRC, if you'd be willing to take care of the unit tests patch, I could remove that one (and, maaaaybe, the warning print patch) from the series and we could merge this RFC's successor version %N once it's ready. What do you think? P. [1] https://lore.kernel.org/dri-devel/20250522082742.148191-2-phasta at kernel.org/ [2] https://lore.kernel.org/dri-devel/20250418113211.69956-1-tvrtko.ursulin at igalia.com/ Philipp Stanner (6): drm/sched: Avoid memory leaks with cancel_job() callback drm/sched/tests: Implement cancel_job() drm/sched: Warn if pending list is not empty drm/nouveau: Make fence container helper usable driver-wide drm/nouveau: Add new callback for scheduler teardown drm/nouveau: Remove waitque for sched teardown drivers/gpu/drm/nouveau/nouveau_fence.c | 35 +++++---- drivers/gpu/drm/nouveau/nouveau_fence.h | 7 ++ drivers/gpu/drm/nouveau/nouveau_sched.c | 35 +++++---- drivers/gpu/drm/nouveau/nouveau_sched.h | 9 +-- drivers/gpu/drm/nouveau/nouveau_uvmm.c | 8 +-- drivers/gpu/drm/scheduler/sched_main.c | 37 ++++++---- .../gpu/drm/scheduler/tests/mock_scheduler.c | 71 +++++++------------ drivers/gpu/drm/scheduler/tests/sched_tests.h | 4 +- include/drm/gpu_scheduler.h | 9 +++ 9 files changed, 115 insertions(+), 100 deletions(-) -- 2.49.0
Philipp Stanner
2025-Jun-03 09:31 UTC
[RFC PATCH 1/6] drm/sched: Avoid memory leaks with cancel_job() callback
Since its inception, the GPU scheduler can leak memory if the driver calls drm_sched_fini() while there are still jobs in flight. The simplest way to solve this in a backwards compatible manner is by adding a new callback, drm_sched_backend_ops.cancel_job(), which instructs the driver to signal the hardware fence associated with the job. Afterwards, the scheduler can savely use the established free_job() callback for freeing the job. Implement the new backend_ops callback cancel_job(). Suggested-by: Tvrtko Ursulin <tvrtko.ursulin at igalia.com> Signed-off-by: Philipp Stanner <phasta at kernel.org> --- drivers/gpu/drm/scheduler/sched_main.c | 34 ++++++++++++++++---------- include/drm/gpu_scheduler.h | 9 +++++++ 2 files changed, 30 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index d20726d7adf0..3f14f1e151fa 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1352,6 +1352,18 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, const struct drm_sched_init_ } EXPORT_SYMBOL(drm_sched_init); +static void drm_sched_kill_remaining_jobs(struct drm_gpu_scheduler *sched) +{ + struct drm_sched_job *job, *tmp; + + /* All other accessors are stopped. No locking necessary. */ + list_for_each_entry_safe_reverse(job, tmp, &sched->pending_list, list) { + sched->ops->cancel_job(job); + list_del(&job->list); + sched->ops->free_job(job); + } +} + /** * drm_sched_fini - Destroy a gpu scheduler * @@ -1359,19 +1371,11 @@ EXPORT_SYMBOL(drm_sched_init); * * Tears down and cleans up the scheduler. * - * This stops submission of new jobs to the hardware through - * drm_sched_backend_ops.run_job(). Consequently, drm_sched_backend_ops.free_job() - * will not be called for all jobs still in drm_gpu_scheduler.pending_list. - * There is no solution for this currently. Thus, it is up to the driver to make - * sure that: - * - * a) drm_sched_fini() is only called after for all submitted jobs - * drm_sched_backend_ops.free_job() has been called or that - * b) the jobs for which drm_sched_backend_ops.free_job() has not been called - * after drm_sched_fini() ran are freed manually. - * - * FIXME: Take care of the above problem and prevent this function from leaking - * the jobs in drm_gpu_scheduler.pending_list under any circumstances. + * This stops submission of new jobs to the hardware through &struct + * drm_sched_backend_ops.run_job. If &struct drm_sched_backend_ops.cancel_job + * is implemented, all jobs will be canceled through it and afterwards cleaned + * up through &struct drm_sched_backend_ops.free_job. If cancel_job is not + * implemented, memory could leak. */ void drm_sched_fini(struct drm_gpu_scheduler *sched) { @@ -1401,6 +1405,10 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched) /* Confirm no work left behind accessing device structures */ cancel_delayed_work_sync(&sched->work_tdr); + /* Avoid memory leaks if supported by the driver. */ + if (sched->ops->cancel_job) + drm_sched_kill_remaining_jobs(sched); + if (sched->own_submit_wq) destroy_workqueue(sched->submit_wq); sched->ready = false; diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index e62a7214e052..81dcbfc8c223 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -512,6 +512,15 @@ struct drm_sched_backend_ops { * and it's time to clean it up. */ void (*free_job)(struct drm_sched_job *sched_job); + + /** + * @cancel_job: Used by the scheduler to guarantee remaining jobs' fences + * get signaled in drm_sched_fini(). + * + * Drivers need to signal the passed job's hardware fence with + * -ECANCELED in this callback. They must not free the job. + */ + void (*cancel_job)(struct drm_sched_job *sched_job); }; /** -- 2.49.0
Philipp Stanner
2025-Jun-03 09:31 UTC
[RFC PATCH 2/6] drm/sched/tests: Implement cancel_job()
The GPU scheduler now provides a new callback to prevent memory leaks on scheduler teardown. The callback is optional, but should be implemented since it simplifies the cleanup code path. Moreover, the unit tests serve as a resource for understanding the canonical usage of the scheduler API and should therefore support the callback. Provide the backend_ops callback cancel_job() in the unit tests. This code is WIP and still buggy. Take it more as an RFC. It seems that it interferes negatively with timeout handling, which is broken in the sense of the timeout handler not signaling the hardware fence. That should be repaired and cleaned up, but it's probably better to do that in a separate series. Signed-off-by: Philipp Stanner <phasta at kernel.org> --- .../gpu/drm/scheduler/tests/mock_scheduler.c | 71 +++++++------------ drivers/gpu/drm/scheduler/tests/sched_tests.h | 4 +- 2 files changed, 25 insertions(+), 50 deletions(-) diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c index 7f947ab9d322..33864b179704 100644 --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c @@ -55,7 +55,7 @@ void drm_mock_sched_entity_free(struct drm_mock_sched_entity *entity) drm_sched_entity_destroy(&entity->base); } -static void drm_mock_sched_job_complete(struct drm_mock_sched_job *job) +static void drm_mock_sched_job_complete(struct drm_mock_sched_job *job, int err) { struct drm_mock_scheduler *sched drm_sched_to_mock_sched(job->base.sched); @@ -63,8 +63,11 @@ static void drm_mock_sched_job_complete(struct drm_mock_sched_job *job) lockdep_assert_held(&sched->lock); job->flags |= DRM_MOCK_SCHED_JOB_DONE; - list_move_tail(&job->link, &sched->done_list); - dma_fence_signal_locked(&job->hw_fence); + list_del(&job->link); + if (!dma_fence_is_signaled(&job->hw_fence)) { + dma_fence_set_error(&job->hw_fence, err); + dma_fence_signal(&job->hw_fence); + } complete(&job->done); } @@ -89,7 +92,7 @@ drm_mock_sched_job_signal_timer(struct hrtimer *hrtimer) break; sched->hw_timeline.cur_seqno = job->hw_fence.seqno; - drm_mock_sched_job_complete(job); + drm_mock_sched_job_complete(job, 0); } spin_unlock_irqrestore(&sched->lock, flags); @@ -212,26 +215,33 @@ mock_sched_timedout_job(struct drm_sched_job *sched_job) static void mock_sched_free_job(struct drm_sched_job *sched_job) { - struct drm_mock_scheduler *sched - drm_sched_to_mock_sched(sched_job->sched); struct drm_mock_sched_job *job = drm_sched_job_to_mock_job(sched_job); - unsigned long flags; - /* Remove from the scheduler done list. */ - spin_lock_irqsave(&sched->lock, flags); - list_del(&job->link); - spin_unlock_irqrestore(&sched->lock, flags); dma_fence_put(&job->hw_fence); - drm_sched_job_cleanup(sched_job); /* Mock job itself is freed by the kunit framework. */ } +static void mock_sched_cancel_job(struct drm_sched_job *sched_job) +{ + struct drm_mock_scheduler *sched + drm_sched_to_mock_sched(sched_job->sched); + struct drm_mock_sched_job *job = drm_sched_job_to_mock_job(sched_job); + + hrtimer_cancel(&job->timer); + + spin_lock_irq(&sched->lock); + if (!dma_fence_is_signaled(&job->hw_fence)) + drm_mock_sched_job_complete(job, -ECANCELED); + spin_unlock_irq(&sched->lock); +} + static const struct drm_sched_backend_ops drm_mock_scheduler_ops = { .run_job = mock_sched_run_job, .timedout_job = mock_sched_timedout_job, - .free_job = mock_sched_free_job + .free_job = mock_sched_free_job, + .cancel_job = mock_sched_cancel_job, }; /** @@ -265,7 +275,6 @@ struct drm_mock_scheduler *drm_mock_sched_new(struct kunit *test, long timeout) sched->hw_timeline.context = dma_fence_context_alloc(1); atomic_set(&sched->hw_timeline.next_seqno, 0); INIT_LIST_HEAD(&sched->job_list); - INIT_LIST_HEAD(&sched->done_list); spin_lock_init(&sched->lock); return sched; @@ -280,38 +289,6 @@ struct drm_mock_scheduler *drm_mock_sched_new(struct kunit *test, long timeout) */ void drm_mock_sched_fini(struct drm_mock_scheduler *sched) { - struct drm_mock_sched_job *job, *next; - unsigned long flags; - LIST_HEAD(list); - - drm_sched_wqueue_stop(&sched->base); - - /* Force complete all unfinished jobs. */ - spin_lock_irqsave(&sched->lock, flags); - list_for_each_entry_safe(job, next, &sched->job_list, link) - list_move_tail(&job->link, &list); - spin_unlock_irqrestore(&sched->lock, flags); - - list_for_each_entry(job, &list, link) - hrtimer_cancel(&job->timer); - - spin_lock_irqsave(&sched->lock, flags); - list_for_each_entry_safe(job, next, &list, link) - drm_mock_sched_job_complete(job); - spin_unlock_irqrestore(&sched->lock, flags); - - /* - * Free completed jobs and jobs not yet processed by the DRM scheduler - * free worker. - */ - spin_lock_irqsave(&sched->lock, flags); - list_for_each_entry_safe(job, next, &sched->done_list, link) - list_move_tail(&job->link, &list); - spin_unlock_irqrestore(&sched->lock, flags); - - list_for_each_entry_safe(job, next, &list, link) - mock_sched_free_job(&job->base); - drm_sched_fini(&sched->base); } @@ -346,7 +323,7 @@ unsigned int drm_mock_sched_advance(struct drm_mock_scheduler *sched, if (sched->hw_timeline.cur_seqno < job->hw_fence.seqno) break; - drm_mock_sched_job_complete(job); + drm_mock_sched_job_complete(job, 0); found++; } unlock: diff --git a/drivers/gpu/drm/scheduler/tests/sched_tests.h b/drivers/gpu/drm/scheduler/tests/sched_tests.h index fbba38137f0c..a905db835ccc 100644 --- a/drivers/gpu/drm/scheduler/tests/sched_tests.h +++ b/drivers/gpu/drm/scheduler/tests/sched_tests.h @@ -32,9 +32,8 @@ * * @base: DRM scheduler base class * @test: Backpointer to owning the kunit test case - * @lock: Lock to protect the simulated @hw_timeline, @job_list and @done_list + * @lock: Lock to protect the simulated @hw_timeline, @job_list * @job_list: List of jobs submitted to the mock GPU - * @done_list: List of jobs completed by the mock GPU * @hw_timeline: Simulated hardware timeline has a @context, @next_seqno and * @cur_seqno for implementing a struct dma_fence signaling the * simulated job completion. @@ -49,7 +48,6 @@ struct drm_mock_scheduler { spinlock_t lock; struct list_head job_list; - struct list_head done_list; struct { u64 context; -- 2.49.0
Philipp Stanner
2025-Jun-03 09:31 UTC
[RFC PATCH 3/6] drm/sched: Warn if pending list is not empty
drm_sched_fini() can leak jobs under certain circumstances. Warn if that happens. Signed-off-by: Philipp Stanner <phasta at kernel.org> --- drivers/gpu/drm/scheduler/sched_main.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 3f14f1e151fa..df12d5aaa1af 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1414,6 +1414,9 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched) sched->ready = false; kfree(sched->sched_rq); sched->sched_rq = NULL; + + if (!list_empty(&sched->pending_list)) + dev_err(sched->dev, "Tearing down scheduler while jobs are pending!\n"); } EXPORT_SYMBOL(drm_sched_fini); -- 2.49.0
Philipp Stanner
2025-Jun-03 09:31 UTC
[RFC PATCH 4/6] drm/nouveau: Make fence container helper usable driver-wide
In order to implement a new DRM GPU scheduler callback in Nouveau, a helper for obtaining a nouveau_fence from a dma_fence is necessary. Such a helper exists already inside nouveau_fence.c, called from_fence(). Make that helper available to other C files with a more precise name. Signed-off-by: Philipp Stanner <phasta at kernel.org> --- drivers/gpu/drm/nouveau/nouveau_fence.c | 20 +++++++------------- drivers/gpu/drm/nouveau/nouveau_fence.h | 6 ++++++ 2 files changed, 13 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c index d5654e26d5bc..869d4335c0f4 100644 --- a/drivers/gpu/drm/nouveau/nouveau_fence.c +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c @@ -38,12 +38,6 @@ static const struct dma_fence_ops nouveau_fence_ops_uevent; static const struct dma_fence_ops nouveau_fence_ops_legacy; -static inline struct nouveau_fence * -from_fence(struct dma_fence *fence) -{ - return container_of(fence, struct nouveau_fence, base); -} - static inline struct nouveau_fence_chan * nouveau_fctx(struct nouveau_fence *fence) { @@ -77,7 +71,7 @@ nouveau_local_fence(struct dma_fence *fence, struct nouveau_drm *drm) fence->ops != &nouveau_fence_ops_uevent) return NULL; - return from_fence(fence); + return to_nouveau_fence(fence); } void @@ -268,7 +262,7 @@ nouveau_fence_done(struct nouveau_fence *fence) static long nouveau_fence_wait_legacy(struct dma_fence *f, bool intr, long wait) { - struct nouveau_fence *fence = from_fence(f); + struct nouveau_fence *fence = to_nouveau_fence(f); unsigned long sleep_time = NSEC_PER_MSEC / 1000; unsigned long t = jiffies, timeout = t + wait; @@ -448,7 +442,7 @@ static const char *nouveau_fence_get_get_driver_name(struct dma_fence *fence) static const char *nouveau_fence_get_timeline_name(struct dma_fence *f) { - struct nouveau_fence *fence = from_fence(f); + struct nouveau_fence *fence = to_nouveau_fence(f); struct nouveau_fence_chan *fctx = nouveau_fctx(fence); return !fctx->dead ? fctx->name : "dead channel"; @@ -462,7 +456,7 @@ static const char *nouveau_fence_get_timeline_name(struct dma_fence *f) */ static bool nouveau_fence_is_signaled(struct dma_fence *f) { - struct nouveau_fence *fence = from_fence(f); + struct nouveau_fence *fence = to_nouveau_fence(f); struct nouveau_fence_chan *fctx = nouveau_fctx(fence); struct nouveau_channel *chan; bool ret = false; @@ -478,7 +472,7 @@ static bool nouveau_fence_is_signaled(struct dma_fence *f) static bool nouveau_fence_no_signaling(struct dma_fence *f) { - struct nouveau_fence *fence = from_fence(f); + struct nouveau_fence *fence = to_nouveau_fence(f); /* * caller should have a reference on the fence, @@ -503,7 +497,7 @@ static bool nouveau_fence_no_signaling(struct dma_fence *f) static void nouveau_fence_release(struct dma_fence *f) { - struct nouveau_fence *fence = from_fence(f); + struct nouveau_fence *fence = to_nouveau_fence(f); struct nouveau_fence_chan *fctx = nouveau_fctx(fence); kref_put(&fctx->fence_ref, nouveau_fence_context_put); @@ -521,7 +515,7 @@ static const struct dma_fence_ops nouveau_fence_ops_legacy = { static bool nouveau_fence_enable_signaling(struct dma_fence *f) { - struct nouveau_fence *fence = from_fence(f); + struct nouveau_fence *fence = to_nouveau_fence(f); struct nouveau_fence_chan *fctx = nouveau_fctx(fence); bool ret; diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.h b/drivers/gpu/drm/nouveau/nouveau_fence.h index 8bc065acfe35..c3595c2197b5 100644 --- a/drivers/gpu/drm/nouveau/nouveau_fence.h +++ b/drivers/gpu/drm/nouveau/nouveau_fence.h @@ -17,6 +17,12 @@ struct nouveau_fence { unsigned long timeout; }; +static inline struct nouveau_fence * +to_nouveau_fence(struct dma_fence *fence) +{ + return container_of(fence, struct nouveau_fence, base); +} + int nouveau_fence_create(struct nouveau_fence **, struct nouveau_channel *); int nouveau_fence_new(struct nouveau_fence **, struct nouveau_channel *); void nouveau_fence_unref(struct nouveau_fence **); -- 2.49.0
Philipp Stanner
2025-Jun-03 09:31 UTC
[RFC PATCH 5/6] drm/nouveau: Add new callback for scheduler teardown
There is a new callback for always tearing the scheduler down in a leak-free, deadlock-free manner. Port Nouveau as its first user by providing the scheduler with a callback that ensures the fence context gets killed in drm_sched_fini(). Signed-off-by: Philipp Stanner <phasta at kernel.org> --- drivers/gpu/drm/nouveau/nouveau_fence.c | 15 +++++++++++++++ drivers/gpu/drm/nouveau/nouveau_fence.h | 1 + drivers/gpu/drm/nouveau/nouveau_sched.c | 15 ++++++++++++++- 3 files changed, 30 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c index 869d4335c0f4..1c30ce686c6a 100644 --- a/drivers/gpu/drm/nouveau/nouveau_fence.c +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c @@ -240,6 +240,21 @@ nouveau_fence_emit(struct nouveau_fence *fence) return ret; } +void +nouveau_fence_cancel(struct nouveau_fence *fence) +{ + struct nouveau_fence_chan *fctx = nouveau_fctx(fence); + unsigned long flags; + + spin_lock_irqsave(&fctx->lock, flags); + if (!dma_fence_is_signaled(&fence->base)) { + dma_fence_set_error(&fence->base, -ECANCELED); + if (nouveau_fence_signal(fence)) + nvif_event_block(&fctx->event); + } + spin_unlock_irqrestore(&fctx->lock, flags); +} + bool nouveau_fence_done(struct nouveau_fence *fence) { diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.h b/drivers/gpu/drm/nouveau/nouveau_fence.h index c3595c2197b5..4d8f78cd6ebc 100644 --- a/drivers/gpu/drm/nouveau/nouveau_fence.h +++ b/drivers/gpu/drm/nouveau/nouveau_fence.h @@ -29,6 +29,7 @@ void nouveau_fence_unref(struct nouveau_fence **); int nouveau_fence_emit(struct nouveau_fence *); bool nouveau_fence_done(struct nouveau_fence *); +void nouveau_fence_cancel(struct nouveau_fence *fence); int nouveau_fence_wait(struct nouveau_fence *, bool lazy, bool intr); int nouveau_fence_sync(struct nouveau_bo *, struct nouveau_channel *, bool exclusive, bool intr); diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c index 460a5fb02412..2ec62059c351 100644 --- a/drivers/gpu/drm/nouveau/nouveau_sched.c +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c @@ -11,6 +11,7 @@ #include "nouveau_exec.h" #include "nouveau_abi16.h" #include "nouveau_sched.h" +#include "nouveau_chan.h" #define NOUVEAU_SCHED_JOB_TIMEOUT_MS 10000 @@ -393,10 +394,23 @@ nouveau_sched_free_job(struct drm_sched_job *sched_job) nouveau_job_fini(job); } +static void +nouveau_sched_cancel_job(struct drm_sched_job *sched_job) +{ + struct nouveau_fence *fence; + struct nouveau_job *job; + + job = to_nouveau_job(sched_job); + fence = to_nouveau_fence(job->done_fence); + + nouveau_fence_cancel(fence); +} + static const struct drm_sched_backend_ops nouveau_sched_ops = { .run_job = nouveau_sched_run_job, .timedout_job = nouveau_sched_timedout_job, .free_job = nouveau_sched_free_job, + .cancel_job = nouveau_sched_cancel_job, }; static int @@ -482,7 +496,6 @@ nouveau_sched_create(struct nouveau_sched **psched, struct nouveau_drm *drm, return 0; } - static void nouveau_sched_fini(struct nouveau_sched *sched) { -- 2.49.0
Philipp Stanner
2025-Jun-03 09:31 UTC
[RFC PATCH 6/6] drm/nouveau: Remove waitque for sched teardown
struct nouveau_sched contains a waitque needed to prevent drm_sched_fini() from being called while there are still jobs pending. Doing so so far would have caused memory leaks. With the new memleak-free mode of operation switched on in drm_sched_fini() by providing the callback nouveau_sched_fence_context_kill() the waitque is not necessary anymore. Remove the waitque. Signed-off-by: Philipp Stanner <phasta at kernel.org> --- drivers/gpu/drm/nouveau/nouveau_sched.c | 20 +++++++------------- drivers/gpu/drm/nouveau/nouveau_sched.h | 9 +++------ drivers/gpu/drm/nouveau/nouveau_uvmm.c | 8 ++++---- 3 files changed, 14 insertions(+), 23 deletions(-) diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c index 2ec62059c351..7d9c3418e76b 100644 --- a/drivers/gpu/drm/nouveau/nouveau_sched.c +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c @@ -122,11 +122,9 @@ nouveau_job_done(struct nouveau_job *job) { struct nouveau_sched *sched = job->sched; - spin_lock(&sched->job.list.lock); + spin_lock(&sched->job_list.lock); list_del(&job->entry); - spin_unlock(&sched->job.list.lock); - - wake_up(&sched->job.wq); + spin_unlock(&sched->job_list.lock); } void @@ -307,9 +305,9 @@ nouveau_job_submit(struct nouveau_job *job) } /* Submit was successful; add the job to the schedulers job list. */ - spin_lock(&sched->job.list.lock); - list_add(&job->entry, &sched->job.list.head); - spin_unlock(&sched->job.list.lock); + spin_lock(&sched->job_list.lock); + list_add(&job->entry, &sched->job_list.head); + spin_unlock(&sched->job_list.lock); drm_sched_job_arm(&job->base); job->done_fence = dma_fence_get(&job->base.s_fence->finished); @@ -460,9 +458,8 @@ nouveau_sched_init(struct nouveau_sched *sched, struct nouveau_drm *drm, goto fail_sched; mutex_init(&sched->mutex); - spin_lock_init(&sched->job.list.lock); - INIT_LIST_HEAD(&sched->job.list.head); - init_waitqueue_head(&sched->job.wq); + spin_lock_init(&sched->job_list.lock); + INIT_LIST_HEAD(&sched->job_list.head); return 0; @@ -502,9 +499,6 @@ nouveau_sched_fini(struct nouveau_sched *sched) struct drm_gpu_scheduler *drm_sched = &sched->base; struct drm_sched_entity *entity = &sched->entity; - rmb(); /* for list_empty to work without lock */ - wait_event(sched->job.wq, list_empty(&sched->job.list.head)); - drm_sched_entity_fini(entity); drm_sched_fini(drm_sched); diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.h b/drivers/gpu/drm/nouveau/nouveau_sched.h index 20cd1da8db73..b98c3f0bef30 100644 --- a/drivers/gpu/drm/nouveau/nouveau_sched.h +++ b/drivers/gpu/drm/nouveau/nouveau_sched.h @@ -103,12 +103,9 @@ struct nouveau_sched { struct mutex mutex; struct { - struct { - struct list_head head; - spinlock_t lock; - } list; - struct wait_queue_head wq; - } job; + struct list_head head; + spinlock_t lock; + } job_list; }; int nouveau_sched_create(struct nouveau_sched **psched, struct nouveau_drm *drm, diff --git a/drivers/gpu/drm/nouveau/nouveau_uvmm.c b/drivers/gpu/drm/nouveau/nouveau_uvmm.c index 48f105239f42..ddfc46bc1b3e 100644 --- a/drivers/gpu/drm/nouveau/nouveau_uvmm.c +++ b/drivers/gpu/drm/nouveau/nouveau_uvmm.c @@ -1019,8 +1019,8 @@ bind_validate_map_sparse(struct nouveau_job *job, u64 addr, u64 range) u64 end = addr + range; again: - spin_lock(&sched->job.list.lock); - list_for_each_entry(__job, &sched->job.list.head, entry) { + spin_lock(&sched->job_list.lock); + list_for_each_entry(__job, &sched->job_list.head, entry) { struct nouveau_uvmm_bind_job *bind_job = to_uvmm_bind_job(__job); list_for_each_op(op, &bind_job->ops) { @@ -1030,7 +1030,7 @@ bind_validate_map_sparse(struct nouveau_job *job, u64 addr, u64 range) if (!(end <= op_addr || addr >= op_end)) { nouveau_uvmm_bind_job_get(bind_job); - spin_unlock(&sched->job.list.lock); + spin_unlock(&sched->job_list.lock); wait_for_completion(&bind_job->complete); nouveau_uvmm_bind_job_put(bind_job); goto again; @@ -1038,7 +1038,7 @@ bind_validate_map_sparse(struct nouveau_job *job, u64 addr, u64 range) } } } - spin_unlock(&sched->job.list.lock); + spin_unlock(&sched->job_list.lock); } static int -- 2.49.0
Tvrtko Ursulin
2025-Jun-03 12:27 UTC
[RFC PATCH 0/6] drm/sched: Avoid memory leaks by canceling job-by-job
On 03/06/2025 10:31, Philipp Stanner wrote:> An alternative version to [1], based on Tvrtko's suggestion from [2]. > > I tested this for Nouveau. Works. > > I'm having, however, bigger problems properly porting the unit tests and > have seen various explosions. In the process I noticed that some things > in the unit tests aren't right and a bit of a larger rework will be > necessary (for example, the timedout job callback must signal the > timedout fence, remove it from the list and so on).General approach I follow when implementing any mock component is to implement only as much is needed for a test to pass. Only add more and rework when a test/functionality is added which requires it. Specifically for timedout callback signaling I see that I had exactly that added in the patch you linked as [2]. > Anyways. Please comment on the general idea. I am obviously okay with it. :) Especially now that you verified it works well for nouveau. What I am not that ecstatic about is only getting the Suggested-by credit in 1/6. Given it is basically my patch with some cosmetic changes like the kernel doc and the cancel loop extracted to a helper.> @Tvrtko: As briefly brainstormed about on IRC, if you'd be willing to > take care of the unit tests patch, I could remove that one (and, > maaaaybe, the warning print patch) from the series and we could merge > this RFC's successor version %N once it's ready. What do you think?Okay in principle but the first thing I would suggest you could try is to take my unit tests adaptations from [2] verbatim. Benefit of keeping everything in one series is more confidence we are merging a solid thing. But I can take it on myself as a follow up too if you want. Regards, Tvrtko> > P. > > [1] https://lore.kernel.org/dri-devel/20250522082742.148191-2-phasta at kernel.org/ > [2] https://lore.kernel.org/dri-devel/20250418113211.69956-1-tvrtko.ursulin at igalia.com/ > > Philipp Stanner (6): > drm/sched: Avoid memory leaks with cancel_job() callback > drm/sched/tests: Implement cancel_job() > drm/sched: Warn if pending list is not empty > drm/nouveau: Make fence container helper usable driver-wide > drm/nouveau: Add new callback for scheduler teardown > drm/nouveau: Remove waitque for sched teardown > > drivers/gpu/drm/nouveau/nouveau_fence.c | 35 +++++---- > drivers/gpu/drm/nouveau/nouveau_fence.h | 7 ++ > drivers/gpu/drm/nouveau/nouveau_sched.c | 35 +++++---- > drivers/gpu/drm/nouveau/nouveau_sched.h | 9 +-- > drivers/gpu/drm/nouveau/nouveau_uvmm.c | 8 +-- > drivers/gpu/drm/scheduler/sched_main.c | 37 ++++++---- > .../gpu/drm/scheduler/tests/mock_scheduler.c | 71 +++++++------------ > drivers/gpu/drm/scheduler/tests/sched_tests.h | 4 +- > include/drm/gpu_scheduler.h | 9 +++ > 9 files changed, 115 insertions(+), 100 deletions(-) >