Danilo Krummrich
2025-Apr-22 14:52 UTC
[PATCH 3/5] drm/sched: Warn if pending list is not empty
On Tue, Apr 22, 2025 at 04:16:48PM +0200, Philipp Stanner wrote:> On Tue, 2025-04-22 at 16:08 +0200, Danilo Krummrich wrote: > > On Tue, Apr 22, 2025 at 02:39:21PM +0100, Tvrtko Ursulin wrote: > > > > Sorry I don't see the argument for the claim it is relying on the > > > internals > > > with the re-positioned drm_sched_fini call. In that case it is > > > fully > > > compliant with: > > > > > > /** > > > ?* drm_sched_fini - Destroy a gpu scheduler > > > ?* > > > ?* @sched: scheduler instance > > > ?* > > > ?* Tears down and cleans up the scheduler. > > > ?* > > > ?* This stops submission of new jobs to the hardware through > > > ?* drm_sched_backend_ops.run_job(). Consequently, > > > drm_sched_backend_ops.free_job() > > > ?* will not be called for all jobs still in > > > drm_gpu_scheduler.pending_list. > > > ?* There is no solution for this currently. Thus, it is up to the > > > driver to > > > make > > > ?* sure that: > > > ?* > > > ?*? a) drm_sched_fini() is only called after for all submitted jobs > > > ?*???? drm_sched_backend_ops.free_job() has been called or that > > > ?*? b) the jobs for which drm_sched_backend_ops.free_job() has not > > > been > > > called > > > ?* > > > ?* FIXME: Take care of the above problem and prevent this function > > > from > > > leaking > > > ?* the jobs in drm_gpu_scheduler.pending_list under any > > > circumstances. > > > > > > ^^^ recommended solution b). > > > > This has been introduced recently with commit baf4afc58314 > > ("drm/sched: Improve > > teardown documentation") and I do not agree with this. The scheduler > > should > > *not* make any promises about implementation details to enable > > drivers to abuse > > their knowledge about component internals. > > > > This makes the problem *worse* as it encourages drivers to rely on > > implementation details, making maintainability of the scheduler even > > worse. > > > > For instance, what if I change the scheduler implementation, such > > that for every > > entry in the pending_list the scheduler allocates another internal > > object for > > ${something}? Then drivers would already fall apart leaking those > > internal > > objects. > > > > Now, obviously that's pretty unlikely, but I assume you get the idea. > > > > The b) paragraph in drm_sched_fini() should be removed for the given > > reasons. > > > > AFAICS, since the introduction of this commit, driver implementations > > haven't > > changed in this regard, hence we should be good. > > > > So, for me this doesn't change the fact that every driver > > implementation that > > just stops the scheduler at an arbitrary point of time and tries to > > clean things > > up manually relying on knowledge about component internals is broken. > > To elaborate on that, this documentation has been written so that we at > least have *some* documentation about the problem, instead of just > letting new drivers run into the knife. > > The commit explicitly introduced the FIXME, marking those two hacky > workarounds as undesirable. > > But back then we couldn't fix the problem quickly, so it was either > document the issue at least a bit, or leave it completely undocumented.Agreed, but b) really sounds like an invitation (or even justification) for doing the wrong thing, let's removed it.
Tvrtko Ursulin
2025-Apr-23 07:34 UTC
[PATCH 3/5] drm/sched: Warn if pending list is not empty
On 22/04/2025 15:52, Danilo Krummrich wrote:> On Tue, Apr 22, 2025 at 04:16:48PM +0200, Philipp Stanner wrote: >> On Tue, 2025-04-22 at 16:08 +0200, Danilo Krummrich wrote: >>> On Tue, Apr 22, 2025 at 02:39:21PM +0100, Tvrtko Ursulin wrote: >> >>>> Sorry I don't see the argument for the claim it is relying on the >>>> internals >>>> with the re-positioned drm_sched_fini call. In that case it is >>>> fully >>>> compliant with: >>>> >>>> /** >>>> ?* drm_sched_fini - Destroy a gpu scheduler >>>> ?* >>>> ?* @sched: scheduler instance >>>> ?* >>>> ?* Tears down and cleans up the scheduler. >>>> ?* >>>> ?* This stops submission of new jobs to the hardware through >>>> ?* drm_sched_backend_ops.run_job(). Consequently, >>>> drm_sched_backend_ops.free_job() >>>> ?* will not be called for all jobs still in >>>> drm_gpu_scheduler.pending_list. >>>> ?* There is no solution for this currently. Thus, it is up to the >>>> driver to >>>> make >>>> ?* sure that: >>>> ?* >>>> ?*? a) drm_sched_fini() is only called after for all submitted jobs >>>> ?*???? drm_sched_backend_ops.free_job() has been called or that >>>> ?*? b) the jobs for which drm_sched_backend_ops.free_job() has not >>>> been >>>> called >>>> ?* >>>> ?* FIXME: Take care of the above problem and prevent this function >>>> from >>>> leaking >>>> ?* the jobs in drm_gpu_scheduler.pending_list under any >>>> circumstances. >>>> >>>> ^^^ recommended solution b). >>> >>> This has been introduced recently with commit baf4afc58314 >>> ("drm/sched: Improve >>> teardown documentation") and I do not agree with this. The scheduler >>> should >>> *not* make any promises about implementation details to enable >>> drivers to abuse >>> their knowledge about component internals. >>> >>> This makes the problem *worse* as it encourages drivers to rely on >>> implementation details, making maintainability of the scheduler even >>> worse. >>> >>> For instance, what if I change the scheduler implementation, such >>> that for every >>> entry in the pending_list the scheduler allocates another internal >>> object for >>> ${something}? Then drivers would already fall apart leaking those >>> internal >>> objects. >>> >>> Now, obviously that's pretty unlikely, but I assume you get the idea. >>> >>> The b) paragraph in drm_sched_fini() should be removed for the given >>> reasons. >>> >>> AFAICS, since the introduction of this commit, driver implementations >>> haven't >>> changed in this regard, hence we should be good. >>> >>> So, for me this doesn't change the fact that every driver >>> implementation that >>> just stops the scheduler at an arbitrary point of time and tries to >>> clean things >>> up manually relying on knowledge about component internals is broken. >> >> To elaborate on that, this documentation has been written so that we at >> least have *some* documentation about the problem, instead of just >> letting new drivers run into the knife. >> >> The commit explicitly introduced the FIXME, marking those two hacky >> workarounds as undesirable. >> >> But back then we couldn't fix the problem quickly, so it was either >> document the issue at least a bit, or leave it completely undocumented. > > Agreed, but b) really sounds like an invitation (or even justification) for > doing the wrong thing, let's removed it.IMO it is better to leave it. Regardless of whether it was added because some driver is actually operating like that, it does describe a _currently_ workable option to avoid memory leaks. Once a better method is there, ie. FIXME is addressed, then it can be removed or replaced. Regards, Tvrtko