Tvrtko Ursulin
2025-May-16 10:19 UTC
[PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long
On 16/05/2025 10:53, Danilo Krummrich wrote:> On Fri, May 16, 2025 at 10:33:30AM +0100, Tvrtko Ursulin wrote: >> On 24/04/2025 10:55, Philipp Stanner wrote: >>> + * @kill_fence_context: kill the fence context belonging to this scheduler >> >> Which fence context would that be? ;) > > There's one one per ring and a scheduler instance represents a single ring. So, > what should be specified here?I was pointing out the fact not all drivers are 1:1 sched:entity. So plural at least. Thought it would be obvious from the ";)".>> Also, "fence context" would be a new terminology in gpu_scheduler.h API >> level. You could call it ->sched_fini() or similar to signify at which point >> in the API it gets called and then the fact it takes sched as parameter >> would be natural. > > The driver should tear down the fence context in this callback, not the while > scheduler. ->sched_fini() would hence be misleading.Not the while what? Not while drm_sched_fini()? Could call it sched_kill() or anything. My point is that we dont' have "fence context" in the API but entities so adding a new term sounds sub-optimal.>> We also probably want some commentary on the topic of indefinite (or very >> long at least) blocking a thread exit / SIGINT/TERM/KILL time. > > You mean in case the driver does implement the callback, but does *not* properly > tear down the fence context? So, you ask for describing potential consequences > of drivers having bugs in the implementation of the callback? Or something else?I was proposing the kerneldoc for the vfunc should document the callback must not block, or if blocking is unavoidable, either document a guideline on how long is acceptable. Maybe even enforce a limit in the scheduler core itself. Regards, Tvrtko
Danilo Krummrich
2025-May-16 10:54 UTC
[PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long
On Fri, May 16, 2025 at 11:19:50AM +0100, Tvrtko Ursulin wrote:> > On 16/05/2025 10:53, Danilo Krummrich wrote: > > On Fri, May 16, 2025 at 10:33:30AM +0100, Tvrtko Ursulin wrote: > > > On 24/04/2025 10:55, Philipp Stanner wrote: > > > > + * @kill_fence_context: kill the fence context belonging to this scheduler > > > > > > Which fence context would that be? ;) > > > > There's one one per ring and a scheduler instance represents a single ring. So, > > what should be specified here? > > I was pointing out the fact not all drivers are 1:1 sched:entity.I'm well aware, but how is that relevant? Entities don't have an associated fence context, but a GPU Ring (either hardware or software) has, which a scheduler instance represents.> Thought it would be obvious from the ";)".I should read from ";)" that you refer to a 1:N-sched:entity relationship (which doesn't seem to be related)?> > > Also, "fence context" would be a new terminology in gpu_scheduler.h API > > > level. You could call it ->sched_fini() or similar to signify at which point > > > in the API it gets called and then the fact it takes sched as parameter > > > would be natural. > > > > The driver should tear down the fence context in this callback, not the while > > scheduler. ->sched_fini() would hence be misleading. > > Not the while what? Not while drm_sched_fini()?*whole> Could call it sched_kill() > or anything. My point is that we dont' have "fence context" in the API but > entities so adding a new term sounds sub-optimal.In the callback the driver should neither tear down an entity, nor the whole scheduler, hence we shouldn't call it like that. sched_kill() is therefore misleading as well. It should be named after what it actually does (or should do). Feel free to propose a different name that conforms with that.> > > We also probably want some commentary on the topic of indefinite (or very > > > long at least) blocking a thread exit / SIGINT/TERM/KILL time. > > > > You mean in case the driver does implement the callback, but does *not* properly > > tear down the fence context? So, you ask for describing potential consequences > > of drivers having bugs in the implementation of the callback? Or something else? > > I was proposing the kerneldoc for the vfunc should document the callback > must not block, or if blocking is unavoidable, either document a guideline > on how long is acceptable. Maybe even enforce a limit in the scheduler core > itself.Killing the fence context shouldn't block.