Danilo Krummrich
2025-May-16 10:54 UTC
[PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long
On Fri, May 16, 2025 at 11:19:50AM +0100, Tvrtko Ursulin wrote:> > On 16/05/2025 10:53, Danilo Krummrich wrote: > > On Fri, May 16, 2025 at 10:33:30AM +0100, Tvrtko Ursulin wrote: > > > On 24/04/2025 10:55, Philipp Stanner wrote: > > > > + * @kill_fence_context: kill the fence context belonging to this scheduler > > > > > > Which fence context would that be? ;) > > > > There's one one per ring and a scheduler instance represents a single ring. So, > > what should be specified here? > > I was pointing out the fact not all drivers are 1:1 sched:entity.I'm well aware, but how is that relevant? Entities don't have an associated fence context, but a GPU Ring (either hardware or software) has, which a scheduler instance represents.> Thought it would be obvious from the ";)".I should read from ";)" that you refer to a 1:N-sched:entity relationship (which doesn't seem to be related)?> > > Also, "fence context" would be a new terminology in gpu_scheduler.h API > > > level. You could call it ->sched_fini() or similar to signify at which point > > > in the API it gets called and then the fact it takes sched as parameter > > > would be natural. > > > > The driver should tear down the fence context in this callback, not the while > > scheduler. ->sched_fini() would hence be misleading. > > Not the while what? Not while drm_sched_fini()?*whole> Could call it sched_kill() > or anything. My point is that we dont' have "fence context" in the API but > entities so adding a new term sounds sub-optimal.In the callback the driver should neither tear down an entity, nor the whole scheduler, hence we shouldn't call it like that. sched_kill() is therefore misleading as well. It should be named after what it actually does (or should do). Feel free to propose a different name that conforms with that.> > > We also probably want some commentary on the topic of indefinite (or very > > > long at least) blocking a thread exit / SIGINT/TERM/KILL time. > > > > You mean in case the driver does implement the callback, but does *not* properly > > tear down the fence context? So, you ask for describing potential consequences > > of drivers having bugs in the implementation of the callback? Or something else? > > I was proposing the kerneldoc for the vfunc should document the callback > must not block, or if blocking is unavoidable, either document a guideline > on how long is acceptable. Maybe even enforce a limit in the scheduler core > itself.Killing the fence context shouldn't block.
Tvrtko Ursulin
2025-May-16 11:35 UTC
[PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long
On 16/05/2025 11:54, Danilo Krummrich wrote:> On Fri, May 16, 2025 at 11:19:50AM +0100, Tvrtko Ursulin wrote: >> >> On 16/05/2025 10:53, Danilo Krummrich wrote: >>> On Fri, May 16, 2025 at 10:33:30AM +0100, Tvrtko Ursulin wrote: >>>> On 24/04/2025 10:55, Philipp Stanner wrote: >>>>> + * @kill_fence_context: kill the fence context belonging to this scheduler >>>> >>>> Which fence context would that be? ;) >>> >>> There's one one per ring and a scheduler instance represents a single ring. So, >>> what should be specified here? >> >> I was pointing out the fact not all drivers are 1:1 sched:entity. > > I'm well aware, but how is that relevant? Entities don't have an associated > fence context, but a GPU Ring (either hardware or software) has, which a > scheduler instance represents.Aha! Well.. how it is relevant and do entities not have an associated fence context? Well, entity->fence_context.. that was my first association this whole time. Never it crossed my mind this is talking about the hardware fence context. Proof in the pudding naming should be improved. But I also don't think there is a requirement for fences returned from ->run_job() to have a single context. Which again makes it not the best naming.>> Thought it would be obvious from the ";)". > > I should read from ";)" that you refer to a 1:N-sched:entity relationship (which > doesn't seem to be related)? > >>>> Also, "fence context" would be a new terminology in gpu_scheduler.h API >>>> level. You could call it ->sched_fini() or similar to signify at which point >>>> in the API it gets called and then the fact it takes sched as parameter >>>> would be natural. >>> >>> The driver should tear down the fence context in this callback, not the while >>> scheduler. ->sched_fini() would hence be misleading. >> >> Not the while what? Not while drm_sched_fini()? > > *whole > >> Could call it sched_kill() >> or anything. My point is that we dont' have "fence context" in the API but >> entities so adding a new term sounds sub-optimal. > > In the callback the driver should neither tear down an entity, nor the whole > scheduler, hence we shouldn't call it like that. sched_kill() is therefore > misleading as well.->sched_exit()? ->sched_stop()? ->sched_cleanup()?> It should be named after what it actually does (or should do). Feel free to > propose a different name that conforms with that. > >>>> We also probably want some commentary on the topic of indefinite (or very >>>> long at least) blocking a thread exit / SIGINT/TERM/KILL time. >>> >>> You mean in case the driver does implement the callback, but does *not* properly >>> tear down the fence context? So, you ask for describing potential consequences >>> of drivers having bugs in the implementation of the callback? Or something else? >> >> I was proposing the kerneldoc for the vfunc should document the callback >> must not block, or if blocking is unavoidable, either document a guideline >> on how long is acceptable. Maybe even enforce a limit in the scheduler core >> itself. > > Killing the fence context shouldn't block.Cool. And maybe convert the wait_event to wait_event_timeout with a warning to be robust. Mind you, I still think a solution which does not add new state machinery should be preferred. Regards, Tvrtko