thr3ads.net - Nouveau - [PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long [May 2025]

If this information is useful, please help other people find it:
Share via:

Tvrtko Ursulin

2025-May-16 10:19 UTC

[PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long

On 16/05/2025 10:53, Danilo Krummrich wrote:> On Fri, May 16, 2025 at 10:33:30AM +0100, Tvrtko Ursulin wrote:
>> On 24/04/2025 10:55, Philipp Stanner wrote:
>>> +	 * @kill_fence_context: kill the fence context belonging to this
scheduler
>>
>> Which fence context would that be? ;)
> 
> There's one one per ring and a scheduler instance represents a single
ring. So,
> what should be specified here?
I was pointing out the fact not all drivers are 1:1 sched:entity. So 
plural at least. Thought it would be obvious from the ";)".
>> Also, "fence context" would be a new terminology in
gpu_scheduler.h API
>> level. You could call it ->sched_fini() or similar to signify at
which point
>> in the API it gets called and then the fact it takes sched as parameter
>> would be natural.
> 
> The driver should tear down the fence context in this callback, not the
while
> scheduler. ->sched_fini() would hence be misleading.
Not the while what? Not while drm_sched_fini()? Could call it 
sched_kill() or anything. My point is that we dont' have "fence
context"
in the API but entities so adding a new term sounds sub-optimal.
>> We also probably want some commentary on the topic of indefinite (or
very
>> long at least) blocking a thread exit / SIGINT/TERM/KILL time.
> 
> You mean in case the driver does implement the callback, but does *not*
properly
> tear down the fence context? So, you ask for describing potential
consequences
> of drivers having bugs in the implementation of the callback? Or something
else?
I was proposing the kerneldoc for the vfunc should document the callback 
must not block, or if blocking is unavoidable, either document a 
guideline on how long is acceptable. Maybe even enforce a limit in the 
scheduler core itself.

Regards,

Tvrtko

Danilo Krummrich

2025-May-16 10:54 UTC

head link

[PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long

On Fri, May 16, 2025 at 11:19:50AM +0100, Tvrtko Ursulin
wrote:> 
> On 16/05/2025 10:53, Danilo Krummrich wrote:
> > On Fri, May 16, 2025 at 10:33:30AM +0100, Tvrtko Ursulin wrote:
> > > On 24/04/2025 10:55, Philipp Stanner wrote:
> > > > +	 * @kill_fence_context: kill the fence context belonging
to this scheduler
> > > 
> > > Which fence context would that be? ;)
> > 
> > There's one one per ring and a scheduler instance represents a
single ring. So,
> > what should be specified here?
> 
> I was pointing out the fact not all drivers are 1:1 sched:entity.
I'm well aware, but how is that relevant? Entities don't have an
associated
fence context, but a GPU Ring (either hardware or software) has, which a
scheduler instance represents.
> Thought it would be obvious from the ";)".
I should read from ";)" that you refer to a 1:N-sched:entity
relationship (which
doesn't seem to be related)?
> > > Also, "fence context" would be a new terminology in
gpu_scheduler.h API
> > > level. You could call it ->sched_fini() or similar to signify
at which point
> > > in the API it gets called and then the fact it takes sched as
parameter
> > > would be natural.
> > 
> > The driver should tear down the fence context in this callback, not
the while
> > scheduler. ->sched_fini() would hence be misleading.
> 
> Not the while what? Not while drm_sched_fini()?
*whole
> Could call it sched_kill()
> or anything. My point is that we dont' have "fence context"
in the API but
> entities so adding a new term sounds sub-optimal.
In the callback the driver should neither tear down an entity, nor the whole
scheduler, hence we shouldn't call it like that. sched_kill() is therefore
misleading as well.

It should be named after what it actually does (or should do). Feel free to
propose a different name that conforms with that.
> > > We also probably want some commentary on the topic of indefinite
(or very
> > > long at least) blocking a thread exit / SIGINT/TERM/KILL time.
> > 
> > You mean in case the driver does implement the callback, but does
*not* properly
> > tear down the fence context? So, you ask for describing potential
consequences
> > of drivers having bugs in the implementation of the callback? Or
something else?
> 
> I was proposing the kerneldoc for the vfunc should document the callback
> must not block, or if blocking is unavoidable, either document a guideline
> on how long is acceptable. Maybe even enforce a limit in the scheduler core
> itself.
Killing the fence context shouldn't block.

Nouveau - May 2025 - [PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long

[PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long

[PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long