thr3ads.net - Nouveau - [PATCH] nouveau: offload fence uevents work to workqueue [Feb 2024]

If this information is useful, please help other people find it:
Share via:

Danilo Krummrich

2024-Feb-05 22:00 UTC

[PATCH] nouveau: offload fence uevents work to workqueue

On 2/5/24 22:08, Dave Airlie wrote:> On Tue, 6 Feb 2024 at 02:22, Danilo Krummrich <dakr at redhat.com>
wrote:
>>
>> On 1/29/24 02:50, Dave Airlie wrote:
>>> From: Dave Airlie <airlied at redhat.com>
>>>
>>> This should break the deadlock between the fctx lock and the irq
lock.
>>>
>>> This offloads the processing off the work from the irq into a
workqueue.
>>>
>>> Signed-off-by: Dave Airlie <airlied at redhat.com>
>>
>> Nouveau's scheduler uses a dedicated wq, hence from this
perspective it's
>> safe deferring fence signalling to the kernel global wq. However, I
wonder
>> if we could create deadlocks by building dependency chains into other
>> drivers / kernel code that, by chance, makes use of the kernel global
wq as
>> well.
>>
>> Admittedly, even if, it's gonna be extremely unlikely given that
>> WQ_MAX_ACTIVE == 512. But maybe it'd be safer to use a dedicated
wq.
>>
>> Also, do we need to CC stable?
> 
> I pushed this to Linus at the end of last week, since the hangs in 6.7
> take out the complete system and I wanted it in stable.
> 
> It might be safer to use a dedicated wq, is the concern someone is
> waiting on a fence in a workqueue somewhere else so we will never
> signal it?
Yes, if some other work is waiting for this fence (or something else in the same
dependency chain) to signal it can prevent executing the work signaling this
fence,
in case both are scheduled on the same wq. As mentioned, with the kernel global
wq
this would be rather unlikely to lead to an actual stall with WQ_MAX_ACTIVE ==
512,
but formally the race condition exists. I guess a malicious attacker could try
to
intentionally push jobs directly or indirectly depending on this fence to a
driver
which queues them up on a scheduler using the kernel global wq.
> 
> Dave.
>

Daniel Vetter

2024-Feb-06 14:03 UTC

head link

[PATCH] nouveau: offload fence uevents work to workqueue

On Mon, Feb 05, 2024 at 11:00:04PM +0100, Danilo Krummrich
wrote:> On 2/5/24 22:08, Dave Airlie wrote:
> > On Tue, 6 Feb 2024 at 02:22, Danilo Krummrich <dakr at
redhat.com> wrote:
> > > 
> > > On 1/29/24 02:50, Dave Airlie wrote:
> > > > From: Dave Airlie <airlied at redhat.com>
> > > > 
> > > > This should break the deadlock between the fctx lock and the
irq lock.
> > > > 
> > > > This offloads the processing off the work from the irq into
a workqueue.
> > > > 
> > > > Signed-off-by: Dave Airlie <airlied at redhat.com>
> > > 
> > > Nouveau's scheduler uses a dedicated wq, hence from this
perspective it's
> > > safe deferring fence signalling to the kernel global wq. However,
I wonder
> > > if we could create deadlocks by building dependency chains into
other
> > > drivers / kernel code that, by chance, makes use of the kernel
global wq as
> > > well.
> > > 
> > > Admittedly, even if, it's gonna be extremely unlikely given
that
> > > WQ_MAX_ACTIVE == 512. But maybe it'd be safer to use a
dedicated wq.
> > > 
> > > Also, do we need to CC stable?
> > 
> > I pushed this to Linus at the end of last week, since the hangs in 6.7
> > take out the complete system and I wanted it in stable.
> > 
> > It might be safer to use a dedicated wq, is the concern someone is
> > waiting on a fence in a workqueue somewhere else so we will never
> > signal it?
> 
> Yes, if some other work is waiting for this fence (or something else in the
same
> dependency chain) to signal it can prevent executing the work signaling
this fence,
> in case both are scheduled on the same wq. As mentioned, with the kernel
global wq
> this would be rather unlikely to lead to an actual stall with WQ_MAX_ACTIVE
== 512,
> but formally the race condition exists. I guess a malicious attacker could
try to
> intentionally push jobs directly or indirectly depending on this fence to a
driver
> which queues them up on a scheduler using the kernel global wq.
I think if you add dma_fence_signalling annotations (aside, there's some
patch from iirc Thomas Hellstrom to improve them and cut down on some
false positives, but I lost track) then I think you won't get any splats
because the wq subsystem assumes that WC_MAX_ACTIVE is close enough to
infinity to not matter.

I'm not sure we should care differently, but I guess it'd be good to
annotate it all in case the wq subsystem's idea of how much such deadlocks
are real changes.

Also Teo is on a mission to get rid of all the global wq flushes, so there
should also be no source of deadlocks from that kind of cross-driver
dependency. Or at least shouldn't be in the future, I'm not sure it all
landed.
-Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

Nouveau - Feb 2024 - [PATCH] nouveau: offload fence uevents work to workqueue

[PATCH] nouveau: offload fence uevents work to workqueue

[PATCH] nouveau: offload fence uevents work to workqueue