Danilo Krummrich
2025-Jan-07 16:02 UTC
[PATCH] nouveau/fence: handle cross device fences properly.
On Tue, Jan 07, 2025 at 03:58:46PM +1000, Dave Airlie wrote:> From: Dave Airlie <airlied at redhat.com> > > If we have two nouveau controlled devices and one passes a dma-fence > to the other, when we hit the sync path it can cause the second device > to try and put a sync wait in it's pushbuf for the seqno of the context > on the first device. > > Since fence contexts are vmm bound, check the if vmm's match between > both users, this should ensure that fence seqnos don't get used wrongly > on incorrect channels.The fence sequence number is global, i.e. per device, hence checking the vmm context seems too restrictive. Wouldn't it be better to ensure that `prev->cli->drm == chan->cli->drm`? This way we can still optimize where dependencies are between different applications, but on the same device.> > This seems to happen fairly spuriously and I found it tracking down > a multi-card regression report, that seems to work by luck before this. > > Signed-off-by: Dave Airlie <airlied at redhat.com> > Cc: stable at vger.kernel.org > --- > drivers/gpu/drm/nouveau/nouveau_fence.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c > index ee5e9d40c166f..5743c82f4094b 100644 > --- a/drivers/gpu/drm/nouveau/nouveau_fence.c > +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c > @@ -370,7 +370,8 @@ nouveau_fence_sync(struct nouveau_bo *nvbo, struct nouveau_channel *chan, > > rcu_read_lock(); > prev = rcu_dereference(f->channel); > - if (prev && (prev == chan || > + if (prev && (prev->vmm == chan->vmm) && > + (prev == chan ||Maybe better break it down a bit, e.g. bool local = prev && (prev->... == chan->...); if (local && ...) { ... }> fctx->sync(f, prev, chan) == 0)) > must_wait = false; > rcu_read_unlock(); > -- > 2.43.0 >
Dave Airlie
2025-Jan-08 01:04 UTC
[PATCH] nouveau/fence: handle cross device fences properly.
On Wed, 8 Jan 2025 at 02:02, Danilo Krummrich <dakr at kernel.org> wrote:> > On Tue, Jan 07, 2025 at 03:58:46PM +1000, Dave Airlie wrote: > > From: Dave Airlie <airlied at redhat.com> > > > > If we have two nouveau controlled devices and one passes a dma-fence > > to the other, when we hit the sync path it can cause the second device > > to try and put a sync wait in it's pushbuf for the seqno of the context > > on the first device. > > > > Since fence contexts are vmm bound, check the if vmm's match between > > both users, this should ensure that fence seqnos don't get used wrongly > > on incorrect channels. > > The fence sequence number is global, i.e. per device, hence checking the vmm > context seems too restrictive. > > Wouldn't it be better to ensure that `prev->cli->drm == chan->cli->drm`?Can you prove that? I thought the same and I've gone around a few times yesterday/today and convinced myself what I wrote is right. dma_fence_init gets passed the seqno which comes from fctx->sequence, which is nouveau_fence_chan, which gets allocated for each channel. So we should hit this path if we have 2 userspace submits, one with say graphics, the one with copy engine contexts, otherwise we should wait on the CPU.> > drivers/gpu/drm/nouveau/nouveau_fence.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c > > index ee5e9d40c166f..5743c82f4094b 100644 > > --- a/drivers/gpu/drm/nouveau/nouveau_fence.c > > +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c > > @@ -370,7 +370,8 @@ nouveau_fence_sync(struct nouveau_bo *nvbo, struct nouveau_channel *chan, > > > > rcu_read_lock(); > > prev = rcu_dereference(f->channel); > > - if (prev && (prev == chan || > > + if (prev && (prev->vmm == chan->vmm) && > > + (prev == chan || > > Maybe better break it down a bit, e.g. > > bool local = prev && (prev->... == chan->...); > > if (local && ...) { > ... > }I'll update that once we resolve the above. Dave.