Dave Airlie
2025-Jul-02 23:27 UTC
[PATCH] nouveau/gsp: add a 50ms delay between fbsr and driver unload rpcs
From: Dave Airlie <airlied at redhat.com> This fixes a bunch of command hangs after runtime suspend/resume. This fixes a regression caused by code movement in the commit below, the commit seems to just change timings enough to cause this to happen now, and adding the sleep seems to avoid it. I've spent some time trying to root cause it to no great avail, it seems like a bug on the firmware side, but it could be a bug in our rpc handling that I can't find. Either way, we should land the workaround to fix the problem, while we continue to work out the root cause. Signed-off-by: Dave Airlie <airlied at redhat.com> Cc: Ben Skeggs <bskeggs at nvidia.com> Cc: Danilo Krummrich <dakr at kernel.org> Fixes: 21b039715ce9 ("drm/nouveau/gsp: add hals for fbsr.suspend/resume()") --- drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c index baf42339f93e..ff362a6d9f5c 100644 --- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c @@ -1744,6 +1744,9 @@ r535_gsp_fini(struct nvkm_gsp *gsp, bool suspend) nvkm_gsp_sg_free(gsp->subdev.device, &gsp->sr.sgt); return ret; } + + /* without this Turing ends up resetting all channels after resume. */ + msleep(50); } ret = r535_gsp_rpc_unloading_guest_driver(gsp, suspend); -- 2.49.0
Danilo Krummrich
2025-Jul-03 21:46 UTC
[PATCH] nouveau/gsp: add a 50ms delay between fbsr and driver unload rpcs
On 7/3/25 1:27 AM, Dave Airlie wrote:> From: Dave Airlie <airlied at redhat.com> > > This fixes a bunch of command hangs after runtime suspend/resume. > > This fixes a regression caused by code movement in the commit below, > the commit seems to just change timings enough to cause this to happen > now, and adding the sleep seems to avoid it. > > I've spent some time trying to root cause it to no great avail, > it seems like a bug on the firmware side, but it could be a bug > in our rpc handling that I can't find. > > Either way, we should land the workaround to fix the problem, > while we continue to work out the root cause.I think we should add a TODO above the msleep(); what do you think would be a good comment here? I can add it when applying the patch if you want.> Signed-off-by: Dave Airlie <airlied at redhat.com> > Cc: Ben Skeggs <bskeggs at nvidia.com> > Cc: Danilo Krummrich <dakr at kernel.org> > Fixes: 21b039715ce9 ("drm/nouveau/gsp: add hals for fbsr.suspend/resume()") > --- > drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c > index baf42339f93e..ff362a6d9f5c 100644 > --- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c > +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c > @@ -1744,6 +1744,9 @@ r535_gsp_fini(struct nvkm_gsp *gsp, bool suspend) > nvkm_gsp_sg_free(gsp->subdev.device, &gsp->sr.sgt); > return ret; > } > + > + /* without this Turing ends up resetting all channels after resume. */ > + msleep(50); > } > > ret = r535_gsp_rpc_unloading_guest_driver(gsp, suspend);
Danilo Krummrich
2025-Jul-03 22:22 UTC
[PATCH] nouveau/gsp: add a 50ms delay between fbsr and driver unload rpcs
On Thu, Jul 03, 2025 at 09:27:07AM +1000, Dave Airlie wrote:> From: Dave Airlie <airlied at redhat.com> > > This fixes a bunch of command hangs after runtime suspend/resume. > > This fixes a regression caused by code movement in the commit below, > the commit seems to just change timings enough to cause this to happen > now, and adding the sleep seems to avoid it. > > I've spent some time trying to root cause it to no great avail, > it seems like a bug on the firmware side, but it could be a bug > in our rpc handling that I can't find. > > Either way, we should land the workaround to fix the problem, > while we continue to work out the root cause. > > Signed-off-by: Dave Airlie <airlied at redhat.com> > Cc: Ben Skeggs <bskeggs at nvidia.com> > Cc: Danilo Krummrich <dakr at kernel.org> > Fixes: 21b039715ce9 ("drm/nouveau/gsp: add hals for fbsr.suspend/resume()")Applied to drm-misc-fixes with the following diff. diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c index ff362a6d9f5c..23f80e167705 100644 --- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/rm/r535/gsp.c @@ -1745,7 +1745,11 @@ r535_gsp_fini(struct nvkm_gsp *gsp, bool suspend) return ret; } - /* without this Turing ends up resetting all channels after resume. */ + /* + * TODO: Debug the GSP firmware / RPC handling to find out why + * without this Turing (but none of the other architectures) + * ends up resetting all channels after resume. + */ msleep(50); } I also changed the 'Fixes' tag to: Fixes: c21b039715ce ("drm/nouveau/gsp: add hals for fbsr.suspend/resume()")