Danilo Krummrich
2025-Mar-19 14:18 UTC
[PATCH v2] drm/nouveau: prime: fix ttm_bo_delayed_delete oops
On Wed, Mar 19, 2025 at 03:06:52PM +0100, Christian K?nig wrote:> Am 19.03.25 um 14:04 schrieb Danilo Krummrich: > > >> Signed-off-by: Chris Bainbridge <chris.bainbridge at gmail.com> > >> Co-Developed-by: Christian K?nig <christian.koenig at amd.com> > > Then also Christian's SoB is required. > > I only pointed out which two lines in nouveau need to move to fix this. > > All the credit to figuring out what's wrong go to Chris, but feel free to add my SoB if required.Then maybe Suggested-by: is the tags that fits best. :)> > > > >> Fixes: https://gitlab.freedesktop.org/drm/amd/-/issues/3937 > > This is a bug report from amdgpu, but I understand that the same issue applies > > for nouveau. > > The crash in amdgpu was caused by nouveau incorrectly dropping a DMA-buf reference while it was still needed.Oh, I see.> > Took us a while to figure that out, we could update the tags in the bug report but I think at this point it's unnecessary.Agreed.> > > > > If at all, this needs to be > > > > Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3937 > > > > Maybe you can add a brief comment that this report applies for nouveau as well. > > > > Please also add a Fixes: tag that indicates the commit in nouveau that > > introduced the problem and Cc stable. > > As far as I can see it was always there and was added >10years ago with the very first DMA-buf support. > > But adding CC stable is a really good idea.Sounds good.
Chris Bainbridge
2025-Mar-26 12:52 UTC
[PATCH v3] drm/nouveau: prime: fix ttm_bo_delayed_delete oops
Fix an oops in ttm_bo_delayed_delete which results from dererencing a
dangling pointer:
Oops: general protection fault, probably for non-canonical address
0x6b6b6b6b6b6b6b7b: 0000 [#1] PREEMPT SMP
CPU: 4 UID: 0 PID: 1082 Comm: kworker/u65:2 Not tainted
6.14.0-rc4-00267-g505460b44513-dirty #216
Hardware name: LENOVO 82N6/LNVNB161216, BIOS GKCN65WW 01/16/2024
Workqueue: ttm ttm_bo_delayed_delete [ttm]
RIP: 0010:dma_resv_iter_first_unlocked+0x55/0x290
Code: 31 f6 48 c7 c7 00 2b fa aa e8 97 bd 52 ff e8 a2 c1 53 00 5a 85 c0 74 48 e9
88 01 00 00 4c 89 63 20 4d 85 e4 0f 84 30 01 00 00 <41> 8b 44 24 10 c6 43
2c 01 48 89 df 89 43 28 e8 97 fd ff ff 4c 8b
RSP: 0018:ffffbf9383473d60 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffffbf9383473d88 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffbf9383473d78 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 6b6b6b6b6b6b6b6b
R13: ffffa003bbf78580 R14: ffffa003a6728040 R15: 00000000000383cc
FS: 0000000000000000(0000) GS:ffffa00991c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000758348024dd0 CR3: 000000012c259000 CR4: 0000000000f50ef0
PKRU: 55555554
Call Trace:
<TASK>
? __die_body.cold+0x19/0x26
? die_addr+0x3d/0x70
? exc_general_protection+0x159/0x460
? asm_exc_general_protection+0x27/0x30
? dma_resv_iter_first_unlocked+0x55/0x290
dma_resv_wait_timeout+0x56/0x100
ttm_bo_delayed_delete+0x69/0xb0 [ttm]
process_one_work+0x217/0x5c0
worker_thread+0x1c8/0x3d0
? apply_wqattrs_cleanup.part.0+0xc0/0xc0
kthread+0x10b/0x240
? kthreads_online_cpu+0x140/0x140
ret_from_fork+0x40/0x70
? kthreads_online_cpu+0x140/0x140
ret_from_fork_asm+0x11/0x20
</TASK>
The cause of this is:
- drm_prime_gem_destroy calls dma_buf_put(dma_buf) which releases the
reference to the shared dma_buf. The reference count is 0, so the
dma_buf is destroyed, which in turn decrements the corresponding
amdgpu_bo reference count to 0, and the amdgpu_bo is destroyed -
calling drm_gem_object_release then dma_resv_fini (which destroys the
reservation object), then finally freeing the amdgpu_bo.
- nouveau_bo obj->bo.base.resv is now a dangling pointer to the memory
formerly allocated to the amdgpu_bo.
- nouveau_gem_object_del calls ttm_bo_put(&nvbo->bo) which calls
ttm_bo_release, which schedules ttm_bo_delayed_delete.
- ttm_bo_delayed_delete runs and dereferences the dangling resv pointer,
resulting in a general protection fault.
Fix this by moving the drm_prime_gem_destroy call from
nouveau_gem_object_del to nouveau_bo_del_ttm. This ensures that it will
be run after ttm_bo_delayed_delete.
Signed-off-by: Chris Bainbridge <chris.bainbridge at gmail.com>
Suggested-by: Christian K?nig <christian.koenig at amd.com>
Fixes: 22b33e8ed0e3 ("22b33e8ed0e3nouveau: add PRIME support")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3937
Cc: <Stable at vger.kernel.org>
---
drivers/gpu/drm/nouveau/nouveau_bo.c | 3 +++
drivers/gpu/drm/nouveau/nouveau_gem.c | 3 ---
2 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c
b/drivers/gpu/drm/nouveau/nouveau_bo.c
index db961eade225..2016c1e7242f 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -144,6 +144,9 @@ nouveau_bo_del_ttm(struct ttm_buffer_object *bo)
nouveau_bo_del_io_reserve_lru(bo);
nv10_bo_put_tile_region(dev, nvbo->tile, NULL);
+ if (bo->base.import_attach)
+ drm_prime_gem_destroy(&bo->base, bo->sg);
+
/*
* If nouveau_bo_new() allocated this buffer, the GEM object was never
* initialized, so don't attempt to release it.
diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c
b/drivers/gpu/drm/nouveau/nouveau_gem.c
index 9ae2cee1c7c5..67e3c99de73a 100644
--- a/drivers/gpu/drm/nouveau/nouveau_gem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
@@ -87,9 +87,6 @@ nouveau_gem_object_del(struct drm_gem_object *gem)
return;
}
- if (gem->import_attach)
- drm_prime_gem_destroy(gem, nvbo->bo.sg);
-
ttm_bo_put(&nvbo->bo);
pm_runtime_mark_last_busy(dev);
--
2.47.2
Chris Bainbridge
2025-Mar-26 12:53 UTC
[PATCH] drm/nouveau: prime: drm_prime_gem_destroy comment
Edit the comments on correct usage of drm_prime_gem_destroy to note that, if using TTM, drm_prime_gem_destroy must be called in the ttm_buffer_object.destroy hook, to avoid the dma_buf being freed leaving a dangling pointer which will be later dereferenced by ttm_bo_delayed_delete. Signed-off-by: Chris Bainbridge <chris.bainbridge at gmail.com> Suggested-by: Christian K?nig <christian.koenig at amd.com> --- drivers/gpu/drm/drm_prime.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c index 32a8781cfd67..452d5c7cd292 100644 --- a/drivers/gpu/drm/drm_prime.c +++ b/drivers/gpu/drm/drm_prime.c @@ -929,7 +929,9 @@ EXPORT_SYMBOL(drm_gem_prime_export); * &drm_driver.gem_prime_import_sg_table internally. * * Drivers must arrange to call drm_prime_gem_destroy() from their - * &drm_gem_object_funcs.free hook when using this function. + * &drm_gem_object_funcs.free hook or &ttm_buffer_object.destroy + * hook when using this function, to avoid the dma_buf being freed while the + * ttm_buffer_object can still dereference it. */ struct drm_gem_object *drm_gem_prime_import_dev(struct drm_device *dev, struct dma_buf *dma_buf, @@ -999,7 +1001,9 @@ EXPORT_SYMBOL(drm_gem_prime_import_dev); * implementation in drm_gem_prime_fd_to_handle(). * * Drivers must arrange to call drm_prime_gem_destroy() from their - * &drm_gem_object_funcs.free hook when using this function. + * &drm_gem_object_funcs.free hook or &ttm_buffer_object.destroy + * hook when using this function, to avoid the dma_buf being freed while the + * ttm_buffer_object can still dereference it. */ struct drm_gem_object *drm_gem_prime_import(struct drm_device *dev, struct dma_buf *dma_buf) -- 2.47.2
Christian König
2025-Mar-26 13:05 UTC
[PATCH] drm/nouveau: prime: drm_prime_gem_destroy comment
Am 26.03.25 um 13:53 schrieb Chris Bainbridge:> Edit the comments on correct usage of drm_prime_gem_destroy to note > that, if using TTM, drm_prime_gem_destroy must be called in the > ttm_buffer_object.destroy hook, to avoid the dma_buf being freed leaving > a dangling pointer which will be later dereferenced by > ttm_bo_delayed_delete. > > Signed-off-by: Chris Bainbridge <chris.bainbridge at gmail.com> > Suggested-by: Christian K?nig <christian.koenig at amd.com>The subject line of the patch should probably read "drm/prime: fix drm_prime_gem_destroy comment" since this isn't nouveau specific at all. It's just that all other TTM drivers except for nouveau got that right. Regards, Christian.> --- > drivers/gpu/drm/drm_prime.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c > index 32a8781cfd67..452d5c7cd292 100644 > --- a/drivers/gpu/drm/drm_prime.c > +++ b/drivers/gpu/drm/drm_prime.c > @@ -929,7 +929,9 @@ EXPORT_SYMBOL(drm_gem_prime_export); > * &drm_driver.gem_prime_import_sg_table internally. > * > * Drivers must arrange to call drm_prime_gem_destroy() from their > - * &drm_gem_object_funcs.free hook when using this function. > + * &drm_gem_object_funcs.free hook or &ttm_buffer_object.destroy > + * hook when using this function, to avoid the dma_buf being freed while the > + * ttm_buffer_object can still dereference it. > */ > struct drm_gem_object *drm_gem_prime_import_dev(struct drm_device *dev, > struct dma_buf *dma_buf, > @@ -999,7 +1001,9 @@ EXPORT_SYMBOL(drm_gem_prime_import_dev); > * implementation in drm_gem_prime_fd_to_handle(). > * > * Drivers must arrange to call drm_prime_gem_destroy() from their > - * &drm_gem_object_funcs.free hook when using this function. > + * &drm_gem_object_funcs.free hook or &ttm_buffer_object.destroy > + * hook when using this function, to avoid the dma_buf being freed while the > + * ttm_buffer_object can still dereference it. > */ > struct drm_gem_object *drm_gem_prime_import(struct drm_device *dev, > struct dma_buf *dma_buf)
Danilo Krummrich
2025-Mar-28 10:59 UTC
[PATCH v3] drm/nouveau: prime: fix ttm_bo_delayed_delete oops
On Wed, Mar 26, 2025 at 12:52:10PM +0000, Chris Bainbridge wrote:> Fix an oops in ttm_bo_delayed_delete which results from dererencing a > dangling pointer: > > Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b7b: 0000 [#1] PREEMPT SMP > CPU: 4 UID: 0 PID: 1082 Comm: kworker/u65:2 Not tainted 6.14.0-rc4-00267-g505460b44513-dirty #216 > Hardware name: LENOVO 82N6/LNVNB161216, BIOS GKCN65WW 01/16/2024 > Workqueue: ttm ttm_bo_delayed_delete [ttm] > RIP: 0010:dma_resv_iter_first_unlocked+0x55/0x290 > Code: 31 f6 48 c7 c7 00 2b fa aa e8 97 bd 52 ff e8 a2 c1 53 00 5a 85 c0 74 48 e9 88 01 00 00 4c 89 63 20 4d 85 e4 0f 84 30 01 00 00 <41> 8b 44 24 10 c6 43 2c 01 48 89 df 89 43 28 e8 97 fd ff ff 4c 8b > RSP: 0018:ffffbf9383473d60 EFLAGS: 00010202 > RAX: 0000000000000001 RBX: ffffbf9383473d88 RCX: 0000000000000000 > RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 > RBP: ffffbf9383473d78 R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000000000000 R11: 0000000000000000 R12: 6b6b6b6b6b6b6b6b > R13: ffffa003bbf78580 R14: ffffa003a6728040 R15: 00000000000383cc > FS: 0000000000000000(0000) GS:ffffa00991c00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000758348024dd0 CR3: 000000012c259000 CR4: 0000000000f50ef0 > PKRU: 55555554 > Call Trace: > <TASK> > ? __die_body.cold+0x19/0x26 > ? die_addr+0x3d/0x70 > ? exc_general_protection+0x159/0x460 > ? asm_exc_general_protection+0x27/0x30 > ? dma_resv_iter_first_unlocked+0x55/0x290 > dma_resv_wait_timeout+0x56/0x100 > ttm_bo_delayed_delete+0x69/0xb0 [ttm] > process_one_work+0x217/0x5c0 > worker_thread+0x1c8/0x3d0 > ? apply_wqattrs_cleanup.part.0+0xc0/0xc0 > kthread+0x10b/0x240 > ? kthreads_online_cpu+0x140/0x140 > ret_from_fork+0x40/0x70 > ? kthreads_online_cpu+0x140/0x140 > ret_from_fork_asm+0x11/0x20 > </TASK> > > The cause of this is: > > - drm_prime_gem_destroy calls dma_buf_put(dma_buf) which releases the > reference to the shared dma_buf. The reference count is 0, so the > dma_buf is destroyed, which in turn decrements the corresponding > amdgpu_bo reference count to 0, and the amdgpu_bo is destroyed - > calling drm_gem_object_release then dma_resv_fini (which destroys the > reservation object), then finally freeing the amdgpu_bo. > > - nouveau_bo obj->bo.base.resv is now a dangling pointer to the memory > formerly allocated to the amdgpu_bo. > > - nouveau_gem_object_del calls ttm_bo_put(&nvbo->bo) which calls > ttm_bo_release, which schedules ttm_bo_delayed_delete. > > - ttm_bo_delayed_delete runs and dereferences the dangling resv pointer, > resulting in a general protection fault. > > Fix this by moving the drm_prime_gem_destroy call from > nouveau_gem_object_del to nouveau_bo_del_ttm. This ensures that it will > be run after ttm_bo_delayed_delete. > > Signed-off-by: Chris Bainbridge <chris.bainbridge at gmail.com> > Suggested-by: Christian K?nig <christian.koenig at amd.com> > Fixes: 22b33e8ed0e3 ("22b33e8ed0e3nouveau: add PRIME support") > Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3937 > Cc: <Stable at vger.kernel.org>Applied to drm-misc-fixes, thanks! [ Fixed up the Fixes: tag, where the commit hash is repeated in the commit subject. ]