Christian König
2021-Jun-10 17:59 UTC
[Nouveau] nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
Am 10.06.21 um 19:50 schrieb Ondrej Zary:> On Thursday 10 June 2021 08:43:06 Christian K?nig wrote: >> Am 09.06.21 um 22:00 schrieb Ondrej Zary: >>> On Wednesday 09 June 2021 11:21:05 Christian K?nig wrote: >>>> Am 09.06.21 um 09:10 schrieb Ondrej Zary: >>>>> On Wednesday 09 June 2021, Christian K?nig wrote: >>>>>> Am 09.06.21 um 08:57 schrieb Ondrej Zary: >>>>>>> [SNIP] >>>>>>>> Thanks for the heads up. So the problem with my patch is already fixed, >>>>>>>> isn't it? >>>>>>> The NULL pointer dereference in nouveau_bo_wr16 introduced in >>>>>>> 141b15e59175aa174ca1f7596188bd15a7ca17ba was fixed by >>>>>>> aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d. >>>>>>> >>>>>>> That's the bug I hit when bisecting the original problem: >>>>>>> NULL pointer dereference in nouveau_bo_sync_for_device >>>>>>> It's caused by: >>>>>>> # first bad commit: [e34b8feeaa4b65725b25f49c9b08a0f8707e8e86] drm/ttm: merge ttm_dma_tt back into ttm_tt >>>>>> Good that I've asked :) >>>>>> >>>>>> Ok that's a bit strange. e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 was >>>>>> created mostly automated. >>>>>> >>>>>> Do you have the original backtrace of that NULL pointer deref once more? >>>>> The original backtrace is here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2021%2F6%2F5%2F350&data=04%7C01%7Cchristian.koenig%40amd.com%7C657222345e3242e7a6a608d92c383f66%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637589442963348551%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ZkJs%2FR8MeQKUxwhJUC%2FG4Hi3T%2FMIftt%2FWRh%2B1%2BU5rUE%3D&reserved=0 >>>> And the problem is that ttm_dma->dma_address is NULL, right? Mhm, I >>>> don't see how that can happen since nouveau is using ttm_sg_tt_init(). >>>> >>>> Apart from that what nouveau does here is rather questionable since you >>>> need a coherent architecture for most things anyway, but that's not what >>>> we are trying to fix here. >>>> >>>> Can you try to narrow down if ttm_sg_tt_init is called before calling >>>> this function for the tt object in question? >>> ttm_sg_tt_init is not called: >>> [ 12.150124] nouveau 0000:01:00.0: DRM: VRAM: 31 MiB >>> [ 12.150133] nouveau 0000:01:00.0: DRM: GART: 128 MiB >>> [ 12.150143] nouveau 0000:01:00.0: DRM: BMP version 5.6 >>> [ 12.150151] nouveau 0000:01:00.0: DRM: No DCB data found in VBIOS >>> [ 12.151362] ttm_tt_init >>> [ 12.151370] ttm_tt_init_fields >>> [ 12.151374] ttm_tt_alloc_page_directory >>> [ 12.151615] BUG: kernel NULL pointer dereference, address: 00000000 >> Please add dump_stack(); to ttm_tt_init() and report back with the >> backtrace. >> >> I can't see how this is called from the nouveau code, only possibility I >> see is that it is maybe called through the AGP code somehow. > Yes, you're right: > [ 13.192663] Call Trace: > [ 13.192678] dump_stack+0x54/0x68 > [ 13.192690] ttm_tt_init+0x11/0x8a [ttm] > [ 13.192699] ttm_agp_tt_create+0x39/0x51 [ttm] > [ 13.192840] nouveau_ttm_tt_create+0x17/0x22 [nouveau] > [ 13.192856] ttm_tt_create+0x78/0x8c [ttm] > [ 13.192864] ttm_bo_handle_move_mem+0x7d/0xca [ttm] > [ 13.192873] ttm_bo_validate+0x92/0xc8 [ttm] > [ 13.192883] ttm_bo_init_reserved+0x216/0x243 [ttm] > [ 13.192892] ttm_bo_init+0x45/0x65 [ttm] > [ 13.193018] ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau] > [ 13.193150] nouveau_bo_init+0x8c/0x94 [nouveau] > [ 13.193273] ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau] > [ 13.193407] nouveau_bo_new+0x44/0x57 [nouveau] > [ 13.193537] nouveau_channel_prep+0xa3/0x269 [nouveau] > [ 13.193665] nouveau_channel_new+0x3c/0x5f7 [nouveau] > [ 13.193679] ? slab_free_freelist_hook+0x3b/0xa7 > [ 13.193686] ? kfree+0x9e/0x11a > [ 13.193781] ? nvif_object_sclass_put+0xd/0x16 [nouveau] > [ 13.193908] nouveau_drm_device_init+0x2e2/0x646 [nouveau] > [ 13.193924] ? pci_enable_device_flags+0x1e/0xac > [ 13.194052] nouveau_drm_probe+0xeb/0x188 [nouveau] > [ 13.194182] ? nouveau_drm_device_init+0x646/0x646 [nouveau] > [ 13.194195] pci_device_probe+0x89/0xe9 > [ 13.194205] really_probe+0x127/0x2a7 > [ 13.194212] driver_probe_device+0x5b/0x87 > [ 13.194219] device_driver_attach+0x2e/0x41 > [ 13.194226] __driver_attach+0x7c/0x83 > [ 13.194232] bus_for_each_dev+0x4c/0x66 > [ 13.194238] driver_attach+0x14/0x16 > [ 13.194244] ? device_driver_attach+0x41/0x41 > [ 13.194251] bus_add_driver+0xc5/0x16c > [ 13.194258] driver_register+0x87/0xb9 > [ 13.194265] __pci_register_driver+0x38/0x3b > [ 13.194271] ? 0xf0c0d000 > [ 13.194362] nouveau_drm_init+0x14c/0x1000 [nouveau] > > How is ttm_dma_tt->dma_address allocated?Mhm, I need to double check how AGP is supposed to work. Since barely anybody is using it these days it is something which breaks from time to time. Thanks for the backtrace, Christian.> I cannot find any assignment > executed (in the working code): > > $ git grep dma_address\ = drivers/gpu/ > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c: sg->sgl->dma_address = addr; > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address = &dma->dma_address[offset >> PAGE_SHIFT]; > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address = (mm_node->start << PAGE_SHIFT) + offset; > drivers/gpu/drm/i915/gvt/scheduler.c: sg->dma_address = addr; > drivers/gpu/drm/i915/i915_gpu_error.c: sg->dma_address = it; > drivers/gpu/drm/ttm/ttm_tt.c: ttm->dma_address = (void *) (ttm->ttm.pages + ttm->ttm.num_pages); > drivers/gpu/drm/ttm/ttm_tt.c: ttm->dma_address = kvmalloc_array(ttm->ttm.num_pages, > drivers/gpu/drm/ttm/ttm_tt.c: ttm_dma->dma_address = NULL; > drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address = &__vmw_piter_phys_addr; > drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address = &__vmw_piter_dma_addr; > drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address = &__vmw_piter_sg_addr; > > The 2 cases in ttm_tt.c are in ttm_dma_tt_alloc_page_directory() and > ttm_sg_tt_alloc_page_directory(). > Confirmed by adding printk()s that they're NOT called. > >
Christian König
2021-Jun-11 12:38 UTC
[Nouveau] nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device
Am 10.06.21 um 19:59 schrieb Christian K?nig:> Am 10.06.21 um 19:50 schrieb Ondrej Zary: >> [SNIP] >>> I can't see how this is called from the nouveau code, only >>> possibility I >>> see is that it is maybe called through the AGP code somehow. >> Yes, you're right: >> [?? 13.192663] Call Trace: >> [?? 13.192678]? dump_stack+0x54/0x68 >> [?? 13.192690]? ttm_tt_init+0x11/0x8a [ttm] >> [?? 13.192699]? ttm_agp_tt_create+0x39/0x51 [ttm] >> [?? 13.192840]? nouveau_ttm_tt_create+0x17/0x22 [nouveau] >> [?? 13.192856]? ttm_tt_create+0x78/0x8c [ttm] >> [?? 13.192864]? ttm_bo_handle_move_mem+0x7d/0xca [ttm] >> [?? 13.192873]? ttm_bo_validate+0x92/0xc8 [ttm] >> [?? 13.192883]? ttm_bo_init_reserved+0x216/0x243 [ttm] >> [?? 13.192892]? ttm_bo_init+0x45/0x65 [ttm] >> [?? 13.193018]? ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau] >> [?? 13.193150]? nouveau_bo_init+0x8c/0x94 [nouveau] >> [?? 13.193273]? ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau] >> [?? 13.193407]? nouveau_bo_new+0x44/0x57 [nouveau] >> [?? 13.193537]? nouveau_channel_prep+0xa3/0x269 [nouveau] >> [?? 13.193665]? nouveau_channel_new+0x3c/0x5f7 [nouveau] >> [?? 13.193679]? ? slab_free_freelist_hook+0x3b/0xa7 >> [?? 13.193686]? ? kfree+0x9e/0x11a >> [?? 13.193781]? ? nvif_object_sclass_put+0xd/0x16 [nouveau] >> [?? 13.193908]? nouveau_drm_device_init+0x2e2/0x646 [nouveau] >> [?? 13.193924]? ? pci_enable_device_flags+0x1e/0xac >> [?? 13.194052]? nouveau_drm_probe+0xeb/0x188 [nouveau] >> [?? 13.194182]? ? nouveau_drm_device_init+0x646/0x646 [nouveau] >> [?? 13.194195]? pci_device_probe+0x89/0xe9 >> [?? 13.194205]? really_probe+0x127/0x2a7 >> [?? 13.194212]? driver_probe_device+0x5b/0x87 >> [?? 13.194219]? device_driver_attach+0x2e/0x41 >> [?? 13.194226]? __driver_attach+0x7c/0x83 >> [?? 13.194232]? bus_for_each_dev+0x4c/0x66 >> [?? 13.194238]? driver_attach+0x14/0x16 >> [?? 13.194244]? ? device_driver_attach+0x41/0x41 >> [?? 13.194251]? bus_add_driver+0xc5/0x16c >> [?? 13.194258]? driver_register+0x87/0xb9 >> [?? 13.194265]? __pci_register_driver+0x38/0x3b >> [?? 13.194271]? ? 0xf0c0d000 >> [?? 13.194362]? nouveau_drm_init+0x14c/0x1000 [nouveau] >> >> How is ttm_dma_tt->dma_address allocated? > > Mhm, I need to double check how AGP is supposed to work. > > Since barely anybody is using it these days it is something which > breaks from time to time.I have no idea how that ever worked in the first place since AGP isn't supposed to sync between CPU/GPU. Everything is coherent for that case. Anyway here is a patch which adds a check to those functions if the dma_address array is allocated in the first place. Please test it. Thanks, Christian.> > Thanks for the backtrace, > Christian. > >> ? I cannot find any assignment >> executed (in the working code): >> >> $ git grep dma_address\ = drivers/gpu/ >> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c: >> sg->sgl->dma_address = addr; >> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address = >> &dma->dma_address[offset >> PAGE_SHIFT]; >> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address = >> (mm_node->start << PAGE_SHIFT) + offset; >> drivers/gpu/drm/i915/gvt/scheduler.c:?? sg->dma_address = addr; >> drivers/gpu/drm/i915/i915_gpu_error.c:? sg->dma_address = it; >> drivers/gpu/drm/ttm/ttm_tt.c:?? ttm->dma_address = (void *) >> (ttm->ttm.pages + ttm->ttm.num_pages); >> drivers/gpu/drm/ttm/ttm_tt.c:?? ttm->dma_address = >> kvmalloc_array(ttm->ttm.num_pages, >> drivers/gpu/drm/ttm/ttm_tt.c:?? ttm_dma->dma_address = NULL; >> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address = >> &__vmw_piter_phys_addr; >> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address = >> &__vmw_piter_dma_addr; >> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address = >> &__vmw_piter_sg_addr; >> >> The 2 cases in ttm_tt.c are in ttm_dma_tt_alloc_page_directory() and >> ttm_sg_tt_alloc_page_directory(). >> Confirmed by adding printk()s that they're NOT called. >> >> >-------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-drm-nouveau-check-dma_address-array-for-CPU-GPU-sync.patch Type: text/x-patch Size: 1363 bytes Desc: not available URL: <https://lists.freedesktop.org/archives/nouveau/attachments/20210611/f0a982a7/attachment.bin>