Zhi Wang
2024-Sep-22 13:07 UTC
[PATCH 2/3] nvkm/gsp: correctly calculate the available space of the GSP cmdq buffer
r535_gsp_cmdq_push() waits for the available page in the GSP cmdq buffer when handling a large RPC request. When it sees at least one available page in the cmdq, it quits the waiting with the amount of free buffer pages in the queue. Unfortunately, it always takes the [write pointer, buf_size) as available buffer pages before rolling back and wrongly calculates the size of the data should be copied. Thus, it can overwrite the RPC request that GSP is currently reading, which causes GSP hang due to corrupted RPC request: [ 549.209389] ------------[ cut here ]------------ [ 549.214010] WARNING: CPU: 8 PID: 6314 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:116 r535_gsp_msgq_wait+0xd0/0x190 [nvkm] [ 549.225678] Modules linked in: nvkm(E+) gsp_log(E) snd_seq_dummy(E) snd_hrtimer(E) snd_seq(E) snd_timer(E) snd_seq_device(E) snd(E) soundcore(E) rfkill(E) qrtr(E) vfat(E) fat(E) ipmi_ssif(E) amd_atl(E) intel_rapl_msr(E) intel_rapl_common(E) mlx5_ib(E) amd64_edac(E) edac_mce_amd(E) kvm_amd(E) ib_uverbs(E) kvm(E) ib_core(E) acpi_ipmi(E) ipmi_si(E) mxm_wmi(E) ipmi_devintf(E) rapl(E) i2c_piix4(E) wmi_bmof(E) joydev(E) ptdma(E) acpi_cpufreq(E) k10temp(E) pcspkr(E) ipmi_msghandler(E) xfs(E) libcrc32c(E) ast(E) i2c_algo_bit(E) crct10dif_pclmul(E) drm_shmem_helper(E) nvme_tcp(E) crc32_pclmul(E) ahci(E) drm_kms_helper(E) libahci(E) nvme_fabrics(E) crc32c_intel(E) nvme(E) cdc_ether(E) mlx5_core(E) nvme_core(E) usbnet(E) drm(E) libata(E) ccp(E) ghash_clmulni_intel(E) mii(E) t10_pi(E) mlxfw(E) sp5100_tco(E) psample(E) pci_hyperv_intf(E) wmi(E) dm_multipath(E) sunrpc(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) be2iscsi(E) bnx2i(E) cnic(E) uio(E) cxgb4i(E) cxgb4(E) tls(E) libcxgbi(E) libcxgb(E) qla4xxx(E) [ 549.225752] iscsi_boot_sysfs(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E) scsi_transport_iscsi(E) fuse(E) [last unloaded: gsp_log(E)] [ 549.326293] CPU: 8 PID: 6314 Comm: insmod Tainted: G E 6.9.0-rc6+ #1 [ 549.334039] Hardware name: ASRockRack 1U1G-MILAN/N/ROMED8-NL, BIOS L3.12E 09/06/2022 [ 549.341781] RIP: 0010:r535_gsp_msgq_wait+0xd0/0x190 [nvkm] [ 549.347343] Code: 08 00 00 89 da c1 e2 0c 48 8d ac 11 00 10 00 00 48 8b 0c 24 48 85 c9 74 1f c1 e0 0c 4c 8d 6d 30 83 e8 30 89 01 e9 68 ff ff ff <0f> 0b 49 c7 c5 92 ff ff ff e9 5a ff ff ff ba ff ff ff ff be c0 0c [ 549.366090] RSP: 0018:ffffacbccaaeb7d0 EFLAGS: 00010246 [ 549.371315] RAX: 0000000000000000 RBX: 0000000000000012 RCX: 0000000000923e28 [ 549.378451] RDX: 0000000000000000 RSI: 0000000055555554 RDI: ffffacbccaaeb730 [ 549.385590] RBP: 0000000000000001 R08: ffff8bd14d235f70 R09: ffff8bd14d235f70 [ 549.392721] R10: 0000000000000002 R11: ffff8bd14d233864 R12: 0000000000000020 [ 549.399854] R13: ffffacbccaaeb818 R14: 0000000000000020 R15: ffff8bb298c67000 [ 549.406988] FS: 00007f5179244740(0000) GS:ffff8bd14d200000(0000) knlGS:0000000000000000 [ 549.415076] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 549.420829] CR2: 00007fa844000010 CR3: 00000001567dc005 CR4: 0000000000770ef0 [ 549.427963] PKRU: 55555554 [ 549.430672] Call Trace: [ 549.433126] <TASK> [ 549.435233] ? __warn+0x7f/0x130 [ 549.438473] ? r535_gsp_msgq_wait+0xd0/0x190 [nvkm] [ 549.443426] ? report_bug+0x18a/0x1a0 [ 549.447098] ? handle_bug+0x3c/0x70 [ 549.450589] ? exc_invalid_op+0x14/0x70 [ 549.454430] ? asm_exc_invalid_op+0x16/0x20 [ 549.458619] ? r535_gsp_msgq_wait+0xd0/0x190 [nvkm] [ 549.463565] r535_gsp_msg_recv+0x46/0x230 [nvkm] [ 549.468257] r535_gsp_rpc_push+0x106/0x160 [nvkm] [ 549.473033] r535_gsp_rpc_rm_ctrl_push+0x40/0x130 [nvkm] [ 549.478422] nvidia_grid_init_vgpu_types+0xbc/0xe0 [nvkm] [ 549.483899] nvidia_grid_init+0xb1/0xd0 [nvkm] [ 549.488420] ? srso_alias_return_thunk+0x5/0xfbef5 [ 549.493213] nvkm_device_pci_probe+0x305/0x420 [nvkm] [ 549.498338] local_pci_probe+0x46/0xa0 [ 549.502096] pci_call_probe+0x56/0x170 [ 549.505851] pci_device_probe+0x79/0xf0 [ 549.509690] ? driver_sysfs_add+0x59/0xc0 [ 549.513702] really_probe+0xd9/0x380 [ 549.517282] __driver_probe_device+0x78/0x150 [ 549.521640] driver_probe_device+0x1e/0x90 [ 549.525746] __driver_attach+0xd2/0x1c0 [ 549.529594] ? __pfx___driver_attach+0x10/0x10 [ 549.534045] bus_for_each_dev+0x78/0xd0 [ 549.537893] bus_add_driver+0x112/0x210 [ 549.541750] driver_register+0x5c/0x120 [ 549.545596] ? __pfx_nvkm_init+0x10/0x10 [nvkm] [ 549.550224] do_one_initcall+0x44/0x300 [ 549.554063] ? do_init_module+0x23/0x240 [ 549.557989] do_init_module+0x64/0x240 Calculate the available buffer page before rolling back based on the result from the waiting. Fixes: 176fdcbddfd28 ("drm/nouveau/gsp/r535: add support for booting GSP-RM") Cc: Ben Skeggs <bskeggs at nvidia.com> Cc: Karol Herbst <kherbst at redhat.com> Cc: Lyude Paul <lyude at redhat.com> Cc: Danilo Krummrich <dakr at redhat.com> Cc: David Airlie <airlied at gmail.com> Signed-off-by: Zhi Wang <zhiw at nvidia.com> --- drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c index 736cde1987d0..49721935013b 100644 --- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c @@ -161,7 +161,7 @@ r535_gsp_cmdq_push(struct nvkm_gsp *gsp, void *argv) u64 *end; u64 csum = 0; int free, time = 1000000; - u32 wptr, size; + u32 wptr, size, step; u32 off = 0; argc = ALIGN(GSP_MSG_HDR_SIZE + argc, GSP_PAGE_SIZE); @@ -178,11 +178,13 @@ r535_gsp_cmdq_push(struct nvkm_gsp *gsp, void *argv) cmd->checksum = upper_32_bits(csum) ^ lower_32_bits(csum); wptr = *gsp->cmdq.wptr; + do { do { free = *gsp->cmdq.rptr + gsp->cmdq.cnt - wptr - 1; if (free >= gsp->cmdq.cnt) free -= gsp->cmdq.cnt; + if (free >= 1) break; @@ -195,7 +197,9 @@ r535_gsp_cmdq_push(struct nvkm_gsp *gsp, void *argv) } cqe = (void *)((u8 *)gsp->shm.cmdq.ptr + 0x1000 + wptr * 0x1000); - size = min_t(u32, argc, (gsp->cmdq.cnt - wptr) * GSP_PAGE_SIZE); + step = min_t(u32, free, (gsp->cmdq.cnt - wptr)); + size = min_t(u32, argc, step * GSP_PAGE_SIZE); + memcpy(cqe, (u8 *)cmd + off, size); wptr += DIV_ROUND_UP(size, 0x1000); -- 2.34.1
Danilo Krummrich
2024-Oct-04 17:16 UTC
[PATCH 2/3] nvkm/gsp: correctly calculate the available space of the GSP cmdq buffer
On Sun, Sep 22, 2024 at 06:07:08AM -0700, Zhi Wang wrote:> r535_gsp_cmdq_push() waits for the available page in the GSP cmdq > buffer when handling a large RPC request. When it sees at least one > available page in the cmdq, it quits the waiting with the amount of > free buffer pages in the queue. > > Unfortunately, it always takes the [write pointer, buf_size) as > available buffer pages before rolling back and wrongly calculates the > size of the data should be copied. Thus, it can overwrite the RPC > request that GSP is currently reading, which causes GSP hang due > to corrupted RPC request: > > [ 549.209389] ------------[ cut here ]------------ > [ 549.214010] WARNING: CPU: 8 PID: 6314 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:116 r535_gsp_msgq_wait+0xd0/0x190 [nvkm] > [ 549.225678] Modules linked in: nvkm(E+) gsp_log(E) snd_seq_dummy(E) snd_hrtimer(E) snd_seq(E) snd_timer(E) snd_seq_device(E) snd(E) soundcore(E) rfkill(E) qrtr(E) vfat(E) fat(E) ipmi_ssif(E) amd_atl(E) intel_rapl_msr(E) intel_rapl_common(E) mlx5_ib(E) amd64_edac(E) edac_mce_amd(E) kvm_amd(E) ib_uverbs(E) kvm(E) ib_core(E) acpi_ipmi(E) ipmi_si(E) mxm_wmi(E) ipmi_devintf(E) rapl(E) i2c_piix4(E) wmi_bmof(E) joydev(E) ptdma(E) acpi_cpufreq(E) k10temp(E) pcspkr(E) ipmi_msghandler(E) xfs(E) libcrc32c(E) ast(E) i2c_algo_bit(E) crct10dif_pclmul(E) drm_shmem_helper(E) nvme_tcp(E) crc32_pclmul(E) ahci(E) drm_kms_helper(E) libahci(E) nvme_fabrics(E) crc32c_intel(E) nvme(E) cdc_ether(E) mlx5_core(E) nvme_core(E) usbnet(E) drm(E) libata(E) ccp(E) ghash_clmulni_intel(E) mii(E) t10_pi(E) mlxfw(E) sp5100_tco(E) psample(E) pci_hyperv_intf(E) wmi(E) dm_multipath(E) sunrpc(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) be2iscsi(E) bnx2i(E) cnic(E) uio(E) cxgb4i(E) cxgb4(E) tls(E) libcxgbi(E) libcxgb(E) qla4xxx(E) > [ 549.225752] iscsi_boot_sysfs(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E) scsi_transport_iscsi(E) fuse(E) [last unloaded: gsp_log(E)] > [ 549.326293] CPU: 8 PID: 6314 Comm: insmod Tainted: G E 6.9.0-rc6+ #1 > [ 549.334039] Hardware name: ASRockRack 1U1G-MILAN/N/ROMED8-NL, BIOS L3.12E 09/06/2022 > [ 549.341781] RIP: 0010:r535_gsp_msgq_wait+0xd0/0x190 [nvkm] > [ 549.347343] Code: 08 00 00 89 da c1 e2 0c 48 8d ac 11 00 10 00 00 48 8b 0c 24 48 85 c9 74 1f c1 e0 0c 4c 8d 6d 30 83 e8 30 89 01 e9 68 ff ff ff <0f> 0b 49 c7 c5 92 ff ff ff e9 5a ff ff ff ba ff ff ff ff be c0 0c > [ 549.366090] RSP: 0018:ffffacbccaaeb7d0 EFLAGS: 00010246 > [ 549.371315] RAX: 0000000000000000 RBX: 0000000000000012 RCX: 0000000000923e28 > [ 549.378451] RDX: 0000000000000000 RSI: 0000000055555554 RDI: ffffacbccaaeb730 > [ 549.385590] RBP: 0000000000000001 R08: ffff8bd14d235f70 R09: ffff8bd14d235f70 > [ 549.392721] R10: 0000000000000002 R11: ffff8bd14d233864 R12: 0000000000000020 > [ 549.399854] R13: ffffacbccaaeb818 R14: 0000000000000020 R15: ffff8bb298c67000 > [ 549.406988] FS: 00007f5179244740(0000) GS:ffff8bd14d200000(0000) knlGS:0000000000000000 > [ 549.415076] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 549.420829] CR2: 00007fa844000010 CR3: 00000001567dc005 CR4: 0000000000770ef0 > [ 549.427963] PKRU: 55555554 > [ 549.430672] Call Trace: > [ 549.433126] <TASK> > [ 549.435233] ? __warn+0x7f/0x130 > [ 549.438473] ? r535_gsp_msgq_wait+0xd0/0x190 [nvkm] > [ 549.443426] ? report_bug+0x18a/0x1a0 > [ 549.447098] ? handle_bug+0x3c/0x70 > [ 549.450589] ? exc_invalid_op+0x14/0x70 > [ 549.454430] ? asm_exc_invalid_op+0x16/0x20 > [ 549.458619] ? r535_gsp_msgq_wait+0xd0/0x190 [nvkm] > [ 549.463565] r535_gsp_msg_recv+0x46/0x230 [nvkm] > [ 549.468257] r535_gsp_rpc_push+0x106/0x160 [nvkm] > [ 549.473033] r535_gsp_rpc_rm_ctrl_push+0x40/0x130 [nvkm] > [ 549.478422] nvidia_grid_init_vgpu_types+0xbc/0xe0 [nvkm] > [ 549.483899] nvidia_grid_init+0xb1/0xd0 [nvkm] > [ 549.488420] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 549.493213] nvkm_device_pci_probe+0x305/0x420 [nvkm] > [ 549.498338] local_pci_probe+0x46/0xa0 > [ 549.502096] pci_call_probe+0x56/0x170 > [ 549.505851] pci_device_probe+0x79/0xf0 > [ 549.509690] ? driver_sysfs_add+0x59/0xc0 > [ 549.513702] really_probe+0xd9/0x380 > [ 549.517282] __driver_probe_device+0x78/0x150 > [ 549.521640] driver_probe_device+0x1e/0x90 > [ 549.525746] __driver_attach+0xd2/0x1c0 > [ 549.529594] ? __pfx___driver_attach+0x10/0x10 > [ 549.534045] bus_for_each_dev+0x78/0xd0 > [ 549.537893] bus_add_driver+0x112/0x210 > [ 549.541750] driver_register+0x5c/0x120 > [ 549.545596] ? __pfx_nvkm_init+0x10/0x10 [nvkm] > [ 549.550224] do_one_initcall+0x44/0x300 > [ 549.554063] ? do_init_module+0x23/0x240 > [ 549.557989] do_init_module+0x64/0x240 > > Calculate the available buffer page before rolling back based on > the result from the waiting.It looks like you hit this one while working on the VFIO stuff too. So, same question here, can we hit this case with "vanilla nouveau"?> > Fixes: 176fdcbddfd28 ("drm/nouveau/gsp/r535: add support for booting GSP-RM")Same as in patch 1.> Cc: Ben Skeggs <bskeggs at nvidia.com> > Cc: Karol Herbst <kherbst at redhat.com> > Cc: Lyude Paul <lyude at redhat.com> > Cc: Danilo Krummrich <dakr at redhat.com> > Cc: David Airlie <airlied at gmail.com> > Signed-off-by: Zhi Wang <zhiw at nvidia.com> > --- > drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c > index 736cde1987d0..49721935013b 100644 > --- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c > +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c > @@ -161,7 +161,7 @@ r535_gsp_cmdq_push(struct nvkm_gsp *gsp, void *argv) > u64 *end; > u64 csum = 0; > int free, time = 1000000; > - u32 wptr, size; > + u32 wptr, size, step; > u32 off = 0; > > argc = ALIGN(GSP_MSG_HDR_SIZE + argc, GSP_PAGE_SIZE); > @@ -178,11 +178,13 @@ r535_gsp_cmdq_push(struct nvkm_gsp *gsp, void *argv) > cmd->checksum = upper_32_bits(csum) ^ lower_32_bits(csum); > > wptr = *gsp->cmdq.wptr; > +Please remove the addition of empty lines here...> do { > do { > free = *gsp->cmdq.rptr + gsp->cmdq.cnt - wptr - 1; > if (free >= gsp->cmdq.cnt) > free -= gsp->cmdq.cnt; > +and here.> if (free >= 1) > break; > > @@ -195,7 +197,9 @@ r535_gsp_cmdq_push(struct nvkm_gsp *gsp, void *argv) > } > > cqe = (void *)((u8 *)gsp->shm.cmdq.ptr + 0x1000 + wptr * 0x1000); > - size = min_t(u32, argc, (gsp->cmdq.cnt - wptr) * GSP_PAGE_SIZE); > + step = min_t(u32, free, (gsp->cmdq.cnt - wptr)); > + size = min_t(u32, argc, step * GSP_PAGE_SIZE); > + > memcpy(cqe, (u8 *)cmd + off, size); > > wptr += DIV_ROUND_UP(size, 0x1000); > -- > 2.34.1 >