Danilo Krummrich
2024-Oct-04 17:16 UTC
[PATCH 2/3] nvkm/gsp: correctly calculate the available space of the GSP cmdq buffer
On Sun, Sep 22, 2024 at 06:07:08AM -0700, Zhi Wang wrote:> r535_gsp_cmdq_push() waits for the available page in the GSP cmdq > buffer when handling a large RPC request. When it sees at least one > available page in the cmdq, it quits the waiting with the amount of > free buffer pages in the queue. > > Unfortunately, it always takes the [write pointer, buf_size) as > available buffer pages before rolling back and wrongly calculates the > size of the data should be copied. Thus, it can overwrite the RPC > request that GSP is currently reading, which causes GSP hang due > to corrupted RPC request: > > [ 549.209389] ------------[ cut here ]------------ > [ 549.214010] WARNING: CPU: 8 PID: 6314 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:116 r535_gsp_msgq_wait+0xd0/0x190 [nvkm] > [ 549.225678] Modules linked in: nvkm(E+) gsp_log(E) snd_seq_dummy(E) snd_hrtimer(E) snd_seq(E) snd_timer(E) snd_seq_device(E) snd(E) soundcore(E) rfkill(E) qrtr(E) vfat(E) fat(E) ipmi_ssif(E) amd_atl(E) intel_rapl_msr(E) intel_rapl_common(E) mlx5_ib(E) amd64_edac(E) edac_mce_amd(E) kvm_amd(E) ib_uverbs(E) kvm(E) ib_core(E) acpi_ipmi(E) ipmi_si(E) mxm_wmi(E) ipmi_devintf(E) rapl(E) i2c_piix4(E) wmi_bmof(E) joydev(E) ptdma(E) acpi_cpufreq(E) k10temp(E) pcspkr(E) ipmi_msghandler(E) xfs(E) libcrc32c(E) ast(E) i2c_algo_bit(E) crct10dif_pclmul(E) drm_shmem_helper(E) nvme_tcp(E) crc32_pclmul(E) ahci(E) drm_kms_helper(E) libahci(E) nvme_fabrics(E) crc32c_intel(E) nvme(E) cdc_ether(E) mlx5_core(E) nvme_core(E) usbnet(E) drm(E) libata(E) ccp(E) ghash_clmulni_intel(E) mii(E) t10_pi(E) mlxfw(E) sp5100_tco(E) psample(E) pci_hyperv_intf(E) wmi(E) dm_multipath(E) sunrpc(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) be2iscsi(E) bnx2i(E) cnic(E) uio(E) cxgb4i(E) cxgb4(E) tls(E) libcxgbi(E) libcxgb(E) qla4xxx(E) > [ 549.225752] iscsi_boot_sysfs(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E) scsi_transport_iscsi(E) fuse(E) [last unloaded: gsp_log(E)] > [ 549.326293] CPU: 8 PID: 6314 Comm: insmod Tainted: G E 6.9.0-rc6+ #1 > [ 549.334039] Hardware name: ASRockRack 1U1G-MILAN/N/ROMED8-NL, BIOS L3.12E 09/06/2022 > [ 549.341781] RIP: 0010:r535_gsp_msgq_wait+0xd0/0x190 [nvkm] > [ 549.347343] Code: 08 00 00 89 da c1 e2 0c 48 8d ac 11 00 10 00 00 48 8b 0c 24 48 85 c9 74 1f c1 e0 0c 4c 8d 6d 30 83 e8 30 89 01 e9 68 ff ff ff <0f> 0b 49 c7 c5 92 ff ff ff e9 5a ff ff ff ba ff ff ff ff be c0 0c > [ 549.366090] RSP: 0018:ffffacbccaaeb7d0 EFLAGS: 00010246 > [ 549.371315] RAX: 0000000000000000 RBX: 0000000000000012 RCX: 0000000000923e28 > [ 549.378451] RDX: 0000000000000000 RSI: 0000000055555554 RDI: ffffacbccaaeb730 > [ 549.385590] RBP: 0000000000000001 R08: ffff8bd14d235f70 R09: ffff8bd14d235f70 > [ 549.392721] R10: 0000000000000002 R11: ffff8bd14d233864 R12: 0000000000000020 > [ 549.399854] R13: ffffacbccaaeb818 R14: 0000000000000020 R15: ffff8bb298c67000 > [ 549.406988] FS: 00007f5179244740(0000) GS:ffff8bd14d200000(0000) knlGS:0000000000000000 > [ 549.415076] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 549.420829] CR2: 00007fa844000010 CR3: 00000001567dc005 CR4: 0000000000770ef0 > [ 549.427963] PKRU: 55555554 > [ 549.430672] Call Trace: > [ 549.433126] <TASK> > [ 549.435233] ? __warn+0x7f/0x130 > [ 549.438473] ? r535_gsp_msgq_wait+0xd0/0x190 [nvkm] > [ 549.443426] ? report_bug+0x18a/0x1a0 > [ 549.447098] ? handle_bug+0x3c/0x70 > [ 549.450589] ? exc_invalid_op+0x14/0x70 > [ 549.454430] ? asm_exc_invalid_op+0x16/0x20 > [ 549.458619] ? r535_gsp_msgq_wait+0xd0/0x190 [nvkm] > [ 549.463565] r535_gsp_msg_recv+0x46/0x230 [nvkm] > [ 549.468257] r535_gsp_rpc_push+0x106/0x160 [nvkm] > [ 549.473033] r535_gsp_rpc_rm_ctrl_push+0x40/0x130 [nvkm] > [ 549.478422] nvidia_grid_init_vgpu_types+0xbc/0xe0 [nvkm] > [ 549.483899] nvidia_grid_init+0xb1/0xd0 [nvkm] > [ 549.488420] ? srso_alias_return_thunk+0x5/0xfbef5 > [ 549.493213] nvkm_device_pci_probe+0x305/0x420 [nvkm] > [ 549.498338] local_pci_probe+0x46/0xa0 > [ 549.502096] pci_call_probe+0x56/0x170 > [ 549.505851] pci_device_probe+0x79/0xf0 > [ 549.509690] ? driver_sysfs_add+0x59/0xc0 > [ 549.513702] really_probe+0xd9/0x380 > [ 549.517282] __driver_probe_device+0x78/0x150 > [ 549.521640] driver_probe_device+0x1e/0x90 > [ 549.525746] __driver_attach+0xd2/0x1c0 > [ 549.529594] ? __pfx___driver_attach+0x10/0x10 > [ 549.534045] bus_for_each_dev+0x78/0xd0 > [ 549.537893] bus_add_driver+0x112/0x210 > [ 549.541750] driver_register+0x5c/0x120 > [ 549.545596] ? __pfx_nvkm_init+0x10/0x10 [nvkm] > [ 549.550224] do_one_initcall+0x44/0x300 > [ 549.554063] ? do_init_module+0x23/0x240 > [ 549.557989] do_init_module+0x64/0x240 > > Calculate the available buffer page before rolling back based on > the result from the waiting.It looks like you hit this one while working on the VFIO stuff too. So, same question here, can we hit this case with "vanilla nouveau"?> > Fixes: 176fdcbddfd28 ("drm/nouveau/gsp/r535: add support for booting GSP-RM")Same as in patch 1.> Cc: Ben Skeggs <bskeggs at nvidia.com> > Cc: Karol Herbst <kherbst at redhat.com> > Cc: Lyude Paul <lyude at redhat.com> > Cc: Danilo Krummrich <dakr at redhat.com> > Cc: David Airlie <airlied at gmail.com> > Signed-off-by: Zhi Wang <zhiw at nvidia.com> > --- > drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c > index 736cde1987d0..49721935013b 100644 > --- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c > +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c > @@ -161,7 +161,7 @@ r535_gsp_cmdq_push(struct nvkm_gsp *gsp, void *argv) > u64 *end; > u64 csum = 0; > int free, time = 1000000; > - u32 wptr, size; > + u32 wptr, size, step; > u32 off = 0; > > argc = ALIGN(GSP_MSG_HDR_SIZE + argc, GSP_PAGE_SIZE); > @@ -178,11 +178,13 @@ r535_gsp_cmdq_push(struct nvkm_gsp *gsp, void *argv) > cmd->checksum = upper_32_bits(csum) ^ lower_32_bits(csum); > > wptr = *gsp->cmdq.wptr; > +Please remove the addition of empty lines here...> do { > do { > free = *gsp->cmdq.rptr + gsp->cmdq.cnt - wptr - 1; > if (free >= gsp->cmdq.cnt) > free -= gsp->cmdq.cnt; > +and here.> if (free >= 1) > break; > > @@ -195,7 +197,9 @@ r535_gsp_cmdq_push(struct nvkm_gsp *gsp, void *argv) > } > > cqe = (void *)((u8 *)gsp->shm.cmdq.ptr + 0x1000 + wptr * 0x1000); > - size = min_t(u32, argc, (gsp->cmdq.cnt - wptr) * GSP_PAGE_SIZE); > + step = min_t(u32, free, (gsp->cmdq.cnt - wptr)); > + size = min_t(u32, argc, step * GSP_PAGE_SIZE); > + > memcpy(cqe, (u8 *)cmd + off, size); > > wptr += DIV_ROUND_UP(size, 0x1000); > -- > 2.34.1 >
Zhi Wang
2024-Oct-13 18:27 UTC
[PATCH 2/3] nvkm/gsp: correctly calculate the available space of the GSP cmdq buffer
On 04/10/2024 20.16, Danilo Krummrich wrote:> External email: Use caution opening links or attachments > >Hey Danilo. I am just back from my vacation. Sorry for the delay. See my comments below.> On Sun, Sep 22, 2024 at 06:07:08AM -0700, Zhi Wang wrote: >> r535_gsp_cmdq_push() waits for the available page in the GSP cmdq >> buffer when handling a large RPC request. When it sees at least one >> available page in the cmdq, it quits the waiting with the amount of >> free buffer pages in the queue. >> >> Unfortunately, it always takes the [write pointer, buf_size) as >> available buffer pages before rolling back and wrongly calculates the >> size of the data should be copied. Thus, it can overwrite the RPC >> request that GSP is currently reading, which causes GSP hang due >> to corrupted RPC request: >> >> [ 549.209389] ------------[ cut here ]------------ >> [ 549.214010] WARNING: CPU: 8 PID: 6314 at drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c:116 r535_gsp_msgq_wait+0xd0/0x190 [nvkm] >> [ 549.225678] Modules linked in: nvkm(E+) gsp_log(E) snd_seq_dummy(E) snd_hrtimer(E) snd_seq(E) snd_timer(E) snd_seq_device(E) snd(E) soundcore(E) rfkill(E) qrtr(E) vfat(E) fat(E) ipmi_ssif(E) amd_atl(E) intel_rapl_msr(E) intel_rapl_common(E) mlx5_ib(E) amd64_edac(E) edac_mce_amd(E) kvm_amd(E) ib_uverbs(E) kvm(E) ib_core(E) acpi_ipmi(E) ipmi_si(E) mxm_wmi(E) ipmi_devintf(E) rapl(E) i2c_piix4(E) wmi_bmof(E) joydev(E) ptdma(E) acpi_cpufreq(E) k10temp(E) pcspkr(E) ipmi_msghandler(E) xfs(E) libcrc32c(E) ast(E) i2c_algo_bit(E) crct10dif_pclmul(E) drm_shmem_helper(E) nvme_tcp(E) crc32_pclmul(E) ahci(E) drm_kms_helper(E) libahci(E) nvme_fabrics(E) crc32c_intel(E) nvme(E) cdc_ether(E) mlx5_core(E) nvme_core(E) usbnet(E) drm(E) libata(E) ccp(E) ghash_clmulni_intel(E) mii(E) t10_pi(E) mlxfw(E) sp5100_tco(E) psample(E) pci_hyperv_intf(E) wmi(E) dm_multipath(E) sunrpc(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) be2iscsi(E) bnx2i(E) cnic(E) uio(E) cxgb4i(E) cxgb4(E) tls(E) libcxgbi(E) libcxgb(E) qla4xxx(E) >> [ 549.225752] iscsi_boot_sysfs(E) iscsi_tcp(E) libiscsi_tcp(E) libiscsi(E) scsi_transport_iscsi(E) fuse(E) [last unloaded: gsp_log(E)] >> [ 549.326293] CPU: 8 PID: 6314 Comm: insmod Tainted: G E 6.9.0-rc6+ #1 >> [ 549.334039] Hardware name: ASRockRack 1U1G-MILAN/N/ROMED8-NL, BIOS L3.12E 09/06/2022 >> [ 549.341781] RIP: 0010:r535_gsp_msgq_wait+0xd0/0x190 [nvkm] >> [ 549.347343] Code: 08 00 00 89 da c1 e2 0c 48 8d ac 11 00 10 00 00 48 8b 0c 24 48 85 c9 74 1f c1 e0 0c 4c 8d 6d 30 83 e8 30 89 01 e9 68 ff ff ff <0f> 0b 49 c7 c5 92 ff ff ff e9 5a ff ff ff ba ff ff ff ff be c0 0c >> [ 549.366090] RSP: 0018:ffffacbccaaeb7d0 EFLAGS: 00010246 >> [ 549.371315] RAX: 0000000000000000 RBX: 0000000000000012 RCX: 0000000000923e28 >> [ 549.378451] RDX: 0000000000000000 RSI: 0000000055555554 RDI: ffffacbccaaeb730 >> [ 549.385590] RBP: 0000000000000001 R08: ffff8bd14d235f70 R09: ffff8bd14d235f70 >> [ 549.392721] R10: 0000000000000002 R11: ffff8bd14d233864 R12: 0000000000000020 >> [ 549.399854] R13: ffffacbccaaeb818 R14: 0000000000000020 R15: ffff8bb298c67000 >> [ 549.406988] FS: 00007f5179244740(0000) GS:ffff8bd14d200000(0000) knlGS:0000000000000000 >> [ 549.415076] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 549.420829] CR2: 00007fa844000010 CR3: 00000001567dc005 CR4: 0000000000770ef0 >> [ 549.427963] PKRU: 55555554 >> [ 549.430672] Call Trace: >> [ 549.433126] <TASK> >> [ 549.435233] ? __warn+0x7f/0x130 >> [ 549.438473] ? r535_gsp_msgq_wait+0xd0/0x190 [nvkm] >> [ 549.443426] ? report_bug+0x18a/0x1a0 >> [ 549.447098] ? handle_bug+0x3c/0x70 >> [ 549.450589] ? exc_invalid_op+0x14/0x70 >> [ 549.454430] ? asm_exc_invalid_op+0x16/0x20 >> [ 549.458619] ? r535_gsp_msgq_wait+0xd0/0x190 [nvkm] >> [ 549.463565] r535_gsp_msg_recv+0x46/0x230 [nvkm] >> [ 549.468257] r535_gsp_rpc_push+0x106/0x160 [nvkm] >> [ 549.473033] r535_gsp_rpc_rm_ctrl_push+0x40/0x130 [nvkm] >> [ 549.478422] nvidia_grid_init_vgpu_types+0xbc/0xe0 [nvkm] >> [ 549.483899] nvidia_grid_init+0xb1/0xd0 [nvkm] >> [ 549.488420] ? srso_alias_return_thunk+0x5/0xfbef5 >> [ 549.493213] nvkm_device_pci_probe+0x305/0x420 [nvkm] >> [ 549.498338] local_pci_probe+0x46/0xa0 >> [ 549.502096] pci_call_probe+0x56/0x170 >> [ 549.505851] pci_device_probe+0x79/0xf0 >> [ 549.509690] ? driver_sysfs_add+0x59/0xc0 >> [ 549.513702] really_probe+0xd9/0x380 >> [ 549.517282] __driver_probe_device+0x78/0x150 >> [ 549.521640] driver_probe_device+0x1e/0x90 >> [ 549.525746] __driver_attach+0xd2/0x1c0 >> [ 549.529594] ? __pfx___driver_attach+0x10/0x10 >> [ 549.534045] bus_for_each_dev+0x78/0xd0 >> [ 549.537893] bus_add_driver+0x112/0x210 >> [ 549.541750] driver_register+0x5c/0x120 >> [ 549.545596] ? __pfx_nvkm_init+0x10/0x10 [nvkm] >> [ 549.550224] do_one_initcall+0x44/0x300 >> [ 549.554063] ? do_init_module+0x23/0x240 >> [ 549.557989] do_init_module+0x64/0x240 >> >> Calculate the available buffer page before rolling back based on >> the result from the waiting. > > It looks like you hit this one while working on the VFIO stuff too. So, same > question here,Yes. But theses bugs are not specific to vGPU because two-page GSP RPC are part of valid RPC vehicle format of GSP RPC protocol family. The fixes are for a better sophisticated GSP RPC handling in Nouveau. Other GSP RPC can use this vehicle format as well. can we hit this case with "vanilla nouveau"? Not yet. But introducing new GSP RPCs that using this vehicle format (related to vGPU/not-related to vGPU) in nouveau might hit this bug later. Out of curiostiy, do we have any unit-test package or flows to test the patches? Like CIs. I am using the Phoronix test suite in the ubuntu with a PPA repo that has latest mesa/drm userspace libraries. It would be nice that I can align with others. :) Thanks, Zhi.>>> >> Fixes: 176fdcbddfd28 ("drm/nouveau/gsp/r535: add support for booting GSP-RM") > > Same as in patch 1. > >> Cc: Ben Skeggs <bskeggs at nvidia.com> >> Cc: Karol Herbst <kherbst at redhat.com> >> Cc: Lyude Paul <lyude at redhat.com> >> Cc: Danilo Krummrich <dakr at redhat.com> >> Cc: David Airlie <airlied at gmail.com> >> Signed-off-by: Zhi Wang <zhiw at nvidia.com> >> --- >> drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c | 8 ++++++-- >> 1 file changed, 6 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c >> index 736cde1987d0..49721935013b 100644 >> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c >> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/gsp/r535.c >> @@ -161,7 +161,7 @@ r535_gsp_cmdq_push(struct nvkm_gsp *gsp, void *argv) >> u64 *end; >> u64 csum = 0; >> int free, time = 1000000; >> - u32 wptr, size; >> + u32 wptr, size, step; >> u32 off = 0; >> >> argc = ALIGN(GSP_MSG_HDR_SIZE + argc, GSP_PAGE_SIZE); >> @@ -178,11 +178,13 @@ r535_gsp_cmdq_push(struct nvkm_gsp *gsp, void *argv) >> cmd->checksum = upper_32_bits(csum) ^ lower_32_bits(csum); >> >> wptr = *gsp->cmdq.wptr; >> + > > Please remove the addition of empty lines here... > >> do { >> do { >> free = *gsp->cmdq.rptr + gsp->cmdq.cnt - wptr - 1; >> if (free >= gsp->cmdq.cnt) >> free -= gsp->cmdq.cnt; >> + > > and here. > >> if (free >= 1) >> break; >> >> @@ -195,7 +197,9 @@ r535_gsp_cmdq_push(struct nvkm_gsp *gsp, void *argv) >> } >> >> cqe = (void *)((u8 *)gsp->shm.cmdq.ptr + 0x1000 + wptr * 0x1000); >> - size = min_t(u32, argc, (gsp->cmdq.cnt - wptr) * GSP_PAGE_SIZE); >> + step = min_t(u32, free, (gsp->cmdq.cnt - wptr)); >> + size = min_t(u32, argc, step * GSP_PAGE_SIZE); >> + >> memcpy(cqe, (u8 *)cmd + off, size); >> >> wptr += DIV_ROUND_UP(size, 0x1000); >> -- >> 2.34.1 >>