Joel Fernandes
2025-Dec-07 02:26 UTC
[RFC 7/7] gpu: nova-core: load the scrubber ucode when vGPU support is enabled
Hi Zhi, On 12/6/2025 7:42 AM, Zhi Wang wrote:> To support the maximum vGPUs on the device that support vGPU, a larger > WPR2 heap size is required. By setting the WPR2 heap size larger than 256MB > the scrubber ucode image is required to scrub the FB memory before any > other ucode image is executed. > > If not, the GSP firmware hangs when booting. > > When vGPU support is enabled, execute the scrubber ucode image to scrub the > FB memory before executing any other ucode images. >[..]> pub(crate) const fn create( > diff --git a/drivers/gpu/nova-core/firmware/booter.rs b/drivers/gpu/nova-core/firmware/booter.rs > index f107f753214a..f622c9b960de 100644 > --- a/drivers/gpu/nova-core/firmware/booter.rs > +++ b/drivers/gpu/nova-core/firmware/booter.rs > @@ -269,6 +269,7 @@ fn new_booter(dev: &device::Device<device::Bound>, data: &[u8]) -> Result<Self> > > #[derive(Copy, Clone, Debug, PartialEq)] > pub(crate) enum BooterKind { > + Scrubber, > Loader, > #[expect(unused)] > Unloader, > @@ -286,6 +287,7 @@ pub(crate) fn new( > bar: &Bar0, > ) -> Result<Self> { > let fw_name = match kind { > + BooterKind::Scrubber => "scrubber", > BooterKind::Loader => "booter_load", > BooterKind::Unloader => "booter_unload", > }; > diff --git a/drivers/gpu/nova-core/gsp/boot.rs b/drivers/gpu/nova-core/gsp/boot.rs > index ec006c26f19f..8ef79433f017 100644 > --- a/drivers/gpu/nova-core/gsp/boot.rs > +++ b/drivers/gpu/nova-core/gsp/boot.rs > @@ -151,6 +151,33 @@ pub(crate) fn boot( > > Self::run_fwsec_frts(dev, gsp_falcon, bar, &bios, &fb_layout)?;Could you elaborate on how the timeout below works? See comment below.> > + if vgpu_support { > + let scrubber = BooterFirmware::new( > + dev, > + BooterKind::Scrubber, > + chipset, > + FIRMWARE_VERSION, > + sec2_falcon, > + bar, > + )?; > + > + sec2_falcon.reset(bar)?; > + sec2_falcon.dma_load(bar, &scrubber)?; > + > + let (mbox0, mbox1) = sec2_falcon.boot(bar, None, None)?;boot() already returns -ETIMEDOUT via wait_till_halted()->read_poll_timeout(). The wait there is 2 seconds. I assume the scrubber would have completed by then.> + > + dev_dbg!( > + pdev.as_ref(), > + "SEC2 MBOX0: {:#x}, MBOX1{:#x}\n", > + mbox0, > + mbox1 > + ); > + > + if !regs::NV_PGC6_BSI_SECURE_SCRATCH_15::read(bar).scrubber_completed() { > + return Err(ETIMEDOUT);So under which situation do you get to this point (!scrubber_completed) ? Basically I am not sure if ETIMEDOUT is the right error to return here, because boot() already returns ETIMEDOUT by waiting for the halt. If you still want return ETIMEDOUT here, then it sounds like you're waiting for scrubbing beyond the waiting already done by boot(). If so, then shouldn't you need to use read_poll_timeout() here? perhaps something like: read_poll_timeout( || Ok(regs::NV_PGC6_BSI_SECURE_SCRATCH_15::read(bar).scrubber_completed()), |val: &bool| *val, Delta::from_millis(10), Delta::from_secs(5), )?; Thanks.
Zhi Wang
2025-Dec-09 14:05 UTC
[RFC 7/7] gpu: nova-core: load the scrubber ucode when vGPU support is enabled
On Sat, 6 Dec 2025 21:26:12 -0500 Joel Fernandes <joelagnelf at nvidia.com> wrote:> Hi Zhi, > > On 12/6/2025 7:42 AM, Zhi Wang wrote:snip> > boot() already returns -ETIMEDOUT via > wait_till_halted()->read_poll_timeout(). > > The wait there is 2 seconds. I assume the scrubber would have > completed by then. > 1 > > + > > + dev_dbg!( > > + pdev.as_ref(), > > + "SEC2 MBOX0: {:#x}, MBOX1{:#x}\n", > > + mbox0, > > + mbox1 > > + ); > > + > > + if > > !regs::NV_PGC6_BSI_SECURE_SCRATCH_15::read(bar).scrubber_completed() > > { > > + return Err(ETIMEDOUT); > > So under which situation do you get to this point > (!scrubber_completed) ? Basically I am not sure if ETIMEDOUT is the > right error to return here, because boot() already returns ETIMEDOUT > by waiting for the halt. > > If you still want return ETIMEDOUT here, then it sounds like you're > waiting for scrubbing beyond the waiting already done by boot(). If > so, then shouldn't you need to use read_poll_timeout() here? > > perhaps something like: > > read_poll_timeout( > || > Ok(regs::NV_PGC6_BSI_SECURE_SCRATCH_15::read(bar).scrubber_completed()), > |val: &bool| *val, Delta::from_millis(10), > Delta::from_secs(5), > )?; >This is the identical implementation to OpenRM [1]. According to that parts of code, I think the scrubber runs in the binary booting process. When it signals the firmware booting successfully, the scrubbing should be done. Let me change to another errno. [1]https://github.com/NVIDIA/open-gpu-kernel-modules/blob/a5bfb10e75a4046c5d991c65f49b5d29151e68cf/src/nvidia/src/kernel/gpu/gsp/arch/ada/kernel_gsp_ad102.c#L49> Thanks. >