thr3ads.net - Nouveau - [PATCH 10/11] gpu: nova-core: LibosMemoryRegionInitArgument size must be page aligned [Dec 2025]

If this information is useful, please help other people find it:
Share via:

Alexandre Courbot

2025-Dec-04 14:43 UTC

[PATCH 10/11] gpu: nova-core: LibosMemoryRegionInitArgument size must be page aligned

On Thu Dec 4, 2025 at 3:31 AM JST, Timur Tabi wrote:> On Wed, 2025-12-03 at 20:54 +0900, Alexandre Courbot wrote:
>> On Tue Dec 2, 2025 at 8:25 AM JST, Timur Tabi wrote:
>> > On Wed, 2025-11-19 at 12:36 +0900, Alexandre Courbot wrote:
>> > > You can use the `Alignment` type here, as the rest of the
code does:
>> > > 
>> > > ??? let size = num::usize_as_u64(obj.size())
>> > > ??????? .align_up(Alignment::new::<GSP_PAGE_SIZE>())?;
>> > > 
>> > > Now `align_up` returns an error in case of overflow, that we
will need
>> > > to pass down to the caller by changing the return type of
`new`. It is a
>> > > bit annoying, but better than the behavior of
`next_mutiple_of` in such
>> > > a case, which is to panic. :)
>> > 
>> > I see your point, but these are u64s that we're talking
about.? The only way next_mutiple_of()
>> > can
>> > panic is if obj.size() is greater than 0xFFFFFFFFFFFFF000, which
is not possible.?
>> > 
>> > I would say in this case, a panic is preferable to a convoluted
error return that will never be
>> > exercised, because failure here indicates a coding error, not an
input error.
>> 
>> The input data is a usize, so technically we could get an input that
>> triggers that error.
>
> Actually, I still say it's not possible.  
>
> Say I change the code to this, so that .next_multiple_of is called on a u64
instead of a usize:
>
> 	let size = num::usize_as_u64(obj.size()).next_multiple_of(GSP_PAGE_SIZE);
>
> Again, the only way this can fail is if the allocated object being passed
in is almost 16 exabytes
> in size, which is physically impossible.
>
>> I know it's a very edge case, and clearly indicates a bug, but the
>> general rule is: don't panic the kernel. And in Rust, if possible,
don't
>> even let me compiler insert panic-handling code. If you don't want
to
>> change the return type of the method, then maybe use `unwrap_or` and
>> `inspect_err` to print an error before returning e.g. `0`.
>
> How about this: if .next_multiple_of(GSP_PAGE_SIZE) does return an error,
I'll just assign size to
> obj.size() as-is?  After all, at about 16GB/second for DMA, it will take
about 31 years to DMA all
> that memory, so I will have long since retired before that bug shows up.
Please allow me to commend you for doing the computation, that really
cracked me up. :D Maybe we need a `ECAREEROVERFLOW` error code for cases
like this.

And yeah, you are abolutely right, but my point was more about not
having the code to handle panic conditions generated. Maybe I am
thinking too much ahead, but I dream of a future where we could make
guarantees like "this function never panics" and have the compiler
complain if it does. So as a matter of principle I like to avoid having
these, especially when they cannot happen in practice.

So something like using `pr_warn` looks reasonable to me as a last
resort.

... or maybe we can address the problem differently. Reading your commit
log again:

  GSP-RM insists that the 'size' parameter of the
  LibosMemoryRegionInitArgument struct be aligned to 4KB.      

sounds to me like "it is a bug if `size` is not aligned to 4KB to begin
with". Could that be a correct interpretation?

Because if we align up past the valid data of the object, then what are
we copying? Granted, `CoherentAllocation` will likely have an aligned
size, but that's a lucky implementation detail. So maybe we can just
downright return an error if the size is not aligned, which would solve
the panic problem.

Or we fix the problem when allocating the `CoherentAllocation`, making
sure the filler data exists and is zeroes, and providing a valid `size`
from the beginning.

Timur Tabi

2025-Dec-04 21:18 UTC

head link

[PATCH 10/11] gpu: nova-core: LibosMemoryRegionInitArgument size must be page aligned

On Thu, 2025-12-04 at 23:43 +0900, Alexandre Courbot
wrote:> And yeah, you are abolutely right, but my point was more about not
> having the code to handle panic conditions generated. Maybe I am
> thinking too much ahead, but I dream of a future where we could make
> guarantees like "this function never panics" and have the
compiler
> complain if it does. So as a matter of principle I like to avoid having
> these, especially when they cannot happen in practice.
I can totally agree with that, but I don't know how to tell the compiler
that obj.size() will never
be close to 2^64.

The issue only occurs with RMARGS, the last entry in the LibosArgs array.  That
is a 72-byte struct,
IIRC.  All of the other entries in the array are logging buffers that are
already 4K aligned.
> So something like using `pr_warn` looks reasonable to me as a last
> resort.
> 
> ... or maybe we can address the problem differently. Reading your commit
> log again:
> 
> ? GSP-RM insists that the 'size' parameter of the
> ? LibosMemoryRegionInitArgument struct be aligned to 4KB.????? 
> 
> sounds to me like "it is a bug if `size` is not aligned to 4KB to
begin
> with". Could that be a correct interpretation?
Short answer: yes.  I had to dig real deep into the firmware to find this
restriction.  GSP-RM has
this check:

    #define LIBOS_PAGETABLE_SEARCH_GRANULARITY_SMALL      0x1000ULL

    INIT_PANIC_IF((init_args->size &
(LIBOS_PAGETABLE_SEARCH_GRANULARITY_SMALL - 1)) != 0,
                  LibosPanicReasonInvalidConfiguration, init_args->size,
                  "Unaligned size specified");

For some reason, this code exists only in the Turing/GA100 version of GSP-RM
(aka Libos V2).  The
GA102+ version of GSP-RM (Libos V3) has completely different code for parsing
the LibosArgs table
that doesn't check the size for alignment.
> Because if we align up past the valid data of the object, then what are
> we copying? Granted, `CoherentAllocation` will likely have an aligned
> size, but that's a lucky implementation detail. So maybe we can just
> downright return an error if the size is not aligned, which would solve
> the panic problem.
> 
> Or we fix the problem when allocating the `CoherentAllocation`, making
> sure the filler data exists and is zeroes, and providing a valid `size`
> from the beginning.
Yes, that is better.  In fact, now that I think about it, rounding up the size
was a work-around
that I implemented and then forgot about.  The correct approach is to do what
Nouveau does and
allocate 4K for RMARGS:

		ret = nvkm_gsp_mem_ctor(gsp, 0x1000, &gsp->rmargs);

Unfortunately, I don't know how to get Nova to allocate a larger block of
memory for an object than
its size.  The RMARGS struct is defined in bindings.rs:

pub struct GSP_ARGUMENTS_CACHED {
    pub messageQueueInitArguments: MESSAGE_QUEUE_INIT_ARGUMENTS,
    pub srInitArguments: GSP_SR_INIT_ARGUMENTS,
    pub gpuInstance: u32_,
    pub bDmemStack: u8_,
    pub __bindgen_padding_0: [u8; 7usize],
    pub profilerArgs: GSP_ARGUMENTS_CACHED__bindgen_ty_1,
}

which I believe is 72 bytes.

It is allocated here:

        let rmargs =
CoherentAllocation::<GspArgumentsCached>::alloc_coherent(
            dev,
            1,
            GFP_KERNEL | __GFP_ZERO,
        )?;


The "1" is a count field.  Unfortunately, 4096/72 is not an even
number, so I can't just change
`count` to another number.

If you can tell me how to fix the CoherentAllocation allocation call so that it
allocates 4K, that
would be a better fix for this problem.  It's only necessary for
Turing/GA100, however.

Nouveau - Dec 2025 - [PATCH 10/11] gpu: nova-core: LibosMemoryRegionInitArgument size must be page aligned

[PATCH 10/11] gpu: nova-core: LibosMemoryRegionInitArgument size must be page aligned

[PATCH 10/11] gpu: nova-core: LibosMemoryRegionInitArgument size must be page aligned