Alexandre Courbot
2025-Dec-04 14:43 UTC
[PATCH 10/11] gpu: nova-core: LibosMemoryRegionInitArgument size must be page aligned
On Thu Dec 4, 2025 at 3:31 AM JST, Timur Tabi wrote:> On Wed, 2025-12-03 at 20:54 +0900, Alexandre Courbot wrote: >> On Tue Dec 2, 2025 at 8:25 AM JST, Timur Tabi wrote: >> > On Wed, 2025-11-19 at 12:36 +0900, Alexandre Courbot wrote: >> > > You can use the `Alignment` type here, as the rest of the code does: >> > > >> > > ??? let size = num::usize_as_u64(obj.size()) >> > > ??????? .align_up(Alignment::new::<GSP_PAGE_SIZE>())?; >> > > >> > > Now `align_up` returns an error in case of overflow, that we will need >> > > to pass down to the caller by changing the return type of `new`. It is a >> > > bit annoying, but better than the behavior of `next_mutiple_of` in such >> > > a case, which is to panic. :) >> > >> > I see your point, but these are u64s that we're talking about.? The only way next_mutiple_of() >> > can >> > panic is if obj.size() is greater than 0xFFFFFFFFFFFFF000, which is not possible.? >> > >> > I would say in this case, a panic is preferable to a convoluted error return that will never be >> > exercised, because failure here indicates a coding error, not an input error. >> >> The input data is a usize, so technically we could get an input that >> triggers that error. > > Actually, I still say it's not possible. > > Say I change the code to this, so that .next_multiple_of is called on a u64 instead of a usize: > > let size = num::usize_as_u64(obj.size()).next_multiple_of(GSP_PAGE_SIZE); > > Again, the only way this can fail is if the allocated object being passed in is almost 16 exabytes > in size, which is physically impossible. > >> I know it's a very edge case, and clearly indicates a bug, but the >> general rule is: don't panic the kernel. And in Rust, if possible, don't >> even let me compiler insert panic-handling code. If you don't want to >> change the return type of the method, then maybe use `unwrap_or` and >> `inspect_err` to print an error before returning e.g. `0`. > > How about this: if .next_multiple_of(GSP_PAGE_SIZE) does return an error, I'll just assign size to > obj.size() as-is? After all, at about 16GB/second for DMA, it will take about 31 years to DMA all > that memory, so I will have long since retired before that bug shows up.Please allow me to commend you for doing the computation, that really cracked me up. :D Maybe we need a `ECAREEROVERFLOW` error code for cases like this. And yeah, you are abolutely right, but my point was more about not having the code to handle panic conditions generated. Maybe I am thinking too much ahead, but I dream of a future where we could make guarantees like "this function never panics" and have the compiler complain if it does. So as a matter of principle I like to avoid having these, especially when they cannot happen in practice. So something like using `pr_warn` looks reasonable to me as a last resort. ... or maybe we can address the problem differently. Reading your commit log again: GSP-RM insists that the 'size' parameter of the LibosMemoryRegionInitArgument struct be aligned to 4KB. sounds to me like "it is a bug if `size` is not aligned to 4KB to begin with". Could that be a correct interpretation? Because if we align up past the valid data of the object, then what are we copying? Granted, `CoherentAllocation` will likely have an aligned size, but that's a lucky implementation detail. So maybe we can just downright return an error if the size is not aligned, which would solve the panic problem. Or we fix the problem when allocating the `CoherentAllocation`, making sure the filler data exists and is zeroes, and providing a valid `size` from the beginning.
Timur Tabi
2025-Dec-04 21:18 UTC
[PATCH 10/11] gpu: nova-core: LibosMemoryRegionInitArgument size must be page aligned
On Thu, 2025-12-04 at 23:43 +0900, Alexandre Courbot wrote:> And yeah, you are abolutely right, but my point was more about not > having the code to handle panic conditions generated. Maybe I am > thinking too much ahead, but I dream of a future where we could make > guarantees like "this function never panics" and have the compiler > complain if it does. So as a matter of principle I like to avoid having > these, especially when they cannot happen in practice.I can totally agree with that, but I don't know how to tell the compiler that obj.size() will never be close to 2^64. The issue only occurs with RMARGS, the last entry in the LibosArgs array. That is a 72-byte struct, IIRC. All of the other entries in the array are logging buffers that are already 4K aligned.> So something like using `pr_warn` looks reasonable to me as a last > resort. > > ... or maybe we can address the problem differently. Reading your commit > log again: > > ? GSP-RM insists that the 'size' parameter of the > ? LibosMemoryRegionInitArgument struct be aligned to 4KB.????? > > sounds to me like "it is a bug if `size` is not aligned to 4KB to begin > with". Could that be a correct interpretation?Short answer: yes. I had to dig real deep into the firmware to find this restriction. GSP-RM has this check: #define LIBOS_PAGETABLE_SEARCH_GRANULARITY_SMALL 0x1000ULL INIT_PANIC_IF((init_args->size & (LIBOS_PAGETABLE_SEARCH_GRANULARITY_SMALL - 1)) != 0, LibosPanicReasonInvalidConfiguration, init_args->size, "Unaligned size specified"); For some reason, this code exists only in the Turing/GA100 version of GSP-RM (aka Libos V2). The GA102+ version of GSP-RM (Libos V3) has completely different code for parsing the LibosArgs table that doesn't check the size for alignment.> Because if we align up past the valid data of the object, then what are > we copying? Granted, `CoherentAllocation` will likely have an aligned > size, but that's a lucky implementation detail. So maybe we can just > downright return an error if the size is not aligned, which would solve > the panic problem. > > Or we fix the problem when allocating the `CoherentAllocation`, making > sure the filler data exists and is zeroes, and providing a valid `size` > from the beginning.Yes, that is better. In fact, now that I think about it, rounding up the size was a work-around that I implemented and then forgot about. The correct approach is to do what Nouveau does and allocate 4K for RMARGS: ret = nvkm_gsp_mem_ctor(gsp, 0x1000, &gsp->rmargs); Unfortunately, I don't know how to get Nova to allocate a larger block of memory for an object than its size. The RMARGS struct is defined in bindings.rs: pub struct GSP_ARGUMENTS_CACHED { pub messageQueueInitArguments: MESSAGE_QUEUE_INIT_ARGUMENTS, pub srInitArguments: GSP_SR_INIT_ARGUMENTS, pub gpuInstance: u32_, pub bDmemStack: u8_, pub __bindgen_padding_0: [u8; 7usize], pub profilerArgs: GSP_ARGUMENTS_CACHED__bindgen_ty_1, } which I believe is 72 bytes. It is allocated here: let rmargs = CoherentAllocation::<GspArgumentsCached>::alloc_coherent( dev, 1, GFP_KERNEL | __GFP_ZERO, )?; The "1" is a count field. Unfortunately, 4096/72 is not an even number, so I can't just change `count` to another number. If you can tell me how to fix the CoherentAllocation allocation call so that it allocates 4K, that would be a better fix for this problem. It's only necessary for Turing/GA100, however.