Timur Tabi
2025-Dec-04 21:18 UTC
[PATCH 10/11] gpu: nova-core: LibosMemoryRegionInitArgument size must be page aligned
On Thu, 2025-12-04 at 23:43 +0900, Alexandre Courbot wrote:> And yeah, you are abolutely right, but my point was more about not > having the code to handle panic conditions generated. Maybe I am > thinking too much ahead, but I dream of a future where we could make > guarantees like "this function never panics" and have the compiler > complain if it does. So as a matter of principle I like to avoid having > these, especially when they cannot happen in practice.I can totally agree with that, but I don't know how to tell the compiler that obj.size() will never be close to 2^64. The issue only occurs with RMARGS, the last entry in the LibosArgs array. That is a 72-byte struct, IIRC. All of the other entries in the array are logging buffers that are already 4K aligned.> So something like using `pr_warn` looks reasonable to me as a last > resort. > > ... or maybe we can address the problem differently. Reading your commit > log again: > > ? GSP-RM insists that the 'size' parameter of the > ? LibosMemoryRegionInitArgument struct be aligned to 4KB.????? > > sounds to me like "it is a bug if `size` is not aligned to 4KB to begin > with". Could that be a correct interpretation?Short answer: yes. I had to dig real deep into the firmware to find this restriction. GSP-RM has this check: #define LIBOS_PAGETABLE_SEARCH_GRANULARITY_SMALL 0x1000ULL INIT_PANIC_IF((init_args->size & (LIBOS_PAGETABLE_SEARCH_GRANULARITY_SMALL - 1)) != 0, LibosPanicReasonInvalidConfiguration, init_args->size, "Unaligned size specified"); For some reason, this code exists only in the Turing/GA100 version of GSP-RM (aka Libos V2). The GA102+ version of GSP-RM (Libos V3) has completely different code for parsing the LibosArgs table that doesn't check the size for alignment.> Because if we align up past the valid data of the object, then what are > we copying? Granted, `CoherentAllocation` will likely have an aligned > size, but that's a lucky implementation detail. So maybe we can just > downright return an error if the size is not aligned, which would solve > the panic problem. > > Or we fix the problem when allocating the `CoherentAllocation`, making > sure the filler data exists and is zeroes, and providing a valid `size` > from the beginning.Yes, that is better. In fact, now that I think about it, rounding up the size was a work-around that I implemented and then forgot about. The correct approach is to do what Nouveau does and allocate 4K for RMARGS: ret = nvkm_gsp_mem_ctor(gsp, 0x1000, &gsp->rmargs); Unfortunately, I don't know how to get Nova to allocate a larger block of memory for an object than its size. The RMARGS struct is defined in bindings.rs: pub struct GSP_ARGUMENTS_CACHED { pub messageQueueInitArguments: MESSAGE_QUEUE_INIT_ARGUMENTS, pub srInitArguments: GSP_SR_INIT_ARGUMENTS, pub gpuInstance: u32_, pub bDmemStack: u8_, pub __bindgen_padding_0: [u8; 7usize], pub profilerArgs: GSP_ARGUMENTS_CACHED__bindgen_ty_1, } which I believe is 72 bytes. It is allocated here: let rmargs = CoherentAllocation::<GspArgumentsCached>::alloc_coherent( dev, 1, GFP_KERNEL | __GFP_ZERO, )?; The "1" is a count field. Unfortunately, 4096/72 is not an even number, so I can't just change `count` to another number. If you can tell me how to fix the CoherentAllocation allocation call so that it allocates 4K, that would be a better fix for this problem. It's only necessary for Turing/GA100, however.
Timur Tabi
2025-Dec-04 21:45 UTC
[PATCH 10/11] gpu: nova-core: LibosMemoryRegionInitArgument size must be page aligned
On Thu, 2025-12-04 at 21:18 +0000, Timur Tabi wrote:> If you can tell me how to fix the CoherentAllocation allocation call so that it allocates 4K, that > would be a better fix for this problem.? It's only necessary for Turing/GA100, however.Actually, I think I figured it out. /// Arguments for GSP startup. #[repr(C)] pub(crate) struct GspArgumentsCached( bindings::GSP_ARGUMENTS_CACHED, [u8; GSP_PAGE_SIZE - core::mem::size_of::<bindings::GSP_ARGUMENTS_CACHED>()], // Padding to make it 4K ); impl GspArgumentsCached { /// Creates the arguments for starting the GSP up using `cmdq` as its command queue. pub(crate) fn new(cmdq: &Cmdq) -> Self { Self( bindings::GSP_ARGUMENTS_CACHED { messageQueueInitArguments: MessageQueueInitArguments::new(cmdq).0, bDmemStack: 1, ..Default::default() }, [0u8; GSP_PAGE_SIZE - core::mem::size_of::<bindings::GSP_ARGUMENTS_CACHED>()], ) } } If you think this is okay, I'll put it in v3. I had to remove the #[repr(transparent)]. Is that okay? The code compiles and seems to work.