Timur Tabi
2025-Dec-03 18:31 UTC
[PATCH 10/11] gpu: nova-core: LibosMemoryRegionInitArgument size must be page aligned
On Wed, 2025-12-03 at 20:54 +0900, Alexandre Courbot wrote:> On Tue Dec 2, 2025 at 8:25 AM JST, Timur Tabi wrote: > > On Wed, 2025-11-19 at 12:36 +0900, Alexandre Courbot wrote: > > > You can use the `Alignment` type here, as the rest of the code does: > > > > > > ??? let size = num::usize_as_u64(obj.size()) > > > ??????? .align_up(Alignment::new::<GSP_PAGE_SIZE>())?; > > > > > > Now `align_up` returns an error in case of overflow, that we will need > > > to pass down to the caller by changing the return type of `new`. It is a > > > bit annoying, but better than the behavior of `next_mutiple_of` in such > > > a case, which is to panic. :) > > > > I see your point, but these are u64s that we're talking about.? The only way next_mutiple_of() > > can > > panic is if obj.size() is greater than 0xFFFFFFFFFFFFF000, which is not possible.? > > > > I would say in this case, a panic is preferable to a convoluted error return that will never be > > exercised, because failure here indicates a coding error, not an input error. > > The input data is a usize, so technically we could get an input that > triggers that error.Actually, I still say it's not possible. Say I change the code to this, so that .next_multiple_of is called on a u64 instead of a usize: let size = num::usize_as_u64(obj.size()).next_multiple_of(GSP_PAGE_SIZE); Again, the only way this can fail is if the allocated object being passed in is almost 16 exabytes in size, which is physically impossible.> I know it's a very edge case, and clearly indicates a bug, but the > general rule is: don't panic the kernel. And in Rust, if possible, don't > even let me compiler insert panic-handling code. If you don't want to > change the return type of the method, then maybe use `unwrap_or` and > `inspect_err` to print an error before returning e.g. `0`.How about this: if .next_multiple_of(GSP_PAGE_SIZE) does return an error, I'll just assign size to obj.size() as-is? After all, at about 16GB/second for DMA, it will take about 31 years to DMA all that memory, so I will have long since retired before that bug shows up.
Alexandre Courbot
2025-Dec-04 14:43 UTC
[PATCH 10/11] gpu: nova-core: LibosMemoryRegionInitArgument size must be page aligned
On Thu Dec 4, 2025 at 3:31 AM JST, Timur Tabi wrote:> On Wed, 2025-12-03 at 20:54 +0900, Alexandre Courbot wrote: >> On Tue Dec 2, 2025 at 8:25 AM JST, Timur Tabi wrote: >> > On Wed, 2025-11-19 at 12:36 +0900, Alexandre Courbot wrote: >> > > You can use the `Alignment` type here, as the rest of the code does: >> > > >> > > ??? let size = num::usize_as_u64(obj.size()) >> > > ??????? .align_up(Alignment::new::<GSP_PAGE_SIZE>())?; >> > > >> > > Now `align_up` returns an error in case of overflow, that we will need >> > > to pass down to the caller by changing the return type of `new`. It is a >> > > bit annoying, but better than the behavior of `next_mutiple_of` in such >> > > a case, which is to panic. :) >> > >> > I see your point, but these are u64s that we're talking about.? The only way next_mutiple_of() >> > can >> > panic is if obj.size() is greater than 0xFFFFFFFFFFFFF000, which is not possible.? >> > >> > I would say in this case, a panic is preferable to a convoluted error return that will never be >> > exercised, because failure here indicates a coding error, not an input error. >> >> The input data is a usize, so technically we could get an input that >> triggers that error. > > Actually, I still say it's not possible. > > Say I change the code to this, so that .next_multiple_of is called on a u64 instead of a usize: > > let size = num::usize_as_u64(obj.size()).next_multiple_of(GSP_PAGE_SIZE); > > Again, the only way this can fail is if the allocated object being passed in is almost 16 exabytes > in size, which is physically impossible. > >> I know it's a very edge case, and clearly indicates a bug, but the >> general rule is: don't panic the kernel. And in Rust, if possible, don't >> even let me compiler insert panic-handling code. If you don't want to >> change the return type of the method, then maybe use `unwrap_or` and >> `inspect_err` to print an error before returning e.g. `0`. > > How about this: if .next_multiple_of(GSP_PAGE_SIZE) does return an error, I'll just assign size to > obj.size() as-is? After all, at about 16GB/second for DMA, it will take about 31 years to DMA all > that memory, so I will have long since retired before that bug shows up.Please allow me to commend you for doing the computation, that really cracked me up. :D Maybe we need a `ECAREEROVERFLOW` error code for cases like this. And yeah, you are abolutely right, but my point was more about not having the code to handle panic conditions generated. Maybe I am thinking too much ahead, but I dream of a future where we could make guarantees like "this function never panics" and have the compiler complain if it does. So as a matter of principle I like to avoid having these, especially when they cannot happen in practice. So something like using `pr_warn` looks reasonable to me as a last resort. ... or maybe we can address the problem differently. Reading your commit log again: GSP-RM insists that the 'size' parameter of the LibosMemoryRegionInitArgument struct be aligned to 4KB. sounds to me like "it is a bug if `size` is not aligned to 4KB to begin with". Could that be a correct interpretation? Because if we align up past the valid data of the object, then what are we copying? Granted, `CoherentAllocation` will likely have an aligned size, but that's a lucky implementation detail. So maybe we can just downright return an error if the size is not aligned, which would solve the panic problem. Or we fix the problem when allocating the `CoherentAllocation`, making sure the filler data exists and is zeroes, and providing a valid `size` from the beginning.