Jason Gunthorpe
2025-Feb-27 19:23 UTC
[RFC PATCH 0/3] gpu: nova-core: add basic timer subdevice implementation
On Thu, Feb 27, 2025 at 06:32:15PM +0100, Danilo Krummrich wrote:> On Thu, Feb 27, 2025 at 08:55:09AM -0800, Boqun Feng wrote: > > On Thu, Feb 27, 2025 at 12:17:33PM -0400, Jason Gunthorpe wrote: > > > > > I still wonder why you couldn't also have these reliable reference > > > counts rooted on the device driver instead of only on the module. > > > > > > > You could put reliable reference counts anywhere you want, as long as it > > reflects the resource dependencies. > > Right, as I explained in a different reply, the signature for PCI driver probe() > looks like this: > > fn probe(pdev: &mut pci::Device, _info: &Self::IdInfo) -> Result<Pin<KBox<Self>>> > > The returned Pin<KBox<Self>> has the lifetime of the driver being bound to the > device. > > Which means a driver can bind things to this lifetime. But, it isn't forced to, > it can also put things into an Arc and share it with the rest of the world.This statement right here seems to be the fundamental problem. The design pattern says that 'share it with the rest of the world' is a bug. A driver following the pattern cannot do that, it must contain the driver objects within the driver scope and free them. In C we inspect for this manually, and check for it with kmemleak progamatically. It appears to me that the main issue here is that nobody has figured out how to make rust have rules that can enforce that design pattern. Have the compiler prevent the driver author from incorrectly extending the lifetime of a driver-object beyond the driver's inherent scope, ie that Self object above. Instead we get this:> If something is crucial to be bound to the lifetime of a driver being bound to a > device (i.e. device resources), you have to expose it as Devres<T>.Which creates a costly way to work around this missing design pattern by adding runtime checks to every single access of T in all the operational threads. Failable rcu_lock across every batch of register access. The reason the kernel has these design patterns of shutdown then destroy is to avoid that runtime overhead! We optimize by swapping fine grained locks for coarse locks that probably already exist. It is a valid pattern, works well and has alot of APIs designed to support it. This giant thread started because people were objecting to the cost and usability of the runtime checks on the operational paths. So, I think you can say it can't be done, that the alternative is only a little worse. Sad, but OK, but let's please acknowledge the limitation. Jason
Boqun Feng
2025-Feb-27 21:26 UTC
[RFC PATCH 0/3] gpu: nova-core: add basic timer subdevice implementation
On Thu, Feb 27, 2025 at 03:23:21PM -0400, Jason Gunthorpe wrote:> On Thu, Feb 27, 2025 at 06:32:15PM +0100, Danilo Krummrich wrote: > > On Thu, Feb 27, 2025 at 08:55:09AM -0800, Boqun Feng wrote: > > > On Thu, Feb 27, 2025 at 12:17:33PM -0400, Jason Gunthorpe wrote: > > > > > > > I still wonder why you couldn't also have these reliable reference > > > > counts rooted on the device driver instead of only on the module. > > > > > > > > > > You could put reliable reference counts anywhere you want, as long as it > > > reflects the resource dependencies. > > > > Right, as I explained in a different reply, the signature for PCI driver probe() > > looks like this: > > > > fn probe(pdev: &mut pci::Device, _info: &Self::IdInfo) -> Result<Pin<KBox<Self>>> > > > > The returned Pin<KBox<Self>> has the lifetime of the driver being bound to the > > device. > > > > Which means a driver can bind things to this lifetime. But, it isn't forced to, > > it can also put things into an Arc and share it with the rest of the world. > > This statement right here seems to be the fundamental problem. > > The design pattern says that 'share it with the rest of the world' is > a bug. A driver following the pattern cannot do that, it must contain > the driver objects within the driver scope and free them. In C weI cannot speak for Danilo, but IIUC, the 'share it with the rest of the world' things are the ones that drivers can share, for example, I suppose (not a network expert) a NIC driver can share the packet object with the upper layer of netowrk.> inspect for this manually, and check for it with kmemleakIn Rust, it's better (of course, depending on your PoV ;-)), because your driver or module data structures need to track the things they use (otherwise they will be cancelled and maybe freed, e.g. the hrtimer case). So you have that part covered by compiler. But could there be corner cases? Probably. We will just resolve that case by case.> progamatically. > > It appears to me that the main issue here is that nobody has figured > out how to make rust have rules that can enforce that design pattern. >Most of the cases, it should be naturally achieved, because you already bind the objects into your module or driver, otherwise they would be already cancelled and freed. Handwavingly, it provides a "data/type-oriented" resource management instead of "oh I have to remember to call this function before module unload". Again, I believe there are and will be corner cases, but happy to look into them.> Have the compiler prevent the driver author from incorrectly extending > the lifetime of a driver-object beyond the driver's inherent scope, ie > that Self object above. >Compilers can help in the cases where they know which objects are belong to a driver/module. So I think in Rust you can have the "design pattern", the difference is instead of putting cancel/free functions carefully in some remove() function, you will need to (still!) carefully arrange the fields in your driver/module data structure, and you can have more fine grained control by writting the drop() function for the driver/module data structure.> Instead we get this: > > > If something is crucial to be bound to the lifetime of a driver being bound to a > > device (i.e. device resources), you have to expose it as Devres<T>. >I feel I'm still missing some contexts why Devres<T> is related to the "design pattern", so I will just skip this part for now... Hope we are on the same page of the "design pattern" in Rust? Regards, Boqun> Which creates a costly way to work around this missing design pattern > by adding runtime checks to every single access of T in all the > operational threads. Failable rcu_lock across every batch of register > access. >[...]
Jason Gunthorpe
2025-Feb-27 22:00 UTC
[RFC PATCH 0/3] gpu: nova-core: add basic timer subdevice implementation
On Thu, Feb 27, 2025 at 01:25:10PM -0800, Boqun Feng wrote:> > The design pattern says that 'share it with the rest of the world' is > > a bug. A driver following the pattern cannot do that, it must contain > > the driver objects within the driver scope and free them. In C we > > I cannot speak for Danilo, but IIUC, the 'share it with the rest of the > world' things are the ones that drivers can share, for example, I > suppose (not a network expert) a NIC driver can share the packet object > with the upper layer of netowrk.I'm having a bit of trouble parsing this sentence.. In your example a NIC driver passing a packet into the network stack would still be within the world of the driver. Outside the world would be assigning the packet to a global variable. It is a very similar idea to what you explained about the module lifetime, just instead of module __exit being the terminal point, it is driver remove().> > It appears to me that the main issue here is that nobody has figured > > out how to make rust have rules that can enforce that design pattern. > > Most of the cases, it should be naturally achieved, because you already > bind the objects into your module or driver, otherwise they would be > already cancelled and freed.I'm getting the feeling you can probably naturally achieve the required destructors, but I think Danillo is concerned that since it isn't *mandatory* it isn't safe/sound.> So I think in Rust you can have the "design pattern", the difference is > instead of putting cancel/free functions carefully in some remove() > function, you will need to (still!) carefully arrange the fields in your > driver/module data structure, and you can have more fine grained control > by writting the drop() function for the driver/module data structure.That all makes sense, but again, the challenge seems to be making that mandatory so it is all safe. If you can show people a sketch how you think that could work it would probably help.> I feel I'm still missing some contexts why Devres<T> is related to the > "design pattern", so I will just skip this part for now... Hope we are > on the same page of the "design pattern" in Rust?There is a requirement that a certain C destructor function be *guaranteed* to be called during remove(). Ie if I told you that the C functions *required* hrtimer_cancel() be called in the device_driver remove for correctness, how could you accomplish this? And avoid an UAF of the hrtimer from other threads? And do it without adding new locks. And prevent the driver author from escaping these requirements. Then you have the design pattern I'm talking about. Is it clearer? Jason