Jason Gunthorpe
2025-Feb-25 22:57 UTC
[RFC PATCH 0/3] gpu: nova-core: add basic timer subdevice implementation
On Tue, Feb 25, 2025 at 04:02:28PM -0500, Joel Fernandes wrote:> > Besides that I don't really see why we can't just re-acquire it after we sleep? > > Rust provides good options to implement it ergonimcally I think. > > > > > > > > Another colleague told me RDMA also uses SRCU for a similar purpose as well. > > > > See the reasoning against SRCU from Sima [1], what's the reasoning of RDMA? > > > > [1] https://lore.kernel.org/nouveau/Z7XVfnnrRKrtQbB6 at phenom.ffwll.local/> For RDMA, I will ask Jason Gunthorpe to chime in, I CC'd him. Jason, correct > me if I'm wrong about the RDMA user but this is what I recollect discussing > with you.In RDMA SRCU is not unbounded. It is limited to a system call duration, and we don't have system calls that become time unbounded inside drivers. The motivation for RDMA was not really hotplug, but to support kernel module upgrade. Some data center HA users were very interested in this. To achieve it the driver module itself cannot have an elevated module refcount. This requires swapping the module refcount for a sleepable RW lock like SRCU or rwsem protecting all driver callbacks. [1] To be very clear, in RDMA you can open /dev/infiniband/uverbsX, run a ioctl on the FD and then successfully rmmod the driver module while the FD is open and while the ioctl is running. Any driver op will complete, future ioctls will fail, and the module will complete. So, from my perspective, this revocable idea would completely destroy the actual production purpose we built the fast hot-plug machinery for. It does not guarentee that driver threads are fenced prior to completing remove. Intead it must rely on the FD itself to hold the module refcount on the driver to keep the .text alive while driver callbacks continue to be called. Making the module refcount linked to userspace closing a FD renders the module unloadable in practice. The common driver shutdown process in the kernel, that is well tested and copied, makes the driver single threaded during the remove() callback. Effectively instead of trying to micro-revoke individual resources we revoke all concurrency threads and then it is completely safe to destroy all the resources. This also guarentees that after completing remove there is no Execute After Free risk to the driver code. SRCU/rwsem across all driver ops function pointer calls is part of this scheme, but also things like cancling/flushing work queues, blocking new work submission, preventing interrupts, removing syfs files (they block concurrent threads internally), synchronizing any possibly outstanding RCU callbacks, and more. So, I'd suggest that if you have system calls that wait, the typical expected solution would be to shoot down the waits during a remove event so they can become time bounded.> > > I have heard some concern around whether Rust is changing the driver model when > > > it comes to driver detach / driver remove. Can you elaborate may be a bit about > > > how Rust changes that mechanism versus C, when it comes to that? > > > > I think that one is simple, Rust does *not* change the driver model.I think this resource-revoke idea is deviating from the normal expected driver model by allowing driver code to continue to run in other threads once remove completes. That is definitely abnormal at least. It is not necessarily *wrong*, but it sure is weird and as I explained above it has bad system level properties. Further, it seems to me there is a very unique DRM specific issue at work "time unbounded driver callbacks". A weird solution to this should not be baked into the common core kernel rust bindings and break the working model of all other subsystems that don't have that problem.> > Similarly you can have custom functions for short sequences of I/O ops, or use > > closures. I don't understand the concern. > > Yeah, this is certainly possible. I think one concern is similar to what you > raised on the other thread you shared [1]: > "Maybe we even want to replace it with SRCU entirely to ensure that drivers > can't stall the RCU grace period for too long by accident."I'd be worried about introducing a whole bunch more untestable failure paths in drivers. Especially in areas like work queue submit that are designed not to fail [2]. Non-failing work queues is a critical property that I've relied on countless times. I'm not sure you even *can* recover from this correctly in all cases. Then in the other email did you say that even some memory allocations go into this scheme? Yikes! Further, hiding a synchronize_rcu in a devm destructor [3], once per revocable object is awful. If you imagine having a rcu around each of your revocable objects, how many synchronize_rcu()s is devm going to call post-remove()? On a busy server it is known to take a long time. So it is easy to imagine driver remove times going into the many 10's of seconds for no good reason. Maybe even multiple minutes if the driver ends up with many of these objects. [1] - Module .text is not unplugged from the kernel until all probed drivers affiliated with that module have completed their remove operations. [2] - It is important that drivers shut down all their concurrency in workqueues during remove because work queues do not hold the module refcount. The only way .text lifecyle works is for drivers using work queues is to rely on [1] to protect against Execute after Free. [3] - Personally I agree with Laurent's points and I strongly dislike devm. I'm really surprised to see Rust using it, I imagined Rust has sufficiently strong object lifecycle management that it would not be needed :( Jason
Danilo Krummrich
2025-Feb-25 23:26 UTC
[RFC PATCH 0/3] gpu: nova-core: add basic timer subdevice implementation
On Tue, Feb 25, 2025 at 06:57:56PM -0400, Jason Gunthorpe wrote:> I think this resource-revoke idea is deviating from the normal > expected driver model by allowing driver code to continue to run in > other threads once remove completes. That is definitely abnormal at > least.No, it simply guarantees that once remove() completed the pointer to the resource can't be accessed anymore and the resource can't be kept alive (which includes the actual memory mapping as well as the allocated resource region). It also solves the unplug problem, where ioctls can't access the resource anymore after remove(). This is indeed a problem that does not affect all subsystems.> > It is not necessarily *wrong*, but it sure is weird and as I explained > above it has bad system level properties. > > Further, it seems to me there is a very unique DRM specific issue at > work "time unbounded driver callbacks". A weird solution to this > should not be baked into the common core kernel rust bindings and > break the working model of all other subsystems that don't have that > problem. > > > > Similarly you can have custom functions for short sequences of I/O ops, or use > > > closures. I don't understand the concern. > > > > Yeah, this is certainly possible. I think one concern is similar to what you > > raised on the other thread you shared [1]: > > "Maybe we even want to replace it with SRCU entirely to ensure that drivers > > can't stall the RCU grace period for too long by accident." > > I'd be worried about introducing a whole bunch more untestable failure > paths in drivers. Especially in areas like work queue submit that are > designed not to fail [2]. Non-failing work queues is a critical property > that I've relied on countless times. I'm not sure you even *can* recover > from this correctly in all cases. > > Then in the other email did you say that even some memory allocations > go into this scheme? Yikes!"For instance, let's take devm_kzalloc(). Once the device is detached from the driver the memory allocated with this function is freed automatically. The additional step in Rust is, that we'd not only free the memory, but also revoke the access to the pointer that has been returned by devm_kzalloc() for the driver, such that it can't be used by accident anymore." This was just an analogy to explain what we're doing here. Obviously, memory allocations can be managed by Rust's object lifetime management. The reason we have Devres for device resources is that the lifetime of a pci::Bar is *not* bound to the object lifetime directly, but to the lifetime of the binding between a device and a driver. That's why it needs to be revoked (which forcefully drops the object) when the device is unbound *not* when the pci::Bar object is dropped regularly. That's all the magic we're doing here. And again, this is not a change to the device / driver model. It is making use of the device / driver model to ensure safety.> > Further, hiding a synchronize_rcu in a devm destructor [3], once per > revocable object is awful. If you imagine having a rcu around each of > your revocable objects, how many synchronize_rcu()s is devm going to > call post-remove()?As many as you have MMIO mappings in your driver. But we can probably optimize this to just a single synchronize_rcu().
Danilo Krummrich
2025-Feb-25 23:45 UTC
[RFC PATCH 0/3] gpu: nova-core: add basic timer subdevice implementation
On Tue, Feb 25, 2025 at 06:57:56PM -0400, Jason Gunthorpe wrote:> The common driver shutdown process in the kernel, that is well tested > and copied, makes the driver single threaded during the remove() > callback.All devres callbacks run in the same callchain, __device_release_driver() first calls remove() and then all the devres callbacks, where we revoke the pci::Bar, by which gets dropped and hence the bar is unmapped and resource regions are free. It's not different to C drivers. Except that in C you don't lose access to the void pointer that still points to the (unmapped) MMIO address.