Boris Brezillon
2025-Aug-21 11:51 UTC
[PATCH v4 1/7] drm/panthor: Add support for atomic page table updates
On Wed, 16 Jul 2025 16:43:24 +0100 Steven Price <steven.price at arm.com> wrote:> >>> I think we need to briefly take vm->op_lock to ensure synchronisation > >>> but that doesn't seem a big issue. Or perhaps there's a good reason that > >>> I'm missing? > >> > >> I think you're right, all other accesses to locked_region are guarded by > >> op_lock. GPU job submit poke vm_active concurrently with vm_bind jobs doing > >> region {,un}locks. > > Actually no, that's not necessary. Access to locked_region is protected by > > slots_lock, which is held here. Trying to lock vm->op_lock would also be > > detrimental here, because these locks are often taken together and slots_lock > > is taken after op_lock is taken, so taking op_lock here would be extremely > > deadlockful. > > It would obviously be necessary to acquire vm->op_lock before > as.slots_lock as you say to avoid deadlocks. Note that as soon as > as.slots_lock is held vm->op_lock can be dropped.Yeah, lock ordering is not an issue, because we take slots_lock in this function, so we're in full control of the ordering. And I wouldn't even consider releasing op_lock as soon as we acquire slots_lock because - that make things harder to reason about - the locked section is not blocking on any sort of external event - the locked section is pretty straightforward (so no excessive delays expected here)> > I just find the current approach a little odd, and unless there's a good > reason for it would prefer that we don't enable a VM on a new address > space while there's an outstanding vm_bind still running. Obviously if > there's a good reason (e.g. we really do expect long running vm_bind > operations) then that just need documenting in the commit message. But > I'm not aware that's the case here.I fully agree here. If there's no obvious reason to not serialize vm_active() on VM bind ops, I'd opt for taking the VM op_lock and calling it a day. And I honestly can't think of any: - the VM op logic is all synchronous/non-blocking - it's expected to be fast - AS rotation is something I hope is not happening too often, otherwise we'll have other things to worry about (the whole CSG slot scheduling logic is quite involved, and I'd expect the BIND-while-making-AS-active to be rare enough that it becomes noise in the overall overhead of kernel-side GPU scheduling happening in Panthor)> > Although in general I'm a bit wary of relying on the whole lock region > feature - previous GPUs have an errata. But maybe I'm being over > cautious there.We're heavily relying on it already to allow updates of the VM while the GPU is executing stuff. If that's problematic on v10+, I'd rather know early :D. Regards, Boris
Steven Price
2025-Aug-21 15:02 UTC
[PATCH v4 1/7] drm/panthor: Add support for atomic page table updates
On 21/08/2025 12:51, Boris Brezillon wrote:> On Wed, 16 Jul 2025 16:43:24 +0100 > Steven Price <steven.price at arm.com> wrote:[...]>> Although in general I'm a bit wary of relying on the whole lock region >> feature - previous GPUs have an errata. But maybe I'm being over >> cautious there. > > We're heavily relying on it already to allow updates of the VM while > the GPU is executing stuff. If that's problematic on v10+, I'd rather > know early :D.I think I'm just scarred by my experiences over a decade ago... ;) I'm not aware of any issues with the modern[1] GPUs. The issue used to be that the lock region could get accidentally unlocked by a cache flush from another source - specifically the cache flush on job start flag. It's also not a major issue if you keep the page tables consistent, the lock region in theory allows a region to be in an inconsistent state - but generally there's no need for that. AFAIK we mostly keep the tables consistent anyway. Thanks, Steve [1] Which in this context means well over a decade - it's somewhat scary how long I've been doing this!> Regards, > > Boris