Danilo Krummrich
2023-Oct-03 16:55 UTC
[Nouveau] [PATCH drm-misc-next v5 4/6] drm/gpuvm: track/lock/validate external/evicted objects
It seems like we're mostly aligned on this series, except for the key controversy we're discussing for a few versions now: locking of the internal lists. Hence, let's just re-iterate the options we have to get this out of the way. (1) The spinlock dance. This basically works for every use case, updating the VA space from the IOCTL, from the fence signaling path or anywhere else. However, it has the downside of requiring spin_lock() / spin_unlock() for *each* list element when locking all external objects and validating all evicted objects. Typically, the amount of extobjs and evicted objects shouldn't be excessive, but there might be exceptions, e.g. Xe. (2) The dma-resv lock dance. This is convinient for drivers updating the VA space from a VM_BIND ioctl() and is especially efficient if such drivers have a huge amount of external and/or evicted objects to manage. However, the downsides are that it requires a few tricks in drivers updating the VA space from the fence signaling path (e.g. job_run()). Design wise, I'm still skeptical that it is a good idea to protect internal data structures with external locks in a way that it's not clear to callers that a certain function would access one of those resources and hence needs protection. E.g. it is counter intuitive that drm_gpuvm_bo_put() would require both the dma-resv lock of the corresponding object and the VM's dma-resv lock held. (Additionally, there were some concerns from amdgpu regarding flexibility in terms of using GPUVM for non-VM_BIND uAPIs and compute, however, AFAICS those discussions did not complete and to me it's still unclear why it wouldn't work.) (3) Simply use an internal mutex per list. This adds a tiny (IMHO negligible) overhead for drivers updating the VA space from a VM_BIND ioctl(), namely a *single* mutex_lock()/mutex_unlock() when locking all external objects and validating all evicted objects. And it still requires some tricks for drivers updating the VA space from the fence signaling path. However, it's as simple as it can be and hence way less error prone as well as self-contained and hence easy to use. Additionally, it's flexible in a way that we don't have any expections on drivers to already hold certain locks that the driver in some situation might not be able to acquire in the first place. (4) Arbitrary combinations of the above. For instance, the current V5 implements both (1) and (2) (as either one or the other). But also (1) and (3) (as in (1) additionally to (3)) would be an option, where a driver could opt-in for the spinlock dance in case it updates the VA space from the fence signaling path. I also considered a few other options as well, however, they don't seem to be flexible enough. For instance, as by now we could use SRCU for the external object list. However, this falls apart once a driver wants to remove and re-add extobjs for the same VM_BO instance. (For the same reason it wouldn't work for evicted objects.) Personally, after seeing the weird implications of (1), (2) and a combination of both, I tend to go with (3). Optionally, with an opt-in for (1). The reason for the latter is that with (3) the weirdness of (1) by its own mostly disappears. Please let me know what you think, and, of course, other ideas than the mentioned ones above are still welcome. - Danilo On Tue, Oct 03, 2023 at 04:21:43PM +0200, Boris Brezillon wrote:> On Tue, 03 Oct 2023 14:25:56 +0200 > Thomas Hellstr?m <thomas.hellstrom at linux.intel.com> wrote: > > > > > > +/** > > > > > + * get_next_vm_bo_from_list() - get the next vm_bo element > > > > > + * @__gpuvm: The GPU VM > > > > > + * @__list_name: The name of the list we're iterating on > > > > > + * @__local_list: A pointer to the local list used to store > > > > > already iterated items > > > > > + * @__prev_vm_bo: The previous element we got from > > > > > drm_gpuvm_get_next_cached_vm_bo() > > > > > + * > > > > > + * This helper is here to provide lockless list iteration. > > > > > Lockless as in, the > > > > > + * iterator releases the lock immediately after picking the > > > > > first element from > > > > > + * the list, so list insertion deletion can happen concurrently. > > > > > + * > > > > > + * Elements popped from the original list are kept in a local > > > > > list, so removal > > > > > + * and is_empty checks can still happen while we're iterating > > > > > the list. > > > > > + */ > > > > > +#define get_next_vm_bo_from_list(__gpuvm, __list_name, > > > > > __local_list, __prev_vm_bo)?????\ > > > > > +???????({??????????????????????????????????????????????????????? > > > > > ???????????????????????\ > > > > > +???????????????struct drm_gpuvm_bo *__vm_bo > > > > > NULL;????????????????????????????????????\ > > > > > +???????????????????????????????????????????????????????????????? > > > > > ???????????????????????\ > > > > > +???????????????drm_gpuvm_bo_put(__prev_vm_bo);?????????????????? > > > > > ???????????????????????\ > > > > > +???????????????????????????????????????????????????????????????? > > > > > ???????????????????????\ > > > > > +???????????????spin_lock(&(__gpuvm)- > > > > > >__list_name.lock);????????????????????????????????\? > > > > > > > > Here we unconditionally take the spinlocks while iterating, and the > > > > main > > > > point of DRM_GPUVM_RESV_PROTECTED was really to avoid that? > > > > > > > > > > > > > +???????????????if (!(__gpuvm)- > > > > > >__list_name.local_list)?????????????????????????????????\ > > > > > +???????????????????????(__gpuvm)->__list_name.local_list > > > > > __local_list;???????????????\ > > > > > +???????????????else????????????????????????????????????????????? > > > > > ???????????????????????\ > > > > > +???????????????????????WARN_ON((__gpuvm)->__list_name.local_list > > > > > != __local_list);?????\ > > > > > +???????????????????????????????????????????????????????????????? > > > > > ???????????????????????\ > > > > > +???????????????while (!list_empty(&(__gpuvm)->__list_name.list)) > > > > > {?????????????????????\ > > > > > +???????????????????????__vm_bo = list_first_entry(&(__gpuvm)- > > > > > >__list_name.list,????????\ > > > > > +????????????????????????????????????????????????? struct > > > > > drm_gpuvm_bo,?????????????????\ > > > > > +????????????????????????????????????????????????? > > > > > list.entry.__list_name);?????????????\ > > > > > +???????????????????????if (kref_get_unless_zero(&__vm_bo->kref)) > > > > > {? > > > > And unnecessarily grab a reference in the RESV_PROTECTED case. > > > > > ????????????????????????\ > > > > > +???????????????????????????????list_move_tail(&(__vm_bo)- > > > > > >list.entry.__list_name,??????\ > > > > > +????????????????????????????????????????????? > > > > > __local_list);???????????????????????????\ > > > > > +???????????????????????????????break;??????????????????????????? > > > > > ???????????????????????\ > > > > > +???????????????????????} else > > > > > {????????????????????????????????????????????????????????\ > > > > > +???????????????????????????????list_del_init(&(__vm_bo)- > > > > > >list.entry.__list_name);??????\ > > > > > +???????????????????????????????__vm_bo > > > > > NULL;?????????????????????????????????????????\ > > > > > +???????????????????????}???????????????????????????????????????? > > > > > ???????????????????????\ > > > > > +???????????????}???????????????????????????????????????????????? > > > > > ???????????????????????\ > > > > > +???????????????spin_unlock(&(__gpuvm)- > > > > > >__list_name.lock);??????????????????????????????\ > > > > > +???????????????????????????????????????????????????????????????? > > > > > ???????????????????????\ > > > > > +???????????????__vm_bo;????????????????????????????????????????? > > > > > ???????????????????????\ > > > > > +???????})? > > > > > > > > IMHO this lockless list iteration looks very complex and should be > > > > pretty difficult to maintain while moving forward, also since it > > > > pulls > > > > the gpuvm_bos off the list, list iteration needs to be protected by > > > > an > > > > outer lock anyway. > > > > > > As being partly responsible for this convoluted list iterator, I must > > > say I agree with you. There's so many ways this can go wrong if the > > > user doesn't call it the right way, or doesn't protect concurrent > > > list > > > iterations with a separate lock (luckily, this is a private > > > iterator). I > > > mean, it works, so there's certainly a way to get it right, but gosh, > > > this is so far from the simple API I had hoped for. > > > > > > > Also from what I understand from Boris, the extobj > > > > list would typically not need the fine-grained locking; only the > > > > evict > > > > list? > > > > > > Right, I'm adding the gpuvm_bo to extobj list in the ioctl path, when > > > the GEM and VM resvs are held, and I'm deferring the > > > drm_gpuvm_bo_put() > > > call to a work that's not in the dma-signalling path. This being > > > said, > > > I'm still not comfortable with the > > > > > > gem = drm_gem_object_get(vm_bo->gem); > > > dma_resv_lock(gem->resv); > > > drm_gpuvm_bo_put(vm_bo); > > > dma_resv_unlock(gem->resv); > > > drm_gem_object_put(gem); > > > > > > dance that's needed to avoid a UAF when the gpuvm_bo is the last GEM > > > owner, not to mention that drm_gpuva_unlink() calls > > > drm_gpuvm_bo_put() > > > after making sure the GEM gpuvm_list lock is held, but this lock > > > might > > > differ from the resv lock (custom locking so we can call > > > gpuvm_unlink() in the dma-signalling path). So we now have paths > > > where > > > drm_gpuvm_bo_put() are called with the resv lock held, and others > > > where > > > they are not, and that only works because we're relying on the the > > > fact > > > those drm_gpuvm_bo_put() calls won't make the refcount drop to zero, > > > because the deferred vm_bo_put() work still owns a vm_bo ref. > > > > I'm not sure I follow to 100% here, but in the code snippet above it's > > pretty clear to me that it needs to hold an explicit gem object > > reference when calling dma_resv_unlock(gem->resv). Each time you copy a > > referenced pointer (here from vm_bo->gem to gem) you need to up the > > refcount unless you make sure (by locks or other means) that the source > > of the copy has a strong refcount and stays alive, so that's no weird > > action to me. Could possibly add a drm_gpuvm_bo_get_gem() to access the > > gem member (and that also takes a refcount) for driver users to avoid > > the potential pitfall. > > Except this is only needed because of the GEM-resv-must-be-held locking > constraint that was added on vm_bo_put(). I mean, the usual way we do > object un-referencing is by calling _put() and letting the internal > logic undo things when the refcount drops to zero. If the object needs > to be removed from some list, it's normally the responsibility of the > destruction method to lock the list, remove the object and unlock the > list. Now, we have a refcounted object that's referenced by vm_bo, and > whose lock needs to be taken when the destruction happens, which leads > to this weird dance described above, when, in normal situations, we'd > just call drm_gpuvm_bo_put(vm_bo) and let drm_gpuvm do its thing. > > > > > > > > > All these tiny details add to the overall complexity of this common > > > layer, and to me, that's not any better than the > > > get_next_vm_bo_from_list() complexity you were complaining about > > > (might > > > be even worth, because this sort of things leak to users). > > > > > > Having an internal lock partly solves that, in that the locking of > > > the > > > extobj list is now entirely orthogonal to the GEM that's being > > > removed > > > from this list, and we can lock/unlock internally without forcing the > > > caller to take weird actions to make sure things don't explode. Don't > > > get me wrong, I get that this locking overhead is not acceptable for > > > Xe, but I feel like we're turning drm_gpuvm into a white elephant > > > that > > > only few people will get right. > > > > I tend to agree, but to me the big complication comes from the async > > (dma signalling path) state updates. > > I don't deny updating the VM state from the dma signalling path adds > some amount of complexity, but the fact we're trying to use dma_resv > locks for everything, including protection of internal datasets doesn't > help. Anyway, I think both of us are biased when it comes to judging > which approach adds the most complexity :P. > > Also note that, right now, the only thing I'd like to be able to update > from the dma signalling path is the VM mapping tree. Everything else > (drm_gpuva_[un]link(), add/remove extobj), we could do outside this > path: > > - for MAP operations, we could call drm_gpuva_link() in the ioctl path > (we'd just need to initialize the drm_gpuva object) > - for MAP operations, we're already calling drm_gpuvm_bo_obtain() from > the ioctl path > - for UNMAP operations, we could add the drm_gpuva_unlink() call to the > VM op cleanup worker > > The only problem we'd have is that drm_gpuva_link() needs to be called > inside drm_gpuvm_ops::sm_step_remap() when a remap with next/prev !> NULL occurs, otherwise we lose track of these mappings. > > > > > Let's say for example we have a lower level lock for the gem object's > > gpuvm_bo list. Some drivers grab it from the dma fence signalling path, > > other drivers need to access all vm's of a bo to grab their dma_resv > > locks using a WW transaction. There will be problems, although probably > > solveable. > > To me, the gpuvm extobj vm_bo list is just an internal list and has an > internal lock associated. The lock that's protecting the GEM vm_bo list > is a bit different in that the driver gets to decide when a vm_bo is > inserted/removed by calling drm_gpuvm_[un]link(), and can easily make > sure the lock is held when this happens, while the gpuvm internal lists > are kinda transparently updated (for instance, the first caller of > drm_gpuvm_bo_obtain() adds the vm_bo to the extobj and the last vm_bo > owner calling drm_gpuvm_bo_put() removes it from this list, which is > certainly not obvious based on the name of these functions). > > If we want to let drivers iterate over the extobj/evict lists, and > assuming they are considered internal lists maintained by the core and > protected with an internal lock, we should indeed provide iteration > helpers that: > > 1/ make sure all the necessary external locks are held (VM resv, I > guess) > 2/ make sure the internal lock is not held during iteration (the sort > of snapshot list trick you're using for the evict list in Xe) > > > > > Also it seems that if we are to maintain two modes here, for > > > > reasonably clean code we'd need two separate instances of > > > > get_next_bo_from_list(). > > > > > > > > For the !RESV_PROTECTED case, perhaps one would want to consider > > > > the > > > > solution used currently in xe, where the VM maintains two evict > > > > lists. > > > > One protected by a spinlock and one protected by the VM resv. When > > > > the > > > > VM resv is locked to begin list traversal, the spinlock is locked > > > > *once* > > > > and the spinlock-protected list is looped over and copied into the > > > > resv > > > > protected one. For traversal, the resv protected one is used. > > > > > > Oh, so you do have the same sort of trick where you move the entire > > > list to another list, such that you can let other paths update the > > > list > > > while you're iterating your own snapshot. That's interesting... > > > > Yes, it's instead of the "evicted" bool suggested here. I thought the > > latter would be simpler. Although that remains to be seen after all > > use-cases are implemented. > > > > But in general I think the concept of copying from a staging list to > > another with different protection rather than traversing the first list > > and unlocking between items is a good way of solving the locking > > inversion problem with minimal overhead. We use it also for O(1) > > userptr validation. > > That's more or less the idea behind get_next_vm_bo_from_list() except > it's dequeuing one element at a time, instead of moving all items at > once. Note that, if you allow concurrent removal protected only by the > spinlock, you still need to take/release this spinlock when iterating > over elements of this snapshot list, because all the remover needs to > remove an element is the element itself, and it doesn't care in which > list it's currently inserted (real or snapshot/staging list), so you'd > be iterating over a moving target if you don't protect the iteration > with the spinlock. >
Thomas Hellström
2023-Oct-03 17:37 UTC
[Nouveau] [PATCH drm-misc-next v5 4/6] drm/gpuvm: track/lock/validate external/evicted objects
Hi, Danilo On Tue, 2023-10-03 at 18:55 +0200, Danilo Krummrich wrote:> It seems like we're mostly aligned on this series, except for the key > controversy we're discussing for a few versions now: locking of the > internal > lists. Hence, let's just re-iterate the options we have to get this > out of the > way. > > (1) The spinlock dance. This basically works for every use case, > updating the VA > ??? space from the IOCTL, from the fence signaling path or anywhere > else. > ??? However, it has the downside of requiring spin_lock() / > spin_unlock() for > ??? *each* list element when locking all external objects and > validating all > ??? evicted objects. Typically, the amount of extobjs and evicted > objects > ??? shouldn't be excessive, but there might be exceptions, e.g. Xe. > > (2) The dma-resv lock dance. This is convinient for drivers updating > the VA > ??? space from a VM_BIND ioctl() and is especially efficient if such > drivers > ??? have a huge amount of external and/or evicted objects to manage. > However, > ??? the downsides are that it requires a few tricks in drivers > updating the VA > ??? space from the fence signaling path (e.g. job_run()). Design > wise, I'm still > ??? skeptical that it is a good idea to protect internal data > structures with > ??? external locks in a way that it's not clear to callers that a > certain > ??? function would access one of those resources and hence needs > protection. > ??? E.g. it is counter intuitive that drm_gpuvm_bo_put() would > require both the > ??? dma-resv lock of the corresponding object and the VM's dma-resv > lock held. > ??? (Additionally, there were some concerns from amdgpu regarding > flexibility in > ??? terms of using GPUVM for non-VM_BIND uAPIs and compute, however, > AFAICS > ??? those discussions did not complete and to me it's still unclear > why it > ??? wouldn't work.) > > (3) Simply use an internal mutex per list. This adds a tiny (IMHO > negligible) > ??? overhead for drivers updating the VA space from a VM_BIND > ioctl(), namely > ??? a *single* mutex_lock()/mutex_unlock() when locking all external > objects > ??? and validating all evicted objects. And it still requires some > tricks for > ??? drivers updating the VA space from the fence signaling path. > However, it's > ??? as simple as it can be and hence way less error prone as well as > ??? self-contained and hence easy to use. Additionally, it's flexible > in a way > ??? that we don't have any expections on drivers to already hold > certain locks > ??? that the driver in some situation might not be able to acquire in > the first > ??? place.Such an overhead is fully OK IMO, But didn't we conclude at some point that using a mutex in this way isn't possible due to the fact that validate() needs to be able to lock dma_resv, and then we have dma_resv()->mutex->dma_resv()?> > (4) Arbitrary combinations of the above. For instance, the current V5 > implements > ??? both (1) and (2) (as either one or the other). But also (1) and > (3) (as in > ??? (1) additionally to (3)) would be an option, where a driver could > opt-in for > ??? the spinlock dance in case it updates the VA space from the fence > signaling > ??? path. > > I also considered a few other options as well, however, they don't > seem to be > flexible enough. For instance, as by now we could use SRCU for the > external > object list. However, this falls apart once a driver wants to remove > and re-add > extobjs for the same VM_BO instance. (For the same reason it wouldn't > work for > evicted objects.) > > Personally, after seeing the weird implications of (1), (2) and a > combination of > both, I tend to go with (3). Optionally, with an opt-in for (1). The > reason for > the latter is that with (3) the weirdness of (1) by its own mostly > disappears. > > Please let me know what you think, and, of course, other ideas than > the > mentioned ones above are still welcome.Personally, after converting xe to version 5, I think it's pretty convenient for the driver, (although had to add the evict trick), so I think I'd vote for this, even if not currently using the opt-in for (1). /Thomas> > - Danilo > > On Tue, Oct 03, 2023 at 04:21:43PM +0200, Boris Brezillon wrote: > > On Tue, 03 Oct 2023 14:25:56 +0200 > > Thomas Hellstr?m <thomas.hellstrom at linux.intel.com> wrote: > > > > > > > > +/** > > > > > > + * get_next_vm_bo_from_list() - get the next vm_bo element > > > > > > + * @__gpuvm: The GPU VM > > > > > > + * @__list_name: The name of the list we're iterating on > > > > > > + * @__local_list: A pointer to the local list used to > > > > > > store > > > > > > already iterated items > > > > > > + * @__prev_vm_bo: The previous element we got from > > > > > > drm_gpuvm_get_next_cached_vm_bo() > > > > > > + * > > > > > > + * This helper is here to provide lockless list iteration. > > > > > > Lockless as in, the > > > > > > + * iterator releases the lock immediately after picking > > > > > > the > > > > > > first element from > > > > > > + * the list, so list insertion deletion can happen > > > > > > concurrently. > > > > > > + * > > > > > > + * Elements popped from the original list are kept in a > > > > > > local > > > > > > list, so removal > > > > > > + * and is_empty checks can still happen while we're > > > > > > iterating > > > > > > the list. > > > > > > + */ > > > > > > +#define get_next_vm_bo_from_list(__gpuvm, __list_name, > > > > > > __local_list, __prev_vm_bo)?????\ > > > > > > +???????({????????????????????????????????????????????????? > > > > > > ?????? > > > > > > ???????????????????????\ > > > > > > +???????????????struct drm_gpuvm_bo *__vm_bo > > > > > > NULL;????????????????????????????????????\ > > > > > > +?????????????????????????????????????????????????????????? > > > > > > ?????? > > > > > > ???????????????????????\ > > > > > > +???????????????drm_gpuvm_bo_put(__prev_vm_bo);???????????? > > > > > > ?????? > > > > > > ???????????????????????\ > > > > > > +?????????????????????????????????????????????????????????? > > > > > > ?????? > > > > > > ???????????????????????\ > > > > > > +???????????????spin_lock(&(__gpuvm)-? > > > > > > > __list_name.lock);????????????????????????????????\??? > > > > > > > > > > Here we unconditionally take the spinlocks while iterating, > > > > > and the > > > > > main > > > > > point of DRM_GPUVM_RESV_PROTECTED was really to avoid that? > > > > > > > > > > ? > > > > > > +???????????????if (!(__gpuvm)-? > > > > > > > __list_name.local_list)?????????????????????????????????\ > > > > > > > ? > > > > > > +???????????????????????(__gpuvm)->__list_name.local_list > > > > > > __local_list;???????????????\ > > > > > > +???????????????else??????????????????????????????????????? > > > > > > ?????? > > > > > > ???????????????????????\ > > > > > > +???????????????????????WARN_ON((__gpuvm)- > > > > > > >__list_name.local_list > > > > > > != __local_list);?????\ > > > > > > +?????????????????????????????????????????????????????????? > > > > > > ?????? > > > > > > ???????????????????????\ > > > > > > +???????????????while (!list_empty(&(__gpuvm)- > > > > > > >__list_name.list)) > > > > > > {?????????????????????\ > > > > > > +???????????????????????__vm_bo > > > > > > list_first_entry(&(__gpuvm)-? > > > > > > > __list_name.list,????????\? > > > > > > +????????????????????????????????????????????????? struct > > > > > > drm_gpuvm_bo,?????????????????\ > > > > > > +????????????????????????????????????????????????? > > > > > > list.entry.__list_name);?????????????\ > > > > > > +???????????????????????if (kref_get_unless_zero(&__vm_bo- > > > > > > >kref)) > > > > > > {??? > > > > > And unnecessarily grab a reference in the RESV_PROTECTED > > > > > case.? > > > > > > ????????????????????????\ > > > > > > +???????????????????????????????list_move_tail(&(__vm_bo)-? > > > > > > > list.entry.__list_name,??????\? > > > > > > +????????????????????????????????????????????? > > > > > > __local_list);???????????????????????????\ > > > > > > +???????????????????????????????break;????????????????????? > > > > > > ?????? > > > > > > ???????????????????????\ > > > > > > +???????????????????????} else > > > > > > {????????????????????????????????????????????????????????\ > > > > > > +???????????????????????????????list_del_init(&(__vm_bo)-? > > > > > > > list.entry.__list_name);??????\? > > > > > > +???????????????????????????????__vm_bo > > > > > > NULL;?????????????????????????????????????????\ > > > > > > +???????????????????????}?????????????????????????????????? > > > > > > ?????? > > > > > > ???????????????????????\ > > > > > > +???????????????}?????????????????????????????????????????? > > > > > > ?????? > > > > > > ???????????????????????\ > > > > > > +???????????????spin_unlock(&(__gpuvm)-? > > > > > > > __list_name.lock);??????????????????????????????\? > > > > > > +?????????????????????????????????????????????????????????? > > > > > > ?????? > > > > > > ???????????????????????\ > > > > > > +???????????????__vm_bo;??????????????????????????????????? > > > > > > ?????? > > > > > > ???????????????????????\ > > > > > > +???????})??? > > > > > > > > > > IMHO this lockless list iteration looks very complex and > > > > > should be > > > > > pretty difficult to maintain while moving forward, also since > > > > > it > > > > > pulls > > > > > the gpuvm_bos off the list, list iteration needs to be > > > > > protected by > > > > > an > > > > > outer lock anyway.? > > > > > > > > As being partly responsible for this convoluted list iterator, > > > > I must > > > > say I agree with you. There's so many ways this can go wrong if > > > > the > > > > user doesn't call it the right way, or doesn't protect > > > > concurrent > > > > list > > > > iterations with a separate lock (luckily, this is a private > > > > iterator). I > > > > mean, it works, so there's certainly a way to get it right, but > > > > gosh, > > > > this is so far from the simple API I had hoped for. > > > > ? > > > > > Also from what I understand from Boris, the extobj > > > > > list would typically not need the fine-grained locking; only > > > > > the > > > > > evict > > > > > list?? > > > > > > > > Right, I'm adding the gpuvm_bo to extobj list in the ioctl > > > > path, when > > > > the GEM and VM resvs are held, and I'm deferring the > > > > drm_gpuvm_bo_put() > > > > call to a work that's not in the dma-signalling path. This > > > > being > > > > said, > > > > I'm still not comfortable with the > > > > > > > > gem = drm_gem_object_get(vm_bo->gem); > > > > dma_resv_lock(gem->resv); > > > > drm_gpuvm_bo_put(vm_bo); > > > > dma_resv_unlock(gem->resv); > > > > drm_gem_object_put(gem); > > > > > > > > dance that's needed to avoid a UAF when the gpuvm_bo is the > > > > last GEM > > > > owner, not to mention that drm_gpuva_unlink() calls > > > > drm_gpuvm_bo_put() > > > > after making sure the GEM gpuvm_list lock is held, but this > > > > lock > > > > might > > > > differ from the resv lock (custom locking so we can call > > > > gpuvm_unlink() in the dma-signalling path). So we now have > > > > paths > > > > where > > > > drm_gpuvm_bo_put() are called with the resv lock held, and > > > > others > > > > where > > > > they are not, and that only works because we're relying on the > > > > the > > > > fact > > > > those drm_gpuvm_bo_put() calls won't make the refcount drop to > > > > zero, > > > > because the deferred vm_bo_put() work still owns a vm_bo ref.? > > > > > > I'm not sure I follow to 100% here, but in the code snippet above > > > it's > > > pretty clear to me that it needs to hold an explicit gem object > > > reference when calling dma_resv_unlock(gem->resv). Each time you > > > copy a > > > referenced pointer (here from vm_bo->gem to gem) you need to up > > > the > > > refcount unless you make sure (by locks or other means) that the > > > source > > > of the copy has a strong refcount and stays alive, so that's no > > > weird > > > action to me. Could possibly add a drm_gpuvm_bo_get_gem() to > > > access the > > > gem member (and that also takes a refcount) for driver users to > > > avoid > > > the potential pitfall. > > > > Except this is only needed because of the GEM-resv-must-be-held > > locking > > constraint that was added on vm_bo_put(). I mean, the usual way we > > do > > object un-referencing is by calling _put() and letting the internal > > logic undo things when the refcount drops to zero. If the object > > needs > > to be removed from some list, it's normally the responsibility of > > the > > destruction method to lock the list, remove the object and unlock > > the > > list. Now, we have a refcounted object that's referenced by vm_bo, > > and > > whose lock needs to be taken when the destruction happens, which > > leads > > to this weird dance described above, when, in normal situations, > > we'd > > just call drm_gpuvm_bo_put(vm_bo) and let drm_gpuvm do its thing. > > > > > > > > > > > > > All these tiny details add to the overall complexity of this > > > > common > > > > layer, and to me, that's not any better than the > > > > get_next_vm_bo_from_list() complexity you were complaining > > > > about > > > > (might > > > > be even worth, because this sort of things leak to users). > > > > > > > > Having an internal lock partly solves that, in that the locking > > > > of > > > > the > > > > extobj list is now entirely orthogonal to the GEM that's being > > > > removed > > > > from this list, and we can lock/unlock internally without > > > > forcing the > > > > caller to take weird actions to make sure things don't explode. > > > > Don't > > > > get me wrong, I get that this locking overhead is not > > > > acceptable for > > > > Xe, but I feel like we're turning drm_gpuvm into a white > > > > elephant > > > > that > > > > only few people will get right.? > > > > > > I tend to agree, but to me the big complication comes from the > > > async > > > (dma signalling path) state updates. > > > > I don't deny updating the VM state from the dma signalling path > > adds > > some amount of complexity, but the fact we're trying to use > > dma_resv > > locks for everything, including protection of internal datasets > > doesn't > > help. Anyway, I think both of us are biased when it comes to > > judging > > which approach adds the most complexity :P. > > > > Also note that, right now, the only thing I'd like to be able to > > update > > from the dma signalling path is the VM mapping tree. Everything > > else > > (drm_gpuva_[un]link(), add/remove extobj), we could do outside this > > path: > > > > - for MAP operations, we could call drm_gpuva_link() in the ioctl > > path > > ? (we'd just need to initialize the drm_gpuva object) > > - for MAP operations, we're already calling drm_gpuvm_bo_obtain() > > from > > ? the ioctl path > > - for UNMAP operations, we could add the drm_gpuva_unlink() call to > > the > > ? VM op cleanup worker > > > > The only problem we'd have is that drm_gpuva_link() needs to be > > called > > inside drm_gpuvm_ops::sm_step_remap() when a remap with next/prev > > !> > NULL occurs, otherwise we lose track of these mappings. > > > > > > > > Let's say for example we have a lower level lock for the gem > > > object's > > > gpuvm_bo list. Some drivers grab it from the dma fence signalling > > > path, > > > other drivers need to access all vm's of a bo to grab their > > > dma_resv > > > locks using a WW transaction. There will be problems, although > > > probably > > > solveable. > > > > To me, the gpuvm extobj vm_bo list is just an internal list and has > > an > > internal lock associated. The lock that's protecting the GEM vm_bo > > list > > is a bit different in that the driver gets to decide when a vm_bo > > is > > inserted/removed by calling drm_gpuvm_[un]link(), and can easily > > make > > sure the lock is held when this happens, while the gpuvm internal > > lists > > are kinda transparently updated (for instance, the first caller of > > drm_gpuvm_bo_obtain() adds the vm_bo to the extobj and the last > > vm_bo > > owner calling drm_gpuvm_bo_put() removes it from this list, which > > is > > certainly not obvious based on the name of these functions). > > > > If we want to let drivers iterate over the extobj/evict lists, and > > assuming they are considered internal lists maintained by the core > > and > > protected with an internal lock, we should indeed provide iteration > > helpers that: > > > > 1/ make sure all the necessary external locks are held (VM resv, I > > ?? guess) > > 2/ make sure the internal lock is not held during iteration (the > > sort > > ?? of snapshot list trick you're using for the evict list in Xe) > > > > > > > Also it seems that if we are to maintain two modes here, for > > > > > reasonably clean code we'd need two separate instances of > > > > > get_next_bo_from_list(). > > > > > > > > > > For the !RESV_PROTECTED case, perhaps one would want to > > > > > consider > > > > > the > > > > > solution used currently in xe, where the VM maintains two > > > > > evict > > > > > lists. > > > > > One protected by a spinlock and one protected by the VM resv. > > > > > When > > > > > the > > > > > VM resv is locked to begin list traversal, the spinlock is > > > > > locked > > > > > *once* > > > > > and the spinlock-protected list is looped over and copied > > > > > into the > > > > > resv > > > > > protected one. For traversal, the resv protected one is > > > > > used.? > > > > > > > > Oh, so you do have the same sort of trick where you move the > > > > entire > > > > list to another list, such that you can let other paths update > > > > the > > > > list > > > > while you're iterating your own snapshot. That's > > > > interesting...? > > > > > > Yes, it's instead of the "evicted" bool suggested here. I thought > > > the > > > latter would be simpler. Although that remains to be seen after > > > all > > > use-cases are implemented. > > > > > > But in general I think the concept of copying from a staging list > > > to > > > another with different protection rather than traversing the > > > first list > > > and unlocking between items is a good way of solving the locking > > > inversion problem with minimal overhead. We use it also for O(1) > > > userptr validation. > > > > That's more or less the idea behind get_next_vm_bo_from_list() > > except > > it's dequeuing one element at a time, instead of moving all items > > at > > once. Note that, if you allow concurrent removal protected only by > > the > > spinlock, you still need to take/release this spinlock when > > iterating > > over elements of this snapshot list, because all the remover needs > > to > > remove an element is the element itself, and it doesn't care in > > which > > list it's currently inserted (real or snapshot/staging list), so > > you'd > > be iterating over a moving target if you don't protect the > > iteration > > with the spinlock. > > >
Thomas Hellström
2023-Oct-03 18:57 UTC
[Nouveau] [PATCH drm-misc-next v5 4/6] drm/gpuvm: track/lock/validate external/evicted objects
On 10/3/23 18:55, Danilo Krummrich wrote:> It seems like we're mostly aligned on this series, except for the key > controversy we're discussing for a few versions now: locking of the internal > lists. Hence, let's just re-iterate the options we have to get this out of the > way. > > (1) The spinlock dance. This basically works for every use case, updating the VA > space from the IOCTL, from the fence signaling path or anywhere else. > However, it has the downside of requiring spin_lock() / spin_unlock() for > *each* list element when locking all external objects and validating all > evicted objects. Typically, the amount of extobjs and evicted objects > shouldn't be excessive, but there might be exceptions, e.g. Xe. > > (2) The dma-resv lock dance. This is convinient for drivers updating the VA > space from a VM_BIND ioctl() and is especially efficient if such drivers > have a huge amount of external and/or evicted objects to manage. However, > the downsides are that it requires a few tricks in drivers updating the VA > space from the fence signaling path (e.g. job_run()). Design wise, I'm still > skeptical that it is a good idea to protect internal data structures with > external locks in a way that it's not clear to callers that a certain > function would access one of those resources and hence needs protection. > E.g. it is counter intuitive that drm_gpuvm_bo_put() would require both the > dma-resv lock of the corresponding object and the VM's dma-resv lock held. > (Additionally, there were some concerns from amdgpu regarding flexibility in > terms of using GPUVM for non-VM_BIND uAPIs and compute, however, AFAICS > those discussions did not complete and to me it's still unclear why it > wouldn't work.) > > (3) Simply use an internal mutex per list. This adds a tiny (IMHO negligible) > overhead for drivers updating the VA space from a VM_BIND ioctl(), namely > a *single* mutex_lock()/mutex_unlock() when locking all external objects > and validating all evicted objects. And it still requires some tricks for > drivers updating the VA space from the fence signaling path. However, it's > as simple as it can be and hence way less error prone as well as > self-contained and hence easy to use. Additionally, it's flexible in a way > that we don't have any expections on drivers to already hold certain locks > that the driver in some situation might not be able to acquire in the first > place. > > (4) Arbitrary combinations of the above. For instance, the current V5 implements > both (1) and (2) (as either one or the other). But also (1) and (3) (as in > (1) additionally to (3)) would be an option, where a driver could opt-in for > the spinlock dance in case it updates the VA space from the fence signaling > path. > > I also considered a few other options as well, however, they don't seem to be > flexible enough. For instance, as by now we could use SRCU for the external > object list. However, this falls apart once a driver wants to remove and re-add > extobjs for the same VM_BO instance. (For the same reason it wouldn't work for > evicted objects.) > > Personally, after seeing the weird implications of (1), (2) and a combination of > both, I tend to go with (3). Optionally, with an opt-in for (1). The reason for > the latter is that with (3) the weirdness of (1) by its own mostly disappears. > > Please let me know what you think, and, of course, other ideas than the > mentioned ones above are still welcome. > > - Danilo >Here are the locking principles Daniel put together and Dave once called out for us to be applying when reviewing DRM code. These were prompted by very fragile and hard to understand locking patterns in the i915 driver and I think the xe vm_bind locking design was made with these in mind, (not sure exactly who wrote what, though so can't say for sure). https://blog.ffwll.ch/2022/07/locking-engineering.html https://blog.ffwll.ch/2022/08/locking-hierarchy.html At least to me, this motivates using the resv design unless we strictly need lower level locks that are taken in the eviction paths or userptr invalidation paths, but doesn't rule out spinlocks or lock dropping tricks where these are really necessary. But pretty much rules out RCU / SRCU from what I can tell. It also calls for documenting how individual members of structs are protected when ever possible. Thanks, Thomas