Luca Barbieri
2010-Jan-18 18:47 UTC
[Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete
ttm_bo_delayed_delete has a race condition, because after we do: kref_put(&nentry->list_kref, ttm_bo_release_list); we are not holding the list lock and not holding any reference to objects, and thus every bo in the list can be removed and freed at this point. However, we then use the next pointer we stored, which is not guaranteed to be valid. This was apparently the cause of some Nouveau oopses I experienced. This patch rewrites the function so that it keeps the reference to nentry until nentry itself is freed and we already got a reference to nentry->next. It should now be correct and free of races, but please double check this. Signed-off-by: Luca Barbieri <luca at luca-barbieri.com> --- drivers/gpu/drm/ttm/ttm_bo.c | 58 +++++++++++++++++------------------------ 1 files changed, 24 insertions(+), 34 deletions(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 2920f9a..1daa2f1 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -523,52 +523,42 @@ static int ttm_bo_cleanup_refs(struct ttm_buffer_object *bo, bool remove_all) static int ttm_bo_delayed_delete(struct ttm_bo_device *bdev, bool remove_all) { struct ttm_bo_global *glob = bdev->glob; - struct ttm_buffer_object *entry, *nentry; - struct list_head *list, *next; - int ret; + struct ttm_buffer_object *entry; + int ret = 0; spin_lock(&glob->lru_lock); - list_for_each_safe(list, next, &bdev->ddestroy) { - entry = list_entry(list, struct ttm_buffer_object, ddestroy); - nentry = NULL; + if (list_empty(&bdev->ddestroy)) { + spin_unlock(&glob->lru_lock); + return 0; + } - /* - * Protect the next list entry from destruction while we - * unlock the lru_lock. - */ + entry = list_first_entry(&bdev->ddestroy, + struct ttm_buffer_object, ddestroy); + kref_get(&entry->list_kref); - if (next != &bdev->ddestroy) { - nentry = list_entry(next, struct ttm_buffer_object, - ddestroy); + for (;;) { + struct ttm_buffer_object *nentry = NULL; + + if (!list_empty(&entry->ddestroy) + && entry->ddestroy.next != &bdev->ddestroy) { + nentry = list_entry(entry->ddestroy.next, + struct ttm_buffer_object, ddestroy); kref_get(&nentry->list_kref); } - kref_get(&entry->list_kref); spin_unlock(&glob->lru_lock); ret = ttm_bo_cleanup_refs(entry, remove_all); kref_put(&entry->list_kref, ttm_bo_release_list); + entry = nentry; - spin_lock(&glob->lru_lock); - if (nentry) { - bool next_onlist = !list_empty(next); - spin_unlock(&glob->lru_lock); - kref_put(&nentry->list_kref, ttm_bo_release_list); - spin_lock(&glob->lru_lock); - /* - * Someone might have raced us and removed the - * next entry from the list. We don't bother restarting - * list traversal. - */ - - if (!next_onlist) - break; - } - if (ret) + if (ret || !entry) break; + + spin_lock(&glob->lru_lock); } - ret = !list_empty(&bdev->ddestroy); - spin_unlock(&glob->lru_lock); + if (entry) + kref_put(&entry->list_kref, ttm_bo_release_list); return ret; } -- 1.6.3.3
Thomas Hellstrom
2010-Jan-18 19:40 UTC
[Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete
Luca, Good catch. Some comments inline: Luca Barbieri wrote:> + entry = list_first_entry(&bdev->ddestroy, > + struct ttm_buffer_object, ddestroy); > + kref_get(&entry->list_kref); > > - if (next != &bdev->ddestroy) { > - nentry = list_entry(next, struct ttm_buffer_object, > - ddestroy); > + for (;;) { > + struct ttm_buffer_object *nentry = NULL; > + > + if (!list_empty(&entry->ddestroy) > + && entry->ddestroy.next != &bdev->ddestroy) { > + nentry = list_entry(entry->ddestroy.next, > + struct ttm_buffer_object, ddestroy); >I'm not sure it's considered ok to access the list_head members directly in new code. Would nentry=list_first_entry(&entry->ddestroy, ....) work?> kref_get(&nentry->list_kref); > } >else unlock and break? That would save an unnecessary call to ttm_bo_cleanup_refs.> - kref_get(&entry->list_kref); > > spin_unlock(&glob->lru_lock); > ret = ttm_bo_cleanup_refs(entry, remove_all); > kref_put(&entry->list_kref, ttm_bo_release_list); > + entry = nentry; >Here nentry may have been removed from the list by another process, which would trigger the unnecessary call, mentioned above. /Thomas
Luca Barbieri
2010-Jan-18 22:33 UTC
[Nouveau] [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete
> Would ?nentry=list_first_entry(&entry->ddestroy, ....) work?Yes, it seems a bit less intuitive, but if that's the accepted practice, let's do that instead.> Here nentry may have been removed from the list by another process, which > would trigger the unnecessary call, mentioned above.You are right. I attached a revised patch. It's only compile tested, but the changes are small and it should hopefully work. Note that in principle we could remove the special-case code for the list head but that would require pretending the list head is actually inside a ttm_buffer_object and adding a flag to not do the unlock/cleanup/put/lock on it, which seems bad. The whole function seems more complicated than needed, but I can't find a way to do it with less code. If we could keep glob->lru_lock while calling ttm_bo_cleanup_refs things would be much simpler, but that would require intrusive changes and may not be possible. -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-drm-ttm-Fix-race-condition-in-ttm_bo_delayed_delete-.patch Type: text/x-diff Size: 3345 bytes Desc: not available Url : http://lists.freedesktop.org/archives/nouveau/attachments/20100118/1d9fc7ea/attachment.patch
Apparently Analagous Threads
- [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete (v2)
- [PATCH] drm/ttm: Fix race condition in ttm_bo_delayed_delete (v3, final)
- [RFC PATCH v1 12/16] drm/ttm: flip the switch, and convert to dma_fence
- [PATCH 13/17] drm/ttm: flip the switch, and convert to dma_fence
- [RFC PATCH v1 06/16] drm/ttm: kill fence_lock