Alexandre Courbot
2014-Jun-24 09:54 UTC
[Nouveau] [PATCH v2 0/3] drm/ttm: nouveau: memory coherency for ARM
For this v2 I have fixed the patches that are non-controversial (all Lucas' :)) and am resubmitting them in the hope that they will get merged. This will just leave the issue of Nouveau system-memory buffers mapping to be solved. This issue is quite complex, so let me summarize the situation and the data I have at hand. ARM caching is like a quantum world where Murphy's law constantly applies: sometimes things that you don't expect to work happen to work, but never the ones you would like to. On ARM the accepted wisdom is that all CPU mappings must share the same attributes, and that failure to do so results in undefined behavior. I have heard voices claiming that recent ARM SoCs were not affected by this, but have yet to see hard evidence that it is indeed the case. Most (or all) physical memory happens to be already mapped, cached, in the lowmem area of the virtual address space. This means that if we want to be safe wrt. the rule mentioned above, we must perform all subsequent memory mappings the same way. Nouveau currently performs its memory mappings cached, since it can rely on PCI to snoop and invalidate addresses written by the CPU - something that we don't have on ARM shared-memory systems. I had proposed a (bad) patch in the previous revision of this series that added a way to flush the CPU cache after each write (http://lists.freedesktop.org/archives/dri-devel/2014-May/059893.html ) but it did not trigger a lot of approval. Instead, it has been suggested to map such BOs write-combined. This would break the "one mapping type only" rule, but I gave it a try by changing the TTM_PL_TT manager's authorized caching to WC. This immediatly resulted in breakage of applications using the GPU. Digging a little further, I noticed that kernel mappings could be performed WC or uncached without any problem, but that user-space mappings *must* be cached under all circumstances. Failure to do so results in invalid pushbuffers being sent to the GPUs, messed up vertices, and other corruptions. Uncached mappings result in the same breakage. So, to summarize our options for GK20A: 1) Keeping mappings of TTM buffers cached seems to be the safest option, as it is consistent with the lowmem mapping likely affecting the memory of our buffers. But we will have to flush kernel CPU writes to these buffers one way or the other. 2) Changing the kernel mappings to WC or uncached seems to be safe. However user-space mappings must still be cached or inconsistencies happen. This double-policy for kernel and user-space mappings is not implemented in TTM and nothing so far suggests that it should be. And that's the state where we are. I am not considering the other possibilities (carving memory out of lowmem, etc.) as they have already been discussed many times by people much smarter than me (e.g. http://lists.linaro.org/pipermail/linaro-mm-sig/2011-April/000003.html ) and it seems that the issue is still here nonetheless. At this point suggestions towards option 1) or 2) (or where I could have screwed up in my understanding of ARM mappings) are welcome. And in the meantime, let's try to get the 3 guys below merged! Changes since v1: - Removed conditional compilation for Nouveau cache sync handler - Refactored nouveau_gem_ioctl_cpu_prep() into a new function to keep buffer cache management into nouveau_bo.c Lucas Stach (3): drm/ttm: recognize ARM arch in ioprot handler drm/ttm: introduce dma cache sync helpers drm/nouveau: hook up cache sync functions drivers/gpu/drm/nouveau/nouveau_bo.c | 47 +++++++++++++++++++++++++++++++++++ drivers/gpu/drm/nouveau/nouveau_bo.h | 1 + drivers/gpu/drm/nouveau/nouveau_gem.c | 10 +++----- drivers/gpu/drm/ttm/ttm_bo_util.c | 2 +- drivers/gpu/drm/ttm/ttm_tt.c | 25 +++++++++++++++++++ include/drm/ttm/ttm_bo_driver.h | 28 +++++++++++++++++++++ 6 files changed, 105 insertions(+), 8 deletions(-) -- 2.0.0
Alexandre Courbot
2014-Jun-24 09:54 UTC
[Nouveau] [PATCH v2 1/3] drm/ttm: recognize ARM arch in ioprot handler
From: Lucas Stach <dev at lynxeye.de> Signed-off-by: Lucas Stach <dev at lynxeye.de> Signed-off-by: Alexandre Courbot <acourbot at nvidia.com> --- drivers/gpu/drm/ttm/ttm_bo_util.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c index 1df856f78568..30e5d90cb7bc 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_util.c +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c @@ -500,7 +500,7 @@ pgprot_t ttm_io_prot(uint32_t caching_flags, pgprot_t tmp) pgprot_val(tmp) |= _PAGE_GUARDED; } #endif -#if defined(__ia64__) +#if defined(__ia64__) || defined(__arm__) if (caching_flags & TTM_PL_FLAG_WC) tmp = pgprot_writecombine(tmp); else -- 2.0.0
Alexandre Courbot
2014-Jun-24 09:54 UTC
[Nouveau] [PATCH v2 2/3] drm/ttm: introduce dma cache sync helpers
From: Lucas Stach <dev at lynxeye.de> On architectures for which access to GPU memory is non-coherent, caches need to be flushed and invalidated explicitly at the appropriate places. Introduce two small helpers to make things easy for TTM-based drivers. Signed-off-by: Lucas Stach <dev at lynxeye.de> Signed-off-by: Alexandre Courbot <acourbot at nvidia.com> --- drivers/gpu/drm/ttm/ttm_tt.c | 25 +++++++++++++++++++++++++ include/drm/ttm/ttm_bo_driver.h | 28 ++++++++++++++++++++++++++++ 2 files changed, 53 insertions(+) diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c index 75f319090043..66c16ad35f70 100644 --- a/drivers/gpu/drm/ttm/ttm_tt.c +++ b/drivers/gpu/drm/ttm/ttm_tt.c @@ -38,6 +38,7 @@ #include <linux/swap.h> #include <linux/slab.h> #include <linux/export.h> +#include <linux/dma-mapping.h> #include <drm/drm_cache.h> #include <drm/drm_mem_util.h> #include <drm/ttm/ttm_module.h> @@ -248,6 +249,30 @@ void ttm_dma_tt_fini(struct ttm_dma_tt *ttm_dma) } EXPORT_SYMBOL(ttm_dma_tt_fini); +void ttm_dma_tt_cache_sync_for_device(struct ttm_dma_tt *ttm_dma, + struct device *dev) +{ + unsigned long i; + + for (i = 0; i < ttm_dma->ttm.num_pages; i++) { + dma_sync_single_for_device(dev, ttm_dma->dma_address[i], + PAGE_SIZE, DMA_TO_DEVICE); + } +} +EXPORT_SYMBOL(ttm_dma_tt_cache_sync_for_device); + +void ttm_dma_tt_cache_sync_for_cpu(struct ttm_dma_tt *ttm_dma, + struct device *dev) +{ + unsigned long i; + + for (i = 0; i < ttm_dma->ttm.num_pages; i++) { + dma_sync_single_for_cpu(dev, ttm_dma->dma_address[i], + PAGE_SIZE, DMA_FROM_DEVICE); + } +} +EXPORT_SYMBOL(ttm_dma_tt_cache_sync_for_cpu); + void ttm_tt_unbind(struct ttm_tt *ttm) { int ret; diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h index a5183da3ef92..52fb709568fc 100644 --- a/include/drm/ttm/ttm_bo_driver.h +++ b/include/drm/ttm/ttm_bo_driver.h @@ -41,6 +41,7 @@ #include <linux/fs.h> #include <linux/spinlock.h> #include <linux/reservation.h> +#include <linux/device.h> struct ttm_backend_func { /** @@ -690,6 +691,33 @@ extern int ttm_tt_swapout(struct ttm_tt *ttm, */ extern void ttm_tt_unpopulate(struct ttm_tt *ttm); +/** + * ttm_dma_tt_cache_sync_for_device: + * + * @ttm A struct ttm_tt of the type returned by ttm_dma_tt_init. + * @dev A struct device representing the device to which to sync. + * + * This function will flush the CPU caches on arches where snooping in the + * TT is not available. On fully coherent arches this will turn into an (almost) + * noop. This makes sure that data written by the CPU is visible to the device. + */ +extern void ttm_dma_tt_cache_sync_for_device(struct ttm_dma_tt *ttm_dma, + struct device *dev); + +/** + * ttm_dma_tt_cache_sync_for_cpu: + * + * @ttm A struct ttm_tt of the type returned by ttm_dma_tt_init. + * @dev A struct device representing the device from which to sync. + * + * This function will invalidate the CPU caches on arches where snooping in the + * TT is not available. On fully coherent arches this will turn into an (almost) + * noop. This makes sure that the CPU does not read any stale cached or + * prefetched data. + */ +extern void ttm_dma_tt_cache_sync_for_cpu(struct ttm_dma_tt *ttm_dma, + struct device *dev); + /* * ttm_bo.c */ -- 2.0.0
Alexandre Courbot
2014-Jun-24 09:54 UTC
[Nouveau] [PATCH v2 3/3] drm/nouveau: hook up cache sync functions
From: Lucas Stach <dev at lynxeye.de> Use the newly-introduced TTM cache sync functions in Nouveau. Signed-off-by: Lucas Stach <dev at lynxeye.de> [acourbot at nvidia.com: rearrange code, make platform-friendly] Signed-off-by: Alexandre Courbot <acourbot at nvidia.com> --- drivers/gpu/drm/nouveau/nouveau_bo.c | 47 +++++++++++++++++++++++++++++++++++ drivers/gpu/drm/nouveau/nouveau_bo.h | 1 + drivers/gpu/drm/nouveau/nouveau_gem.c | 10 +++----- 3 files changed, 51 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c index b6dc85c614be..74c68c16e777 100644 --- a/drivers/gpu/drm/nouveau/nouveau_bo.c +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c @@ -284,6 +284,34 @@ set_placement_range(struct nouveau_bo *nvbo, uint32_t type) } } +static void +nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo) +{ + struct nouveau_device *device; + struct ttm_tt *ttm = nvbo->bo.ttm; + + device = nouveau_dev(nouveau_bdev(ttm->bdev)->dev); + + if (nvbo->bo.ttm && nvbo->bo.ttm->caching_state == tt_cached) + ttm_dma_tt_cache_sync_for_cpu((struct ttm_dma_tt *)nvbo->bo.ttm, + nv_device_base(device)); +} + +static void +nouveau_bo_sync_for_device(struct nouveau_bo *nvbo) +{ + struct ttm_tt *ttm = nvbo->bo.ttm; + + if (ttm && ttm->caching_state == tt_cached) { + struct nouveau_device *device; + + device = nouveau_dev(nouveau_bdev(ttm->bdev)->dev); + + ttm_dma_tt_cache_sync_for_device((struct ttm_dma_tt *)ttm, + nv_device_base(device)); + } +} + void nouveau_bo_placement_set(struct nouveau_bo *nvbo, uint32_t type, uint32_t busy) { @@ -407,6 +435,8 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible, { int ret; + nouveau_bo_sync_for_device(nvbo); + ret = ttm_bo_validate(&nvbo->bo, &nvbo->placement, interruptible, no_wait_gpu); if (ret) @@ -415,6 +445,23 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible, return 0; } +int +nouveau_bo_wait(struct nouveau_bo *nvbo, bool no_wait) +{ + int ret; + + spin_lock(&nvbo->bo.bdev->fence_lock); + ret = ttm_bo_wait(&nvbo->bo, true, true, no_wait); + spin_unlock(&nvbo->bo.bdev->fence_lock); + + if (ret) + return ret; + + nouveau_bo_sync_for_cpu(nvbo); + + return 0; +} + u16 nouveau_bo_rd16(struct nouveau_bo *nvbo, unsigned index) { diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.h b/drivers/gpu/drm/nouveau/nouveau_bo.h index ff17c1f432fc..a4e9052d54fd 100644 --- a/drivers/gpu/drm/nouveau/nouveau_bo.h +++ b/drivers/gpu/drm/nouveau/nouveau_bo.h @@ -81,6 +81,7 @@ void nouveau_bo_wr32(struct nouveau_bo *, unsigned index, u32 val); void nouveau_bo_fence(struct nouveau_bo *, struct nouveau_fence *); int nouveau_bo_validate(struct nouveau_bo *, bool interruptible, bool no_wait_gpu); +int nouveau_bo_wait(struct nouveau_bo *, bool no_wait); struct nouveau_vma * nouveau_bo_vma_find(struct nouveau_bo *, struct nouveau_vm *); diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c index c90c0dc0afe8..916cb8ff568c 100644 --- a/drivers/gpu/drm/nouveau/nouveau_gem.c +++ b/drivers/gpu/drm/nouveau/nouveau_gem.c @@ -884,19 +884,15 @@ nouveau_gem_ioctl_cpu_prep(struct drm_device *dev, void *data, { struct drm_nouveau_gem_cpu_prep *req = data; struct drm_gem_object *gem; - struct nouveau_bo *nvbo; bool no_wait = !!(req->flags & NOUVEAU_GEM_CPU_PREP_NOWAIT); - int ret = -EINVAL; + int ret; gem = drm_gem_object_lookup(dev, file_priv, req->handle); if (!gem) return -ENOENT; - nvbo = nouveau_gem_object(gem); - - spin_lock(&nvbo->bo.bdev->fence_lock); - ret = ttm_bo_wait(&nvbo->bo, true, true, no_wait); - spin_unlock(&nvbo->bo.bdev->fence_lock); + ret = nouveau_bo_wait(nouveau_gem_object(gem), no_wait); drm_gem_object_unreference_unlocked(gem); + return ret; } -- 2.0.0
Russell King - ARM Linux
2014-Jun-24 10:02 UTC
[Nouveau] [PATCH v2 2/3] drm/ttm: introduce dma cache sync helpers
On Tue, Jun 24, 2014 at 06:54:26PM +0900, Alexandre Courbot wrote:> From: Lucas Stach <dev at lynxeye.de> > > On architectures for which access to GPU memory is non-coherent, > caches need to be flushed and invalidated explicitly at the > appropriate places. Introduce two small helpers to make things > easy for TTM-based drivers.Have you run this with DMA API debugging enabled? I suspect you haven't, and I recommend that you do. -- FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly improving, and getting towards what was expected from it.