Alexandre Courbot
2014-Oct-27 09:49 UTC
[Nouveau] [PATCH v5 0/4] drm: nouveau: memory coherency on ARM
It has been a couple of months since v4 - apologies for this. v4 has not received many comments, but this version addresses them and makes a new attempt at pushing the critical bit for GK20A and Nouveau on ARM in general. As a reminder, this series addresses the memory coherency issue that we are seeing on ARM platforms. Contrary to x86 which invalidates the PCI caches whenever a write is made by the CPU to a GPU-accessed area (and vice-versa), such accesses on ARM might result in the other accessor to end up in an incoherent state. To address this, patches 1-3 add the ability to understand whether we are on a non-coherent architecture, implement a way to explicitly allocate coherent buffers buffers using the DMA API, and uses it for GPFIFOS and fences. Patch 4 also uses the DMA API to synchronize user-space allocated buffers when they are passed from the CPU to the GPU and vice-versa. Thanks to the feedback received on the previous revisions I believe this code looks rather good now. I also have extensively tested it and could not see any buffer corruption issue anymore. There is still one point which is not completely satisfying in my opinion: TTMs for TTM-backed objects are allocated in nouveau_sgdma_create_ttm() and populated in nouveau_ttm_tt_populate(). Coherently-allocated buffers need to use the ttm_dma API instead of the pool-based TTM API, and whether an object is coherent or not is stored in its instance of nouveau_bo. The problem is that neither nouveau_sgdma_create_ttm() nor nouveau_ttm_tt_populate() have a way to access the nouveau_bo they are working for. This is in particular a problem for nouveau_ttm_tt_populate() since we need to rely on a purely TTM-based heuristic to decide how to allocate the memory. The heuristic we are using works, but it makes the code harder to understand than if we could just access the nouveau_bo. nouveau_sgdma_create_ttm() always allocates a ttm_dma_tt structure, which is wrong but happens to suit us for now. Still, this part of the code could be rewritten much more cleanly if only we could access the nouveau_bo instance in these functions. I proposed some time ago to address this by making the ttm_tt_create hook take a pointer to a ttm_bo_object instead of a ttm_bo_device. This would still allow us to access the ttm_bo_device, while letting us retrieve the nouveau_bo and store it into whatever structure we embed our TTM into. For some reason David was not fond of the idea - I am taking another chance at submitting it since the issue is still not resolved and leads in inferior-looking code in at least Nouveau. Phew, sorry for the long cover letter - thanks if you have read until here! :) Changes since v4: - Only use DMA API for sync, as suggested by Daniel Alexandre Courbot (4): drm: introduce nv_device_is_cpu_coherent() drm: implement explicitly coherent BOs drm: allocate GPFIFOs and fences coherently drm: synchronize BOs when required drm/nouveau_bo.c | 122 ++++++++++++++++++++++++++++++++++++++++++--- drm/nouveau_bo.h | 3 ++ drm/nouveau_chan.c | 2 +- drm/nouveau_gem.c | 12 +++++ drm/nv84_fence.c | 4 +- lib/core/os.h | 2 + nvkm/include/core/device.h | 6 +++ 7 files changed, 140 insertions(+), 11 deletions(-) -- 2.1.2
Alexandre Courbot
2014-Oct-27 09:49 UTC
[Nouveau] [PATCH v5 1/4] drm: introduce nv_device_is_cpu_coherent()
Add a function allowing us to know whether a device is CPU-coherent, i.e. accesses performed by the CPU on GPU-mapped buffers will be immediately visible on the GPU side and vice-versa. For now, a device is considered to be coherent if it uses the PCI bus on a non-ARM architecture. Signed-off-by: Alexandre Courbot <acourbot at nvidia.com> --- lib/core/os.h | 2 ++ nvkm/include/core/device.h | 6 ++++++ 2 files changed, 8 insertions(+) diff --git a/lib/core/os.h b/lib/core/os.h index fba9542292ac..79462eb2cfd4 100644 --- a/lib/core/os.h +++ b/lib/core/os.h @@ -101,6 +101,8 @@ typedef dma_addr_t resource_size_t; #define __printf(a,b) #define __user +#define IS_ENABLED(x) (0) + static inline int order_base_2(u64 base) { diff --git a/nvkm/include/core/device.h b/nvkm/include/core/device.h index 1d9d893929bb..0d839e1ddaf4 100644 --- a/nvkm/include/core/device.h +++ b/nvkm/include/core/device.h @@ -158,6 +158,12 @@ nv_device_is_pci(struct nouveau_device *device) return device->pdev != NULL; } +static inline bool +nv_device_is_cpu_coherent(struct nouveau_device *device) +{ + return (!IS_ENABLED(CONFIG_ARM) && nv_device_is_pci(device)); +} + static inline struct device * nv_device_base(struct nouveau_device *device) { -- 2.1.2
Alexandre Courbot
2014-Oct-27 09:49 UTC
[Nouveau] [PATCH v5 2/4] drm: implement explicitly coherent BOs
Allow nouveau_bo_new() to recognize the TTM_PL_FLAG_UNCACHED flag, which means that we want the allocated BO to be perfectly coherent between the CPU and GPU. This is useful on non-coherent architectures for which we do not want to manually sync some rarely-accessed buffers: typically, fences and pushbuffers. A TTM BO allocated with the TTM_PL_FLAG_UNCACHED on a non-coherent architecture will be populated using the DMA API, and accesses to it performed using the coherent mapping performed by dma_alloc_coherent(). Signed-off-by: Alexandre Courbot <acourbot at nvidia.com> --- drm/nouveau_bo.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++++------ drm/nouveau_bo.h | 1 + 2 files changed, 73 insertions(+), 8 deletions(-) diff --git a/drm/nouveau_bo.c b/drm/nouveau_bo.c index 9a8adeec80cd..ed9a6946f6d6 100644 --- a/drm/nouveau_bo.c +++ b/drm/nouveau_bo.c @@ -214,6 +214,9 @@ nouveau_bo_new(struct drm_device *dev, int size, int align, nvbo->tile_flags = tile_flags; nvbo->bo.bdev = &drm->ttm.bdev; + if (!nv_device_is_cpu_coherent(nvkm_device(&drm->device))) + nvbo->force_coherent = flags & TTM_PL_FLAG_UNCACHED; + nvbo->page_shift = 12; if (drm->client.vm) { if (!(flags & TTM_PL_FLAG_TT) && size > 256 * 1024) @@ -291,8 +294,9 @@ void nouveau_bo_placement_set(struct nouveau_bo *nvbo, uint32_t type, uint32_t busy) { struct ttm_placement *pl = &nvbo->placement; - uint32_t flags = TTM_PL_MASK_CACHING | - (nvbo->pin_refcnt ? TTM_PL_FLAG_NO_EVICT : 0); + uint32_t flags = (nvbo->force_coherent ? TTM_PL_FLAG_UNCACHED : + TTM_PL_MASK_CACHING) | + (nvbo->pin_refcnt ? TTM_PL_FLAG_NO_EVICT : 0); pl->placement = nvbo->placements; set_placement_list(nvbo->placements, &pl->num_placement, @@ -396,7 +400,14 @@ nouveau_bo_map(struct nouveau_bo *nvbo) if (ret) return ret; - ret = ttm_bo_kmap(&nvbo->bo, 0, nvbo->bo.mem.num_pages, &nvbo->kmap); + /* + * TTM buffers allocated using the DMA API already have a mapping, let's + * use it instead. + */ + if (!nvbo->force_coherent) + ret = ttm_bo_kmap(&nvbo->bo, 0, nvbo->bo.mem.num_pages, + &nvbo->kmap); + ttm_bo_unreserve(&nvbo->bo); return ret; } @@ -404,7 +415,14 @@ nouveau_bo_map(struct nouveau_bo *nvbo) void nouveau_bo_unmap(struct nouveau_bo *nvbo) { - if (nvbo) + if (!nvbo) + return; + + /* + * TTM buffers allocated using the DMA API already had a coherent + * mapping which we used, no need to unmap. + */ + if (!nvbo->force_coherent) ttm_bo_kunmap(&nvbo->kmap); } @@ -422,12 +440,36 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible, return 0; } +static inline void * +_nouveau_bo_mem_index(struct nouveau_bo *nvbo, unsigned index, void *mem, u8 sz) +{ + struct ttm_dma_tt *dma_tt; + u8 *m = mem; + + index *= sz; + + if (m) { + /* kmap'd address, return the corresponding offset */ + m += index; + } else { + /* DMA-API mapping, lookup the right address */ + dma_tt = (struct ttm_dma_tt *)nvbo->bo.ttm; + m = dma_tt->cpu_address[index / PAGE_SIZE]; + m += index % PAGE_SIZE; + } + + return m; +} +#define nouveau_bo_mem_index(o, i, m) _nouveau_bo_mem_index(o, i, m, sizeof(*m)) + u16 nouveau_bo_rd16(struct nouveau_bo *nvbo, unsigned index) { bool is_iomem; u16 *mem = ttm_kmap_obj_virtual(&nvbo->kmap, &is_iomem); - mem = &mem[index]; + + mem = nouveau_bo_mem_index(nvbo, index, mem); + if (is_iomem) return ioread16_native((void __force __iomem *)mem); else @@ -439,7 +481,9 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val) { bool is_iomem; u16 *mem = ttm_kmap_obj_virtual(&nvbo->kmap, &is_iomem); - mem = &mem[index]; + + mem = nouveau_bo_mem_index(nvbo, index, mem); + if (is_iomem) iowrite16_native(val, (void __force __iomem *)mem); else @@ -451,7 +495,9 @@ nouveau_bo_rd32(struct nouveau_bo *nvbo, unsigned index) { bool is_iomem; u32 *mem = ttm_kmap_obj_virtual(&nvbo->kmap, &is_iomem); - mem = &mem[index]; + + mem = nouveau_bo_mem_index(nvbo, index, mem); + if (is_iomem) return ioread32_native((void __force __iomem *)mem); else @@ -463,7 +509,9 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val) { bool is_iomem; u32 *mem = ttm_kmap_obj_virtual(&nvbo->kmap, &is_iomem); - mem = &mem[index]; + + mem = nouveau_bo_mem_index(nvbo, index, mem); + if (is_iomem) iowrite32_native(val, (void __force __iomem *)mem); else @@ -1383,6 +1431,14 @@ nouveau_ttm_tt_populate(struct ttm_tt *ttm) dev = drm->dev; pdev = nv_device_base(device); + /* + * Objects matching this condition have been marked as force_coherent, + * so use the DMA API for them. + */ + if (!nv_device_is_cpu_coherent(device) && + ttm->caching_state == tt_uncached) + return ttm_dma_populate(ttm_dma, dev->dev); + #if __OS_HAS_AGP if (drm->agp.stat == ENABLED) { return ttm_agp_tt_populate(ttm); @@ -1440,6 +1496,14 @@ nouveau_ttm_tt_unpopulate(struct ttm_tt *ttm) dev = drm->dev; pdev = nv_device_base(device); + /* + * Objects matching this condition have been marked as force_coherent, + * so use the DMA API for them. + */ + if (!nv_device_is_cpu_coherent(device) && + ttm->caching_state == tt_uncached) + ttm_dma_unpopulate(ttm_dma, dev->dev); + #if __OS_HAS_AGP if (drm->agp.stat == ENABLED) { ttm_agp_tt_unpopulate(ttm); diff --git a/drm/nouveau_bo.h b/drm/nouveau_bo.h index 22d2c764d80b..0f8bbd48a0b9 100644 --- a/drm/nouveau_bo.h +++ b/drm/nouveau_bo.h @@ -13,6 +13,7 @@ struct nouveau_bo { u32 valid_domains; struct ttm_place placements[3]; struct ttm_place busy_placements[3]; + bool force_coherent; struct ttm_bo_kmap_obj kmap; struct list_head head; -- 2.1.2
Alexandre Courbot
2014-Oct-27 09:49 UTC
[Nouveau] [PATCH v5 3/4] drm: allocate GPFIFOs and fences coherently
Specify TTM_PL_FLAG_UNCACHED when allocating GPFIFOs and fences to allow them to be safely accessed by the kernel without being synced on non-coherent architectures. Signed-off-by: Alexandre Courbot <acourbot at nvidia.com> --- drm/nouveau_chan.c | 2 +- drm/nv84_fence.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drm/nouveau_chan.c b/drm/nouveau_chan.c index 77c81d6b45ee..0f3da86840f2 100644 --- a/drm/nouveau_chan.c +++ b/drm/nouveau_chan.c @@ -102,7 +102,7 @@ nouveau_channel_prep(struct nouveau_drm *drm, struct nvif_device *device, chan->drm = drm; /* allocate memory for dma push buffer */ - target = TTM_PL_FLAG_TT; + target = TTM_PL_FLAG_TT | TTM_PL_FLAG_UNCACHED; if (nouveau_vram_pushbuf) target = TTM_PL_FLAG_VRAM; diff --git a/drm/nv84_fence.c b/drm/nv84_fence.c index d6c6c87c3f07..4d79be7558d8 100644 --- a/drm/nv84_fence.c +++ b/drm/nv84_fence.c @@ -246,8 +246,8 @@ nv84_fence_create(struct nouveau_drm *drm) if (ret == 0) ret = nouveau_bo_new(drm->dev, 16 * priv->base.contexts, 0, - TTM_PL_FLAG_TT, 0, 0, NULL, NULL, - &priv->bo_gart); + TTM_PL_FLAG_TT | TTM_PL_FLAG_UNCACHED, 0, + 0, NULL, NULL, &priv->bo_gart); if (ret == 0) { ret = nouveau_bo_pin(priv->bo_gart, TTM_PL_FLAG_TT); if (ret == 0) { -- 2.1.2
Alexandre Courbot
2014-Oct-27 09:49 UTC
[Nouveau] [PATCH v5 4/4] drm: synchronize BOs when required
On architectures for which access to GPU memory is non-coherent, caches need to be flushed and invalidated explicitly when BO control changes between CPU and GPU. This patch adds buffer synchronization functions which invokes the correct API (PCI or DMA) to ensure synchronization is effective. Based on the TTM DMA cache helper patches by Lucas Stach. Signed-off-by: Lucas Stach <dev at lynxeye.de> Signed-off-by: Alexandre Courbot <acourbot at nvidia.com> --- drm/nouveau_bo.c | 42 ++++++++++++++++++++++++++++++++++++++++++ drm/nouveau_bo.h | 2 ++ drm/nouveau_gem.c | 12 ++++++++++++ 3 files changed, 56 insertions(+) diff --git a/drm/nouveau_bo.c b/drm/nouveau_bo.c index ed9a6946f6d6..d2a4768e3efd 100644 --- a/drm/nouveau_bo.c +++ b/drm/nouveau_bo.c @@ -426,6 +426,46 @@ nouveau_bo_unmap(struct nouveau_bo *nvbo) ttm_bo_kunmap(&nvbo->kmap); } +void +nouveau_bo_sync_for_device(struct nouveau_bo *nvbo) +{ + struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev); + struct nouveau_device *device = nvkm_device(&drm->device); + struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm; + int i; + + if (!ttm_dma) + return; + + /* Don't waste time looping if the object is coherent */ + if (nvbo->force_coherent) + return; + + for (i = 0; i < ttm_dma->ttm.num_pages; i++) + dma_sync_single_for_device(nv_device_base(device), + ttm_dma->dma_address[i], PAGE_SIZE, DMA_TO_DEVICE); +} + +void +nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo) +{ + struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev); + struct nouveau_device *device = nvkm_device(&drm->device); + struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm; + int i; + + if (!ttm_dma) + return; + + /* Don't waste time looping if the object is coherent */ + if (nvbo->force_coherent) + return; + + for (i = 0; i < ttm_dma->ttm.num_pages; i++) + dma_sync_single_for_cpu(nv_device_base(device), + ttm_dma->dma_address[i], PAGE_SIZE, DMA_FROM_DEVICE); +} + int nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible, bool no_wait_gpu) @@ -437,6 +477,8 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible, if (ret) return ret; + nouveau_bo_sync_for_device(nvbo); + return 0; } diff --git a/drm/nouveau_bo.h b/drm/nouveau_bo.h index 0f8bbd48a0b9..c827f233e41d 100644 --- a/drm/nouveau_bo.h +++ b/drm/nouveau_bo.h @@ -85,6 +85,8 @@ void nouveau_bo_wr32(struct nouveau_bo *, unsigned index, u32 val); void nouveau_bo_fence(struct nouveau_bo *, struct nouveau_fence *, bool exclusive); int nouveau_bo_validate(struct nouveau_bo *, bool interruptible, bool no_wait_gpu); +void nouveau_bo_sync_for_device(struct nouveau_bo *nvbo); +void nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo); struct nouveau_vma * nouveau_bo_vma_find(struct nouveau_bo *, struct nouveau_vm *); diff --git a/drm/nouveau_gem.c b/drm/nouveau_gem.c index 36951ee4b157..42c34babc2e5 100644 --- a/drm/nouveau_gem.c +++ b/drm/nouveau_gem.c @@ -867,6 +867,7 @@ nouveau_gem_ioctl_cpu_prep(struct drm_device *dev, void *data, else ret = lret; } + nouveau_bo_sync_for_cpu(nvbo); drm_gem_object_unreference_unlocked(gem); return ret; @@ -876,6 +877,17 @@ int nouveau_gem_ioctl_cpu_fini(struct drm_device *dev, void *data, struct drm_file *file_priv) { + struct drm_nouveau_gem_cpu_fini *req = data; + struct drm_gem_object *gem; + struct nouveau_bo *nvbo; + + gem = drm_gem_object_lookup(dev, file_priv, req->handle); + if (!gem) + return -ENOENT; + nvbo = nouveau_gem_object(gem); + + nouveau_bo_sync_for_device(nvbo); + drm_gem_object_unreference_unlocked(gem); return 0; } -- 2.1.2
Possibly Parallel Threads
- [PATCH v4 0/6] drm: nouveau: memory coherency on ARM
- [PATCH v4 4/6] drm/nouveau: synchronize BOs when required
- [PATCH v4 4/6] drm/nouveau: synchronize BOs when required
- [PATCH 1/2] ttm: remove special handling of coherent objects
- [PATCH 0/4] drm/ttm: nouveau: memory coherency fixes for ARM