thr3ads.net - Nouveau - [Nouveau] [PATCH v2 2/3] drm/ttm: introduce dma cache sync helpers [Jun 2014]

If this information is useful, please help other people find it:
Share via:

Alexandre Courbot

2014-Jun-24 09:54 UTC

[Nouveau] [PATCH v2 0/3] drm/ttm: nouveau: memory coherency for ARM

For this v2 I have fixed the patches that are non-controversial (all Lucas'
:))
and am resubmitting them in the hope that they will get merged. This will
just leave the issue of Nouveau system-memory buffers mapping to be solved.

This issue is quite complex, so let me summarize the situation and the data
I have at hand. ARM caching is like a quantum world where Murphy's law
constantly applies: sometimes things that you don't expect to work happen to
work, but never the ones you would like to.

On ARM the accepted wisdom is that all CPU mappings must share the same
attributes, and that failure to do so results in undefined behavior. I have
heard voices claiming that recent ARM SoCs were not affected by this, but have
yet to see hard evidence that it is indeed the case.

Most (or all) physical memory happens to be already mapped, cached, in the
lowmem area of the virtual address space. This means that if we want to be
safe wrt. the rule mentioned above, we must perform all subsequent memory
mappings the same way.

Nouveau currently performs its memory mappings cached, since it can rely on PCI
to snoop and invalidate addresses written by the CPU - something that we
don't
have on ARM shared-memory systems. I had proposed a (bad) patch in the previous
revision of this series that added a way to flush the CPU cache after each write
(http://lists.freedesktop.org/archives/dri-devel/2014-May/059893.html ) but it
did not trigger a lot of approval. Instead, it has been suggested to map such
BOs write-combined.

This would break the "one mapping type only" rule, but I gave it a try
by
changing the TTM_PL_TT manager's authorized caching to WC. This immediatly
resulted in breakage of applications using the GPU. Digging a little further,
I noticed that kernel mappings could be performed WC or uncached without any
problem, but that user-space mappings *must* be cached under all circumstances.
Failure to do so results in invalid pushbuffers being sent to the GPUs, messed
up vertices, and other corruptions. Uncached mappings result in the same
breakage.

So, to summarize our options for GK20A:
1) Keeping mappings of TTM buffers cached seems to be the safest option, as it
is consistent with the lowmem mapping likely affecting the memory of our
buffers. But we will have to flush kernel CPU writes to these buffers one way
or the other.
2) Changing the kernel mappings to WC or uncached seems to be safe. However
user-space mappings must still be cached or inconsistencies happen. This
double-policy for kernel and user-space mappings is not implemented in TTM
and nothing so far suggests that it should be.

And that's the state where we are. I am not considering the other
possibilities
(carving memory out of lowmem, etc.) as they have already been discussed many
times by people much smarter than me (e.g.
http://lists.linaro.org/pipermail/linaro-mm-sig/2011-April/000003.html )
and it seems that the issue is still here nonetheless.

At this point suggestions towards option 1) or 2) (or where I could have
screwed up in my understanding of ARM mappings) are welcome. And in the
meantime, let's try to get the 3 guys below merged!

Changes since v1:
- Removed conditional compilation for Nouveau cache sync handler
- Refactored nouveau_gem_ioctl_cpu_prep() into a new function to keep buffer
  cache management into nouveau_bo.c

Lucas Stach (3):
  drm/ttm: recognize ARM arch in ioprot handler
  drm/ttm: introduce dma cache sync helpers
  drm/nouveau: hook up cache sync functions

 drivers/gpu/drm/nouveau/nouveau_bo.c  | 47 +++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_bo.h  |  1 +
 drivers/gpu/drm/nouveau/nouveau_gem.c | 10 +++-----
 drivers/gpu/drm/ttm/ttm_bo_util.c     |  2 +-
 drivers/gpu/drm/ttm/ttm_tt.c          | 25 +++++++++++++++++++
 include/drm/ttm/ttm_bo_driver.h       | 28 +++++++++++++++++++++
 6 files changed, 105 insertions(+), 8 deletions(-)

-- 
2.0.0

Alexandre Courbot

2014-Jun-24 09:54 UTC

head link

[Nouveau] [PATCH v2 1/3] drm/ttm: recognize ARM arch in ioprot handler

From: Lucas Stach <dev at lynxeye.de>

Signed-off-by: Lucas Stach <dev at lynxeye.de>
Signed-off-by: Alexandre Courbot <acourbot at nvidia.com>
---
 drivers/gpu/drm/ttm/ttm_bo_util.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c
b/drivers/gpu/drm/ttm/ttm_bo_util.c
index 1df856f78568..30e5d90cb7bc 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -500,7 +500,7 @@ pgprot_t ttm_io_prot(uint32_t caching_flags, pgprot_t tmp)
 			pgprot_val(tmp) |= _PAGE_GUARDED;
 	}
 #endif
-#if defined(__ia64__)
+#if defined(__ia64__) || defined(__arm__)
 	if (caching_flags & TTM_PL_FLAG_WC)
 		tmp = pgprot_writecombine(tmp);
 	else
-- 
2.0.0

Alexandre Courbot

2014-Jun-24 09:54 UTC

head link

[Nouveau] [PATCH v2 2/3] drm/ttm: introduce dma cache sync helpers

From: Lucas Stach <dev at lynxeye.de>

On architectures for which access to GPU memory is non-coherent,
caches need to be flushed and invalidated explicitly at the
appropriate places. Introduce two small helpers to make things
easy for TTM-based drivers.

Signed-off-by: Lucas Stach <dev at lynxeye.de>
Signed-off-by: Alexandre Courbot <acourbot at nvidia.com>
---
 drivers/gpu/drm/ttm/ttm_tt.c    | 25 +++++++++++++++++++++++++
 include/drm/ttm/ttm_bo_driver.h | 28 ++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+)

diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index 75f319090043..66c16ad35f70 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -38,6 +38,7 @@
 #include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/export.h>
+#include <linux/dma-mapping.h>
 #include <drm/drm_cache.h>
 #include <drm/drm_mem_util.h>
 #include <drm/ttm/ttm_module.h>
@@ -248,6 +249,30 @@ void ttm_dma_tt_fini(struct ttm_dma_tt *ttm_dma)
 }
 EXPORT_SYMBOL(ttm_dma_tt_fini);
 
+void ttm_dma_tt_cache_sync_for_device(struct ttm_dma_tt *ttm_dma,
+				      struct device *dev)
+{
+	unsigned long i;
+
+	for (i = 0; i < ttm_dma->ttm.num_pages; i++) {
+		dma_sync_single_for_device(dev, ttm_dma->dma_address[i],
+					   PAGE_SIZE, DMA_TO_DEVICE);
+	}
+}
+EXPORT_SYMBOL(ttm_dma_tt_cache_sync_for_device);
+
+void ttm_dma_tt_cache_sync_for_cpu(struct ttm_dma_tt *ttm_dma,
+				   struct device *dev)
+{
+	unsigned long i;
+
+	for (i = 0; i < ttm_dma->ttm.num_pages; i++) {
+		dma_sync_single_for_cpu(dev, ttm_dma->dma_address[i],
+					PAGE_SIZE, DMA_FROM_DEVICE);
+	}
+}
+EXPORT_SYMBOL(ttm_dma_tt_cache_sync_for_cpu);
+
 void ttm_tt_unbind(struct ttm_tt *ttm)
 {
 	int ret;
diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
index a5183da3ef92..52fb709568fc 100644
--- a/include/drm/ttm/ttm_bo_driver.h
+++ b/include/drm/ttm/ttm_bo_driver.h
@@ -41,6 +41,7 @@
 #include <linux/fs.h>
 #include <linux/spinlock.h>
 #include <linux/reservation.h>
+#include <linux/device.h>
 
 struct ttm_backend_func {
 	/**
@@ -690,6 +691,33 @@ extern int ttm_tt_swapout(struct ttm_tt *ttm,
  */
 extern void ttm_tt_unpopulate(struct ttm_tt *ttm);
 
+/**
+ * ttm_dma_tt_cache_sync_for_device:
+ *
+ * @ttm A struct ttm_tt of the type returned by ttm_dma_tt_init.
+ * @dev A struct device representing the device to which to sync.
+ *
+ * This function will flush the CPU caches on arches where snooping in the
+ * TT is not available. On fully coherent arches this will turn into an
(almost)
+ * noop. This makes sure that data written by the CPU is visible to the device.
+ */
+extern void ttm_dma_tt_cache_sync_for_device(struct ttm_dma_tt *ttm_dma,
+					     struct device *dev);
+
+/**
+ * ttm_dma_tt_cache_sync_for_cpu:
+ *
+ * @ttm A struct ttm_tt of the type returned by ttm_dma_tt_init.
+ * @dev A struct device representing the device from which to sync.
+ *
+ * This function will invalidate the CPU caches on arches where snooping in the
+ * TT is not available. On fully coherent arches this will turn into an
(almost)
+ * noop. This makes sure that the CPU does not read any stale cached or
+ * prefetched data.
+ */
+extern void ttm_dma_tt_cache_sync_for_cpu(struct ttm_dma_tt *ttm_dma,
+					  struct device *dev);
+
 /*
  * ttm_bo.c
  */
-- 
2.0.0

Alexandre Courbot

2014-Jun-24 09:54 UTC

head link

[Nouveau] [PATCH v2 3/3] drm/nouveau: hook up cache sync functions

From: Lucas Stach <dev at lynxeye.de>

Use the newly-introduced TTM cache sync functions in Nouveau.

Signed-off-by: Lucas Stach <dev at lynxeye.de>
[acourbot at nvidia.com: rearrange code, make platform-friendly]
Signed-off-by: Alexandre Courbot <acourbot at nvidia.com>
---
 drivers/gpu/drm/nouveau/nouveau_bo.c  | 47 +++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_bo.h  |  1 +
 drivers/gpu/drm/nouveau/nouveau_gem.c | 10 +++-----
 3 files changed, 51 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c
b/drivers/gpu/drm/nouveau/nouveau_bo.c
index b6dc85c614be..74c68c16e777 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -284,6 +284,34 @@ set_placement_range(struct nouveau_bo *nvbo, uint32_t type)
 	}
 }
 
+static void
+nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo)
+{
+	struct nouveau_device *device;
+	struct ttm_tt *ttm = nvbo->bo.ttm;
+
+	device = nouveau_dev(nouveau_bdev(ttm->bdev)->dev);
+
+	if (nvbo->bo.ttm && nvbo->bo.ttm->caching_state == tt_cached)
+		ttm_dma_tt_cache_sync_for_cpu((struct ttm_dma_tt *)nvbo->bo.ttm,
+					      nv_device_base(device));
+}
+
+static void
+nouveau_bo_sync_for_device(struct nouveau_bo *nvbo)
+{
+	struct ttm_tt *ttm = nvbo->bo.ttm;
+
+	if (ttm && ttm->caching_state == tt_cached) {
+		struct nouveau_device *device;
+
+		device = nouveau_dev(nouveau_bdev(ttm->bdev)->dev);
+
+		ttm_dma_tt_cache_sync_for_device((struct ttm_dma_tt *)ttm,
+						 nv_device_base(device));
+	}
+}
+
 void
 nouveau_bo_placement_set(struct nouveau_bo *nvbo, uint32_t type, uint32_t busy)
 {
@@ -407,6 +435,8 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool
interruptible,
 {
 	int ret;
 
+	nouveau_bo_sync_for_device(nvbo);
+
 	ret = ttm_bo_validate(&nvbo->bo, &nvbo->placement,
 			      interruptible, no_wait_gpu);
 	if (ret)
@@ -415,6 +445,23 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool
interruptible,
 	return 0;
 }
 
+int
+nouveau_bo_wait(struct nouveau_bo *nvbo, bool no_wait)
+{
+	int ret;
+
+	spin_lock(&nvbo->bo.bdev->fence_lock);
+	ret = ttm_bo_wait(&nvbo->bo, true, true, no_wait);
+	spin_unlock(&nvbo->bo.bdev->fence_lock);
+
+	if (ret)
+		return ret;
+
+	nouveau_bo_sync_for_cpu(nvbo);
+
+	return 0;
+}
+
 u16
 nouveau_bo_rd16(struct nouveau_bo *nvbo, unsigned index)
 {
diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.h
b/drivers/gpu/drm/nouveau/nouveau_bo.h
index ff17c1f432fc..a4e9052d54fd 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.h
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.h
@@ -81,6 +81,7 @@ void nouveau_bo_wr32(struct nouveau_bo *, unsigned index, u32
val);
 void nouveau_bo_fence(struct nouveau_bo *, struct nouveau_fence *);
 int  nouveau_bo_validate(struct nouveau_bo *, bool interruptible,
 			 bool no_wait_gpu);
+int nouveau_bo_wait(struct nouveau_bo *, bool no_wait);
 
 struct nouveau_vma *
 nouveau_bo_vma_find(struct nouveau_bo *, struct nouveau_vm *);
diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c
b/drivers/gpu/drm/nouveau/nouveau_gem.c
index c90c0dc0afe8..916cb8ff568c 100644
--- a/drivers/gpu/drm/nouveau/nouveau_gem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
@@ -884,19 +884,15 @@ nouveau_gem_ioctl_cpu_prep(struct drm_device *dev, void
*data,
 {
 	struct drm_nouveau_gem_cpu_prep *req = data;
 	struct drm_gem_object *gem;
-	struct nouveau_bo *nvbo;
 	bool no_wait = !!(req->flags & NOUVEAU_GEM_CPU_PREP_NOWAIT);
-	int ret = -EINVAL;
+	int ret;
 
 	gem = drm_gem_object_lookup(dev, file_priv, req->handle);
 	if (!gem)
 		return -ENOENT;
-	nvbo = nouveau_gem_object(gem);
-
-	spin_lock(&nvbo->bo.bdev->fence_lock);
-	ret = ttm_bo_wait(&nvbo->bo, true, true, no_wait);
-	spin_unlock(&nvbo->bo.bdev->fence_lock);
+	ret = nouveau_bo_wait(nouveau_gem_object(gem), no_wait);
 	drm_gem_object_unreference_unlocked(gem);
+
 	return ret;
 }
 
-- 
2.0.0

Russell King - ARM Linux

2014-Jun-24 10:02 UTC

head link

[Nouveau] [PATCH v2 2/3] drm/ttm: introduce dma cache sync helpers

On Tue, Jun 24, 2014 at 06:54:26PM +0900, Alexandre Courbot
wrote:> From: Lucas Stach <dev at lynxeye.de>
> 
> On architectures for which access to GPU memory is non-coherent,
> caches need to be flushed and invalidated explicitly at the
> appropriate places. Introduce two small helpers to make things
> easy for TTM-based drivers.
Have you run this with DMA API debugging enabled?  I suspect you haven't,
and I recommend that you do.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.

Apparently Analagous Threads

Search for more reasonably related threads

Nouveau - Jun 2014 - [PATCH v2 2/3] drm/ttm: introduce dma cache sync helpers

[Nouveau] [PATCH v2 0/3] drm/ttm: nouveau: memory coherency for ARM

[Nouveau] [PATCH v2 1/3] drm/ttm: recognize ARM arch in ioprot handler

[Nouveau] [PATCH v2 2/3] drm/ttm: introduce dma cache sync helpers

[Nouveau] [PATCH v2 3/3] drm/nouveau: hook up cache sync functions

[Nouveau] [PATCH v2 2/3] drm/ttm: introduce dma cache sync helpers

Apparently Analagous Threads