Thierry Reding
2018-May-30  08:03 UTC
[Nouveau] [PATCH v3 0/2] drm/nouveau: tegra: Detach from ARM DMA/IOMMU mapping
From: Thierry Reding <treding at nvidia.com> An unfortunate interaction between the 32-bit ARM DMA/IOMMU mapping code and Tegra SMMU driver changes to support IOMMU groups introduced a boot- time regression on Tegra124. This was caught very late because none of the standard configurations that are tested on Tegra enable the ARM DMA/ IOMMU mapping code since it is not needed. The reason for the failure is that the GPU found on Tegra uses a special bit in physical addresses to determine whether or not a buffer is mapped through the SMMU. In order to achieve this, the Nouveau driver needs to explicitly understand which buffers are mapped through the SMMU and which aren't. Hiding usage of the SMMU behind the DMA API is bound to fail because the knowledge doesn't exist. Furthermore, the GPU has its own IOMMU and in most cases doesn't need buffers to be physically or virtually contiguous. One notable exception is for compressible buffers which need to be mapped with large pages, which in turn require all the small pages in a large page to be contiguous. This can be achieved with an SMMU mapping, though it isn't currently supported in Nouveau. Since Translating through the SMMU is unnecessary and can have a negative impact on performance for the common case, so we want to avoid it when possible. This series of patches adds a 32-bit ARM specific API that allows a driver to detach the device from the DMA/IOMMU mapping so that it can provide its own implementation for dealing with the SMMU. The second patch makes use of that new API in the Nouveau driver to fix the regression. Thierry Thierry Reding (2): ARM: dma-mapping: Implement arm_dma_iommu_detach_device() drm/nouveau: tegra: Detach from ARM DMA/IOMMU mapping arch/arm/include/asm/dma-mapping.h | 3 +++ arch/arm/mm/dma-mapping-nommu.c | 4 ++++ arch/arm/mm/dma-mapping.c | 16 ++++++++++++++++ .../gpu/drm/nouveau/nvkm/engine/device/tegra.c | 5 +++++ 4 files changed, 28 insertions(+) -- 2.17.0
Thierry Reding
2018-May-30  08:03 UTC
[Nouveau] [PATCH v3 1/2] ARM: dma-mapping: Implement arm_dma_iommu_detach_device()
From: Thierry Reding <treding at nvidia.com>
Implement this function to enable drivers from detaching from any IOMMU
domains that architecture code might have attached them to so that they
can take exclusive control of the IOMMU via the IOMMU API.
Signed-off-by: Thierry Reding <treding at nvidia.com>
---
Changes in v3:
- make API 32-bit ARM specific
- avoid extra local variable
Changes in v2:
- fix compilation
 arch/arm/include/asm/dma-mapping.h |  3 +++
 arch/arm/mm/dma-mapping-nommu.c    |  4 ++++
 arch/arm/mm/dma-mapping.c          | 16 ++++++++++++++++
 3 files changed, 23 insertions(+)
diff --git a/arch/arm/include/asm/dma-mapping.h
b/arch/arm/include/asm/dma-mapping.h
index 8436f6ade57d..5960e9f3a9d0 100644
--- a/arch/arm/include/asm/dma-mapping.h
+++ b/arch/arm/include/asm/dma-mapping.h
@@ -103,6 +103,9 @@ extern void arch_setup_dma_ops(struct device *dev, u64
dma_base, u64 size,
 #define arch_teardown_dma_ops arch_teardown_dma_ops
 extern void arch_teardown_dma_ops(struct device *dev);
 
+#define arm_dma_iommu_detach_device arm_dma_iommu_detach_device
+extern void arm_dma_iommu_detach_device(struct device *dev);
+
 /* do not use this function in a driver */
 static inline bool is_device_dma_coherent(struct device *dev)
 {
diff --git a/arch/arm/mm/dma-mapping-nommu.c b/arch/arm/mm/dma-mapping-nommu.c
index f448a0663b10..eb781369377b 100644
--- a/arch/arm/mm/dma-mapping-nommu.c
+++ b/arch/arm/mm/dma-mapping-nommu.c
@@ -241,3 +241,7 @@ void arch_setup_dma_ops(struct device *dev, u64 dma_base,
u64 size,
 void arch_teardown_dma_ops(struct device *dev)
 {
 }
+
+void arm_dma_iommu_detach_device(struct device *dev)
+{
+}
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index af27f1c22d93..6d8af08b3e7d 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -2400,3 +2400,19 @@ void arch_teardown_dma_ops(struct device *dev)
 
 	arm_teardown_iommu_dma_ops(dev);
 }
+
+void arm_dma_iommu_detach_device(struct device *dev)
+{
+#ifdef CONFIG_ARM_DMA_USE_IOMMU
+	struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev);
+
+	if (!mapping)
+		return;
+
+	arm_iommu_release_mapping(mapping);
+	arm_iommu_detach_device(dev);
+
+	set_dma_ops(dev, arm_get_dma_map_ops(dev->archdata.dma_coherent));
+#endif
+}
+EXPORT_SYMBOL(arm_dma_iommu_detach_device);
-- 
2.17.0
Thierry Reding
2018-May-30  08:03 UTC
[Nouveau] [PATCH v3 2/2] drm/nouveau: tegra: Detach from ARM DMA/IOMMU mapping
From: Thierry Reding <treding at nvidia.com> Depending on the kernel configuration, early ARM architecture setup code may have attached the GPU to a DMA/IOMMU mapping that transparently uses the IOMMU to back the DMA API. Tegra requires special handling for IOMMU backed buffers (a special bit in the GPU's MMU page tables indicates the memory path to take: via the SMMU or directly to the memory controller). Transparently backing DMA memory with an IOMMU prevents Nouveau from properly handling such memory accesses and causes memory access faults. As a side-note: buffers other than those allocated in instance memory don't need to be physically contiguous from the GPU's perspective since the GPU can map them into contiguous buffers using its own MMU. Mapping these buffers through the IOMMU is unnecessary and will even lead to performance degradation because of the additional translation. One exception to this are compressible buffers which need large pages. In order to enable these large pages, multiple small pages will have to be combined into one large (I/O virtually contiguous) mapping via the IOMMU. However, that is a topic outside the scope of this fix and isn't currently supported. An implementation will want to explicitly create these large pages in the Nouveau driver, so detaching from a DMA/IOMMU mapping would still be required. Signed-off-by: Thierry Reding <treding at nvidia.com> --- Changes in v3: - clarify the use of IOMMU mapping for compressible buffers - squash multiple patches into this drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c b/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c index 78597da6313a..d0538af1b967 100644 --- a/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c +++ b/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c @@ -105,6 +105,11 @@ nvkm_device_tegra_probe_iommu(struct nvkm_device_tegra *tdev) unsigned long pgsize_bitmap; int ret; +#if IS_ENABLED(CONFIG_ARM) + /* make sure we can use the IOMMU exclusively */ + arm_dma_iommu_detach_device(dev); +#endif + if (!tdev->func->iommu_bit) return; -- 2.17.0
Robin Murphy
2018-May-30  09:59 UTC
[Nouveau] [PATCH v3 1/2] ARM: dma-mapping: Implement arm_dma_iommu_detach_device()
On 30/05/18 09:03, Thierry Reding wrote:> From: Thierry Reding <treding at nvidia.com> > > Implement this function to enable drivers from detaching from any IOMMU > domains that architecture code might have attached them to so that they > can take exclusive control of the IOMMU via the IOMMU API. > > Signed-off-by: Thierry Reding <treding at nvidia.com> > --- > Changes in v3: > - make API 32-bit ARM specific > - avoid extra local variable > > Changes in v2: > - fix compilation > > arch/arm/include/asm/dma-mapping.h | 3 +++ > arch/arm/mm/dma-mapping-nommu.c | 4 ++++ > arch/arm/mm/dma-mapping.c | 16 ++++++++++++++++ > 3 files changed, 23 insertions(+) > > diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h > index 8436f6ade57d..5960e9f3a9d0 100644 > --- a/arch/arm/include/asm/dma-mapping.h > +++ b/arch/arm/include/asm/dma-mapping.h > @@ -103,6 +103,9 @@ extern void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, > #define arch_teardown_dma_ops arch_teardown_dma_ops > extern void arch_teardown_dma_ops(struct device *dev); > > +#define arm_dma_iommu_detach_device arm_dma_iommu_detach_device > +extern void arm_dma_iommu_detach_device(struct device *dev); > + > /* do not use this function in a driver */ > static inline bool is_device_dma_coherent(struct device *dev) > { > diff --git a/arch/arm/mm/dma-mapping-nommu.c b/arch/arm/mm/dma-mapping-nommu.c > index f448a0663b10..eb781369377b 100644 > --- a/arch/arm/mm/dma-mapping-nommu.c > +++ b/arch/arm/mm/dma-mapping-nommu.c > @@ -241,3 +241,7 @@ void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size, > void arch_teardown_dma_ops(struct device *dev) > { > } > + > +void arm_dma_iommu_detach_device(struct device *dev) > +{ > +} > diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c > index af27f1c22d93..6d8af08b3e7d 100644 > --- a/arch/arm/mm/dma-mapping.c > +++ b/arch/arm/mm/dma-mapping.c > @@ -2400,3 +2400,19 @@ void arch_teardown_dma_ops(struct device *dev) > > arm_teardown_iommu_dma_ops(dev); > } > + > +void arm_dma_iommu_detach_device(struct device *dev) > +{ > +#ifdef CONFIG_ARM_DMA_USE_IOMMU > + struct dma_iommu_mapping *mapping = to_dma_iommu_mapping(dev); > + > + if (!mapping) > + return; > + > + arm_iommu_release_mapping(mapping);Potentially freeing the mapping before you try to operate on it is never the best idea. Plus arm_iommu_detach_device() already releases a reference appropriately anyway, so it's a double-free.> + arm_iommu_detach_device(dev); > + > + set_dma_ops(dev, arm_get_dma_map_ops(dev->archdata.dma_coherent)); > +#endif > +} > +EXPORT_SYMBOL(arm_dma_iommu_detach_device);I really don't see why we need an extra function that essentially just duplicates arm_iommu_detach_device(). The only real difference here is that here you reset the DMA ops more appropriately, but I think the existing function should be fixed to do that anyway, since set_dma_ops(dev, NULL) now just behaves as an unconditional fallback to the noncoherent arm_dma_ops, which clearly isn't always right. Robin.
Robin Murphy
2018-May-30  10:30 UTC
[Nouveau] [PATCH v3 2/2] drm/nouveau: tegra: Detach from ARM DMA/IOMMU mapping
On 30/05/18 09:03, Thierry Reding wrote:> From: Thierry Reding <treding at nvidia.com> > > Depending on the kernel configuration, early ARM architecture setup code > may have attached the GPU to a DMA/IOMMU mapping that transparently uses > the IOMMU to back the DMA API. Tegra requires special handling for IOMMU > backed buffers (a special bit in the GPU's MMU page tables indicates the > memory path to take: via the SMMU or directly to the memory controller). > Transparently backing DMA memory with an IOMMU prevents Nouveau from > properly handling such memory accesses and causes memory access faults. > > As a side-note: buffers other than those allocated in instance memory > don't need to be physically contiguous from the GPU's perspective since > the GPU can map them into contiguous buffers using its own MMU. Mapping > these buffers through the IOMMU is unnecessary and will even lead to > performance degradation because of the additional translation. One > exception to this are compressible buffers which need large pages. In > order to enable these large pages, multiple small pages will have to be > combined into one large (I/O virtually contiguous) mapping via the > IOMMU. However, that is a topic outside the scope of this fix and isn't > currently supported. An implementation will want to explicitly create > these large pages in the Nouveau driver, so detaching from a DMA/IOMMU > mapping would still be required.I wonder if it might make sense to have a hook in iommu_attach_device() to notify the arch DMA API code when moving devices between unmanaged and DMA ops domains? That seems like it might be the most low-impact way to address the overall problem long-term.> Signed-off-by: Thierry Reding <treding at nvidia.com> > --- > Changes in v3: > - clarify the use of IOMMU mapping for compressible buffers > - squash multiple patches into this > > drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c b/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c > index 78597da6313a..d0538af1b967 100644 > --- a/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c > +++ b/drivers/gpu/drm/nouveau/nvkm/engine/device/tegra.c > @@ -105,6 +105,11 @@ nvkm_device_tegra_probe_iommu(struct nvkm_device_tegra *tdev) > unsigned long pgsize_bitmap; > int ret; > > +#if IS_ENABLED(CONFIG_ARM)Wouldn't CONFIG_ARM_DMA_USE_IOMMU be even more appropriate?> + /* make sure we can use the IOMMU exclusively */ > + arm_dma_iommu_detach_device(dev);As before, I would just use the existing infrastructure the same way the Exynos DRM driver currently does in __exynos_iommu_attach() (albeit without then reattaching to another DMA ops mapping). Robin.> +#endif > + > if (!tdev->func->iommu_bit) > return; > >
Possibly Parallel Threads
- [PATCH v3 1/2] ARM: dma-mapping: Implement arm_dma_iommu_detach_device()
- [PATCH v3 0/2] drm/nouveau: tegra: Detach from ARM DMA/IOMMU mapping
- [PATCH v3 1/2] ARM: dma-mapping: Implement arm_dma_iommu_detach_device()
- [PATCH v3 1/2] ARM: dma-mapping: Implement arm_dma_iommu_detach_device()
- [PATCH v3 1/2] ARM: dma-mapping: Implement arm_dma_iommu_detach_device()