thr3ads.net - Nouveau - [Nouveau] [PATCH 6/6] mmu: gk20a: implement IOMMU mapping for big pages [Apr 2015]

If this information is useful, please help other people find it:
Share via:

Ilia Mirkin

2015-Apr-16 19:31 UTC

[Nouveau] [PATCH 6/6] mmu: gk20a: implement IOMMU mapping for big pages

Two questions --

(a) What's the perf impact of doing this? Less work for the GPU MMU
but more work for the IOMMU...
(b) Would it be a good idea to do this for desktop GPUs that are on
CPUs with IOMMUs in them (VT-d and whatever the AMD one is)? Is there
some sort of shared API for this stuff that you should be (or are?)
using?

  -ilia


On Thu, Apr 16, 2015 at 7:06 AM, Vince Hsu <vinceh at nvidia.com>
wrote:> This patch uses IOMMU to aggregate (probably) discrete small pages as
larger
> big page(s) and map it to GMMU.
>
> Signed-off-by: Vince Hsu <vinceh at nvidia.com>
> ---
>  drm/nouveau/nvkm/engine/device/gk104.c |   2 +-
>  drm/nouveau/nvkm/subdev/mmu/Kbuild     |   1 +
>  drm/nouveau/nvkm/subdev/mmu/gk20a.c    | 253
+++++++++++++++++++++++++++++++++
>  3 files changed, 255 insertions(+), 1 deletion(-)
>  create mode 100644 drm/nouveau/nvkm/subdev/mmu/gk20a.c
>
> diff --git a/drm/nouveau/nvkm/engine/device/gk104.c
b/drm/nouveau/nvkm/engine/device/gk104.c
> index 6a9483f65d83..9ea48ba31c0d 100644
> --- a/drm/nouveau/nvkm/engine/device/gk104.c
> +++ b/drm/nouveau/nvkm/engine/device/gk104.c
> @@ -172,7 +172,7 @@ gk104_identify(struct nvkm_device *device)
>                 device->oclass[NVDEV_SUBDEV_LTC    ] = 
gk104_ltc_oclass;
>                 device->oclass[NVDEV_SUBDEV_IBUS   ] =
&gk20a_ibus_oclass;
>                 device->oclass[NVDEV_SUBDEV_INSTMEM] =
gk20a_instmem_oclass;
> -               device->oclass[NVDEV_SUBDEV_MMU    ] =
&gf100_mmu_oclass;
> +               device->oclass[NVDEV_SUBDEV_MMU    ] =
&gk20a_mmu_oclass;
>                 device->oclass[NVDEV_SUBDEV_BAR    ] =
&gk20a_bar_oclass;
>                 device->oclass[NVDEV_ENGINE_DMAOBJ ] = 
gf110_dmaeng_oclass;
>                 device->oclass[NVDEV_ENGINE_FIFO   ] = 
gk20a_fifo_oclass;
> diff --git a/drm/nouveau/nvkm/subdev/mmu/Kbuild
b/drm/nouveau/nvkm/subdev/mmu/Kbuild
> index 012c9db687b2..141302a8e933 100644
> --- a/drm/nouveau/nvkm/subdev/mmu/Kbuild
> +++ b/drm/nouveau/nvkm/subdev/mmu/Kbuild
> @@ -4,3 +4,4 @@ nvkm-y += nvkm/subdev/mmu/nv41.o
>  nvkm-y += nvkm/subdev/mmu/nv44.o
>  nvkm-y += nvkm/subdev/mmu/nv50.o
>  nvkm-y += nvkm/subdev/mmu/gf100.o
> +nvkm-y += nvkm/subdev/mmu/gk20a.o
> diff --git a/drm/nouveau/nvkm/subdev/mmu/gk20a.c
b/drm/nouveau/nvkm/subdev/mmu/gk20a.c
> new file mode 100644
> index 000000000000..b444b73e208d
> --- /dev/null
> +++ b/drm/nouveau/nvkm/subdev/mmu/gk20a.c
> @@ -0,0 +1,253 @@
> +/*
> + * Copyright (c) 2015, NVIDIA CORPORATION. All rights reserved.
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the
"Software"),
> + * to deal in the Software without restriction, including without
limitation
> + * the rights to use, copy, modify, merge, publish, distribute,
sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included
in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT
SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
> + * DEALINGS IN THE SOFTWARE.
> + */
> +
> +#include <subdev/fb.h>
> +#include <subdev/ltc.h>
> +#include <subdev/mmu.h>
> +
> +#ifdef __KERNEL__
> +#include <linux/iommu.h>
> +#include <nouveau_platform.h>
> +#endif
> +
> +#include "gf100.h"
> +
> +struct gk20a_mmu_priv {
> +       struct nvkm_mmu base;
> +};
> +
> +struct gk20a_mmu_iommu_mapping {
> +       struct nvkm_mm_node *node;
> +       u64 iova;
> +};
> +
> +extern const u8 gf100_pte_storage_type_map[256];
> +
> +static void
> +gk20a_vm_map(struct nvkm_vma *vma, struct nvkm_gpuobj *pgt,
> +               struct nvkm_mem *mem, u32 pte, u64 list)
> +{
> +       u32 target = (vma->access & NV_MEM_ACCESS_NOSNOOP) ? 7 : 5;
> +       u64 phys;
> +
> +       pte <<= 3;
> +       phys = gf100_vm_addr(vma, list, mem->memtype, target);
> +
> +       if (mem->tag) {
> +               struct nvkm_ltc *ltc = nvkm_ltc(vma->vm->mmu);
> +               u32 tag = mem->tag->offset;
> +               phys |= (u64)tag << (32 + 12);
> +               ltc->tags_clear(ltc, tag, 1);
> +       }
> +
> +       nv_wo32(pgt, pte + 0, lower_32_bits(phys));
> +       nv_wo32(pgt, pte + 4, upper_32_bits(phys));
> +}
> +
> +static void
> +gk20a_vm_map_iommu(struct nvkm_vma *vma, struct nvkm_gpuobj *pgt,
> +               struct nvkm_mem *mem, u32 pte, dma_addr_t *list,
> +               void **priv)
> +{
> +       struct nvkm_vm *vm = vma->vm;
> +       struct nvkm_mmu *mmu = vm->mmu;
> +       struct nvkm_mm_node *node;
> +       struct nouveau_platform_device *plat;
> +       struct gk20a_mmu_iommu_mapping *p;
> +       int npages = 1 << (mmu->lpg_shift - mmu->spg_shift);
> +       int i, ret;
> +       u64 addr;
> +
> +       plat = nv_device_to_platform(nv_device(&mmu->base));
> +
> +       *priv = kzalloc(sizeof(struct gk20a_mmu_iommu_mapping),
GFP_KERNEL);
> +       if (!*priv)
> +               return;
> +
> +       mutex_lock(&plat->gpu->iommu.mutex);
> +       ret = nvkm_mm_head(plat->gpu->iommu.mm,
> +                       0,
> +                       1,
> +                       npages,
> +                       npages,
> +                       (1 << mmu->lpg_shift) >> 12,
> +                       &node);
> +       mutex_unlock(&plat->gpu->iommu.mutex);
> +       if (ret)
> +               return;
> +
> +       for (i = 0; i < npages; i++, list++) {
> +               ret = iommu_map(plat->gpu->iommu.domain,
> +                               (node->offset + i) << PAGE_SHIFT,
> +                               *list,
> +                               PAGE_SIZE,
> +                               IOMMU_READ | IOMMU_WRITE);
> +
> +               if (ret < 0)
> +                       return;
> +
> +               nv_trace(mmu, "IOMMU: IOVA=0x%016llx-> IOMMU ->
PA=%016llx\n",
> +                               (u64)(node->offset + i) <<
PAGE_SHIFT, (u64)(*list));
> +       }
> +
> +       addr = (u64)node->offset << PAGE_SHIFT;
> +       addr |= BIT_ULL(plat->gpu->iommu.phys_addr_bit);
> +
> +       gk20a_vm_map(vma, pgt, mem, pte, addr);
> +
> +       p = *priv;
> +       p->node = node;
> +       p->iova = node->offset << PAGE_SHIFT;
> +}
> +
> +static void
> +gk20a_vm_map_sg_iommu(struct nvkm_vma *vma, struct nvkm_gpuobj *pgt,
> +               struct nvkm_mem *mem, u32 pte, struct sg_page_iter *iter,
> +               void **priv)
> +{
> +       struct nvkm_vm *vm = vma->vm;
> +       struct nvkm_mmu *mmu = vm->mmu;
> +       struct nvkm_mm_node *node;
> +       struct nouveau_platform_device *plat;
> +       struct gk20a_mmu_iommu_mapping *p;
> +       int npages = 1 << (mmu->lpg_shift - mmu->spg_shift);
> +       int i, ret;
> +       u64 addr;
> +
> +       plat = nv_device_to_platform(nv_device(&mmu->base));
> +
> +       *priv = kzalloc(sizeof(struct gk20a_mmu_iommu_mapping),
GFP_KERNEL);
> +       if (!*priv)
> +               return;
> +
> +       mutex_lock(&plat->gpu->iommu.mutex);
> +       ret = nvkm_mm_head(plat->gpu->iommu.mm,
> +                       0,
> +                       1,
> +                       npages,
> +                       npages,
> +                       (1 << mmu->lpg_shift) >> 12,
> +                       &node);
> +       mutex_unlock(&plat->gpu->iommu.mutex);
> +       if (ret)
> +               return;
> +
> +       for (i = 0; i < npages; i++) {
> +               dma_addr_t phys = sg_page_iter_dma_address(iter);
> +
> +               ret = iommu_map(plat->gpu->iommu.domain,
> +                               (node->offset + i) << PAGE_SHIFT,
> +                               phys,
> +                               PAGE_SIZE,
> +                               IOMMU_READ | IOMMU_WRITE);
> +
> +               if (ret < 0)
> +                       return;
> +
> +               nv_trace(mmu, "IOMMU: IOVA=0x%016llx-> IOMMU ->
PA=%016llx\n",
> +                               (u64)(node->offset + i) <<
PAGE_SHIFT, (u64)phys);
> +
> +               if ((i < npages - 1) &&
!__sg_page_iter_next(iter)) {
> +                       nv_error(mmu, "failed to iterate sg
table\n");
> +                       return;
> +               }
> +       }
> +
> +       addr = (u64)node->offset << PAGE_SHIFT;
> +       addr |= BIT_ULL(plat->gpu->iommu.phys_addr_bit);
> +
> +       gk20a_vm_map(vma, pgt, mem, pte, addr);
> +
> +       p = *priv;
> +       p->node = node;
> +       p->iova = node->offset << PAGE_SHIFT;
> +}
> +
> +static void
> +gk20a_vm_unmap_iommu(struct nvkm_vma *vma, void *priv)
> +{
> +       struct nvkm_vm *vm = vma->vm;
> +       struct nvkm_mmu *mmu = vm->mmu;
> +       struct nouveau_platform_device *plat;
> +       struct gk20a_mmu_iommu_mapping *p = priv;
> +       int ret;
> +
> +       plat = nv_device_to_platform(nv_device(&mmu->base));
> +
> +       ret = iommu_unmap(plat->gpu->iommu.domain, p->iova,
> +                       1 << mmu->lpg_shift);
> +       WARN(ret < 0, "failed to unmap IOMMU address 0x%16llx,
ret=%d\n",
> +                       p->iova, ret);
> +
> +       mutex_lock(&plat->gpu->iommu.mutex);
> +       nvkm_mm_free(plat->gpu->iommu.mm, &p->node);
> +       mutex_unlock(&plat->gpu->iommu.mutex);
> +
> +       kfree(priv);
> +}
> +
> +static int
> +gk20a_mmu_ctor(struct nvkm_object *parent, struct nvkm_object *engine,
> +              struct nvkm_oclass *oclass, void *data, u32 size,
> +              struct nvkm_object **pobject)
> +{
> +       struct gk20a_mmu_priv *priv;
> +       struct nouveau_platform_device *plat;
> +       int ret;
> +
> +       ret = nvkm_mmu_create(parent, engine, oclass, "VM",
"vm", &priv);
> +       *pobject = nv_object(priv);
> +       if (ret)
> +               return ret;
> +
> +       plat = nv_device_to_platform(nv_device(parent));
> +       if (plat->gpu->iommu.domain)
> +               priv->base.iommu_capable = true;
> +
> +       priv->base.limit = 1ULL << 40;
> +       priv->base.dma_bits = 40;
> +       priv->base.pgt_bits  = 27 - 12;
> +       priv->base.spg_shift = 12;
> +       priv->base.lpg_shift = 17;
> +       priv->base.create = gf100_vm_create;
> +       priv->base.map_pgt = gf100_vm_map_pgt;
> +       priv->base.map = gf100_vm_map;
> +       priv->base.map_sg = gf100_vm_map_sg;
> +       priv->base.map_iommu = gk20a_vm_map_iommu;
> +       priv->base.unmap_iommu = gk20a_vm_unmap_iommu;
> +       priv->base.map_sg_iommu = gk20a_vm_map_sg_iommu;
> +       priv->base.unmap = gf100_vm_unmap;
> +       priv->base.flush = gf100_vm_flush;
> +
> +       return 0;
> +}
> +
> +struct nvkm_oclass
> +gk20a_mmu_oclass = {
> +       .handle = NV_SUBDEV(MMU, 0xea),
> +       .ofuncs = &(struct nvkm_ofuncs) {
> +               .ctor = gk20a_mmu_ctor,
> +               .dtor = _nvkm_mmu_dtor,
> +               .init = _nvkm_mmu_init,
> +               .fini = _nvkm_mmu_fini,
> +       },
> +};
> --
> 2.1.4
>
> _______________________________________________
> Nouveau mailing list
> Nouveau at lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/nouveau

Terje Bergstrom

2015-Apr-16 19:55 UTC

head link

[Nouveau] [PATCH 6/6] mmu: gk20a: implement IOMMU mapping for big pages

On 04/16/2015 12:31 PM, Ilia Mirkin wrote:> Two questions --
>
> (a) What's the perf impact of doing this? Less work for the GPU MMU
> but more work for the IOMMU...
> (b) Would it be a good idea to do this for desktop GPUs that are on
> CPUs with IOMMUs in them (VT-d and whatever the AMD one is)? Is there
> some sort of shared API for this stuff that you should be (or are?)
> using?a) Using IOMMU mapping is the best way of getting contiguous post-GMMU 
address spaces. The continuity is required to be able to use frame 
buffer compression. So overall performance impact when compression is 
factored in is about 20-30%.

If compression is left out of the equation, the impact SMMU translation 
and small versus large pages should not be noticeable, but I haven't 
measured it. We have measured large versus small pages with compression 
disabled in both cases in gk20a and the difference was noise.

Additional advantage is extra protection against GPU accidentally 
walking over kernel memory if kernel driver has a bug.

b) This is a Tegra specific mechanism, and for dGPU sysmem is handled 
differently, so I don't have a good answer to that. I *believe* in dGPU 
sysmem does not support compression, so it would be a question of memory 
protection, not performance.

(I'm hoping this email does not get added a corporate boilerplate - if 
it does, I apologize and feel free to ignore)

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may
contain
confidential information.  Any unauthorized review, use, disclosure or
distribution
is prohibited.  If you are not the intended recipient, please contact the sender
by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Ilia Mirkin

2015-Apr-16 20:01 UTC

head link

[Nouveau] [PATCH 6/6] mmu: gk20a: implement IOMMU mapping for big pages

On Thu, Apr 16, 2015 at 3:55 PM, Terje Bergstrom <tbergstrom at
nvidia.com> wrote:>
> On 04/16/2015 12:31 PM, Ilia Mirkin wrote:
>>
>> Two questions --
>>
>> (a) What's the perf impact of doing this? Less work for the GPU MMU
>> but more work for the IOMMU...
>> (b) Would it be a good idea to do this for desktop GPUs that are on
>> CPUs with IOMMUs in them (VT-d and whatever the AMD one is)? Is there
>> some sort of shared API for this stuff that you should be (or are?)
>> using?
>
> a) Using IOMMU mapping is the best way of getting contiguous post-GMMU
> address spaces. The continuity is required to be able to use frame buffer
> compression. So overall performance impact when compression is factored in
> is about 20-30%.
>
> If compression is left out of the equation, the impact SMMU translation and
> small versus large pages should not be noticeable, but I haven't
measured
> it. We have measured large versus small pages with compression disabled in
> both cases in gk20a and the difference was noise.
Ah, I never made the connection to compression. I had assumed it was
something done at a higher level by PGRAPH rather than at the PTE
level by the VM. [I did know that you had to set compression at the
PTE level, but didn't think that page size mattered.]
>
> Additional advantage is extra protection against GPU accidentally walking
> over kernel memory if kernel driver has a bug.
Yeah, IOMMU's are nice :)

Nouveau - Apr 2015 - [PATCH 6/6] mmu: gk20a: implement IOMMU mapping for big pages

[Nouveau] [PATCH 6/6] mmu: gk20a: implement IOMMU mapping for big pages

[Nouveau] [PATCH 6/6] mmu: gk20a: implement IOMMU mapping for big pages

[Nouveau] [PATCH 6/6] mmu: gk20a: implement IOMMU mapping for big pages