thr3ads.net - Nouveau - [Nouveau] [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration [Jun 2020]

If this information is useful, please help other people find it:
Share via:

Ralph Campbell

2020-Jun-22 23:38 UTC

[Nouveau] [RESEND PATCH 0/3] nouveau: fixes for SVM

These are based on 5.8.0-rc2 and intended for Ben Skeggs' nouveau tree.
I believe the changes can be queued for 5.8-rcX after being reviewed.
These were part of a larger series but I'm resending them separately as
suggested by Jason Gunthorpe.
https://lore.kernel.org/linux-mm/20200619215649.32297-1-rcampbell at nvidia.com/
Note that in order to exercise/test patch 2 here, you will need a
kernel with patch 1 from the original series (the fix to mm/migrate.c).
It is safe to apply these changes before the fix to mm/migrate.c
though.

Ralph Campbell (3):
  nouveau: fix migrate page regression
  nouveau: fix mixed normal and device private page migration
  nouveau: make nvkm_vmm_ctor() and nvkm_mmu_ptp_get() static

 drivers/gpu/drm/nouveau/nouveau_dmem.c         | 10 +++++++++-
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c |  2 +-
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c  |  2 +-
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h  |  3 ---
 4 files changed, 11 insertions(+), 6 deletions(-)

-- 
2.20.1

Ralph Campbell

2020-Jun-22 23:38 UTC

head link

[Nouveau] [RESEND PATCH 1/3] nouveau: fix migrate page regression

The patch to add zero page migration to GPU memory inadvertantly included
part of a future change which broke normal page migration to GPU memory
by copying too much data and corrupting GPU memory.
Fix this by only copying one page instead of a byte count.

Fixes: 9d4296a7d4b3 ("drm/nouveau/nouveau/hmm: fix migrate zero page to
GPU")
Signed-off-by: Ralph Campbell <rcampbell at nvidia.com>
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index e5c230d9ae24..cc9993837508 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -550,7 +550,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct
nouveau_drm *drm,
 					 DMA_BIDIRECTIONAL);
 		if (dma_mapping_error(dev, *dma_addr))
 			goto out_free_page;
-		if (drm->dmem->migrate.copy_func(drm, page_size(spage),
+		if (drm->dmem->migrate.copy_func(drm, 1,
 			NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST, *dma_addr))
 			goto out_dma_unmap;
 	} else {
-- 
2.20.1

Ralph Campbell

2020-Jun-22 23:38 UTC

head link

[Nouveau] [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration

The OpenCL function clEnqueueSVMMigrateMem(), without any flags, will
migrate memory in the given address range to device private memory. The
source pages might already have been migrated to device private memory.
In that case, the source struct page is not checked to see if it is
a device private page and incorrectly computes the GPU's physical
address of local memory leading to data corruption.
Fix this by checking the source struct page and computing the correct
physical address.

Signed-off-by: Ralph Campbell <rcampbell at nvidia.com>
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index cc9993837508..f6a806ba3caa 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -540,6 +540,12 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct
nouveau_drm *drm,
 	if (!(src & MIGRATE_PFN_MIGRATE))
 		goto out;
 
+	if (spage && is_device_private_page(spage)) {
+		paddr = nouveau_dmem_page_addr(spage);
+		*dma_addr = DMA_MAPPING_ERROR;
+		goto done;
+	}
+
 	dpage = nouveau_dmem_page_alloc_locked(drm);
 	if (!dpage)
 		goto out;
@@ -560,6 +566,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct
nouveau_drm *drm,
 			goto out_free_page;
 	}
 
+done:
 	*pfn = NVIF_VMM_PFNMAP_V0_V | NVIF_VMM_PFNMAP_V0_VRAM |
 		((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT);
 	if (src & MIGRATE_PFN_WRITE)
@@ -615,6 +622,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 	struct migrate_vma args = {
 		.vma		= vma,
 		.start		= start,
+		.src_owner	= drm->dev,
 	};
 	unsigned long i;
 	u64 *pfns;
-- 
2.20.1

Ralph Campbell

2020-Jun-22 23:38 UTC

head link

[Nouveau] [RESEND PATCH 3/3] nouveau: make nvkm_vmm_ctor() and nvkm_mmu_ptp_get() static

The functions nvkm_vmm_ctor() and nvkm_mmu_ptp_get() are not called outside
of the file defining them so make them static.

Signed-off-by: Ralph Campbell <rcampbell at nvidia.com>
---
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c | 2 +-
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c  | 2 +-
 drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h  | 3 ---
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c
b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c
index ee11ccaf0563..de91e9a26172 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c
@@ -61,7 +61,7 @@ nvkm_mmu_ptp_put(struct nvkm_mmu *mmu, bool force, struct
nvkm_mmu_pt *pt)
 	kfree(pt);
 }
 
-struct nvkm_mmu_pt *
+static struct nvkm_mmu_pt *
 nvkm_mmu_ptp_get(struct nvkm_mmu *mmu, u32 size, bool zero)
 {
 	struct nvkm_mmu_pt *pt;
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
index 199f94e15c5f..67b00dcef4b8 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
@@ -1030,7 +1030,7 @@ nvkm_vmm_ctor_managed(struct nvkm_vmm *vmm, u64 addr, u64
size)
 	return 0;
 }
 
-int
+static int
 nvkm_vmm_ctor(const struct nvkm_vmm_func *func, struct nvkm_mmu *mmu,
 	      u32 pd_header, bool managed, u64 addr, u64 size,
 	      struct lock_class_key *key, const char *name,
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
index d3f8f916d0db..a2b179568970 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
@@ -163,9 +163,6 @@ int nvkm_vmm_new_(const struct nvkm_vmm_func *, struct
nvkm_mmu *,
 		  u32 pd_header, bool managed, u64 addr, u64 size,
 		  struct lock_class_key *, const char *name,
 		  struct nvkm_vmm **);
-int nvkm_vmm_ctor(const struct nvkm_vmm_func *, struct nvkm_mmu *,
-		  u32 pd_header, bool managed, u64 addr, u64 size,
-		  struct lock_class_key *, const char *name, struct nvkm_vmm *);
 struct nvkm_vma *nvkm_vmm_node_search(struct nvkm_vmm *, u64 addr);
 struct nvkm_vma *nvkm_vmm_node_split(struct nvkm_vmm *, struct nvkm_vma *,
 				     u64 addr, u64 size);
-- 
2.20.1

John Hubbard

2020-Jun-23 00:30 UTC

head link

[Nouveau] [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration

On 2020-06-22 16:38, Ralph Campbell wrote:> The OpenCL function clEnqueueSVMMigrateMem(), without any flags, will
> migrate memory in the given address range to device private memory. The
> source pages might already have been migrated to device private memory.
> In that case, the source struct page is not checked to see if it is
> a device private page and incorrectly computes the GPU's physical
> address of local memory leading to data corruption.
> Fix this by checking the source struct page and computing the correct
> physical address.
> 
> Signed-off-by: Ralph Campbell <rcampbell at nvidia.com>
> ---
>   drivers/gpu/drm/nouveau/nouveau_dmem.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index cc9993837508..f6a806ba3caa 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -540,6 +540,12 @@ static unsigned long
nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
>   	if (!(src & MIGRATE_PFN_MIGRATE))
>   		goto out;
>   
> +	if (spage && is_device_private_page(spage)) {
> +		paddr = nouveau_dmem_page_addr(spage);
> +		*dma_addr = DMA_MAPPING_ERROR;
> +		goto done;
> +	}
> +
>   	dpage = nouveau_dmem_page_alloc_locked(drm);
>   	if (!dpage)
>   		goto out;
> @@ -560,6 +566,7 @@ static unsigned long
nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
>   			goto out_free_page;
>   	}
>   
> +done:
>   	*pfn = NVIF_VMM_PFNMAP_V0_V | NVIF_VMM_PFNMAP_V0_VRAM |
>   		((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT);
>   	if (src & MIGRATE_PFN_WRITE)
> @@ -615,6 +622,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
>   	struct migrate_vma args = {
>   		.vma		= vma,
>   		.start		= start,
> +		.src_owner	= drm->dev,
Hi Ralph,

This .src_owner setting does look like a required fix, but it seems like
a completely separate fix from what is listed in this patch's commit
description, right? (It feels like a casualty of rearranging the patches.)


thanks,
-- 
John Hubbard
NVIDIA

John Hubbard

2020-Jun-23 00:51 UTC

head link

[Nouveau] [RESEND PATCH 1/3] nouveau: fix migrate page regression

On 2020-06-22 16:38, Ralph Campbell wrote:> The patch to add zero page migration to GPU memory inadvertantly included
inadvertently
> part of a future change which broke normal page migration to GPU memory
> by copying too much data and corrupting GPU memory.
> Fix this by only copying one page instead of a byte count.
> 
> Fixes: 9d4296a7d4b3 ("drm/nouveau/nouveau/hmm: fix migrate zero page
to GPU")
> Signed-off-by: Ralph Campbell <rcampbell at nvidia.com>
> ---
>   drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index e5c230d9ae24..cc9993837508 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -550,7 +550,7 @@ static unsigned long
nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
>   					 DMA_BIDIRECTIONAL);
>   		if (dma_mapping_error(dev, *dma_addr))
>   			goto out_free_page;
> -		if (drm->dmem->migrate.copy_func(drm, page_size(spage),
> +		if (drm->dmem->migrate.copy_func(drm, 1,
>   			NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST, *dma_addr))
>   			goto out_dma_unmap;
>   	} else {
>

I Am Not A Nouveau Expert, nor is it really clear to me how
page_size(spage) came to contain something other than a page's worth of
byte count, but this fix looks accurate to me. It's better for
maintenance, too, because the function never intends to migrate "some
number of bytes". It intends to migrate exactly one page.

Hope I'm not missing something fundamental, but:

Reviewed-by: John Hubbard <jhubbard at nvidia.com


thanks,
-- 
John Hubbard
NVIDIA

John Hubbard

2020-Jun-23 00:57 UTC

head link

[Nouveau] [RESEND PATCH 3/3] nouveau: make nvkm_vmm_ctor() and nvkm_mmu_ptp_get() static

On 2020-06-22 16:38, Ralph Campbell wrote:> The functions nvkm_vmm_ctor() and nvkm_mmu_ptp_get() are not called outside
> of the file defining them so make them static.
> 
> Signed-off-by: Ralph Campbell <rcampbell at nvidia.com>
> ---
>   drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c | 2 +-
>   drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c  | 2 +-
>   drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h  | 3 ---
>   3 files changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c
b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c
> index ee11ccaf0563..de91e9a26172 100644
> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c
> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c
> @@ -61,7 +61,7 @@ nvkm_mmu_ptp_put(struct nvkm_mmu *mmu, bool force, struct
nvkm_mmu_pt *pt)
>   	kfree(pt);
>   }
>   
> -struct nvkm_mmu_pt *
> +static struct nvkm_mmu_pt *
>   nvkm_mmu_ptp_get(struct nvkm_mmu *mmu, u32 size, bool zero)
>   {
>   	struct nvkm_mmu_pt *pt;
> diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
> index 199f94e15c5f..67b00dcef4b8 100644
> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.c
> @@ -1030,7 +1030,7 @@ nvkm_vmm_ctor_managed(struct nvkm_vmm *vmm, u64 addr,
u64 size)
>   	return 0;
>   }
>   
> -int
> +static int
>   nvkm_vmm_ctor(const struct nvkm_vmm_func *func, struct nvkm_mmu *mmu,
>   	      u32 pd_header, bool managed, u64 addr, u64 size,
>   	      struct lock_class_key *key, const char *name,
> diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
> index d3f8f916d0db..a2b179568970 100644
> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h
> @@ -163,9 +163,6 @@ int nvkm_vmm_new_(const struct nvkm_vmm_func *, struct
nvkm_mmu *,
>   		  u32 pd_header, bool managed, u64 addr, u64 size,
>   		  struct lock_class_key *, const char *name,
>   		  struct nvkm_vmm **);
> -int nvkm_vmm_ctor(const struct nvkm_vmm_func *, struct nvkm_mmu *,
> -		  u32 pd_header, bool managed, u64 addr, u64 size,
> -		  struct lock_class_key *, const char *name, struct nvkm_vmm *);
>   struct nvkm_vma *nvkm_vmm_node_search(struct nvkm_vmm *, u64 addr);
>   struct nvkm_vma *nvkm_vmm_node_split(struct nvkm_vmm *, struct nvkm_vma
*,
>   				     u64 addr, u64 size);
> 
Looks accurate: the order within vmm.c (now that there is no .h
declaration) is still good, and I found no other uses of either function
within the linux.git tree, so


Reviewed-by: John Hubbard <jhubbard at nvidia.com


thanks,
-- 
John Hubbard
NVIDIA

Christoph Hellwig

2020-Jun-24 07:23 UTC

head link

[Nouveau] [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration

On Mon, Jun 22, 2020 at 04:38:53PM -0700, Ralph Campbell
wrote:> The OpenCL function clEnqueueSVMMigrateMem(), without any flags, will
> migrate memory in the given address range to device private memory. The
> source pages might already have been migrated to device private memory.
> In that case, the source struct page is not checked to see if it is
> a device private page and incorrectly computes the GPU's physical
> address of local memory leading to data corruption.
> Fix this by checking the source struct page and computing the correct
> physical address.
I'm really worried about all this delicate code to fix the mixed
ranges.  Can't we make it clear at the migrate_vma_* level if we want
to migrate from or two device private memory, and then skip all the work
for regions of memory that already are in the right place?  This might be
a little more work initially, but I think it leads to a much better
API.

Possibly Parallel Threads

Search for more maybe matching threads

Nouveau - Jun 2020 - [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration

[Nouveau] [RESEND PATCH 0/3] nouveau: fixes for SVM

[Nouveau] [RESEND PATCH 1/3] nouveau: fix migrate page regression

[Nouveau] [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration

[Nouveau] [RESEND PATCH 3/3] nouveau: make nvkm_vmm_ctor() and nvkm_mmu_ptp_get() static

[Nouveau] [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration

[Nouveau] [RESEND PATCH 1/3] nouveau: fix migrate page regression

[Nouveau] [RESEND PATCH 3/3] nouveau: make nvkm_vmm_ctor() and nvkm_mmu_ptp_get() static

[Nouveau] [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration

Possibly Parallel Threads