thr3ads.net - Nouveau - [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages [Oct 2024]

If this information is useful, please help other people find it:
Share via:

Yonatan Maman

2024-Oct-15 15:23 UTC

[PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages

From: Yonatan Maman <Ymaman at Nvidia.com>

hmm_range_fault() natively triggers a page fault on device private
pages, migrating them to RAM. In some cases, such as with RDMA devices,
the migration overhead between the device (e.g., GPU) and the CPU, and
vice-versa, significantly damages performance. Thus, enabling Peer-to-
Peer (P2P) DMA access for device private page might be crucial for
minimizing data transfer overhead.

This change introduces an API to support P2P connections for device
private pages by implementing the following:

 - Leveraging the struct pagemap_ops for P2P Page Callbacks. This
   callback involves mapping the page to MMIO and returning the
   corresponding PCI_P2P page.

 - Utilizing hmm_range_fault for Initializing P2P Connections. The API
   also adds the HMM_PFN_REQ_TRY_P2P flag option for the
   hmm_range_fault caller to initialize P2P. If set, hmm_range_fault
   attempts initializing the P2P connection first, if the owner device
   supports P2P, using p2p_page. In case of failure or lack of support,
   hmm_range_fault will continue with the regular flow of migrating the
   page to RAM.

This change does not affect previous use-cases of hmm_range_fault,
because both the caller and the page owner must explicitly request and
support it to initialize P2P connection.

Signed-off-by: Yonatan Maman <Ymaman at Nvidia.com>
Reviewed-by: Gal Shalom <GalShalom at Nvidia.com>
---
 include/linux/hmm.h      |  2 ++
 include/linux/memremap.h |  7 +++++++
 mm/hmm.c                 | 28 ++++++++++++++++++++++++++++
 3 files changed, 37 insertions(+)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 126a36571667..7154f5ed73a1 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -41,6 +41,8 @@ enum hmm_pfn_flags {
 	/* Input flags */
 	HMM_PFN_REQ_FAULT = HMM_PFN_VALID,
 	HMM_PFN_REQ_WRITE = HMM_PFN_WRITE,
+	/* allow returning PCI P2PDMA pages */
+	HMM_PFN_REQ_ALLOW_P2P = 1,
 
 	HMM_PFN_FLAGS = 0xFFUL << HMM_PFN_ORDER_SHIFT,
 };
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 3f7143ade32c..0ecfd3d191fa 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -89,6 +89,13 @@ struct dev_pagemap_ops {
 	 */
 	vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
 
+	/*
+	 * Used for private (un-addressable) device memory only. Return a
+	 * corresponding struct page, that can be mapped to device
+	 * (e.g using dma_map_page)
+	 */
+	struct page *(*get_dma_page_for_device)(struct page *private_page);
+
 	/*
 	 * Handle the memory failure happens on a range of pfns.  Notify the
 	 * processes who are using these pfns, and try to recover the data on
diff --git a/mm/hmm.c b/mm/hmm.c
index 7e0229ae4a5a..987dd143d697 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -230,6 +230,8 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned
long addr,
 	unsigned long cpu_flags;
 	pte_t pte = ptep_get(ptep);
 	uint64_t pfn_req_flags = *hmm_pfn;
+	struct page *(*get_dma_page_handler)(struct page *private_page);
+	struct page *dma_page;
 
 	if (pte_none_mostly(pte)) {
 		required_fault @@ -257,6 +259,32 @@ static int hmm_vma_handle_pte(struct
mm_walk *walk, unsigned long addr,
 			return 0;
 		}
 
+		/*
+		 * P2P for supported pages, and according to caller request
+		 * translate the private page to the match P2P page if it fails
+		 * continue with the regular flow
+		 */
+		if (is_device_private_entry(entry)) {
+			get_dma_page_handler +				pfn_swap_entry_to_page(entry)
+					->pgmap->ops->get_dma_page_for_device;
+			if ((hmm_vma_walk->range->default_flags &
+			    HMM_PFN_REQ_ALLOW_P2P) &&
+			    get_dma_page_handler) {
+				dma_page = get_dma_page_handler(
+					pfn_swap_entry_to_page(entry));
+				if (!IS_ERR(dma_page)) {
+					cpu_flags = HMM_PFN_VALID;
+					if (is_writable_device_private_entry(
+						    entry))
+						cpu_flags |= HMM_PFN_WRITE;
+					*hmm_pfn = page_to_pfn(dma_page) |
+						   cpu_flags;
+					return 0;
+				}
+			}
+		}
+
 		required_fault  			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
 		if (!required_fault) {
-- 
2.34.1

Christoph Hellwig

2024-Oct-16 04:49 UTC

head link

[PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages

The subject does not make sense.  All P2P is on ZONE_DEVICE pages.
It seems like this is about device private memory?

On Tue, Oct 15, 2024 at 06:23:45PM +0300, Yonatan Maman
wrote:> From: Yonatan Maman <Ymaman at Nvidia.com>
> 
> hmm_range_fault() natively triggers a page fault on device private
> pages, migrating them to RAM.
That "natively" above doesn't make sense to me.
> In some cases, such as with RDMA devices,
> the migration overhead between the device (e.g., GPU) and the CPU, and
> vice-versa, significantly damages performance. Thus, enabling Peer-to-
s/damages/degrades/
> Peer (P2P) DMA access for device private page might be crucial for
> minimizing data transfer overhead.
> 
> This change introduces an API to support P2P connections for device
> private pages by implementing the following:
"This change.. " or "This patch.." is pointless, just
explain what you
are doing.
> 
>  - Leveraging the struct pagemap_ops for P2P Page Callbacks. This
>    callback involves mapping the page to MMIO and returning the
>    corresponding PCI_P2P page.
While P2P uses the same underlying PCIe TLPs as MMIO, it is not
MMIO by definition, as memory mapped I/O is by definition about
the CPU memory mappping so that load and store instructions cause
the I/O.  It also uses very different concepts in Linux.
>  - Utilizing hmm_range_fault for Initializing P2P Connections. The API
There is no concept of a "connection" in PCIe dta transfers.
>    also adds the HMM_PFN_REQ_TRY_P2P flag option for the
>    hmm_range_fault caller to initialize P2P. If set, hmm_range_fault
>    attempts initializing the P2P connection first, if the owner device
>    supports P2P, using p2p_page. In case of failure or lack of support,
>    hmm_range_fault will continue with the regular flow of migrating the
>    page to RAM.
What is the need for the flag?  As far as I can tell from reading
the series, the P2P mapping is entirely transparent to the callers
of hmm_range_fault.
> +	/*
> +	 * Used for private (un-addressable) device memory only. Return a
> +	 * corresponding struct page, that can be mapped to device
> +	 * (e.g using dma_map_page)
> +	 */
> +	struct page *(*get_dma_page_for_device)(struct page *private_page);
We are talking about P2P memory here.  How do you manage to get a page
that dma_map_page can be used on?  All P2P memory needs to use the P2P
aware dma_map_sg as the pages for P2P memory are just fake zone device
pages.

> +		 * P2P for supported pages, and according to caller request
> +		 * translate the private page to the match P2P page if it fails
> +		 * continue with the regular flow
> +		 */
> +		if (is_device_private_entry(entry)) {
> +			get_dma_page_handler > +				pfn_swap_entry_to_page(entry)
> +					->pgmap->ops->get_dma_page_for_device;
> +			if ((hmm_vma_walk->range->default_flags &
> +			    HMM_PFN_REQ_ALLOW_P2P) &&
> +			    get_dma_page_handler) {
> +				dma_page = get_dma_page_handler(
> +					pfn_swap_entry_to_page(entry));
This is really messy.  You probably really want to share a branch
with the private page handling for the owner so that you only need
a single is_device_private_entry and can use a local variable for
to shortcut finding the page.  Probably best done with a little helper:

Then  this becomes:

static bool hmm_handle_device_private(struct hmm_range *range,
		swp_entry_t entry, unsigned long *hmm_pfn)
{
	struct page *page = pfn_swap_entry_to_page(entry);
	struct dev_pagemap *pgmap = page->pgmap;

	if (pgmap->owner == range->dev_private_owner) {
		*hmm_pfn = swp_offset_pfn(entry);
		goto found;
	}

	if (pgmap->ops->get_dma_page_for_device) {
		*hmm_pfn 			page_to_pfn(pgmap->ops->get_dma_page_for_device(page));
		goto found;
	}

	return false;

found:
	*hmm_pfn |= HMM_PFN_VALID
	if (is_writable_device_private_entry(entry))
		*hmm_pfn |= HMM_PFN_WRITE;
	return true;
}

which also makes it clear that returning a page from the method is
not that great, a PFN might work a lot better, e.g.

	unsigned long (*device_private_dma_pfn)(struct page *page);

Alistair Popple

2024-Oct-16 05:10 UTC

head link

[PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages

Yonatan Maman <ymaman at nvidia.com> writes:
> From: Yonatan Maman <Ymaman at Nvidia.com>
>
> hmm_range_fault() natively triggers a page fault on device private
> pages, migrating them to RAM. In some cases, such as with RDMA devices,
> the migration overhead between the device (e.g., GPU) and the CPU, and
> vice-versa, significantly damages performance. Thus, enabling Peer-to-
> Peer (P2P) DMA access for device private page might be crucial for
> minimizing data transfer overhead.
>
> This change introduces an API to support P2P connections for device
> private pages by implementing the following:
>
>  - Leveraging the struct pagemap_ops for P2P Page Callbacks. This
>    callback involves mapping the page to MMIO and returning the
>    corresponding PCI_P2P page.
>
>  - Utilizing hmm_range_fault for Initializing P2P Connections. The API
>    also adds the HMM_PFN_REQ_TRY_P2P flag option for the
>    hmm_range_fault caller to initialize P2P. If set, hmm_range_fault
>    attempts initializing the P2P connection first, if the owner device
>    supports P2P, using p2p_page. In case of failure or lack of support,
>    hmm_range_fault will continue with the regular flow of migrating the
>    page to RAM.
>
> This change does not affect previous use-cases of hmm_range_fault,
> because both the caller and the page owner must explicitly request and
> support it to initialize P2P connection.
>
> Signed-off-by: Yonatan Maman <Ymaman at Nvidia.com>
> Reviewed-by: Gal Shalom <GalShalom at Nvidia.com>
> ---
>  include/linux/hmm.h      |  2 ++
>  include/linux/memremap.h |  7 +++++++
>  mm/hmm.c                 | 28 ++++++++++++++++++++++++++++
>  3 files changed, 37 insertions(+)
>
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 126a36571667..7154f5ed73a1 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -41,6 +41,8 @@ enum hmm_pfn_flags {
>  	/* Input flags */
>  	HMM_PFN_REQ_FAULT = HMM_PFN_VALID,
>  	HMM_PFN_REQ_WRITE = HMM_PFN_WRITE,
> +	/* allow returning PCI P2PDMA pages */
> +	HMM_PFN_REQ_ALLOW_P2P = 1,
>  
>  	HMM_PFN_FLAGS = 0xFFUL << HMM_PFN_ORDER_SHIFT,
>  };
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 3f7143ade32c..0ecfd3d191fa 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -89,6 +89,13 @@ struct dev_pagemap_ops {
>  	 */
>  	vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
>  
> +	/*
> +	 * Used for private (un-addressable) device memory only. Return a
> +	 * corresponding struct page, that can be mapped to device
> +	 * (e.g using dma_map_page)
> +	 */
> +	struct page *(*get_dma_page_for_device)(struct page *private_page);
It would be nice to add some documentation about this feature to
Documentation/mm/hmm.rst. In particular some notes on the page
lifetime/refcounting rules.

On that note how is the refcounting of the returned p2pdma page expected
to work? We don't want the driver calling hmm_range_fault() to be able
to pin the page with eg. get_page(), so the returned p2pdma page should
have a zero refcount to enforce that.
> +
>  	/*
>  	 * Handle the memory failure happens on a range of pfns.  Notify the
>  	 * processes who are using these pfns, and try to recover the data on
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 7e0229ae4a5a..987dd143d697 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -230,6 +230,8 @@ static int hmm_vma_handle_pte(struct mm_walk *walk,
unsigned long addr,
>  	unsigned long cpu_flags;
>  	pte_t pte = ptep_get(ptep);
>  	uint64_t pfn_req_flags = *hmm_pfn;
> +	struct page *(*get_dma_page_handler)(struct page *private_page);
> +	struct page *dma_page;
>  
>  	if (pte_none_mostly(pte)) {
>  		required_fault > @@ -257,6 +259,32 @@ static int
hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>  			return 0;
>  		}
>  
> +		/*
> +		 * P2P for supported pages, and according to caller request
> +		 * translate the private page to the match P2P page if it fails
> +		 * continue with the regular flow
> +		 */
> +		if (is_device_private_entry(entry)) {
> +			get_dma_page_handler > +				pfn_swap_entry_to_page(entry)
> +					->pgmap->ops->get_dma_page_for_device;
> +			if ((hmm_vma_walk->range->default_flags &
> +			    HMM_PFN_REQ_ALLOW_P2P) &&
> +			    get_dma_page_handler) {
> +				dma_page = get_dma_page_handler(
> +					pfn_swap_entry_to_page(entry));
> +				if (!IS_ERR(dma_page)) {
> +					cpu_flags = HMM_PFN_VALID;
> +					if (is_writable_device_private_entry(
> +						    entry))
> +						cpu_flags |= HMM_PFN_WRITE;
> +					*hmm_pfn = page_to_pfn(dma_page) |
> +						   cpu_flags;
> +					return 0;
> +				}
> +			}
> +		}
> +
>  		required_fault >  			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags,
0);
>  		if (!required_fault) {

Nouveau - Oct 2024 - [PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages

[PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages

[PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages

[PATCH v1 1/4] mm/hmm: HMM API for P2P DMA to device zone pages