thr3ads.net - Xen devel - [RFC/PATCH] Improve speed of mapping guest memory into Dom0 [Nov 2012]

If this information is useful, please help other people find it:
Share via:

Mats Petersson

2012-Nov-14 11:13 UTC

[RFC/PATCH] Improve speed of mapping guest memory into Dom0

Whilst investigating why "guest not pingable time is too long", I
found
that the rate at which guest memory is copied on a localhost is quite 
slow. I tracked down that a large portion of the time copying the guest 
memory was actually spent mapping guest pages into Dom0 memory space, 
and then found that there are two hypervisor call for every page, one to 
map that page in, and one to flush the TLB. So with some work, I managed 
to get it to map a larger number of pages at once (the complication here 
is that Xen isn''t allocating machine physical pages "in
order", which
the original remap function expects).

I originally prototyped this on a 2.6.32 kernel, and then ported it 
again onto the 3.7rc kernel.

When microbenchmarking only the map/unmap of guest memory into Dom0, 
this provides a speed up of around 10x for the mapping itself, compared 
to the original 2.6 kernel, and a little less, around 7-8x - in clock 
cycles the mapping with the new code, for a block of 1024 pages, takes 
around 650k cycles, where the 3.7 kernel takes 5M, and the 2.6 kernel 
around 7.6M.

When comparing the total localhost migration time, this makes a 15% 
improvment on a light load (idle) guest with 1GB memory, on a heavy load 
guest (aggressively dirtying 800MB of the 1GB guest memory) it gives 
more than 20% improvement. (When comapring 3.7 "original" vs. with the
patch below - 3.7 is a little better than the 2.6 kernel in my benchmarks).

The basic principle of the change is to pass a "list" of mfns (pointer
to a sequence of mfns) to a new xen_remap_domain_mfn_list function, 
instead of the existing xen_remap_domain_mfn_range.

This change should only affect xen parts of the kernel.

I have also found that the munmap() call used to unmap the guest memory 
from Dom0 is about 35% slower in 3.7 kernel than in the 2.6 kernel (3.8M 
cycles vs 2.8M cycles). I think this could be made quicker by using a 
direct write of zero rather than the compare exchange operation that is 
currently used [which traps into Xen, performs the compare & exchange] - 
the compare exchange is required if the mapping is part of a mapped file 
that needs to write dirty pages out to the backing storage, but for 
"we''ve mapped guest memory to Dom0", there is no need to
atomically
check if it''s dirty [it shouldn''t be dirty in the first place,
as we
only read from it]. Unfortunately, this is generic kernel code, so I 
fear it''s hard to get changes approved. I have a feeling, however, that
if the memory is not a memory mapped file [with write allowed], it would 
be faster to write zero to the page-table entries even in native code.

--
Mats

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index dcf5f2d..b8c022c 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -2482,6 +2482,12 @@ struct remap_data {
      struct mmu_update *mmu_update;
  };

+struct remap_list_data {
+    unsigned long *mfn;
+    pgprot_t prot;
+    struct mmu_update *mmu_update;
+};
+
  static int remap_area_mfn_pte_fn(pte_t *ptep, pgtable_t token,
                   unsigned long addr, void *data)
  {
@@ -2495,6 +2501,19 @@ static int remap_area_mfn_pte_fn(pte_t *ptep, 
pgtable_t token,
      return 0;
  }

+static int remap_area_mfn_list_pte_fn(pte_t *ptep, pgtable_t token,
+                      unsigned long addr, void *data)
+{
+    struct remap_list_data *rmd = data;
+    pte_t pte = pte_mkspecial(pfn_pte(*rmd->mfn++, rmd->prot));
+
+    rmd->mmu_update->ptr = virt_to_machine(ptep).maddr;
+    rmd->mmu_update->val = pte_val_ma(pte);
+    rmd->mmu_update++;
+
+    return 0;
+}
+
  int xen_remap_domain_mfn_range(struct vm_area_struct *vma,
                     unsigned long addr,
                     unsigned long mfn, int nr,
@@ -2542,3 +2561,77 @@ out:
      return err;
  }
  EXPORT_SYMBOL_GPL(xen_remap_domain_mfn_range);
+
+/* like xen_remap_domain_mfn_range, but does a list of mfn''s, rather
+ * than the, for xen, quite useless, consecutive pages.
+ */
+int xen_remap_domain_mfn_list(struct vm_area_struct *vma,
+                  unsigned long addr,
+                  unsigned long *mfn, int nr,
+                  int *err_ptr,
+                  pgprot_t prot, unsigned domid)
+{
+    struct remap_list_data rmd;
+    struct mmu_update mmu_update[REMAP_BATCH_SIZE];
+    int batch;
+    int done;
+    unsigned long range;
+    int err = 0;
+
+    if (xen_feature(XENFEAT_auto_translated_physmap))
+        return -EINVAL;
+
+    prot = __pgprot(pgprot_val(prot) | _PAGE_IOMAP);
+
+    BUG_ON(!((vma->vm_flags & (VM_PFNMAP | VM_IO)) == (VM_PFNMAP | 
VM_IO)));
+
+    rmd.mfn = mfn;
+    rmd.prot = prot;
+
+    while (nr) {
+        batch = min(REMAP_BATCH_SIZE, nr);
+        range = (unsigned long)batch << PAGE_SHIFT;
+
+        rmd.mmu_update = mmu_update;
+        err = apply_to_page_range(vma->vm_mm, addr, range,
+                      remap_area_mfn_list_pte_fn, &rmd);
+        if (err)
+        {
+            printk("xen_remap_domain_mfn_list: apply_to_range: 
err=%d\n", err);
+            goto out;
+        }
+
+        err = HYPERVISOR_mmu_update(mmu_update, batch, &done, domid);
+        if (err < 0)
+        {
+            int i;
+            /* TODO: We should remove this printk later */
+            printk("xen_remap_domain_mfn_list: mmu_update: err=%d, 
done=%d, batch=%d\n", err, done, batch);
+            err_ptr[done] = err;
+
+            /* now do the remaining part of this batch */
+            for(i = done+1; i < batch; i++)
+            {
+                int tmp_err = HYPERVISOR_mmu_update(&mmu_update[i], 1, 
NULL, domid);
+                if (tmp_err < 0)
+                {
+                    err_ptr[i] = tmp_err;
+                }
+            }
+
+            goto out;
+        }
+
+        nr -= batch;
+        addr += range;
+        err_ptr += batch;
+    }
+
+    err = 0;
+out:
+
+    xen_flush_tlb_all();
+
+    return err;
+}
+EXPORT_SYMBOL_GPL(xen_remap_domain_mfn_list);
diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index 8adb9cc..b39a7b7 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -151,6 +151,41 @@ static int traverse_pages(unsigned nelem, size_t size,
      return ret;
  }

+/*
+ * Similar to traverse_pages, but use each page as a "block" of
+ * data to be processed as one unit.
+ */
+static int traverse_pages_block(unsigned nelem, size_t size,
+                struct list_head *pos,
+                int (*fn)(void *data, int nr, void *state),
+                void *state)
+{
+    void *pagedata;
+    unsigned pageidx;
+    int ret = 0;
+
+    BUG_ON(size > PAGE_SIZE);
+
+    pageidx = PAGE_SIZE;
+    pagedata = NULL;    /* hush, gcc */
+
+    while (nelem) {
+        int nr = (PAGE_SIZE/size);
+        struct page *page;
+        if (nr > nelem)
+            nr = nelem;
+        pos = pos->next;
+        page = list_entry(pos, struct page, lru);
+        pagedata = page_address(page);
+        ret = (*fn)(pagedata, nr, state);
+        if (ret)
+            break;
+        nelem -= nr;
+    }
+
+    return ret;
+}
+
  struct mmap_mfn_state {
      unsigned long va;
      struct vm_area_struct *vma;
@@ -250,7 +285,7 @@ struct mmap_batch_state {
       *      0 for no errors
       *      1 if at least one error has happened (and no
       *          -ENOENT errors have happened)
-     *      -ENOENT if at least 1 -ENOENT has happened.
+     *      -ENOENT if at least one -ENOENT has happened.
       */
      int global_error;
      /* An array for individual errors */
@@ -260,17 +295,17 @@ struct mmap_batch_state {
      xen_pfn_t __user *user_mfn;
  };

-static int mmap_batch_fn(void *data, void *state)
+static int mmap_batch_fn(void *data, int nr, void *state)
  {
      xen_pfn_t *mfnp = data;
+
      struct mmap_batch_state *st = state;
      int ret;

-    ret = xen_remap_domain_mfn_range(st->vma, st->va & PAGE_MASK,
*mfnp, 1,
-                     st->vma->vm_page_prot, st->domain);
+    BUG_ON(nr < 0);

-    /* Store error code for second pass. */
-    *(st->err++) = ret;
+    ret = xen_remap_domain_mfn_list(st->vma, st->va & PAGE_MASK,
mfnp,
nr, st->err,
+                    st->vma->vm_page_prot, st->domain);

      /* And see if it affects the global_error. */
      if (ret < 0) {
@@ -282,7 +317,7 @@ static int mmap_batch_fn(void *data, void *state)
                  st->global_error = 1;
          }
      }
-    st->va += PAGE_SIZE;
+    st->va += PAGE_SIZE * nr;

      return 0;
  }
@@ -303,6 +338,7 @@ static int mmap_return_errors_v1(void *data, void 
*state)
      return __put_user(*mfnp, st->user_mfn++);
  }

+
  static struct vm_operations_struct privcmd_vm_ops;

  static long privcmd_ioctl_mmap_batch(void __user *udata, int version)
@@ -319,6 +355,7 @@ static long privcmd_ioctl_mmap_batch(void __user 
*udata, int version)
      if (!xen_initial_domain())
          return -EPERM;

+
      switch (version) {
      case 1:
          if (copy_from_user(&m, udata, sizeof(struct privcmd_mmapbatch)))
@@ -378,8 +415,8 @@ static long privcmd_ioctl_mmap_batch(void __user 
*udata, int version)
      state.err           = err_array;

      /* mmap_batch_fn guarantees ret == 0 */
-    BUG_ON(traverse_pages(m.num, sizeof(xen_pfn_t),
-                 &pagelist, mmap_batch_fn, &state));
+    BUG_ON(traverse_pages_block(m.num, sizeof(xen_pfn_t),
+                    &pagelist, mmap_batch_fn, &state));

      up_write(&mm->mmap_sem);

diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 6a198e4..15ae4f7 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -29,4 +29,10 @@ int xen_remap_domain_mfn_range(struct vm_area_struct 
*vma,
                     unsigned long mfn, int nr,
                     pgprot_t prot, unsigned domid);

+int xen_remap_domain_mfn_list(struct vm_area_struct *vma,
+                  unsigned long addr,
+                  unsigned long *mfn, int nr,
+                  int *err_ptr,
+                  pgprot_t prot, unsigned domid);
+
  #endif /* INCLUDE_XEN_OPS_H */

David Vrabel

2012-Nov-14 13:08 UTC

head link

Re: [RFC/PATCH] Improve speed of mapping guest memory into Dom0

Mats,

Your patch has been white-space damaged and will not apply.  You should
use git send-email which will do the right thing.  I''ve also
CC''d Konrad
who is the maintainer for the Xen parts of the kernel.

On 14/11/12 11:13, Mats Petersson wrote:> [a long, rambling commit message?]
The text as-is isn''t really suitable for a commit message (too much
irrelevant stuff) and there is no suitable subject line.
> I have also found that the munmap() call used to unmap the guest memory
> from Dom0 is about 35% slower in 3.7 kernel than in the 2.6 kernel (3.8M
> cycles vs 2.8M cycles).
This performance reduction only occurs with 32-bit guests is the Xen
then traps-and-emulates both halves of the PTE write.
> I think this could be made quicker by using a
> direct write of zero rather than the compare exchange operation that is
> currently used [which traps into Xen, performs the compare & exchange]
-
This is something I noticed but never got around to producing a patch.
How about this (uncomplied!) patch?

-- a/mm/memory.c
+++ b/mm/memory.c
@@ -1146,8 +1146,16 @@ again:
 				     page->index > details->last_index))
 					continue;
 			}
-			ptent = ptep_get_and_clear_full(mm, addr, pte,
-							tlb->fullmm);
+			/*
+			 * No need for the expensive atomic get and
+			 * clear for anonymous mappings as the dirty
+			 * and young bits are not used.
+			 */
+			if (PageAnon(page))
+				pte_clear(mm, addr, pte);
+			else
+				ptent = ptep_get_and_clear_full(mm, addr, pte,
+								tlb->fullmm);
 			tlb_remove_tlb_entry(tlb, pte, addr);
 			if (unlikely(!page))
 				continue;


Now for the patch itself.  On the whole, the approach looks good and the
real-word performance improvements are nice.  Specific comments inline.
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -2542,3 +2561,77 @@ out:
>      return err;
>  }
>  EXPORT_SYMBOL_GPL(xen_remap_domain_mfn_range);
> +
> +/* like xen_remap_domain_mfn_range, but does a list of mfn''s,
rather
> + * than the, for xen, quite useless, consecutive pages.
> + */
/*
 * Like xen_remap_domain_mfn_range(), but more efficiently handles MFNs
 * that are not contiguous (which is common for a domain''s memory).
 */
> +int xen_remap_domain_mfn_list(struct vm_area_struct *vma,
> +                  unsigned long addr,
> +                  unsigned long *mfn, int nr,
> +                  int *err_ptr,
> +                  pgprot_t prot, unsigned domid)
xen_remap_domain_mfn_array() ?  It''s not taking a list.
> +{
> +    struct remap_list_data rmd;
> +    struct mmu_update mmu_update[REMAP_BATCH_SIZE];
This is surprisingly large (256 bytes) but I note that the existing
xen_remap_domain_mfn_range() does the same thing so I guess it''s ok.
> +    int batch;
> +    int done;
> +    unsigned long range;
> +    int err = 0;
> +
> +    if (xen_feature(XENFEAT_auto_translated_physmap))
> +        return -EINVAL;
> +
> +    prot = __pgprot(pgprot_val(prot) | _PAGE_IOMAP);
> +
> +    BUG_ON(!((vma->vm_flags & (VM_PFNMAP | VM_IO)) == (VM_PFNMAP |
> VM_IO)));
> +
> +    rmd.mfn = mfn;
> +    rmd.prot = prot;
> +
> +    while (nr) {
> +        batch = min(REMAP_BATCH_SIZE, nr);
> +        range = (unsigned long)batch << PAGE_SHIFT;
> +
> +        rmd.mmu_update = mmu_update;
> +        err = apply_to_page_range(vma->vm_mm, addr, range,
> +                      remap_area_mfn_list_pte_fn, &rmd);
> +        if (err)
> +        {
> +            printk("xen_remap_domain_mfn_list: apply_to_range:
> err=%d\n", err);
Stray printk?
> +            goto out;
> +        }
> +
> +        err = HYPERVISOR_mmu_update(mmu_update, batch, &done, domid);
> +        if (err < 0)
> +        {
> +            int i;
> +            /* TODO: We should remove this printk later */
> +            printk("xen_remap_domain_mfn_list: mmu_update: err=%d,
> done=%d, batch=%d\n", err, done, batch);
Yes, you should...
> +            err_ptr[done] = err;
> +
> +            /* now do the remaining part of this batch */
> +            for(i = done+1; i < batch; i++)
> +            {
> +                int tmp_err = HYPERVISOR_mmu_update(&mmu_update[i], 1,
> NULL, domid);
> +                if (tmp_err < 0)
> +                {
> +                    err_ptr[i] = tmp_err;
> +                }
> +            }
There''s no need to fall back to doing it page-by-page here. You can
still batch the remainder.
> +
> +            goto out;
> +        }
> +
> +        nr -= batch;
> +        addr += range;
> +        err_ptr += batch;
> +    }
> +
> +    err = 0;
> +out:
> +
> +    xen_flush_tlb_all();
Probably not that important anymore since we would now do far fewer TLB
flushes, but this TLB flush is only needed by the PTEs being updated
were already present -- if they''re all clear then TLB flush can be
omitted entirely.
> +
> +    return err;
> +}
> +EXPORT_SYMBOL_GPL(xen_remap_domain_mfn_list);
> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
> index 8adb9cc..b39a7b7 100644
> --- a/drivers/xen/privcmd.c
> +++ b/drivers/xen/privcmd.c
> @@ -151,6 +151,41 @@ static int traverse_pages(unsigned nelem, size_t size,
>      return ret;
>  }
> 
> +/*
> + * Similar to traverse_pages, but use each page as a "block" of
> + * data to be processed as one unit.
> + */
> +static int traverse_pages_block(unsigned nelem, size_t size,
> +                struct list_head *pos,
> +                int (*fn)(void *data, int nr, void *state),
> +                void *state)
> +{
> +    void *pagedata;
> +    unsigned pageidx;
> +    int ret = 0;
> +
> +    BUG_ON(size > PAGE_SIZE);
> +
> +    pageidx = PAGE_SIZE;
> +    pagedata = NULL;    /* hush, gcc */
What was gcc upset about?  I don''t see anything it could get confused
about.
> @@ -260,17 +295,17 @@ struct mmap_batch_state {
>      xen_pfn_t __user *user_mfn;
>  };
> 
> -static int mmap_batch_fn(void *data, void *state)
> +static int mmap_batch_fn(void *data, int nr, void *state)
>  {
>      xen_pfn_t *mfnp = data;
> +
>      struct mmap_batch_state *st = state;
>      int ret;
> 
> -    ret = xen_remap_domain_mfn_range(st->vma, st->va &
PAGE_MASK,
> *mfnp, 1,
> -                     st->vma->vm_page_prot, st->domain);
> +    BUG_ON(nr < 0);
Is this BUG_ON() useful?

David

Ian Campbell

2012-Nov-14 13:16 UTC

head link

Re: [RFC/PATCH] Improve speed of mapping guest memory into Dom0

On Wed, 2012-11-14 at 13:08 +0000, David Vrabel wrote:> > +
> > +/* like xen_remap_domain_mfn_range, but does a list of
mfn''s,
> rather
> > + * than the, for xen, quite useless, consecutive pages.
> > + */
> 
> /*
>  * Like xen_remap_domain_mfn_range(), but more efficiently handles
> MFNs
>  * that are not contiguous (which is common for a domain''s
memory).
>  */ 
Can we not fix xen_remap_domain_mfn (and modify the small number of
callers to suit) rather than duplicating most of the functionality with
a slight difference?

Ian.

Mats Petersson

2012-Nov-14 14:04 UTC

head link

Re: [RFC/PATCH] Improve speed of mapping guest memory into Dom0

On 14/11/12 13:08, David Vrabel wrote:> Mats,
>
> Your patch has been white-space damaged and will not apply.  You should
> use git send-email which will do the right thing.  I''ve also
CC''d Konrad
> who is the maintainer for the Xen parts of the kernel.Thanks for the advice. Will use git send-email for V2 of this
patch.> On 14/11/12 11:13, Mats Petersson wrote:
>> [a long, rambling commit message?]
> The text as-is isn''t really suitable for a commit message (too
much
> irrelevant stuff) and there is no suitable subject line.
>
>> I have also found that the munmap() call used to unmap the guest memory
>> from Dom0 is about 35% slower in 3.7 kernel than in the 2.6 kernel
(3.8M
>> cycles vs 2.8M cycles).
> This performance reduction only occurs with 32-bit guests is the Xen
> then traps-and-emulates both halves of the PTE write.
>
>> I think this could be made quicker by using a
>> direct write of zero rather than the compare exchange operation that is
>> currently used [which traps into Xen, performs the compare &
exchange] -
> This is something I noticed but never got around to producing a patch.
> How about this (uncomplied!) patch?
>
> -- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1146,8 +1146,16 @@ again:
>   				     page->index > details->last_index))
>   					continue;
>   			}
> -			ptent = ptep_get_and_clear_full(mm, addr, pte,
> -							tlb->fullmm);
> +			/*
> +			 * No need for the expensive atomic get and
> +			 * clear for anonymous mappings as the dirty
> +			 * and young bits are not used.
> +			 */
> +			if (PageAnon(page))
> +				pte_clear(mm, addr, pte);
> +			else
> +				ptent = ptep_get_and_clear_full(mm, addr, pte,
> +								tlb->fullmm);
>   			tlb_remove_tlb_entry(tlb, pte, addr);
>   			if (unlikely(!page))
>   				continue;I came up with something very similar, however found that page is 
sometimes NULL which causes problems. if (page && PageAnon(page)) will 
probably work... Trying that now.
> Now for the patch itself.  On the whole, the approach looks good and the
> real-word performance improvements are nice.  Specific comments inline.
>
>> --- a/arch/x86/xen/mmu.c
>> +++ b/arch/x86/xen/mmu.c
>> @@ -2542,3 +2561,77 @@ out:
>>       return err;
>>   }
>>   EXPORT_SYMBOL_GPL(xen_remap_domain_mfn_range);
>> +
>> +/* like xen_remap_domain_mfn_range, but does a list of mfn''s,
rather
>> + * than the, for xen, quite useless, consecutive pages.
>> + */
> /*
>   * Like xen_remap_domain_mfn_range(), but more efficiently handles MFNs
>   * that are not contiguous (which is common for a domain''s
memory).
>   */
Will update.>> +int xen_remap_domain_mfn_list(struct vm_area_struct *vma,
>> +                  unsigned long addr,
>> +                  unsigned long *mfn, int nr,
>> +                  int *err_ptr,
>> +                  pgprot_t prot, unsigned domid)
> xen_remap_domain_mfn_array() ?  It''s not taking a list.Yes, better name. Although, perhaps following Ian Campbell''s suggestion
and changing the xen_remain_domain_mfn_range to "do the right thing"
is
even better. I will have a quick look at how difficult that may be - I 
feel it may be rather simpler than I first thought, as there is only one 
call to this function in privcmd.c.

The arm/xen/enlighten.c will probably also need fixing, I suppose...
>> +{
>> +    struct remap_list_data rmd;
>> +    struct mmu_update mmu_update[REMAP_BATCH_SIZE];
> This is surprisingly large (256 bytes) but I note that the existing
> xen_remap_domain_mfn_range() does the same thing so I guess it''s
ok.I would actually like to increase it, but that would probably require 
allocation. One HV call per 32>> +    int batch;
>> +    int done;
>> +    unsigned long range;
>> +    int err = 0;
>> +
>> +    if (xen_feature(XENFEAT_auto_translated_physmap))
>> +        return -EINVAL;
>> +
>> +    prot = __pgprot(pgprot_val(prot) | _PAGE_IOMAP);
>> +
>> +    BUG_ON(!((vma->vm_flags & (VM_PFNMAP | VM_IO)) ==
(VM_PFNMAP |
>> VM_IO)));
>> +
>> +    rmd.mfn = mfn;
>> +    rmd.prot = prot;
>> +
>> +    while (nr) {
>> +        batch = min(REMAP_BATCH_SIZE, nr);
>> +        range = (unsigned long)batch << PAGE_SHIFT;
>> +
>> +        rmd.mmu_update = mmu_update;
>> +        err = apply_to_page_range(vma->vm_mm, addr, range,
>> +                      remap_area_mfn_list_pte_fn, &rmd);
>> +        if (err)
>> +        {
>> +            printk("xen_remap_domain_mfn_list: apply_to_range:
>> err=%d\n", err);
> Stray printk?
Yes.>> +            goto out;
>> +        }
>> +
>> +        err = HYPERVISOR_mmu_update(mmu_update, batch, &done,
domid);
>> +        if (err < 0)
>> +        {
>> +            int i;
>> +            /* TODO: We should remove this printk later */
>> +            printk("xen_remap_domain_mfn_list: mmu_update:
err=%d,
>> done=%d, batch=%d\n", err, done, batch);
> Yes, you should...
>
>> +            err_ptr[done] = err;
>> +
>> +            /* now do the remaining part of this batch */
>> +            for(i = done+1; i < batch; i++)
>> +            {
>> +                int tmp_err =
HYPERVISOR_mmu_update(&mmu_update[i], 1,
>> NULL, domid);
>> +                if (tmp_err < 0)
>> +                {
>> +                    err_ptr[i] = tmp_err;
>> +                }
>> +            }
> There''s no need to fall back to doing it page-by-page here. You
can
> still batch the remainder.
Yes, I agree.>> +
>> +            goto out;
>> +        }
>> +
>> +        nr -= batch;
>> +        addr += range;
>> +        err_ptr += batch;
>> +    }
>> +
>> +    err = 0;
>> +out:
>> +
>> +    xen_flush_tlb_all();
> Probably not that important anymore since we would now do far fewer TLB
> flushes, but this TLB flush is only needed by the PTEs being updated
> were already present -- if they''re all clear then TLB flush can be
> omitted entirely.The 2.6 kernel actually had a check BUG_ON(!pte_none(pte)) [or something 
like that], but the 3.x kernel seems to have "lost" that. Either way, 
once per, say, 1024 pages isn''t measurable in my
benchmarks.>
>> +
>> +    return err;
>> +}
>> +EXPORT_SYMBOL_GPL(xen_remap_domain_mfn_list);
>> diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
>> index 8adb9cc..b39a7b7 100644
>> --- a/drivers/xen/privcmd.c
>> +++ b/drivers/xen/privcmd.c
>> @@ -151,6 +151,41 @@ static int traverse_pages(unsigned nelem, size_t
size,
>>       return ret;
>>   }
>>
>> +/*
>> + * Similar to traverse_pages, but use each page as a "block"
of
>> + * data to be processed as one unit.
>> + */
>> +static int traverse_pages_block(unsigned nelem, size_t size,
>> +                struct list_head *pos,
>> +                int (*fn)(void *data, int nr, void *state),
>> +                void *state)
>> +{
>> +    void *pagedata;
>> +    unsigned pageidx;
>> +    int ret = 0;
>> +
>> +    BUG_ON(size > PAGE_SIZE);
>> +
>> +    pageidx = PAGE_SIZE;
>> +    pagedata = NULL;    /* hush, gcc */
> What was gcc upset about?  I don''t see anything it could get
confused about.This comes from the original traverse_pages code, which does this. I 
will see if it compiles without warnings if I remove the
initialization.>
>> @@ -260,17 +295,17 @@ struct mmap_batch_state {
>>       xen_pfn_t __user *user_mfn;
>>   };
>>
>> -static int mmap_batch_fn(void *data, void *state)
>> +static int mmap_batch_fn(void *data, int nr, void *state)
>>   {
>>       xen_pfn_t *mfnp = data;
>> +
>>       struct mmap_batch_state *st = state;
>>       int ret;
>>
>> -    ret = xen_remap_domain_mfn_range(st->vma, st->va &
PAGE_MASK,
>> *mfnp, 1,
>> -                     st->vma->vm_page_prot, st->domain);
>> +    BUG_ON(nr < 0);
> Is this BUG_ON() useful?I think very bad things will happen when nr is less than zero, but I 
guess they will be pretty obvious... So perhaps just letting it die from 
the consequences. It does "while(nr)" in the remap function, and then 
min(REMAP_BATCH_SIZE, nr) an various other things that will definitely 
not work as expected if nr is negative. I thought it''s much easier to 
detect when it''s gone wrong if you assert early tho''.

--
Mats>
> David
>
>

David Vrabel

2012-Nov-14 16:39 UTC

head link

Re: [RFC/PATCH] Improve speed of mapping guest memory into Dom0

On 14/11/12 13:08, David Vrabel wrote:> On 14/11/12 11:13, Mats Petersson wrote:
> 
>> I have also found that the munmap() call used to unmap the guest memory
>> from Dom0 is about 35% slower in 3.7 kernel than in the 2.6 kernel
(3.8M
>> cycles vs 2.8M cycles).
> 
> This performance reduction only occurs with 32-bit guests is the Xen
> then traps-and-emulates both halves of the PTE write.
> 
>> I think this could be made quicker by using a
>> direct write of zero rather than the compare exchange operation that is
>> currently used [which traps into Xen, performs the compare &
exchange] -
> 
> This is something I noticed but never got around to producing a patch.
> How about this (uncomplied!) patch?
> 
> -- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1146,8 +1146,16 @@ again:
>  				     page->index > details->last_index))
>  					continue;
>  			}
> -			ptent = ptep_get_and_clear_full(mm, addr, pte,
> -							tlb->fullmm);
> +			/*
> +			 * No need for the expensive atomic get and
> +			 * clear for anonymous mappings as the dirty
> +			 * and young bits are not used.
> +			 */
> +			if (PageAnon(page))
The mapping might not be backed by pages (e.g., foreign mappings) so:

if (!page || PageAnon(page))
> +				pte_clear(mm, addr, pte);
> +			else
> +				ptent = ptep_get_and_clear_full(mm, addr, pte,
> +								tlb->fullmm);
>  			tlb_remove_tlb_entry(tlb, pte, addr);
>  			if (unlikely(!page))
>  				continue;
David

Mats Petersson

2012-Nov-14 16:43 UTC

head link

Re: [RFC/PATCH] Improve speed of mapping guest memory into Dom0

On 14/11/12 16:39, David Vrabel wrote:> On 14/11/12 13:08, David Vrabel wrote:
>> On 14/11/12 11:13, Mats Petersson wrote:
>>
>>> I have also found that the munmap() call used to unmap the guest
memory
>>> from Dom0 is about 35% slower in 3.7 kernel than in the 2.6 kernel
(3.8M
>>> cycles vs 2.8M cycles).
>> This performance reduction only occurs with 32-bit guests is the Xen
>> then traps-and-emulates both halves of the PTE write.
>>
>>> I think this could be made quicker by using a
>>> direct write of zero rather than the compare exchange operation
that is
>>> currently used [which traps into Xen, performs the compare &
exchange] -
>> This is something I noticed but never got around to producing a patch.
>> How about this (uncomplied!) patch?
>>
>> -- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -1146,8 +1146,16 @@ again:
>>   				     page->index > details->last_index))
>>   					continue;
>>   			}
>> -			ptent = ptep_get_and_clear_full(mm, addr, pte,
>> -							tlb->fullmm);
>> +			/*
>> +			 * No need for the expensive atomic get and
>> +			 * clear for anonymous mappings as the dirty
>> +			 * and young bits are not used.
>> +			 */
>> +			if (PageAnon(page))
> The mapping might not be backed by pages (e.g., foreign mappings) so:
>
> if (!page || PageAnon(page))Indeed, this works fine - it now takes just under 500K cycles to
"unmap"
1024 pages - compared to 3800k cycles with the original code.

--
Mats>
>> +				pte_clear(mm, addr, pte);
>> +			else
>> +				ptent = ptep_get_and_clear_full(mm, addr, pte,
>> +								tlb->fullmm);
>>   			tlb_remove_tlb_entry(tlb, pte, addr);
>>   			if (unlikely(!page))
>>   				continue;
> David
>
>

Xen devel - Nov 2012 - [RFC/PATCH] Improve speed of mapping guest memory into Dom0

[RFC/PATCH] Improve speed of mapping guest memory into Dom0

Re: [RFC/PATCH] Improve speed of mapping guest memory into Dom0

Re: [RFC/PATCH] Improve speed of mapping guest memory into Dom0

Re: [RFC/PATCH] Improve speed of mapping guest memory into Dom0

Re: [RFC/PATCH] Improve speed of mapping guest memory into Dom0

Re: [RFC/PATCH] Improve speed of mapping guest memory into Dom0