thr3ads.net - Xen devel - [Xen-devel] [PATCH 0 of 9] [RFC] p2m fine-grained concurrency control [Oct 2011]

If this information is useful, please help other people find it:
Share via:

Andres Lagar-Cavilla

2011-Oct-27 04:33 UTC

[Xen-devel] [PATCH 0 of 9] [RFC] p2m fine-grained concurrency control

This patch series is an RFC on p2m fine-grained locking.

The p2m (in x86) is accessed today in an unsafe manner. Lookups
do not hold any locks or refs, and things like paging or sharing
can change their entries under their feet. Even the pages that 
may have been mapped as a result of a lookup may disappear.

This is an attempt at a solution. The gist is to lock 2MB aligned
ranges, exclusively, both for lookups and modifications. Callers
external to the p2m also get a ref on the underlying mfn. This 
prevents modifications to the p2m from happening while the caller
is relying on the translation, and ensures liveness of the 
underlying page. This also creates protected critical regions
whithin which the caller can bump the ref count of a page
(e.g. while establishing a mapping) without being exposed to 
races.

Locking of 2MB ranges is recursive, and we also allow a global 
lock on the full p2m for heavy handed operations like log-dirty.

There are plenty of design choices to discuss. The hope is to 
foster some input and progress on this. Some of the questions
below will make sense once you go through the patches:
- is locking on a 4kb basis necessary? (guess: no)
- we do some ugly things to fit 512 spinlocks in a page...
- can we hold a entry "captive" for the lifetime of a 
  foreign mapping? will that not collide against globally-
  locking p2m operations such as log dirty? We''ve decided
  no and yes, so far.
- is our current implementation for holding the global
  p2m lock in a non-exclusive manner too heavy handed on
  barriers and spinlocks? Could we just get away with atomics?
- we''ve considered read/writer locks. But many code paths
  require promotions not known a priori, and a deadlock-free
  promotion is risky to achieve. The semantics of exclusive
  locking simply make it easier (hah!) to reason.
- I''m unclear on the lifespan of some pointers in the nested 
  hvm code (e.g. nv_vvmcx). For p2m purposes, the entries are
  locked and unlocked in different functions, that I''m not sure
  happen in pair within the same scheduler slice. Would that 
  violate in_atomic()?
- note the last patch is massive. There is no way around 
  modifying all callers of p2m queries.

Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>

 xen/arch/x86/mm/mm-locks.h         |   27 +-
 xen/arch/x86/mm/mm-locks.h         |   27 +
 xen/arch/x86/mm/mm-locks.h         |   11 +
 xen/arch/x86/mm/p2m-pod.c          |   40 +-
 xen/include/asm-x86/p2m.h          |    5 +
 xen/arch/x86/mm/mm-locks.h         |    9 +
 xen/arch/x86/mm/p2m-pod.c          |  145 +++++---
 xen/arch/x86/mm/p2m-pt.c           |    3 +
 xen/arch/x86/mm/p2m.c              |    7 +-
 xen/include/asm-x86/p2m.h          |   25 +-
 xen/arch/x86/mm/hap/private.h      |    1 +
 xen/arch/x86/mm/mm-locks.h         |   20 +-
 xen/arch/x86/mm/p2m-ept.c          |    1 +
 xen/arch/x86/mm/p2m-lock.h         |  613 +++++++++++++++++++++++++++++++++++++
 xen/arch/x86/mm/p2m-pod.c          |    1 +
 xen/arch/x86/mm/p2m-pt.c           |    1 +
 xen/arch/x86/mm/p2m.c              |   24 +-
 xen/include/asm-x86/p2m.h          |    3 +-
 xen/arch/x86/mm/p2m-ept.c          |   15 +-
 xen/arch/x86/mm/p2m-lock.h         |   11 +-
 xen/arch/x86/mm/p2m-pt.c           |   82 +++-
 xen/arch/x86/mm/p2m.c              |   38 ++
 xen/include/asm-x86/p2m.h          |   40 +--
 xen/arch/x86/mm/hap/hap.c          |    2 +-
 xen/arch/x86/mm/hap/nested_hap.c   |   21 +-
 xen/arch/x86/mm/p2m-ept.c          |   26 +-
 xen/arch/x86/mm/p2m-pod.c          |   42 +-
 xen/arch/x86/mm/p2m-pt.c           |   20 +-
 xen/arch/x86/mm/p2m.c              |  185 +++++++----
 xen/include/asm-ia64/mm.h          |    5 +
 xen/include/asm-x86/p2m.h          |   45 ++-
 xen/arch/x86/cpu/mcheck/vmce.c     |    7 +-
 xen/arch/x86/debug.c               |    7 +-
 xen/arch/x86/domain.c              |   24 +-
 xen/arch/x86/domctl.c              |    9 +-
 xen/arch/x86/hvm/emulate.c         |   25 +-
 xen/arch/x86/hvm/hvm.c             |  126 ++++++-
 xen/arch/x86/hvm/mtrr.c            |    2 +-
 xen/arch/x86/hvm/nestedhvm.c       |    2 +-
 xen/arch/x86/hvm/stdvga.c          |    4 +-
 xen/arch/x86/hvm/svm/nestedsvm.c   |   12 +-
 xen/arch/x86/hvm/svm/svm.c         |   11 +-
 xen/arch/x86/hvm/viridian.c        |    4 +
 xen/arch/x86/hvm/vmx/vmx.c         |   13 +-
 xen/arch/x86/hvm/vmx/vvmx.c        |   11 +-
 xen/arch/x86/mm.c                  |  126 ++++++-
 xen/arch/x86/mm/guest_walk.c       |   11 +
 xen/arch/x86/mm/hap/guest_walk.c   |   15 +-
 xen/arch/x86/mm/mem_event.c        |   28 +-
 xen/arch/x86/mm/mem_sharing.c      |   23 +-
 xen/arch/x86/mm/shadow/common.c    |    4 +-
 xen/arch/x86/mm/shadow/multi.c     |   67 +++-
 xen/arch/x86/physdev.c             |    9 +
 xen/arch/x86/traps.c               |   17 +-
 xen/common/grant_table.c           |   27 +-
 xen/common/memory.c                |    9 +
 xen/common/tmem_xen.c              |   21 +-
 xen/include/asm-x86/hvm/hvm.h      |    5 +-
 xen/include/asm-x86/hvm/vmx/vvmx.h |    1 +
 59 files changed, 1714 insertions(+), 401 deletions(-)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andres Lagar-Cavilla

2011-Oct-27 04:33 UTC

head link

[Xen-devel] [PATCH 1 of 9] Refactor mm-lock ordering constructs

xen/arch/x86/mm/mm-locks.h |  27 +++++++++++++++++++--------
 1 files changed, 19 insertions(+), 8 deletions(-)


The mm layer has a construct to enforce locks are taken in a pre-
defined order, and thus avert deadlock. Refactor pieces of this
code for later use, no functional changes.

Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>

diff -r a33af75083c7 -r 25b9a9966368 xen/arch/x86/mm/mm-locks.h
--- a/xen/arch/x86/mm/mm-locks.h
+++ b/xen/arch/x86/mm/mm-locks.h
@@ -28,6 +28,7 @@
 
 /* Per-CPU variable for enforcing the lock ordering */
 DECLARE_PER_CPU(int, mm_lock_level);
+#define __get_lock_level()  (this_cpu(mm_lock_level))
 
 static inline void mm_lock_init(mm_lock_t *l)
 {
@@ -42,22 +43,32 @@ static inline int mm_locked_by_me(mm_loc
     return (l->lock.recurse_cpu == current->processor);
 }
 
+/* If you see this crash, the numbers printed are lines in this file 
+ * where the offending locks are declared. */
+#define __check_lock_level(l)                           \
+do {                                                    \
+    if ( unlikely(__get_lock_level()) > (l) )           \
+        panic("mm locking order violation: %i > %i\n",  \
+              __get_lock_level(), (l));                 \
+} while(0)
+
+#define __set_lock_level(l)         \
+do {                                \
+    __get_lock_level() = (l);       \
+} while(0)
+
 static inline void _mm_lock(mm_lock_t *l, const char *func, int level, int rec)
 {
-    /* If you see this crash, the numbers printed are lines in this file 
-     * where the offending locks are declared. */
-    if ( unlikely(this_cpu(mm_lock_level) > level) )
-        panic("mm locking order violation: %i > %i\n", 
-              this_cpu(mm_lock_level), level);
+    __check_lock_level(level);
     spin_lock_recursive(&l->lock);
     if ( l->lock.recurse_cnt == 1 )
     {
         l->locker_function = func;
-        l->unlock_level = this_cpu(mm_lock_level);
+        l->unlock_level = __get_lock_level();
     }
     else if ( (unlikely(!rec)) )
         panic("mm lock already held by %s\n", l->locker_function);
-    this_cpu(mm_lock_level) = level;
+    __set_lock_level(level);
 }
 /* This wrapper uses the line number to express the locking order below */
 #define declare_mm_lock(name)                                                 \
@@ -72,7 +83,7 @@ static inline void mm_unlock(mm_lock_t *
     if ( l->lock.recurse_cnt == 1 )
     {
         l->locker_function = "nobody";
-        this_cpu(mm_lock_level) = l->unlock_level;
+        __set_lock_level(l->unlock_level);
     }
     spin_unlock_recursive(&l->lock);
 }

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andres Lagar-Cavilla

2011-Oct-27 04:33 UTC

head link

[Xen-devel] [PATCH 2 of 9] Declare an order-enforcing construct for external locks used in the mm layer

xen/arch/x86/mm/mm-locks.h |  27 +++++++++++++++++++++++++++
 1 files changed, 27 insertions(+), 0 deletions(-)


Declare an order-enforcing construct for a lock used in the mm layer
that is not of type mm_lock_t. This is useful whenever the mm layer
takes locks from other subsystems, or locks not implemented as
mm_lock_t.

Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>

diff -r 25b9a9966368 -r c915609e4235 xen/arch/x86/mm/mm-locks.h
--- a/xen/arch/x86/mm/mm-locks.h
+++ b/xen/arch/x86/mm/mm-locks.h
@@ -70,6 +70,18 @@ static inline void _mm_lock(mm_lock_t *l
         panic("mm lock already held by %s\n", l->locker_function);
     __set_lock_level(level);
 }
+
+static inline void _mm_enforce_order_lock_pre(int level)
+{
+    __check_lock_level(level);
+}
+
+static inline void _mm_enforce_order_lock_post(int level, int *unlock_level)
+{
+    *unlock_level = __get_lock_level();
+    __set_lock_level(level);
+}
+
 /* This wrapper uses the line number to express the locking order below */
 #define declare_mm_lock(name)                                                 \
     static inline void mm_lock_##name(mm_lock_t *l, const char *func, int rec)\
@@ -78,6 +90,16 @@ static inline void _mm_lock(mm_lock_t *l
 #define mm_lock(name, l) mm_lock_##name(l, __func__, 0)
 #define mm_lock_recursive(name, l) mm_lock_##name(l, __func__, 1)
 
+/* This wrapper is intended for "external" locks which do not use
+ * the mm_lock_t types. Such locks inside the mm code are also subject
+ * to ordering constraints. They cannot be recursive (yet, additional
+ * bookkepping is necessary) */
+#define declare_mm_order_constraint(name)                                   \
+    static inline void mm_enforce_order_lock_pre_##name(void)               \
+    { _mm_enforce_order_lock_pre(__LINE__); }                               \
+    static inline void mm_enforce_order_lock_post_##name(int *unlock_level) \
+    { _mm_enforce_order_lock_post(__LINE__, unlock_level); }                \
+
 static inline void mm_unlock(mm_lock_t *l)
 {
     if ( l->lock.recurse_cnt == 1 )
@@ -88,6 +110,11 @@ static inline void mm_unlock(mm_lock_t *
     spin_unlock_recursive(&l->lock);
 }
 
+static inline void mm_enforce_order_unlock(int unlock_level)
+{
+    __set_lock_level(unlock_level);
+}
+
 /************************************************************************
  *                                                                      *
  * To avoid deadlocks, these locks _MUST_ be taken in the order
they''re *

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andres Lagar-Cavilla

2011-Oct-27 04:33 UTC

head link

[Xen-devel] [PATCH 3 of 9] Enforce ordering constraints for the page alloc lock in the PoD code

xen/arch/x86/mm/mm-locks.h |  11 +++++++++++
 xen/arch/x86/mm/p2m-pod.c  |  40 +++++++++++++++++++++++++++-------------
 xen/include/asm-x86/p2m.h  |   5 +++++
 3 files changed, 43 insertions(+), 13 deletions(-)


The page alloc lock is sometimes used in the PoD code, with an
explicit expectation of ordering. Use our ordering constructs in the
mm layer to enforce this.

Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>

diff -r c915609e4235 -r 332775f72a30 xen/arch/x86/mm/mm-locks.h
--- a/xen/arch/x86/mm/mm-locks.h
+++ b/xen/arch/x86/mm/mm-locks.h
@@ -155,6 +155,17 @@ declare_mm_lock(p2m)
 #define p2m_unlock(p)         mm_unlock(&(p)->lock)
 #define p2m_locked_by_me(p)   mm_locked_by_me(&(p)->lock)
 
+/* Page alloc lock (per-domain)
+ *
+ * This is an external lock, not represented by an mm_lock_t. However, 
+ * pod code uses it in conjunction with the p2m lock, and expecting
+ * the ordering which we enforce here */
+
+declare_mm_order_constraint(page_alloc)
+#define page_alloc_mm_pre_lock()   mm_enforce_order_lock_pre_page_alloc()
+#define page_alloc_mm_post_lock(l)
mm_enforce_order_lock_post_page_alloc(&(l))
+#define page_alloc_mm_unlock(l)    mm_enforce_order_unlock((l))
+
 /* Paging lock (per-domain)
  *
  * For shadow pagetables, this lock protects
diff -r c915609e4235 -r 332775f72a30 xen/arch/x86/mm/p2m-pod.c
--- a/xen/arch/x86/mm/p2m-pod.c
+++ b/xen/arch/x86/mm/p2m-pod.c
@@ -45,6 +45,20 @@
 
 #define superpage_aligned(_x)  (((_x)&(SUPERPAGE_PAGES-1))==0)
 
+/* Enforce lock ordering when grabbing the "external" page_alloc lock
*/
+static inline void lock_page_alloc(struct p2m_domain *p2m)
+{
+    page_alloc_mm_pre_lock();
+    spin_lock(&(p2m->domain->page_alloc_lock));
+    page_alloc_mm_post_lock(p2m->pod.page_alloc_unlock_level);
+}
+
+static inline void unlock_page_alloc(struct p2m_domain *p2m)
+{
+    page_alloc_mm_unlock(p2m->pod.page_alloc_unlock_level);
+    spin_unlock(&(p2m->domain->page_alloc_lock));
+}
+
 /*
  * Populate-on-demand functionality
  */
@@ -100,7 +114,7 @@ p2m_pod_cache_add(struct p2m_domain *p2m
         unmap_domain_page(b);
     }
 
-    spin_lock(&d->page_alloc_lock);
+    lock_page_alloc(p2m);
 
     /* First, take all pages off the domain list */
     for(i=0; i < 1 << order ; i++)
@@ -128,7 +142,7 @@ p2m_pod_cache_add(struct p2m_domain *p2m
      * This may cause "zombie domains" since the page will never be
freed. */
     BUG_ON( d->arch.relmem != RELMEM_not_started );
 
-    spin_unlock(&d->page_alloc_lock);
+    unlock_page_alloc(p2m);
 
     return 0;
 }
@@ -245,7 +259,7 @@ p2m_pod_set_cache_target(struct p2m_doma
 
         /* Grab the lock before checking that pod.super is empty, or the last
          * entries may disappear before we grab the lock. */
-        spin_lock(&d->page_alloc_lock);
+        lock_page_alloc(p2m);
 
         if ( (p2m->pod.count - pod_target) > SUPERPAGE_PAGES
              && !page_list_empty(&p2m->pod.super) )
@@ -257,7 +271,7 @@ p2m_pod_set_cache_target(struct p2m_doma
 
         ASSERT(page != NULL);
 
-        spin_unlock(&d->page_alloc_lock);
+        unlock_page_alloc(p2m);
 
         /* Then free them */
         for ( i = 0 ; i < (1 << order) ; i++ )
@@ -378,7 +392,7 @@ p2m_pod_empty_cache(struct domain *d)
     BUG_ON(!d->is_dying);
     spin_barrier(&p2m->lock.lock);
 
-    spin_lock(&d->page_alloc_lock);
+    lock_page_alloc(p2m);
 
     while ( (page = page_list_remove_head(&p2m->pod.super)) )
     {
@@ -403,7 +417,7 @@ p2m_pod_empty_cache(struct domain *d)
 
     BUG_ON(p2m->pod.count != 0);
 
-    spin_unlock(&d->page_alloc_lock);
+    unlock_page_alloc(p2m);
 }
 
 int
@@ -417,7 +431,7 @@ p2m_pod_offline_or_broken_hit(struct pag
     if ( !(d = page_get_owner(p)) || !(p2m = p2m_get_hostp2m(d)) )
         return 0;
 
-    spin_lock(&d->page_alloc_lock);
+    lock_page_alloc(p2m);
     bmfn = mfn_x(page_to_mfn(p));
     page_list_for_each_safe(q, tmp, &p2m->pod.super)
     {
@@ -448,12 +462,12 @@ p2m_pod_offline_or_broken_hit(struct pag
         }
     }
 
-    spin_unlock(&d->page_alloc_lock);
+    unlock_page_alloc(p2m);
     return 0;
 
 pod_hit:
     page_list_add_tail(p, &d->arch.relmem_list);
-    spin_unlock(&d->page_alloc_lock);
+    unlock_page_alloc(p2m);
     return 1;
 }
 
@@ -994,7 +1008,7 @@ p2m_pod_demand_populate(struct p2m_domai
     if ( q == p2m_guest && gfn > p2m->pod.max_guest )
         p2m->pod.max_guest = gfn;
 
-    spin_lock(&d->page_alloc_lock);
+    lock_page_alloc(p2m);
 
     if ( p2m->pod.count == 0 )
         goto out_of_memory;
@@ -1008,7 +1022,7 @@ p2m_pod_demand_populate(struct p2m_domai
 
     BUG_ON((mfn_x(mfn) & ((1 << order)-1)) != 0);
 
-    spin_unlock(&d->page_alloc_lock);
+    unlock_page_alloc(p2m);
 
     gfn_aligned = (gfn >> order) << order;
 
@@ -1040,7 +1054,7 @@ p2m_pod_demand_populate(struct p2m_domai
 
     return 0;
 out_of_memory:
-    spin_unlock(&d->page_alloc_lock);
+    unlock_page_alloc(p2m);
 
     printk("%s: Out of populate-on-demand memory! tot_pages %" PRIu32
" pod_entries %" PRIi32 "\n",
            __func__, d->tot_pages, p2m->pod.entry_count);
@@ -1049,7 +1063,7 @@ out_fail:
     return -1;
 remap_and_retry:
     BUG_ON(order != PAGE_ORDER_2M);
-    spin_unlock(&d->page_alloc_lock);
+    unlock_page_alloc(p2m);
 
     /* Remap this 2-meg region in singleton chunks */
     gfn_aligned = (gfn>>order)<<order;
diff -r c915609e4235 -r 332775f72a30 xen/include/asm-x86/p2m.h
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -270,6 +270,10 @@ struct p2m_domain {
      * + p2m_pod_demand_populate() grabs both; the p2m lock to avoid
      *   double-demand-populating of pages, the page_alloc lock to
      *   protect moving stuff from the PoD cache to the domain page list.
+     *
+     * We enforce this lock ordering through a construct in mm-locks.h.
+     * This demands, however, that we store the previous lock-ordering
+     * level in effect before grabbing the page_alloc lock.
      */
     struct {
         struct page_list_head super,   /* List of superpages                */
@@ -279,6 +283,7 @@ struct p2m_domain {
         unsigned         reclaim_super; /* Last gpfn of a scan */
         unsigned         reclaim_single; /* Last gpfn of a scan */
         unsigned         max_guest;    /* gpfn of max guest demand-populate */
+        int              page_alloc_unlock_level; /* To enforce lock ordering
*/
     } pod;
 };
 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andres Lagar-Cavilla

2011-Oct-27 04:33 UTC

head link

[Xen-devel] [PATCH 4 of 9] Rework locking in the PoD layer

xen/arch/x86/mm/mm-locks.h |    9 ++
 xen/arch/x86/mm/p2m-pod.c  |  145 +++++++++++++++++++++++++++------------------
 xen/arch/x86/mm/p2m-pt.c   |    3 +
 xen/arch/x86/mm/p2m.c      |    7 +-
 xen/include/asm-x86/p2m.h  |   25 ++-----
 5 files changed, 113 insertions(+), 76 deletions(-)


The PoD layer has a fragile locking discipline. It relies on the
p2m being globally locked, and it also relies on the page alloc
lock to protect some of its data structures. Replace this all by an
explicit pod lock: per p2m, order enforced.

Two consequences:
    - Critical sections in the pod code protected by the page alloc
      lock are now reduced to modifications of the domain page list.
    - When the p2m lock becomes fine-grained, there are no
      assumptions broken in the PoD layer.

Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>

diff -r 332775f72a30 -r 981073d78f7f xen/arch/x86/mm/mm-locks.h
--- a/xen/arch/x86/mm/mm-locks.h
+++ b/xen/arch/x86/mm/mm-locks.h
@@ -155,6 +155,15 @@ declare_mm_lock(p2m)
 #define p2m_unlock(p)         mm_unlock(&(p)->lock)
 #define p2m_locked_by_me(p)   mm_locked_by_me(&(p)->lock)
 
+/* PoD lock (per-p2m-table)
+ * 
+ * Protects private PoD data structs. */
+
+declare_mm_lock(pod)
+#define pod_lock(p)           mm_lock(pod, &(p)->pod.lock)
+#define pod_unlock(p)         mm_unlock(&(p)->pod.lock)
+#define pod_locked_by_me(p)   mm_locked_by_me(&(p)->pod.lock)
+
 /* Page alloc lock (per-domain)
  *
  * This is an external lock, not represented by an mm_lock_t. However, 
diff -r 332775f72a30 -r 981073d78f7f xen/arch/x86/mm/p2m-pod.c
--- a/xen/arch/x86/mm/p2m-pod.c
+++ b/xen/arch/x86/mm/p2m-pod.c
@@ -63,6 +63,7 @@ static inline void unlock_page_alloc(str
  * Populate-on-demand functionality
  */
 
+/* PoD lock held on entry */
 static int
 p2m_pod_cache_add(struct p2m_domain *p2m,
                   struct page_info *page,
@@ -114,43 +115,42 @@ p2m_pod_cache_add(struct p2m_domain *p2m
         unmap_domain_page(b);
     }
 
+    /* First, take all pages off the domain list */
     lock_page_alloc(p2m);
-
-    /* First, take all pages off the domain list */
     for(i=0; i < 1 << order ; i++)
     {
         p = page + i;
         page_list_del(p, &d->page_list);
     }
 
-    /* Then add the first one to the appropriate populate-on-demand list */
-    switch(order)
-    {
-    case PAGE_ORDER_2M:
-        page_list_add_tail(page, &p2m->pod.super); /* lock: page_alloc
*/
-        p2m->pod.count += 1 << order;
-        break;
-    case PAGE_ORDER_4K:
-        page_list_add_tail(page, &p2m->pod.single); /* lock: page_alloc
*/
-        p2m->pod.count += 1;
-        break;
-    default:
-        BUG();
-    }
-
     /* Ensure that the PoD cache has never been emptied.  
      * This may cause "zombie domains" since the page will never be
freed. */
     BUG_ON( d->arch.relmem != RELMEM_not_started );
 
     unlock_page_alloc(p2m);
 
+    /* Then add the first one to the appropriate populate-on-demand list */
+    switch(order)
+    {
+    case PAGE_ORDER_2M:
+        page_list_add_tail(page, &p2m->pod.super);
+        p2m->pod.count += 1 << order;
+        break;
+    case PAGE_ORDER_4K:
+        page_list_add_tail(page, &p2m->pod.single);
+        p2m->pod.count += 1;
+        break;
+    default:
+        BUG();
+    }
+
     return 0;
 }
 
 /* Get a page of size order from the populate-on-demand cache.  Will break
  * down 2-meg pages into singleton pages automatically.  Returns null if
- * a superpage is requested and no superpages are available.  Must be called
- * with the d->page_lock held. */
+ * a superpage is requested and no superpages are available. */
+/* PoD lock held on entry */
 static struct page_info * p2m_pod_cache_get(struct p2m_domain *p2m,
                                             unsigned long order)
 {
@@ -185,7 +185,7 @@ static struct page_info * p2m_pod_cache_
     case PAGE_ORDER_2M:
         BUG_ON( page_list_empty(&p2m->pod.super) );
         p = page_list_remove_head(&p2m->pod.super);
-        p2m->pod.count -= 1 << order; /* Lock: page_alloc */
+        p2m->pod.count -= 1 << order;
         break;
     case PAGE_ORDER_4K:
         BUG_ON( page_list_empty(&p2m->pod.single) );
@@ -197,16 +197,19 @@ static struct page_info * p2m_pod_cache_
     }
 
     /* Put the pages back on the domain page_list */
+    lock_page_alloc(p2m);
     for ( i = 0 ; i < (1 << order); i++ )
     {
         BUG_ON(page_get_owner(p + i) != p2m->domain);
         page_list_add_tail(p + i, &p2m->domain->page_list);
     }
+    unlock_page_alloc(p2m);
 
     return p;
 }
 
 /* Set the size of the cache, allocating or freeing as necessary. */
+/* PoD lock held on entry */
 static int
 p2m_pod_set_cache_target(struct p2m_domain *p2m, unsigned long pod_target, int
preemptible)
 {
@@ -259,8 +262,6 @@ p2m_pod_set_cache_target(struct p2m_doma
 
         /* Grab the lock before checking that pod.super is empty, or the last
          * entries may disappear before we grab the lock. */
-        lock_page_alloc(p2m);
-
         if ( (p2m->pod.count - pod_target) > SUPERPAGE_PAGES
              && !page_list_empty(&p2m->pod.super) )
             order = PAGE_ORDER_2M;
@@ -271,8 +272,6 @@ p2m_pod_set_cache_target(struct p2m_doma
 
         ASSERT(page != NULL);
 
-        unlock_page_alloc(p2m);
-
         /* Then free them */
         for ( i = 0 ; i < (1 << order) ; i++ )
         {
@@ -348,7 +347,7 @@ p2m_pod_set_mem_target(struct domain *d,
     int ret = 0;
     unsigned long populated;
 
-    p2m_lock(p2m);
+    pod_lock(p2m);
 
     /* P == B: Nothing to do. */
     if ( p2m->pod.entry_count == 0 )
@@ -377,7 +376,7 @@ p2m_pod_set_mem_target(struct domain *d,
     ret = p2m_pod_set_cache_target(p2m, pod_target, 1/*preemptible*/);
 
 out:
-    p2m_unlock(p2m);
+    pod_unlock(p2m);
 
     return ret;
 }
@@ -390,7 +389,7 @@ p2m_pod_empty_cache(struct domain *d)
 
     /* After this barrier no new PoD activities can happen. */
     BUG_ON(!d->is_dying);
-    spin_barrier(&p2m->lock.lock);
+    spin_barrier(&p2m->pod.lock.lock);
 
     lock_page_alloc(p2m);
 
@@ -431,7 +430,8 @@ p2m_pod_offline_or_broken_hit(struct pag
     if ( !(d = page_get_owner(p)) || !(p2m = p2m_get_hostp2m(d)) )
         return 0;
 
-    lock_page_alloc(p2m);
+    pod_lock(p2m);
+
     bmfn = mfn_x(page_to_mfn(p));
     page_list_for_each_safe(q, tmp, &p2m->pod.super)
     {
@@ -462,12 +462,14 @@ p2m_pod_offline_or_broken_hit(struct pag
         }
     }
 
-    unlock_page_alloc(p2m);
+    pod_unlock(p2m);
     return 0;
 
 pod_hit:
+    lock_page_alloc(p2m);
     page_list_add_tail(p, &d->arch.relmem_list);
     unlock_page_alloc(p2m);
+    pod_unlock(p2m);
     return 1;
 }
 
@@ -486,9 +488,9 @@ p2m_pod_offline_or_broken_replace(struct
     if ( unlikely(!p) )
         return;
 
-    p2m_lock(p2m);
+    pod_lock(p2m);
     p2m_pod_cache_add(p2m, p, PAGE_ORDER_4K);
-    p2m_unlock(p2m);
+    pod_unlock(p2m);
     return;
 }
 
@@ -512,6 +514,7 @@ p2m_pod_decrease_reservation(struct doma
     int steal_for_cache = 0;
     int pod = 0, nonpod = 0, ram = 0;
     
+    pod_lock(p2m);
 
     /* If we don''t have any outstanding PoD entries, let things take
their
      * course */
@@ -521,11 +524,10 @@ p2m_pod_decrease_reservation(struct doma
     /* Figure out if we need to steal some freed memory for our cache */
     steal_for_cache =  ( p2m->pod.entry_count > p2m->pod.count );
 
-    p2m_lock(p2m);
     audit_p2m(p2m, 1);
 
     if ( unlikely(d->is_dying) )
-        goto out_unlock;
+        goto out;
 
     /* See what''s in here. */
     /* FIXME: Add contiguous; query for PSE entries? */
@@ -547,14 +549,14 @@ p2m_pod_decrease_reservation(struct doma
 
     /* No populate-on-demand?  Don''t need to steal anything?  Then
we''re done!*/
     if(!pod && !steal_for_cache)
-        goto out_unlock;
+        goto out_audit;
 
     if ( !nonpod )
     {
         /* All PoD: Mark the whole region invalid and tell caller
          * we''re done. */
         set_p2m_entry(p2m, gpfn, _mfn(INVALID_MFN), order, p2m_invalid,
p2m->default_access);
-        p2m->pod.entry_count-=(1<<order); /* Lock: p2m */
+        p2m->pod.entry_count-=(1<<order);
         BUG_ON(p2m->pod.entry_count < 0);
         ret = 1;
         goto out_entry_check;
@@ -577,7 +579,7 @@ p2m_pod_decrease_reservation(struct doma
         if ( t == p2m_populate_on_demand )
         {
             set_p2m_entry(p2m, gpfn + i, _mfn(INVALID_MFN), 0, p2m_invalid,
p2m->default_access);
-            p2m->pod.entry_count--; /* Lock: p2m */
+            p2m->pod.entry_count--;
             BUG_ON(p2m->pod.entry_count < 0);
             pod--;
         }
@@ -613,11 +615,11 @@ out_entry_check:
         p2m_pod_set_cache_target(p2m, p2m->pod.entry_count,
0/*can''t preempt*/);
     }
 
-out_unlock:
+out_audit:
     audit_p2m(p2m, 1);
-    p2m_unlock(p2m);
 
 out:
+    pod_unlock(p2m);
     return ret;
 }
 
@@ -630,20 +632,24 @@ void p2m_pod_dump_data(struct domain *d)
 
 
 /* Search for all-zero superpages to be reclaimed as superpages for the
- * PoD cache. Must be called w/ p2m lock held, page_alloc lock not held. */
-static int
+ * PoD cache. Must be called w/ pod lock held, page_alloc lock not held. */
+static void
 p2m_pod_zero_check_superpage(struct p2m_domain *p2m, unsigned long gfn)
 {
     mfn_t mfn, mfn0 = _mfn(INVALID_MFN);
     p2m_type_t type, type0 = 0;
     unsigned long * map = NULL;
-    int ret=0, reset = 0;
+    int success = 0, reset = 0;
     int i, j;
     int max_ref = 1;
     struct domain *d = p2m->domain;
 
     if ( !superpage_aligned(gfn) )
-        goto out;
+        return;
+
+    /* If we were enforcing ordering against p2m locks, this is a place
+     * to drop the PoD lock and re-acquire it once we''re done mucking
with
+     * the p2m. */
 
     /* Allow an extra refcount for one shadow pt mapping in shadowed domains */
     if ( paging_mode_shadow(d) )
@@ -751,19 +757,24 @@ p2m_pod_zero_check_superpage(struct p2m_
         __trace_var(TRC_MEM_POD_ZERO_RECLAIM, 0, sizeof(t), &t);
     }
 
-    /* Finally!  We''ve passed all the checks, and can add the mfn
superpage
-     * back on the PoD cache, and account for the new p2m PoD entries */
-    p2m_pod_cache_add(p2m, mfn_to_page(mfn0), PAGE_ORDER_2M);
-    p2m->pod.entry_count += SUPERPAGE_PAGES;
+    success = 1;
+
 
 out_reset:
     if ( reset )
         set_p2m_entry(p2m, gfn, mfn0, 9, type0, p2m->default_access);
     
 out:
-    return ret;
+    if ( success )
+    {
+        /* Finally!  We''ve passed all the checks, and can add the mfn
superpage
+         * back on the PoD cache, and account for the new p2m PoD entries */
+        p2m_pod_cache_add(p2m, mfn_to_page(mfn0), PAGE_ORDER_2M);
+        p2m->pod.entry_count += SUPERPAGE_PAGES;
+    }
 }
 
+/* On entry, PoD lock is held */
 static void
 p2m_pod_zero_check(struct p2m_domain *p2m, unsigned long *gfns, int count)
 {
@@ -775,6 +786,8 @@ p2m_pod_zero_check(struct p2m_domain *p2
     int i, j;
     int max_ref = 1;
 
+    /* Also the right time to drop pod_lock if enforcing ordering against
p2m_lock */
+
     /* Allow an extra refcount for one shadow pt mapping in shadowed domains */
     if ( paging_mode_shadow(d) )
         max_ref++;
@@ -841,7 +854,6 @@ p2m_pod_zero_check(struct p2m_domain *p2
             if( *(map[i]+j) != 0 )
                 break;
 
-        unmap_domain_page(map[i]);
 
         /* See comment in p2m_pod_zero_check_superpage() re gnttab
          * check timing.  */
@@ -849,8 +861,15 @@ p2m_pod_zero_check(struct p2m_domain *p2
         {
             set_p2m_entry(p2m, gfns[i], mfns[i], PAGE_ORDER_4K,
                 types[i], p2m->default_access);
+            unmap_domain_page(map[i]);
+            map[i] = NULL;
         }
-        else
+    }
+
+    /* Finally, add to cache */
+    for ( i=0; i < count; i++ )
+    {
+        if ( map[i] ) 
         {
             if ( tb_init_done )
             {
@@ -867,6 +886,8 @@ p2m_pod_zero_check(struct p2m_domain *p2
                 __trace_var(TRC_MEM_POD_ZERO_RECLAIM, 0, sizeof(t), &t);
             }
 
+            unmap_domain_page(map[i]);
+
             /* Add to cache, and account for the new p2m PoD entry */
             p2m_pod_cache_add(p2m, mfn_to_page(mfns[i]), PAGE_ORDER_4K);
             p2m->pod.entry_count++;
@@ -876,6 +897,7 @@ p2m_pod_zero_check(struct p2m_domain *p2
 }
 
 #define POD_SWEEP_LIMIT 1024
+/* Only one CPU at a time is guaranteed to enter a sweep */
 static void
 p2m_pod_emergency_sweep_super(struct p2m_domain *p2m)
 {
@@ -964,7 +986,8 @@ p2m_pod_demand_populate(struct p2m_domai
 
     ASSERT(p2m_locked_by_me(p2m));
 
-    /* This check is done with the p2m lock held.  This will make sure that
+    pod_lock(p2m);
+    /* This check is done with the pod lock held.  This will make sure that
      * even if d->is_dying changes under our feet, p2m_pod_empty_cache() 
      * won''t start until we''re done. */
     if ( unlikely(d->is_dying) )
@@ -974,6 +997,7 @@ p2m_pod_demand_populate(struct p2m_domai
      * 1GB region to 2MB chunks for a retry. */
     if ( order == PAGE_ORDER_1G )
     {
+        pod_unlock(p2m);
         gfn_aligned = (gfn >> order) << order;
         /* Note that we are supposed to call set_p2m_entry() 512 times to 
          * split 1GB into 512 2MB pages here. But We only do once here because
@@ -983,6 +1007,7 @@ p2m_pod_demand_populate(struct p2m_domai
         set_p2m_entry(p2m, gfn_aligned, _mfn(0), PAGE_ORDER_2M,
                       p2m_populate_on_demand, p2m->default_access);
         audit_p2m(p2m, 1);
+        /* This is because the ept/pt caller locks the p2m recursively */
         p2m_unlock(p2m);
         return 0;
     }
@@ -996,11 +1021,15 @@ p2m_pod_demand_populate(struct p2m_domai
 
         /* If we''re low, start a sweep */
         if ( order == PAGE_ORDER_2M &&
page_list_empty(&p2m->pod.super) )
+            /* Note that sweeps scan other ranges in the p2m. In an scenario
+             * in which p2m locks are order-enforced wrt pod lock and p2m 
+             * locks are fine grained, this will result in deadlock */
             p2m_pod_emergency_sweep_super(p2m);
 
         if ( page_list_empty(&p2m->pod.single) &&
              ( ( order == PAGE_ORDER_4K )
                || (order == PAGE_ORDER_2M &&
page_list_empty(&p2m->pod.super) ) ) )
+            /* Same comment regarding deadlock applies */
             p2m_pod_emergency_sweep(p2m);
     }
 
@@ -1008,8 +1037,6 @@ p2m_pod_demand_populate(struct p2m_domai
     if ( q == p2m_guest && gfn > p2m->pod.max_guest )
         p2m->pod.max_guest = gfn;
 
-    lock_page_alloc(p2m);
-
     if ( p2m->pod.count == 0 )
         goto out_of_memory;
 
@@ -1022,8 +1049,6 @@ p2m_pod_demand_populate(struct p2m_domai
 
     BUG_ON((mfn_x(mfn) & ((1 << order)-1)) != 0);
 
-    unlock_page_alloc(p2m);
-
     gfn_aligned = (gfn >> order) << order;
 
     set_p2m_entry(p2m, gfn_aligned, mfn, order, p2m_ram_rw,
p2m->default_access);
@@ -1034,8 +1059,9 @@ p2m_pod_demand_populate(struct p2m_domai
         paging_mark_dirty(d, mfn_x(mfn) + i);
     }
     
-    p2m->pod.entry_count -= (1 << order); /* Lock: p2m */
+    p2m->pod.entry_count -= (1 << order);
     BUG_ON(p2m->pod.entry_count < 0);
+    pod_unlock(p2m);
 
     if ( tb_init_done )
     {
@@ -1054,16 +1080,17 @@ p2m_pod_demand_populate(struct p2m_domai
 
     return 0;
 out_of_memory:
-    unlock_page_alloc(p2m);
+    pod_unlock(p2m);
 
     printk("%s: Out of populate-on-demand memory! tot_pages %" PRIu32
" pod_entries %" PRIi32 "\n",
            __func__, d->tot_pages, p2m->pod.entry_count);
     domain_crash(d);
 out_fail:
+    pod_unlock(p2m);
     return -1;
 remap_and_retry:
     BUG_ON(order != PAGE_ORDER_2M);
-    unlock_page_alloc(p2m);
+    pod_unlock(p2m);
 
     /* Remap this 2-meg region in singleton chunks */
     gfn_aligned = (gfn>>order)<<order;
@@ -1133,9 +1160,11 @@ guest_physmap_mark_populate_on_demand(st
         rc = -EINVAL;
     else
     {
-        p2m->pod.entry_count += 1 << order; /* Lock: p2m */
+        pod_lock(p2m);
+        p2m->pod.entry_count += 1 << order;
         p2m->pod.entry_count -= pod_count;
         BUG_ON(p2m->pod.entry_count < 0);
+        pod_unlock(p2m);
     }
 
     audit_p2m(p2m, 1);
diff -r 332775f72a30 -r 981073d78f7f xen/arch/x86/mm/p2m-pt.c
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -1001,6 +1001,7 @@ void audit_p2m(struct p2m_domain *p2m, i
     if ( !paging_mode_translate(d) )
         return;
 
+    pod_lock(p2m);
     //P2M_PRINTK("p2m audit starts\n");
 
     test_linear = ( (d == current->domain)
@@ -1247,6 +1248,8 @@ void audit_p2m(struct p2m_domain *p2m, i
                    pmbad, mpbad);
         WARN();
     }
+
+    pod_unlock(p2m);
 }
 #endif /* P2M_AUDIT */
 
diff -r 332775f72a30 -r 981073d78f7f xen/arch/x86/mm/p2m.c
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -72,6 +72,7 @@ boolean_param("hap_2mb", opt_hap_2mb);
 static void p2m_initialise(struct domain *d, struct p2m_domain *p2m)
 {
     mm_lock_init(&p2m->lock);
+    mm_lock_init(&p2m->pod.lock);
     INIT_LIST_HEAD(&p2m->np2m_list);
     INIT_PAGE_LIST_HEAD(&p2m->pages);
     INIT_PAGE_LIST_HEAD(&p2m->pod.super);
@@ -506,8 +507,10 @@ guest_physmap_add_entry(struct domain *d
             rc = -EINVAL;
         else
         {
-            p2m->pod.entry_count -= pod_count; /* Lock: p2m */
+            pod_lock(p2m);
+            p2m->pod.entry_count -= pod_count;
             BUG_ON(p2m->pod.entry_count < 0);
+            pod_unlock(p2m);
         }
     }
 
@@ -1125,8 +1128,10 @@ p2m_flush_table(struct p2m_domain *p2m)
     /* "Host" p2m tables can have shared entries &c that need a
bit more
      * care when discarding them */
     ASSERT(p2m_is_nestedp2m(p2m));
+    pod_lock(p2m);
     ASSERT(page_list_empty(&p2m->pod.super));
     ASSERT(page_list_empty(&p2m->pod.single));
+    pod_unlock(p2m);
 
     /* This is no longer a valid nested p2m for any address space */
     p2m->cr3 = CR3_EADDR;
diff -r 332775f72a30 -r 981073d78f7f xen/include/asm-x86/p2m.h
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -257,24 +257,13 @@ struct p2m_domain {
     unsigned long max_mapped_pfn;
 
     /* Populate-on-demand variables
-     * NB on locking.  {super,single,count} are
-     * covered by d->page_alloc_lock, since they''re almost always
used in
-     * conjunction with that functionality.  {entry_count} is covered by
-     * the domain p2m lock, since it''s almost always used in
conjunction
-     * with changing the p2m tables.
      *
-     * At this point, both locks are held in two places.  In both,
-     * the order is [p2m,page_alloc]:
-     * + p2m_pod_decrease_reservation() calls p2m_pod_cache_add(),
-     *   which grabs page_alloc
-     * + p2m_pod_demand_populate() grabs both; the p2m lock to avoid
-     *   double-demand-populating of pages, the page_alloc lock to
-     *   protect moving stuff from the PoD cache to the domain page list.
-     *
-     * We enforce this lock ordering through a construct in mm-locks.h.
-     * This demands, however, that we store the previous lock-ordering
-     * level in effect before grabbing the page_alloc lock.
-     */
+     * All variables are protected with the pod lock. We cannot rely on
+     * the p2m lock if it''s turned into a fine-grained lock.
+     * We only use the domain page_alloc lock for additions and 
+     * deletions to the domain''s page list. Because we use it nested
+     * within the PoD lock, we enforce it''s ordering (by remembering
+     * the unlock level). */
     struct {
         struct page_list_head super,   /* List of superpages                */
                          single;       /* Non-super lists                   */
@@ -283,6 +272,8 @@ struct p2m_domain {
         unsigned         reclaim_super; /* Last gpfn of a scan */
         unsigned         reclaim_single; /* Last gpfn of a scan */
         unsigned         max_guest;    /* gpfn of max guest demand-populate */
+        mm_lock_t        lock;         /* Locking of private pod structs,   *
+                                        * not relying on the p2m lock.      */
         int              page_alloc_unlock_level; /* To enforce lock ordering
*/
     } pod;
 };

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andres Lagar-Cavilla

2011-Oct-27 04:33 UTC

head link

[Xen-devel] [PATCH 5 of 9] Fine-grained concurrency control structure for the p2m

xen/arch/x86/mm/hap/private.h |    1 +
 xen/arch/x86/mm/mm-locks.h    |   20 +-
 xen/arch/x86/mm/p2m-ept.c     |    1 +
 xen/arch/x86/mm/p2m-lock.h    |  613 ++++++++++++++++++++++++++++++++++++++++++
 xen/arch/x86/mm/p2m-pod.c     |    1 +
 xen/arch/x86/mm/p2m-pt.c      |    1 +
 xen/arch/x86/mm/p2m.c         |   24 +-
 xen/include/asm-x86/p2m.h     |    3 +-
 8 files changed, 652 insertions(+), 12 deletions(-)


Introduce a fine-grained concurrency control structure for the p2m. This
allows for locking 2M-aligned chunks of the p2m at a time, exclusively.
Recursive locking is allowed. Global locking of the whole p2m is also
allowed for certain operations. Simple deadlock detection heuristics are
put in place.

Note the patch creates backwards-compatible shortcuts that will lock the
p2m globally. So it should remain functionally identical to what is currently
in place.

Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>

diff -r 981073d78f7f -r a23e1262b124 xen/arch/x86/mm/hap/private.h
--- a/xen/arch/x86/mm/hap/private.h
+++ b/xen/arch/x86/mm/hap/private.h
@@ -21,6 +21,7 @@
 #define __HAP_PRIVATE_H__
 
 #include "../mm-locks.h"
+#include "../p2m-lock.h"
 
 /********************************************/
 /*          GUEST TRANSLATION FUNCS         */
diff -r 981073d78f7f -r a23e1262b124 xen/arch/x86/mm/mm-locks.h
--- a/xen/arch/x86/mm/mm-locks.h
+++ b/xen/arch/x86/mm/mm-locks.h
@@ -146,14 +146,22 @@ declare_mm_lock(nestedp2m)
 
 /* P2M lock (per-p2m-table)
  * 
- * This protects all updates to the p2m table.  Updates are expected to
- * be safe against concurrent reads, which do *not* require the lock. */
+ * This protects all updates to the p2m table.
+ * 
+ * In 64 bit mode we disable this because the lock becomes fine-grained,
+ * and several code paths cause inversion/deadlock:
+ *   -- PoD sweeps
+ *   -- mem_sharing_unshare_page
+ *   -- generally widespread recursive locking, which we don''t support
+ *      (yet, I guess) on an "external" mm lock. */
 
+#ifndef __x86_64__
 declare_mm_lock(p2m)
-#define p2m_lock(p)           mm_lock(p2m, &(p)->lock)
-#define p2m_lock_recursive(p) mm_lock_recursive(p2m, &(p)->lock)
-#define p2m_unlock(p)         mm_unlock(&(p)->lock)
-#define p2m_locked_by_me(p)   mm_locked_by_me(&(p)->lock)
+#define _p2m_lock(p)           mm_lock(p2m, &(p)->lock)
+#define _p2m_lock_recursive(p) mm_lock_recursive(p2m, &(p)->lock)
+#define _p2m_unlock(p)         mm_unlock(&(p)->lock)
+#define _p2m_locked_by_me(p)   mm_locked_by_me(&(p)->lock)
+#endif /* __x86_64__ */
 
 /* PoD lock (per-p2m-table)
  * 
diff -r 981073d78f7f -r a23e1262b124 xen/arch/x86/mm/p2m-ept.c
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -33,6 +33,7 @@
 #include <xen/softirq.h>
 
 #include "mm-locks.h"
+#include "p2m-lock.h"
 
 #define atomic_read_ept_entry(__pepte)                              \
     ( (ept_entry_t) { .epte = atomic_read64(&(__pepte)->epte) } )
diff -r 981073d78f7f -r a23e1262b124 xen/arch/x86/mm/p2m-lock.h
--- /dev/null
+++ b/xen/arch/x86/mm/p2m-lock.h
@@ -0,0 +1,613 @@
+/******************************************************************************
+ * arch/x86/mm/p2m-lock.h
+ *
+ * Fine-grained locking of the p2m. Allow for concurrent updates to different
+ * regions of the p2m. Serially synchronize updates and lookups. Mutex 
+ * access on p2m entries while a CPU is using them.
+ *
+ * Copyright (c) 2011 Andres Lagar-Cavilla, GridCentric Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#ifndef _XEN_P2M_LOCK_H
+#define _XEN_P2M_LOCK_H
+
+#include <xen/config.h>
+#include <xen/lib.h>
+/* See comment about space consideration for spinlocks below */
+#define NDEBUG
+#undef LOCK_PROFILE
+#include <xen/spinlock.h>
+#include <asm/atomic.h>
+#include <xen/xmalloc.h>
+#include <xen/paging.h>
+#include <asm/page.h>
+#include <asm/p2m.h>
+#include "mm-locks.h"
+
+/* Rationale:
+ *
+ * The motivating scenario is one in which you have at least three CPUs 
+ * operating on likely disjoint regions of the p2m: a paging utility, a sharing
+ * utility, and the domU vcpu. With yet another p2m-heavy utility (mem 
+ * access?), and/or a migrate/remus utility, the number of CPUs operating
+ * on disjoint regions increases. Not to mention multi-vcpu domUs.
+ *
+ * Therefore, p2m concurrency control is achieved through a hierarchical 
+ * tree of locks, to allow all these CPUs to work without bothering each other.
+ * (Without disallowing any other cases such as single-vcpu domU)
+ *
+ * Leafs in the tree of locks are represented by spinlocks.
+ *
+ * Inner nodes (or uppper levels), are represented by a spinlock and a count.
+ * The count indicates how many CPUs are locking a node beneath. 
+ *
+ * A cpu holds a leaf by grabbing the spinlock, and not letting go of it. On
its
+ * way to the leaf, for each inner node, it grabs the spinlock, increases the 
+ * count, and releases the spinlock.
+ *
+ * Leaf levels are recursive, the same CPU can lock them again.
+ *
+ * A cpu holds an inner node in exclusive mode by busy-waiting until the count 
+ * is zero, grabbing the spinlock, and not letting go of it.
+ *
+ * Unlocks work by releasing the current spinlock, and working your way up:
+ * grab spinlock, decrease count, release.
+ *
+ * No locker can be preempted. For that reason, there are no atomic promotions:
+ * you would end up with promoters deadlocking on their way up the tree.
+ *
+ * Today, there are effectively two levels: the global lock (an inner node),
and
+ * 2M locks, leaf locks for contiguous, aligned, 2M extents (akin to
superpages).
+ *
+ * The global level can be held exclusively for big hammer operations such as
+ * log dirty (re)set.
+ *
+ * For non-global locking, the global lock is grabbed non-exclusively. At each 
+ * 1G boundary we allocate, if we hadn''t before, the corresponding set
of 512
+ * 2M locks. Allocation of 2M locks is itself protected by a regular
+ * spinlock (this is rare enough). Allocation functions on-demand because
+ * we can''t really know a priori the "total" size of the
p2m.
+ *
+ * It is expected that every query or modification to the p2m will lock the 
+ * appropriate range. Leafs are recurisve for this reason: commonly you query a
+ * range and then you modify it.
+ *
+ * Conversely, all callers of queries and modifications, once done, need to
undo
+ * their locking.
+ * 
+ * Because we mimic the page table structure of a 512-radix tree, we run into 
+ * space considerations with the spinlocks in this tree. So we need to be
careful
+ * about space.
+ *
+ * For 32bit code, we currently bail out and default to one big lock. Sorry
Atom :(
+ *
+ * Also note that the p2m tree of locks is included in the ordering constraints
+ * enforced by mm-locks.h. It is treated as an "external" lock in
that code.
+ *
+ */
+
+#define P2M_ORDER_GLOBAL    ~0U
+
+/* The 32 bit case serves as a concise summary of the external API */
+#ifndef __x86_64__
+/* For 32 bits we default to one big lock */
+typedef struct __p2m_lock {
+    mm_lock_t lock;
+} p2m_lock_t;
+
+static inline int p2m_lock_init(struct p2m_domain *p2m)
+{
+    p2m_lock_t *p2ml = xmalloc(p2m_lock_t);
+    if ( !p2ml )
+        return -ENOMEM;
+    mm_lock_init(&p2ml->lock);
+    p2m->lock = p2ml;
+    return 0;
+}
+
+static inline void get_p2m(struct p2m_domain *p2m, unsigned long gfn, unsigned
int order)
+{
+    _p2m_lock(p2m->lock);
+}
+
+static inline void put_p2m(struct p2m_domain *p2m, unsigned long gfn, unsigned
int order)
+{
+    _p2m_unlock(p2m->lock);
+}
+
+static inline void p2m_lock_destroy(struct p2m_domain *p2m)
+{
+    xfree(p2m->lock);
+    p2m->lock = NULL;
+}
+
+/* Backwards compatiblity */
+#define p2m_lock(p)             _p2m_lock((p)->lock)
+#define p2m_lock_recursive(p)   _p2m_lock_recursive((p)->lock)
+#define p2m_locked_by_me(p)     _p2m_locked_by_me((p)->lock)
+#define p2m_unlock(p)           _p2m_unlock((p)->lock)
+
+#else /* __x86_64__ */
+/* If we were to have inner locks (say 1G locks, then the space considerations
+ * outlined below for leaf locks would also apply here. */
+typedef struct p2m_inner_lock {
+    spinlock_t lock;
+    atomic_t   count;
+} p2m_inner_lock_t;
+
+static inline void init_p2m_inner_lock(p2m_inner_lock_t *inner)
+{
+    spin_lock_init(&inner->lock);
+    _atomic_set(inner->count, 0);
+}
+
+/* We cannot risk reusing the code in common/spinlock.c, because it may
+ * have been compiled with LOCK_DEBUG or LOCK_PROFILE. This is unfortunate. */
+static inline void lock_p2m_inner(p2m_inner_lock_t *inner)
+{
+    spin_lock(&inner->lock);
+}
+
+static inline void unlock_p2m_inner(p2m_inner_lock_t *inner)
+{
+    spin_unlock(&inner->lock);
+}
+
+static inline void get_p2m_inner(p2m_inner_lock_t *inner)
+{
+    lock_p2m_inner(inner);
+    atomic_inc(&inner->count);
+    unlock_p2m_inner(inner);
+}
+
+static inline void put_p2m_inner(p2m_inner_lock_t *inner)
+{
+    lock_p2m_inner(inner);
+    atomic_dec(&inner->count);
+    unlock_p2m_inner(inner);
+}
+
+/* XXX Consider starvation here */
+static inline void get_p2m_inner_exclusive(p2m_inner_lock_t *inner)
+{
+    int count;
+retry:
+    while (1)
+    {
+        mb();
+        count = atomic_read(&inner->count);
+        if ( count == 0 )
+            break;
+        cpu_relax();
+    }
+
+    spin_lock(&inner->lock);
+    mb();
+    count = atomic_read(&inner->count);
+    if ( count )
+    {
+        spin_unlock(&inner->lock);
+        goto retry;
+    }
+    /* We leave holding the spinlock */
+}
+
+static inline void put_p2m_inner_exclusive(p2m_inner_lock_t *inner)
+{
+    spin_unlock(&inner->lock);
+}
+
+/* Because we operate under page-table sizing constraints, we need to be 
+ * extremely conscious about the space we''re taking up. So we become
somewhat
+ * re-inventers of the wheel, and we disable many things. */
+typedef struct p2m_leaf_lock {
+    raw_spinlock_t raw;
+    u16 recurse_cpu:12;
+    u16 recurse_cnt:4;
+/* Padding to confine each inner lock to its own word */
+#define LEAF_PAD   4
+    uint8_t             pad[LEAF_PAD];
+} __attribute__((packed)) p2m_leaf_lock_t;
+
+/* BUILD_BUG_ON(sizeof(p2m_leaf_lock_t) != sizeof(unsigned long)); */
+
+static inline void init_p2m_leaf_lock(p2m_leaf_lock_t *lock)
+{
+    *lock = (p2m_leaf_lock_t) { _RAW_SPIN_LOCK_UNLOCKED, 0xfffu, 0, { } };
+}
+
+static inline int __p2m_spin_trylock_recursive(p2m_leaf_lock_t *lock)
+{
+    int cpu = smp_processor_id();
+
+    if ( likely(lock->recurse_cpu != cpu) )
+    {
+        if ( !_raw_spin_trylock(&lock->raw) )
+            return 0;
+        preempt_disable();
+        lock->recurse_cpu = cpu;
+    }
+
+    lock->recurse_cnt++;
+    return 1;
+}
+
+static inline void lock_p2m_leaf(p2m_leaf_lock_t *lock)
+{
+    while ( !__p2m_spin_trylock_recursive(lock) )
+        cpu_relax();
+}
+
+static inline void unlock_p2m_leaf(p2m_leaf_lock_t *lock)
+{
+    if ( likely(--lock->recurse_cnt == 0) )
+    {
+        lock->recurse_cpu = 0xfffu;
+        preempt_enable();
+        _raw_spin_unlock(&lock->raw);
+    }
+}
+
+/* Deadlock book-keeping, see below */
+#define MAX_LOCK_DEPTH  16
+
+/* The lock structure */
+typedef struct __p2m_lock 
+{
+    /* To enforce ordering in mm-locks */
+    int unlock_level;
+    /* To protect on-demand allocation of locks 
+     * (yeah you heard that right) */
+    spinlock_t alloc_lock;
+    /* Global lock */
+    p2m_inner_lock_t global;
+    /* 2M locks. Allocate on demand: fun */
+    p2m_leaf_lock_t  **locks_2m;
+    /* Book-keeping for deadlock detection. Could be a per-cpu. */
+    unsigned long deadlock_guard[NR_CPUS][MAX_LOCK_DEPTH + 1];
+    uint8_t lock_depth[NR_CPUS];
+    /* Is anybody holding this exclusively */
+    unsigned int exclusive_holder;
+    /* Order of pages allocates for first level of locks_2m */
+    uint8_t order;
+} p2m_lock_t;
+
+#define EXCLUSIVE_CPU_NULL  ~0U
+
+/* Some deadlock book-keeping. Say CPU A holds a lock on range A, CPU B holds a
+ * lock on range B. Now, CPU A wants to lock range B and vice-versa. Deadlock.
+ * We detect this by remembering the start of the current locked range.
+ * We keep a fairly small stack of guards (8), because we don''t
anticipate
+ * a great deal of recursive locking because (a) recursive locking is rare 
+ * (b) it is evil (c) only PoD seems to do it (is PoD therefore evil?) */
+
+#define DEADLOCK_NULL   ~0UL
+
+#define CURRENT_GUARD(l)    ((l)->deadlock_guard[current->processor] \
+                                [(l)->lock_depth[current->processor]])
+
+#define DEADLOCK_CHECK(cond, action, _f, _a...) \
+do {                                            \
+    if ( (cond) )                               \
+    {                                           \
+        printk(_f, ##_a);                       \
+        action;                                 \
+    }                                           \
+} while(0)
+
+static inline void push_guard(p2m_lock_t *p2ml, unsigned long gfn)
+{
+    int cpu = current->processor;
+
+    DEADLOCK_CHECK(((p2ml->lock_depth[cpu] + 1) > MAX_LOCK_DEPTH), 
+                    BUG(), "CPU %u exceeded deadlock depth\n", cpu);
+
+    p2ml->lock_depth[cpu]++;
+    p2ml->deadlock_guard[cpu][p2ml->lock_depth[cpu]] = gfn;
+}
+
+static inline void pop_guard(p2m_lock_t *p2ml)
+{
+    int cpu = current->processor;
+
+    DEADLOCK_CHECK((!p2ml->lock_depth[cpu] == 0), BUG(), 
+                    "CPU %u underflow deadlock depth\n", cpu);
+
+    p2ml->lock_depth[cpu]--;
+}
+
+static inline int p2m_lock_init(struct p2m_domain *p2m)
+{
+    unsigned int i;
+    p2m_lock_t *p2ml;
+
+    p2ml = xmalloc(p2m_lock_t);
+    if ( !p2ml ) 
+        return -ENOMEM;
+
+    memset(p2ml, 0, sizeof(p2m_lock_t));
+
+    spin_lock_init(&p2ml->alloc_lock);
+    init_p2m_inner_lock(&p2ml->global);
+
+    p2ml->locks_2m = alloc_xenheap_page();
+    if ( !p2ml->locks_2m )
+    {
+        xfree(p2ml);
+        return -ENOMEM;
+    }
+    memset(p2ml->locks_2m, 0, PAGE_SIZE);
+
+    for (i = 0; i < NR_CPUS; i++)
+        p2ml->deadlock_guard[i][0] = DEADLOCK_NULL;
+
+    p2ml->exclusive_holder = EXCLUSIVE_CPU_NULL;
+
+    p2m->lock = p2ml;
+    return 0;    
+}
+
+/* Conversion macros for aligned boundaries */
+#define gfn_to_superpage(g, o)      (((g) & (~((1 << (o)) - 1)))
>> (o))
+#define gfn_to_1g_sp(gfn)           gfn_to_superpage(gfn, PAGE_ORDER_1G)
+#define gfn_to_2m_sp(gfn)           gfn_to_superpage(gfn, PAGE_ORDER_2M)
+#define gfn_1g_to_2m(gfn_1g)        ((gfn_1g) << (PAGE_ORDER_1G -
PAGE_ORDER_2M))
+#define gfn_1g_to_last_2m(gfn_1g)   (gfn_1g_to_2m(gfn_1g) + \
+                                        ((1 << (PAGE_ORDER_1G -
PAGE_ORDER_2M)) - 1))
+#define gfn_1g_to_4k(gfn_1g)        ((gfn_1g) << PAGE_ORDER_1G)
+#define gfn_1g_to_last_4k(gfn_1g)   (gfn_1g_to_4k(gfn_1g) + ((1 <<
PAGE_ORDER_1G) - 1))
+
+/* Global lock accessors. Global lock is our only "inner" node. */
+#define p2m_exclusive_locked_by_me(l)    \
+     ((l)->lock->exclusive_holder == current->processor)
+
+static inline void get_p2m_global_exclusive(struct p2m_domain *p2m)
+{
+    p2m_lock_t *p2ml = p2m->lock;
+    DEADLOCK_CHECK((CURRENT_GUARD(p2ml) != DEADLOCK_NULL), BUG(),
+                    "P2M DEADLOCK: cpu %u prev range start %lx trying
global\n",
+                    (unsigned) current->processor, CURRENT_GUARD(p2ml)); 
+
+    get_p2m_inner_exclusive(&p2ml->global);
+    p2ml->exclusive_holder = current->processor;
+}
+
+static inline void put_p2m_global_exclusive(struct p2m_domain *p2m)
+{
+    p2m_lock_t *p2ml = p2m->lock;
+    p2ml->exclusive_holder = EXCLUSIVE_CPU_NULL;
+    put_p2m_inner_exclusive(&p2ml->global);
+}
+
+/* Not to be confused with shortcut for external use */
+static inline void __get_p2m_global(struct p2m_domain *p2m)
+{
+    get_p2m_inner(&p2m->lock->global);
+}
+
+/* Not to be confused with shortcut for external use */
+static inline void __put_p2m_global(struct p2m_domain *p2m)
+{
+    put_p2m_inner(&p2m->lock->global);
+}
+
+/* 2M lock accessors */
+static inline p2m_leaf_lock_t *__get_2m_lock(p2m_lock_t *p2ml,
+                            unsigned long gfn_1g, unsigned long gfn_2m)
+{
+    p2m_leaf_lock_t *lock_2m_l1;
+    BUG_ON(gfn_1g >= (1 << PAGETABLE_ORDER));
+    BUG_ON(gfn_2m >= (1 << PAGETABLE_ORDER));
+    lock_2m_l1 = p2ml->locks_2m[gfn_1g];
+    BUG_ON(lock_2m_l1 == NULL);
+    return (lock_2m_l1 + gfn_2m);
+}
+
+static inline void get_p2m_2m(struct p2m_domain *p2m, unsigned long gfn_1g,
+                                unsigned long gfn_2m)
+{
+    lock_p2m_leaf(__get_2m_lock(p2m->lock, gfn_1g, gfn_2m));
+}
+
+static inline void put_p2m_2m(struct p2m_domain *p2m, unsigned long gfn_1g,
+                                unsigned long gfn_2m)
+{
+    unlock_p2m_leaf(__get_2m_lock(p2m->lock, gfn_1g, gfn_2m));
+}
+
+/* Allocate 2M locks we may not have allocated yet for this 1G superpage */
+static inline int alloc_locks_2m(struct p2m_domain *p2m, unsigned long gfn_1g)
+{
+    p2m_lock_t *p2ml = p2m->lock;
+
+    /* With a single page for l1, we cover a gfn space of 512GB (39 bits)
+     * Given that current x86_64 processors physically address 40 bits,
+     * we''re in no immediate danger of overflowing this table for a
domU.
+     * If necessary, the l1 itself can grow subject to proper locking 
+     * on the p2ml->alloc_lock */
+
+    /* Quick test for common case */
+    if ( likely(p2ml->locks_2m[gfn_1g] != NULL) ) 
+        return 0;
+
+    spin_lock(&(p2ml->alloc_lock));
+
+    if ( likely(p2ml->locks_2m[gfn_1g] == NULL) )
+    {
+        unsigned long j;
+        p2m_leaf_lock_t *p = alloc_xenheap_page();
+        if ( !p ) 
+        {
+            spin_unlock(&(p2ml->alloc_lock));
+            return -ENOMEM;
+        }
+
+        for (j = 0; j < (1 << PAGETABLE_ORDER); j++)
+            init_p2m_leaf_lock(&p[j]);
+
+        p2ml->locks_2m[gfn_1g] = p;
+    }
+
+    spin_unlock(&(p2ml->alloc_lock));
+    return 0;
+}
+
+static inline unsigned long __get_last_gfn(unsigned long gfn, unsigned int
order)
+{
+    /* Underflow */
+    unsigned long last_gfn = gfn + (1 << order) - 1;
+    BUG_ON(last_gfn < gfn);
+    return last_gfn;
+}
+
+static inline void get_p2m(struct p2m_domain *p2m, unsigned long gfn, unsigned
int order)
+{
+    unsigned long last_gfn, first_1g, last_1g, first_2m, last_2m, i, j;
+    p2m_lock_t *p2ml = p2m->lock;
+
+    /* Holders of the p2m in exclusive mode can lock sub ranges. We make that a
no-op.
+     * however, locking exclusively again is considered rude and tasteless. */
+    if ( (p2m_exclusive_locked_by_me(p2m)) && (order !=
P2M_ORDER_GLOBAL) )
+        return;
+        
+    DEADLOCK_CHECK(((CURRENT_GUARD(p2ml) != DEADLOCK_NULL) &&
+                    (CURRENT_GUARD(p2ml) > gfn)), WARN(),
+                    "P2M DEADLOCK: cpu %d prev range start %lx new range
start %lx",
+                    current->processor, CURRENT_GUARD(p2ml), gfn);
+
+    preempt_disable();
+
+    if ( order == P2M_ORDER_GLOBAL ) {
+        get_p2m_global_exclusive(p2m);
+        goto get_p2m_out;
+    } 
+
+    __get_p2m_global(p2m);
+    /* We''re non-preemptible. We''ve disallowed global p2m
locking. We
+     * will now (allocate and) lock all relevant 2M leafs */
+
+    last_gfn    = __get_last_gfn(gfn, order);
+    first_1g    = gfn_to_1g_sp(gfn);
+    last_1g     = gfn_to_1g_sp(last_gfn);
+
+    for (i = first_1g; i <= last_1g; i++) 
+    {
+        first_2m    = (gfn_1g_to_4k(i) > gfn) ? gfn_1g_to_2m(i) :
gfn_to_2m_sp(gfn);
+        last_2m     = min(gfn_to_2m_sp(last_gfn), gfn_1g_to_last_2m(i));
+
+        if ( alloc_locks_2m(p2m, i) )
+        {
+            /* There really isn''t much we can do at this point */
+            panic("Fine-grained p2m locking failed to alloc 2M locks"
+                  " for 1G page %lx, domain %hu\n", i,
p2m->domain->domain_id);
+        }
+
+        for (j = first_2m; j <= last_2m; j++)
+        {
+            get_p2m_2m(p2m, i, j & ((1 << PAGETABLE_ORDER) - 1));
+        }
+    }
+
+get_p2m_out:
+    push_guard(p2ml, gfn);
+}
+
+/* Conversely to the get method, we unlock all leafs pro-actively here */
+static inline void put_p2m(struct p2m_domain *p2m, unsigned long gfn, unsigned
int order)
+{
+    unsigned long last_gfn, first_1g, last_1g, first_2m, last_2m, i, j;
+    p2m_lock_t *p2ml = p2m->lock;
+
+    last_gfn = __get_last_gfn(gfn, order);
+
+    /* See comment about exclusive holders recursively locking sub-ranges in
get_p2m */
+    if ( (p2m_exclusive_locked_by_me(p2m)) && (order !=
P2M_ORDER_GLOBAL) )
+        return;
+
+    if ( order == P2M_ORDER_GLOBAL )
+    {
+        put_p2m_global_exclusive(p2m);
+        goto cleanup;
+    }
+
+    first_1g    = gfn_to_1g_sp(gfn);
+    last_1g     = gfn_to_1g_sp(last_gfn);
+
+    for (i = first_1g; i <= last_1g; i++) 
+    {
+        first_2m    = (gfn_1g_to_4k(i) > gfn) ? gfn_1g_to_2m(i) :
gfn_to_2m_sp(gfn);
+        last_2m     = min(gfn_to_2m_sp(last_gfn), gfn_1g_to_last_2m(i));
+
+        for (j = first_2m; j <= last_2m; j++)
+        {
+            put_p2m_2m(p2m, i, j & ((1 << PAGETABLE_ORDER) - 1));
+        }
+    }
+
+    __put_p2m_global(p2m);
+    
+cleanup:
+    pop_guard(p2ml);
+    preempt_enable();
+}
+
+static inline void p2m_lock_destroy(struct p2m_domain *p2m)
+{
+    unsigned int i;
+    p2m_lock_t *p2ml = p2m->lock;
+
+    get_p2m_global_exclusive(p2m);
+
+    for (i = 0; i < (1 << PAGETABLE_ORDER); i++)
+        if ( p2ml->locks_2m[i] )
+            free_xenheap_page(p2ml->locks_2m[i]);
+
+    free_xenheap_page(p2ml->locks_2m);
+
+    put_p2m_global_exclusive(p2m);
+
+    xfree(p2ml);
+    p2m->lock = NULL;
+}
+
+/* Backwards compatibility */
+#define p2m_lock(p)             get_p2m((p), 0, P2M_ORDER_GLOBAL)
+#define p2m_unlock(p)           put_p2m((p), 0, P2M_ORDER_GLOBAL)
+#define p2m_locked_by_me(p)     p2m_exclusive_locked_by_me((p))
+/* There is no backwards compatibility for this, unless we make the 
+ * global lock recursive */
+#define p2m_lock_recursive(p)   ((void)0) 
+
+#endif /* __x86_64__ */
+
+/* Commonly-used shortcus */
+#define get_p2m_global(p2m)     get_p2m((p2m), 0, P2M_ORDER_GLOBAL)
+#define put_p2m_global(p2m)     put_p2m((p2m), 0, P2M_ORDER_GLOBAL)
+
+#define get_p2m_gfn(p2m, gfn)   get_p2m((p2m), (gfn), 0)
+#define put_p2m_gfn(p2m, gfn)   put_p2m((p2m), (gfn), 0)
+
+#endif /* _XEN_P2M_LOCK_H */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-set-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff -r 981073d78f7f -r a23e1262b124 xen/arch/x86/mm/p2m-pod.c
--- a/xen/arch/x86/mm/p2m-pod.c
+++ b/xen/arch/x86/mm/p2m-pod.c
@@ -34,6 +34,7 @@
 #include <asm/hvm/svm/amd-iommu-proto.h>
 
 #include "mm-locks.h"
+#include "p2m-lock.h"
 
 /* Override macros from asm/page.h to make them work with mfn_t */
 #undef mfn_to_page
diff -r 981073d78f7f -r a23e1262b124 xen/arch/x86/mm/p2m-pt.c
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -39,6 +39,7 @@
 #include <asm/hvm/svm/amd-iommu-proto.h>
 
 #include "mm-locks.h"
+#include "p2m-lock.h"
 
 /* Override macros from asm/page.h to make them work with mfn_t */
 #undef mfn_to_page
diff -r 981073d78f7f -r a23e1262b124 xen/arch/x86/mm/p2m.c
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -38,6 +38,7 @@
 #include <asm/hvm/svm/amd-iommu-proto.h>
 
 #include "mm-locks.h"
+#include "p2m-lock.h"
 
 /* turn on/off 1GB host page table support for hap, default on */
 static bool_t __read_mostly opt_hap_1gb = 1;
@@ -69,9 +70,12 @@ boolean_param("hap_2mb", opt_hap_2mb);
 
 
 /* Init the datastructures for later use by the p2m code */
-static void p2m_initialise(struct domain *d, struct p2m_domain *p2m)
+static int p2m_initialise(struct domain *d, struct p2m_domain *p2m)
 {
-    mm_lock_init(&p2m->lock);
+    if (p2m_lock_init(p2m))
+    {
+        return -ENOMEM;
+    }
     mm_lock_init(&p2m->pod.lock);
     INIT_LIST_HEAD(&p2m->np2m_list);
     INIT_PAGE_LIST_HEAD(&p2m->pages);
@@ -89,7 +93,7 @@ static void p2m_initialise(struct domain
     else
         p2m_pt_init(p2m);
 
-    return;
+    return 0;
 }
 
 static int
@@ -103,7 +107,11 @@ p2m_init_nestedp2m(struct domain *d)
         d->arch.nested_p2m[i] = p2m = xzalloc(struct p2m_domain);
         if (p2m == NULL)
             return -ENOMEM;
-        p2m_initialise(d, p2m);
+        if (p2m_initialise(d, p2m))
+        {
+            xfree(p2m);
+            return -ENOMEM;
+        }
         p2m->write_p2m_entry = nestedp2m_write_p2m_entry;
         list_add(&p2m->np2m_list,
&p2m_get_hostp2m(d)->np2m_list);
     }
@@ -118,7 +126,11 @@ int p2m_init(struct domain *d)
     p2m_get_hostp2m(d) = p2m = xzalloc(struct p2m_domain);
     if ( p2m == NULL )
         return -ENOMEM;
-    p2m_initialise(d, p2m);
+    if (p2m_initialise(d, p2m))
+    {
+        xfree(p2m);
+        return -ENOMEM;
+    }
 
     /* Must initialise nestedp2m unconditionally
      * since nestedhvm_enabled(d) returns false here.
@@ -331,6 +343,7 @@ static void p2m_teardown_nestedp2m(struc
     uint8_t i;
 
     for (i = 0; i < MAX_NESTEDP2M; i++) {
+        p2m_lock_destroy(d->arch.nested_p2m[i]);
         xfree(d->arch.nested_p2m[i]);
         d->arch.nested_p2m[i] = NULL;
     }
@@ -338,6 +351,7 @@ static void p2m_teardown_nestedp2m(struc
 
 void p2m_final_teardown(struct domain *d)
 {
+    p2m_lock_destroy(d->arch.p2m); 
     /* Iterate over all p2m tables per domain */
     xfree(d->arch.p2m);
     d->arch.p2m = NULL;
diff -r 981073d78f7f -r a23e1262b124 xen/include/asm-x86/p2m.h
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -187,9 +187,10 @@ typedef enum {
 #define p2m_is_broken(_t)   (p2m_to_mask(_t) & P2M_BROKEN_TYPES)
 
 /* Per-p2m-table state */
+struct __p2m_lock;
 struct p2m_domain {
     /* Lock that protects updates to the p2m */
-    mm_lock_t          lock;
+    struct __p2m_lock *lock;
 
     /* Shadow translated domain: p2m mapping */
     pagetable_t        phys_table;

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andres Lagar-Cavilla

2011-Oct-27 04:33 UTC

head link

[Xen-devel] [PATCH 6 of 9] Protect superpage splitting in implementation-dependent traversals

xen/arch/x86/mm/p2m-ept.c  |  15 +++++++-
 xen/arch/x86/mm/p2m-lock.h |  11 +++++-
 xen/arch/x86/mm/p2m-pt.c   |  82 ++++++++++++++++++++++++++++++++++-----------
 3 files changed, 84 insertions(+), 24 deletions(-)


In both pt and ept modes, the p2m trees can map 1GB superpages.
Because our locks work on a 2MB superpage basis, without proper
locking we could have two simultaneous traversals to two different
2MB ranges split the 1GB superpage in a racy manner.

Fix this with the existing alloc_lock in the superpage structure.
We allow 1GB-grained locking for a future implementation -- we
just default to a global lock in all cases currently.

Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>

diff -r a23e1262b124 -r 0a97d62c2d41 xen/arch/x86/mm/p2m-ept.c
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -163,6 +163,7 @@ static int ept_set_middle_entry(struct p
 }
 
 /* free ept sub tree behind an entry */
+/* Lock on this superpage (if any) held on entry */
 static void ept_free_entry(struct p2m_domain *p2m, ept_entry_t *ept_entry, int
level)
 {
     /* End if the entry is a leaf entry. */
@@ -181,6 +182,7 @@ static void ept_free_entry(struct p2m_do
     p2m_free_ptp(p2m, mfn_to_page(ept_entry->mfn));
 }
 
+/* Lock on this superpage held on entry */
 static int ept_split_super_page(struct p2m_domain *p2m, ept_entry_t *ept_entry,
                                 int level, int target)
 {
@@ -315,6 +317,7 @@ ept_set_entry(struct p2m_domain *p2m, un
     int needs_sync = 1;
     struct domain *d = p2m->domain;
     ept_entry_t old_entry = { .epte = 0 };
+    p2m_lock_t *p2ml = p2m->lock;
 
     /*
      * the caller must make sure:
@@ -361,6 +364,8 @@ ept_set_entry(struct p2m_domain *p2m, un
      * with a leaf entry (a 1GiB or 2MiB page), and handle things
appropriately.
      */
 
+    if ( target == 2 )
+        lock_p2m_1G(p2ml, index);
     if ( i == target )
     {
         /* We reached the target level. */
@@ -373,7 +378,7 @@ ept_set_entry(struct p2m_domain *p2m, un
         /* If we''re replacing a non-leaf entry with a leaf entry (1GiB
or 2MiB),
          * the intermediate tables will be freed below after the ept flush
          *
-         * Read-then-write is OK because we hold the p2m lock. */
+         * Read-then-write is OK because we hold the 1G or 2M lock. */
         old_entry = *ept_entry;
 
         if ( mfn_valid(mfn_x(mfn)) || direct_mmio || p2m_is_paged(p2mt) ||
@@ -412,6 +417,8 @@ ept_set_entry(struct p2m_domain *p2m, un
         if ( !ept_split_super_page(p2m, &split_ept_entry, i, target) )
         {
             ept_free_entry(p2m, &split_ept_entry, i);
+            if ( target == 2 )
+                unlock_p2m_1G(p2ml, index);
             goto out;
         }
 
@@ -440,7 +447,7 @@ ept_set_entry(struct p2m_domain *p2m, un
         /* the caller should take care of the previous page */
         new_entry.mfn = mfn_x(mfn);
 
-        /* Safe to read-then-write because we hold the p2m lock */
+        /* Safe to read-then-write because we hold the 1G or 2M lock */
         if ( ept_entry->mfn == new_entry.mfn )
              need_modify_vtd_table = 0;
 
@@ -448,6 +455,8 @@ ept_set_entry(struct p2m_domain *p2m, un
 
         atomic_write_ept_entry(ept_entry, new_entry);
     }
+    if ( target == 2 )
+        unlock_p2m_1G(p2ml, index);
 
     /* Track the highest gfn for which we have ever had a valid mapping */
     if ( mfn_valid(mfn_x(mfn)) &&
@@ -642,6 +651,8 @@ static ept_entry_t ept_get_entry_content
     return content;
 }
 
+/* This is called before crashing a domain, so we''re not particularly 
+ * concerned with locking */
 void ept_walk_table(struct domain *d, unsigned long gfn)
 {
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
diff -r a23e1262b124 -r 0a97d62c2d41 xen/arch/x86/mm/p2m-lock.h
--- a/xen/arch/x86/mm/p2m-lock.h
+++ b/xen/arch/x86/mm/p2m-lock.h
@@ -109,6 +109,11 @@ typedef struct __p2m_lock {
     mm_lock_t lock;
 } p2m_lock_t;
 
+/* We do not need sub-locking on 1G superpages because we have a 
+ * global lock */
+#define lock_p2m_1G(l, gfn)     ((void)l) 
+#define unlock_p2m_1G(l, gfn)   ((void)l)
+
 static inline int p2m_lock_init(struct p2m_domain *p2m)
 {
     p2m_lock_t *p2ml = xmalloc(p2m_lock_t);
@@ -271,7 +276,8 @@ typedef struct __p2m_lock
     /* To enforce ordering in mm-locks */
     int unlock_level;
     /* To protect on-demand allocation of locks 
-     * (yeah you heard that right) */
+     * (yeah you heard that right) 
+     * Also protects 1GB superpage splitting. */
     spinlock_t alloc_lock;
     /* Global lock */
     p2m_inner_lock_t global;
@@ -429,6 +435,9 @@ static inline void put_p2m_2m(struct p2m
     unlock_p2m_leaf(__get_2m_lock(p2m->lock, gfn_1g, gfn_2m));
 }
 
+#define lock_p2m_1G(l, gfn)     spin_lock(&l->alloc_lock)
+#define unlock_p2m_1G(l, gfn)   spin_unlock(&l->alloc_lock)
+
 /* Allocate 2M locks we may not have allocated yet for this 1G superpage */
 static inline int alloc_locks_2m(struct p2m_domain *p2m, unsigned long gfn_1g)
 {
diff -r a23e1262b124 -r 0a97d62c2d41 xen/arch/x86/mm/p2m-pt.c
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -159,6 +159,7 @@ p2m_next_level(struct p2m_domain *p2m, m
                unsigned long *gfn_remainder, unsigned long gfn, u32 shift,
                u32 max, unsigned long type)
 {
+    p2m_lock_t *p2ml = p2m->lock;
     l1_pgentry_t *l1_entry;
     l1_pgentry_t *p2m_entry;
     l1_pgentry_t new_entry;
@@ -207,6 +208,7 @@ p2m_next_level(struct p2m_domain *p2m, m
     ASSERT(l1e_get_flags(*p2m_entry) & (_PAGE_PRESENT|_PAGE_PSE));
 
     /* split 1GB pages into 2MB pages */
+    lock_p2m_1G(p2ml, *gfn_remainder >> shift);
     if ( type == PGT_l2_page_table && (l1e_get_flags(*p2m_entry) &
_PAGE_PSE) )
     {
         unsigned long flags, pfn;
@@ -214,7 +216,10 @@ p2m_next_level(struct p2m_domain *p2m, m
 
         pg = p2m_alloc_ptp(p2m, PGT_l2_page_table);
         if ( pg == NULL )
+        {
+            unlock_p2m_1G(p2ml, *gfn_remainder >> shift);
             return 0;
+        }
 
         flags = l1e_get_flags(*p2m_entry);
         pfn = l1e_get_pfn(*p2m_entry);
@@ -233,9 +238,11 @@ p2m_next_level(struct p2m_domain *p2m, m
         p2m_add_iommu_flags(&new_entry, 2,
IOMMUF_readable|IOMMUF_writable);
         p2m->write_p2m_entry(p2m, gfn, p2m_entry, *table_mfn, new_entry, 3);
     }
-
+    unlock_p2m_1G(p2ml, *gfn_remainder >> shift);
 
     /* split single 2MB large page into 4KB page in P2M table */
+    /* This does not necessitate locking because 2MB regions are locked
+     * exclusively */
     if ( type == PGT_l1_page_table && (l1e_get_flags(*p2m_entry) &
_PAGE_PSE) )
     {
         unsigned long flags, pfn;
@@ -297,6 +304,7 @@ p2m_set_entry(struct p2m_domain *p2m, un
                                    IOMMUF_readable|IOMMUF_writable:
                                    0; 
     unsigned long old_mfn = 0;
+    p2m_lock_t *p2ml = p2m->lock;
 
     if ( tb_init_done )
     {
@@ -326,7 +334,10 @@ p2m_set_entry(struct p2m_domain *p2m, un
      */
     if ( page_order == PAGE_ORDER_1G )
     {
-        l1_pgentry_t old_entry = l1e_empty();
+        l1_pgentry_t old_entry;
+        lock_p2m_1G(p2ml, l3_table_offset(gfn));
+
+        old_entry = l1e_empty();
         p2m_entry = p2m_find_entry(table, &gfn_remainder, gfn,
                                    L3_PAGETABLE_SHIFT - PAGE_SHIFT,
                                    L3_PAGETABLE_ENTRIES);
@@ -358,7 +369,9 @@ p2m_set_entry(struct p2m_domain *p2m, un
         /* Free old intermediate tables if necessary */
         if ( l1e_get_flags(old_entry) & _PAGE_PRESENT )
             p2m_free_entry(p2m, &old_entry, page_order);
+        unlock_p2m_1G(p2ml, l3_table_offset(gfn));
     }
+
     /*
      * When using PAE Xen, we only allow 33 bits of pseudo-physical
      * address in translated guests (i.e. 8 GBytes).  This restriction
@@ -515,6 +528,7 @@ static mfn_t p2m_gfn_to_mfn_current(stru
      * XXX Once we start explicitly registering MMIO regions in the p2m 
      * XXX we will return p2m_invalid for unmapped gfns */
 
+    p2m_lock_t *p2ml = p2m->lock;
     l1_pgentry_t l1e = l1e_empty(), *p2m_entry;
     l2_pgentry_t l2e = l2e_empty();
     int ret;
@@ -543,6 +557,8 @@ pod_retry_l3:
             /* The read has succeeded, so we know that mapping exists */
             if ( q != p2m_query )
             {
+                /* We do not need to lock the 1G superpage here because PoD 
+                 * will do it by splitting */
                 if ( !p2m_pod_demand_populate(p2m, gfn, PAGE_ORDER_1G, q) )
                     goto pod_retry_l3;
                 p2mt = p2m_invalid;
@@ -558,6 +574,7 @@ pod_retry_l3:
         goto pod_retry_l2;
     }
 
+    lock_p2m_1G(p2ml, l3_table_offset(addr));
     if ( l3e_get_flags(l3e) & _PAGE_PSE )
     {
         p2mt = p2m_flags_to_type(l3e_get_flags(l3e));
@@ -571,8 +588,12 @@ pod_retry_l3:
             
         if ( page_order )
             *page_order = PAGE_ORDER_1G;
+        unlock_p2m_1G(p2ml, l3_table_offset(addr));
         goto out;
     }
+    unlock_p2m_1G(p2ml, l3_table_offset(addr));
+#else
+    (void)p2ml; /* gcc ... */
 #endif
     /*
      * Read & process L2
@@ -691,6 +712,7 @@ p2m_gfn_to_mfn(struct p2m_domain *p2m, u
     paddr_t addr = ((paddr_t)gfn) << PAGE_SHIFT;
     l2_pgentry_t *l2e;
     l1_pgentry_t *l1e;
+    p2m_lock_t *p2ml = p2m->lock;
 
     ASSERT(paging_mode_translate(p2m->domain));
 
@@ -744,6 +766,8 @@ pod_retry_l3:
             {
                 if ( q != p2m_query )
                 {
+                    /* See previous comments on why there is no need to lock
+                     * 1GB superpage here */
                     if ( !p2m_pod_demand_populate(p2m, gfn, PAGE_ORDER_1G, q) )
                         goto pod_retry_l3;
                 }
@@ -755,16 +779,23 @@ pod_retry_l3:
         }
         else if ( (l3e_get_flags(*l3e) & _PAGE_PSE) )
         {
-            mfn = _mfn(l3e_get_pfn(*l3e) +
-                       l2_table_offset(addr) * L1_PAGETABLE_ENTRIES +
-                       l1_table_offset(addr));
-            *t = p2m_flags_to_type(l3e_get_flags(*l3e));
-            unmap_domain_page(l3e);
+            lock_p2m_1G(p2ml, l3_table_offset(addr));
+            /* Retry to be sure */
+            if ( (l3e_get_flags(*l3e) & _PAGE_PSE) )
+            {
+                mfn = _mfn(l3e_get_pfn(*l3e) +
+                           l2_table_offset(addr) * L1_PAGETABLE_ENTRIES +
+                           l1_table_offset(addr));
+                *t = p2m_flags_to_type(l3e_get_flags(*l3e));
+                unmap_domain_page(l3e);
 
-            ASSERT(mfn_valid(mfn) || !p2m_is_ram(*t));
-            if ( page_order )
-                *page_order = PAGE_ORDER_1G;
-            return (p2m_is_valid(*t)) ? mfn : _mfn(INVALID_MFN);
+                ASSERT(mfn_valid(mfn) || !p2m_is_ram(*t));
+                if ( page_order )
+                    *page_order = PAGE_ORDER_1G;
+                unlock_p2m_1G(p2ml, l3_table_offset(addr));
+                return (p2m_is_valid(*t)) ? mfn : _mfn(INVALID_MFN);
+            }
+            unlock_p2m_1G(p2ml, l3_table_offset(addr));
         }
 
         mfn = _mfn(l3e_get_pfn(*l3e));
@@ -852,6 +883,7 @@ static void p2m_change_type_global(struc
     l4_pgentry_t *l4e;
     unsigned long i4;
 #endif /* CONFIG_PAGING_LEVELS == 4 */
+    p2m_lock_t *p2ml = p2m->lock;
 
     BUG_ON(p2m_is_grant(ot) || p2m_is_grant(nt));
     BUG_ON(ot != nt && (ot == p2m_mmio_direct || nt ==
p2m_mmio_direct));
@@ -891,17 +923,25 @@ static void p2m_change_type_global(struc
             }
             if ( (l3e_get_flags(l3e[i3]) & _PAGE_PSE) )
             {
-                flags = l3e_get_flags(l3e[i3]);
-                if ( p2m_flags_to_type(flags) != ot )
+                lock_p2m_1G(p2ml, i3);
+                if ( (l3e_get_flags(l3e[i3]) & _PAGE_PSE) )
+                {
+                    flags = l3e_get_flags(l3e[i3]);
+                    if ( p2m_flags_to_type(flags) != ot )
+                    {
+                        unlock_p2m_1G(p2ml, i3);
+                        continue;
+                    }
+                    mfn = l3e_get_pfn(l3e[i3]);
+                    gfn = get_gpfn_from_mfn(mfn);
+                    flags = p2m_type_to_flags(nt, _mfn(mfn));
+                    l1e_content = l1e_from_pfn(mfn, flags | _PAGE_PSE);
+                    p2m->write_p2m_entry(p2m, gfn,
+                                         (l1_pgentry_t *)&l3e[i3],
+                                         l3mfn, l1e_content, 3);
+                    unlock_p2m_1G(p2ml, i3);
                     continue;
-                mfn = l3e_get_pfn(l3e[i3]);
-                gfn = get_gpfn_from_mfn(mfn);
-                flags = p2m_type_to_flags(nt, _mfn(mfn));
-                l1e_content = l1e_from_pfn(mfn, flags | _PAGE_PSE);
-                p2m->write_p2m_entry(p2m, gfn,
-                                     (l1_pgentry_t *)&l3e[i3],
-                                     l3mfn, l1e_content, 3);
-                continue;
+                }
             }
 
             l2mfn = _mfn(l3e_get_pfn(l3e[i3]));

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andres Lagar-Cavilla

2011-Oct-27 04:33 UTC

head link

[Xen-devel] [PATCH 7 of 9] Refactor p2m get_entry accessor

xen/arch/x86/mm/p2m.c     |  38 ++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/p2m.h |  40 ++--------------------------------------
 2 files changed, 40 insertions(+), 38 deletions(-)


Move the main query accessor to the p2m outside of an inline and into the
p2m code itself. This will allow for p2m internal locking to be added
to the accessor later.

Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>

diff -r 0a97d62c2d41 -r 8a98179666de xen/arch/x86/mm/p2m.c
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -148,6 +148,44 @@ void p2m_change_entry_type_global(struct
     p2m_unlock(p2m);
 }
 
+mfn_t gfn_to_mfn_type_p2m(struct p2m_domain *p2m, unsigned long gfn,
+                    p2m_type_t *t, p2m_access_t *a, p2m_query_t q,
+                    unsigned int *page_order)
+{
+    mfn_t mfn;
+
+    if ( !p2m || !paging_mode_translate(p2m->domain) )
+    {
+        /* Not necessarily true, but for non-translated guests, we claim
+         * it''s the most generic kind of memory */
+        *t = p2m_ram_rw;
+        return _mfn(gfn);
+    }
+
+    mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order);
+
+#ifdef __x86_64__
+    if ( q == p2m_unshare && p2m_is_shared(*t) )
+    {
+        ASSERT(!p2m_is_nestedp2m(p2m));
+        mem_sharing_unshare_page(p2m->domain, gfn, 0);
+        mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order);
+    }
+#endif
+
+#ifdef __x86_64__
+    if (unlikely((p2m_is_broken(*t))))
+    {
+        /* Return invalid_mfn to avoid caller''s access */
+        mfn = _mfn(INVALID_MFN);
+        if (q == p2m_guest)
+            domain_crash(p2m->domain);
+    }
+#endif
+
+    return mfn;
+}
+
 int set_p2m_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn, 
                   unsigned int page_order, p2m_type_t p2mt, p2m_access_t p2ma)
 {
diff -r 0a97d62c2d41 -r 8a98179666de xen/include/asm-x86/p2m.h
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -305,45 +305,9 @@ struct p2m_domain *p2m_get_p2m(struct vc
  * If the lookup succeeds, the return value is != INVALID_MFN and 
  * *page_order is filled in with the order of the superpage (if any) that
  * the entry was found in.  */
-static inline mfn_t
-gfn_to_mfn_type_p2m(struct p2m_domain *p2m, unsigned long gfn,
+mfn_t gfn_to_mfn_type_p2m(struct p2m_domain *p2m, unsigned long gfn,
                     p2m_type_t *t, p2m_access_t *a, p2m_query_t q,
-                    unsigned int *page_order)
-{
-    mfn_t mfn;
-
-    if ( !p2m || !paging_mode_translate(p2m->domain) )
-    {
-        /* Not necessarily true, but for non-translated guests, we claim
-         * it''s the most generic kind of memory */
-        *t = p2m_ram_rw;
-        return _mfn(gfn);
-    }
-
-    mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order);
-
-#ifdef __x86_64__
-    if ( q == p2m_unshare && p2m_is_shared(*t) )
-    {
-        ASSERT(!p2m_is_nestedp2m(p2m));
-        mem_sharing_unshare_page(p2m->domain, gfn, 0);
-        mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order);
-    }
-#endif
-
-#ifdef __x86_64__
-    if (unlikely((p2m_is_broken(*t))))
-    {
-        /* Return invalid_mfn to avoid caller''s access */
-        mfn = _mfn(INVALID_MFN);
-        if (q == p2m_guest)
-            domain_crash(p2m->domain);
-    }
-#endif
-
-    return mfn;
-}
-
+                    unsigned int *page_order);
 
 /* General conversion function from gfn to mfn */
 static inline mfn_t gfn_to_mfn_type(struct domain *d,

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andres Lagar-Cavilla

2011-Oct-27 04:33 UTC

head link

[Xen-devel] [PATCH 8 of 9] Modify all internal p2m functions to use the new fine-grained locking

xen/arch/x86/mm/hap/hap.c        |    2 +-
 xen/arch/x86/mm/hap/nested_hap.c |   21 ++-
 xen/arch/x86/mm/p2m-ept.c        |   26 +----
 xen/arch/x86/mm/p2m-pod.c        |   42 +++++--
 xen/arch/x86/mm/p2m-pt.c         |   20 +---
 xen/arch/x86/mm/p2m.c            |  185 ++++++++++++++++++++++++--------------
 xen/include/asm-ia64/mm.h        |    5 +
 xen/include/asm-x86/p2m.h        |   45 +++++++++-
 8 files changed, 217 insertions(+), 129 deletions(-)


This patch only modifies code internal to the p2m, adding convenience
macros, etc. It will yield a compiling code base but an incorrect
hypervisor (external callers of queries into the p2m will not unlock).
Next patch takes care of external callers, split done for the benefit
of conciseness.

Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>

diff -r 8a98179666de -r 471d4f2754d6 xen/arch/x86/mm/hap/hap.c
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -861,7 +861,7 @@ hap_write_p2m_entry(struct vcpu *v, unsi
     old_flags = l1e_get_flags(*p);
 
     if ( nestedhvm_enabled(d) && (old_flags & _PAGE_PRESENT) 
-         && !p2m_get_hostp2m(d)->defer_nested_flush ) {
+         &&
!atomic_read(&(p2m_get_hostp2m(d)->defer_nested_flush)) ) {
         /* We are replacing a valid entry so we need to flush nested p2ms,
          * unless the only change is an increase in access rights. */
         mfn_t omfn = _mfn(l1e_get_pfn(*p));
diff -r 8a98179666de -r 471d4f2754d6 xen/arch/x86/mm/hap/nested_hap.c
--- a/xen/arch/x86/mm/hap/nested_hap.c
+++ b/xen/arch/x86/mm/hap/nested_hap.c
@@ -105,8 +105,6 @@ nestedhap_fix_p2m(struct vcpu *v, struct
     ASSERT(p2m);
     ASSERT(p2m->set_entry);
 
-    p2m_lock(p2m);
-
     /* If this p2m table has been flushed or recycled under our feet, 
      * leave it alone.  We''ll pick up the right one as we try to 
      * vmenter the guest. */
@@ -122,11 +120,13 @@ nestedhap_fix_p2m(struct vcpu *v, struct
         gfn = (L2_gpa >> PAGE_SHIFT) & mask;
         mfn = _mfn((L0_gpa >> PAGE_SHIFT) & mask);
 
+        /* Not bumping refcount of pages underneath because we''re
getting
+         * rid of whatever was there */
+        get_p2m(p2m, gfn, page_order);
         rv = set_p2m_entry(p2m, gfn, mfn, page_order, p2mt, p2ma);
+        put_p2m(p2m, gfn, page_order);
     }
 
-    p2m_unlock(p2m);
-
     if (rv == 0) {
         gdprintk(XENLOG_ERR,
 		"failed to set entry for 0x%"PRIx64" ->
0x%"PRIx64"\n",
@@ -146,19 +146,26 @@ nestedhap_walk_L0_p2m(struct p2m_domain 
     mfn_t mfn;
     p2m_type_t p2mt;
     p2m_access_t p2ma;
+    int rc;
 
     /* walk L0 P2M table */
     mfn = gfn_to_mfn_type_p2m(p2m, L1_gpa >> PAGE_SHIFT, &p2mt,
&p2ma,
                               p2m_query, page_order);
 
+    rc = NESTEDHVM_PAGEFAULT_ERROR;
     if ( p2m_is_paging(p2mt) || p2m_is_shared(p2mt) || !p2m_is_ram(p2mt) )
-        return NESTEDHVM_PAGEFAULT_ERROR;
+        goto out;
 
+    rc = NESTEDHVM_PAGEFAULT_ERROR;
     if ( !mfn_valid(mfn) )
-        return NESTEDHVM_PAGEFAULT_ERROR;
+        goto out;
 
     *L0_gpa = (mfn_x(mfn) << PAGE_SHIFT) + (L1_gpa & ~PAGE_MASK);
-    return NESTEDHVM_PAGEFAULT_DONE;
+    rc = NESTEDHVM_PAGEFAULT_DONE;
+
+out:
+    drop_p2m_gfn(p2m, L1_gpa >> PAGE_SHIFT, mfn_x(mfn));
+    return rc;
 }
 
 /* This function uses L2_gpa to walk the P2M page table in L1. If the 
diff -r 8a98179666de -r 471d4f2754d6 xen/arch/x86/mm/p2m-ept.c
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -43,29 +43,16 @@
 #define is_epte_present(ept_entry)      ((ept_entry)->epte & 0x7)
 #define is_epte_superpage(ept_entry)    ((ept_entry)->sp)
 
-/* Non-ept "lock-and-check" wrapper */
+/* Ept-specific check wrapper */
 static int ept_pod_check_and_populate(struct p2m_domain *p2m, unsigned long
gfn,
                                       ept_entry_t *entry, int order,
                                       p2m_query_t q)
 {
-    int r;
-
-    /* This is called from the p2m lookups, which can happen with or 
-     * without the lock hed. */
-    p2m_lock_recursive(p2m);
-
     /* Check to make sure this is still PoD */
     if ( entry->sa_p2mt != p2m_populate_on_demand )
-    {
-        p2m_unlock(p2m);
         return 0;
-    }
 
-    r = p2m_pod_demand_populate(p2m, gfn, order, q);
-
-    p2m_unlock(p2m);
-
-    return r;
+    return p2m_pod_demand_populate(p2m, gfn, order, q);
 }
 
 static void ept_p2m_type_to_flags(ept_entry_t *entry, p2m_type_t type,
p2m_access_t access)
@@ -265,9 +252,9 @@ static int ept_next_level(struct p2m_dom
 
     ept_entry = (*table) + index;
 
-    /* ept_next_level() is called (sometimes) without a lock.  Read
+    /* ept_next_level() is called (never) without a lock.  Read
      * the entry once, and act on the "cached" entry after that to
-     * avoid races. */
+     * avoid races. AAA */
     e = atomic_read_ept_entry(ept_entry);
 
     if ( !is_epte_present(&e) )
@@ -733,7 +720,8 @@ void ept_change_entry_emt_with_range(str
     int order = 0;
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
 
-    p2m_lock(p2m);
+    /* This is a global operation, essentially */
+    get_p2m_global(p2m);
     for ( gfn = start_gfn; gfn <= end_gfn; gfn++ )
     {
         int level = 0;
@@ -773,7 +761,7 @@ void ept_change_entry_emt_with_range(str
                 ept_set_entry(p2m, gfn, mfn, order, e.sa_p2mt, e.access);
         }
     }
-    p2m_unlock(p2m);
+    put_p2m_global(p2m);
 }
 
 /*
diff -r 8a98179666de -r 471d4f2754d6 xen/arch/x86/mm/p2m-pod.c
--- a/xen/arch/x86/mm/p2m-pod.c
+++ b/xen/arch/x86/mm/p2m-pod.c
@@ -102,8 +102,6 @@ p2m_pod_cache_add(struct p2m_domain *p2m
     }
 #endif
 
-    ASSERT(p2m_locked_by_me(p2m));
-
     /*
      * Pages from domain_alloc and returned by the balloon driver
aren''t
      * guaranteed to be zero; but by reclaiming zero pages, we implicitly
@@ -536,7 +534,7 @@ p2m_pod_decrease_reservation(struct doma
     {
         p2m_type_t t;
 
-        gfn_to_mfn_query(d, gpfn + i, &t);
+        gfn_to_mfn_query_unlocked(d, gpfn + i, &t);
 
         if ( t == p2m_populate_on_demand )
             pod++;
@@ -602,6 +600,7 @@ p2m_pod_decrease_reservation(struct doma
             nonpod--;
             ram--;
         }
+        drop_p2m_gfn(p2m, gpfn + i, mfn_x(mfn));
     }    
 
     /* If there are no more non-PoD entries, tell decrease_reservation() that
@@ -661,12 +660,15 @@ p2m_pod_zero_check_superpage(struct p2m_
     for ( i=0; i<SUPERPAGE_PAGES; i++ )
     {
         
-        mfn = gfn_to_mfn_query(d, gfn + i, &type);
-
         if ( i == 0 )
         {
+            /* Only lock the p2m entry the first time, that will lock 
+             * server for the whole superpage */
+            mfn = gfn_to_mfn_query(d, gfn + i, &type);
             mfn0 = mfn;
             type0 = type;
+        } else {
+            mfn = gfn_to_mfn_query_unlocked(d, gfn + i, &type);
         }
 
         /* Conditions that must be met for superpage-superpage:
@@ -773,6 +775,10 @@ out:
         p2m_pod_cache_add(p2m, mfn_to_page(mfn0), PAGE_ORDER_2M);
         p2m->pod.entry_count += SUPERPAGE_PAGES;
     }
+    
+    /* We got p2m locks once for the whole superpage, with the original
+     * mfn0. We drop it here. */
+    drop_p2m_gfn(p2m, gfn, mfn_x(mfn0));    
 }
 
 /* On entry, PoD lock is held */
@@ -894,6 +900,12 @@ p2m_pod_zero_check(struct p2m_domain *p2
             p2m->pod.entry_count++;
         }
     }
+
+    /* Drop all p2m locks and references */
+    for ( i=0; i<count; i++ )
+    {
+        drop_p2m_gfn(p2m, gfns[i], mfn_x(mfns[i]));
+    }
     
 }
 
@@ -928,7 +940,9 @@ p2m_pod_emergency_sweep_super(struct p2m
     p2m->pod.reclaim_super = i ? i - SUPERPAGE_PAGES : 0;
 }
 
-#define POD_SWEEP_STRIDE  16
+/* Note that spinlock recursion counters have 4 bits, so 16 or higher
+ * will overflow a single 2M spinlock in a zero check. */
+#define POD_SWEEP_STRIDE  15
 static void
 p2m_pod_emergency_sweep(struct p2m_domain *p2m)
 {
@@ -946,7 +960,7 @@ p2m_pod_emergency_sweep(struct p2m_domai
     /* FIXME: Figure out how to avoid superpages */
     for ( i=p2m->pod.reclaim_single; i > 0 ; i-- )
     {
-        gfn_to_mfn_query(p2m->domain, i, &t );
+        gfn_to_mfn_query_unlocked(p2m->domain, i, &t );
         if ( p2m_is_ram(t) )
         {
             gfns[j] = i;
@@ -974,6 +988,7 @@ p2m_pod_emergency_sweep(struct p2m_domai
 
 }
 
+/* The gfn and order need to be locked in the p2m before you walk in here */
 int
 p2m_pod_demand_populate(struct p2m_domain *p2m, unsigned long gfn,
                         unsigned int order,
@@ -985,8 +1000,6 @@ p2m_pod_demand_populate(struct p2m_domai
     mfn_t mfn;
     int i;
 
-    ASSERT(p2m_locked_by_me(p2m));
-
     pod_lock(p2m);
     /* This check is done with the pod lock held.  This will make sure that
      * even if d->is_dying changes under our feet, p2m_pod_empty_cache() 
@@ -1008,8 +1021,6 @@ p2m_pod_demand_populate(struct p2m_domai
         set_p2m_entry(p2m, gfn_aligned, _mfn(0), PAGE_ORDER_2M,
                       p2m_populate_on_demand, p2m->default_access);
         audit_p2m(p2m, 1);
-        /* This is because the ept/pt caller locks the p2m recursively */
-        p2m_unlock(p2m);
         return 0;
     }
 
@@ -1132,7 +1143,9 @@ guest_physmap_mark_populate_on_demand(st
     if ( rc != 0 )
         return rc;
 
-    p2m_lock(p2m);
+    /* Pre-lock all the p2m entries. We don''t take refs to the
+     * pages, because there shouldn''t be any pages underneath. */
+    get_p2m(p2m, gfn, order);
     audit_p2m(p2m, 1);
 
     P2M_DEBUG("mark pod gfn=%#lx\n", gfn);
@@ -1140,7 +1153,8 @@ guest_physmap_mark_populate_on_demand(st
     /* Make sure all gpfns are unused */
     for ( i = 0; i < (1UL << order); i++ )
     {
-        omfn = gfn_to_mfn_query(d, gfn + i, &ot);
+        p2m_access_t a;
+        omfn = p2m->get_entry(p2m, gfn + i, &ot, &a, p2m_query,
NULL);
         if ( p2m_is_ram(ot) )
         {
             printk("%s: gfn_to_mfn returned type %d!\n",
@@ -1169,9 +1183,9 @@ guest_physmap_mark_populate_on_demand(st
     }
 
     audit_p2m(p2m, 1);
-    p2m_unlock(p2m);
 
 out:
+    put_p2m(p2m, gfn, order);
     return rc;
 }
 
diff -r 8a98179666de -r 471d4f2754d6 xen/arch/x86/mm/p2m-pt.c
--- a/xen/arch/x86/mm/p2m-pt.c
+++ b/xen/arch/x86/mm/p2m-pt.c
@@ -487,31 +487,16 @@ out:
 }
 
 
-/* Non-ept "lock-and-check" wrapper */
+/* PT-specific check wrapper */
 static int p2m_pod_check_and_populate(struct p2m_domain *p2m, unsigned long
gfn,
                                       l1_pgentry_t *p2m_entry, int order,
                                       p2m_query_t q)
 {
-    int r;
-
-    /* This is called from the p2m lookups, which can happen with or 
-     * without the lock hed. */
-    p2m_lock_recursive(p2m);
-    audit_p2m(p2m, 1);
-
     /* Check to make sure this is still PoD */
     if ( p2m_flags_to_type(l1e_get_flags(*p2m_entry)) != p2m_populate_on_demand
)
-    {
-        p2m_unlock(p2m);
         return 0;
-    }
 
-    r = p2m_pod_demand_populate(p2m, gfn, order, q);
-
-    audit_p2m(p2m, 1);
-    p2m_unlock(p2m);
-
-    return r;
+    return p2m_pod_demand_populate(p2m, gfn, order, q);
 }
 
 /* Read the current domain''s p2m table (through the linear mapping).
*/
@@ -894,6 +879,7 @@ static void p2m_change_type_global(struc
     if ( pagetable_get_pfn(p2m_get_pagetable(p2m)) == 0 )
         return;
 
+    /* Checks for exclusive lock */
     ASSERT(p2m_locked_by_me(p2m));
 
 #if CONFIG_PAGING_LEVELS == 4
diff -r 8a98179666de -r 471d4f2754d6 xen/arch/x86/mm/p2m.c
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -143,9 +143,9 @@ void p2m_change_entry_type_global(struct
                                   p2m_type_t ot, p2m_type_t nt)
 {
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
-    p2m_lock(p2m);
+    get_p2m_global(p2m);
     p2m->change_entry_type_global(p2m, ot, nt);
-    p2m_unlock(p2m);
+    put_p2m_global(p2m);
 }
 
 mfn_t gfn_to_mfn_type_p2m(struct p2m_domain *p2m, unsigned long gfn,
@@ -162,12 +162,17 @@ mfn_t gfn_to_mfn_type_p2m(struct p2m_dom
         return _mfn(gfn);
     }
 
+    /* We take the lock for this single gfn. The caller has to put this lock */
+    get_p2m_gfn(p2m, gfn);
+
     mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order);
 
 #ifdef __x86_64__
     if ( q == p2m_unshare && p2m_is_shared(*t) )
     {
         ASSERT(!p2m_is_nestedp2m(p2m));
+        /* p2m locking is recursive, so we won''t deadlock going
+         * into the sharing code */
         mem_sharing_unshare_page(p2m->domain, gfn, 0);
         mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order);
     }
@@ -179,13 +184,28 @@ mfn_t gfn_to_mfn_type_p2m(struct p2m_dom
         /* Return invalid_mfn to avoid caller''s access */
         mfn = _mfn(INVALID_MFN);
         if (q == p2m_guest)
+        {
+            put_p2m_gfn(p2m, gfn);
             domain_crash(p2m->domain);
+        }
     }
 #endif
 
+    /* Take an extra reference to the page. It won''t disappear beneath
us */
+    if ( mfn_valid(mfn) )
+    {
+        /* Use this because we don''t necessarily know who owns the
page */
+        if ( !page_get_owner_and_reference(mfn_to_page(mfn)) )
+        {
+            mfn = _mfn(INVALID_MFN);
+        }
+    }
+
+    /* We leave holding the p2m lock for this gfn */
     return mfn;
 }
 
+/* Appropriate locks held on entry */
 int set_p2m_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn, 
                   unsigned int page_order, p2m_type_t p2mt, p2m_access_t p2ma)
 {
@@ -194,8 +214,6 @@ int set_p2m_entry(struct p2m_domain *p2m
     unsigned int order;
     int rc = 1;
 
-    ASSERT(p2m_locked_by_me(p2m));
-
     while ( todo )
     {
         if ( hap_enabled(d) )
@@ -217,6 +235,18 @@ int set_p2m_entry(struct p2m_domain *p2m
     return rc;
 }
 
+void drop_p2m_gfn(struct p2m_domain *p2m, unsigned long gfn, 
+                    unsigned long frame)
+{
+    mfn_t mfn = _mfn(frame);
+    /* For non-translated domains, locks are never taken */
+    if ( !p2m || !paging_mode_translate(p2m->domain) )
+        return;
+    if ( mfn_valid(mfn) )
+        put_page(mfn_to_page(mfn));
+    put_p2m_gfn(p2m, gfn);
+}
+
 struct page_info *p2m_alloc_ptp(struct p2m_domain *p2m, unsigned long type)
 {
     struct page_info *pg;
@@ -262,12 +292,12 @@ int p2m_alloc_table(struct p2m_domain *p
     unsigned long gfn = -1UL;
     struct domain *d = p2m->domain;
 
-    p2m_lock(p2m);
+    get_p2m_global(p2m);
 
     if ( pagetable_get_pfn(p2m_get_pagetable(p2m)) != 0 )
     {
         P2M_ERROR("p2m already allocated for this domain\n");
-        p2m_unlock(p2m);
+        put_p2m_global(p2m);
         return -EINVAL;
     }
 
@@ -283,7 +313,7 @@ int p2m_alloc_table(struct p2m_domain *p
 
     if ( p2m_top == NULL )
     {
-        p2m_unlock(p2m);
+        put_p2m_global(p2m);
         return -ENOMEM;
     }
 
@@ -295,7 +325,7 @@ int p2m_alloc_table(struct p2m_domain *p
     P2M_PRINTK("populating p2m table\n");
 
     /* Initialise physmap tables for slot zero. Other code assumes this. */
-    p2m->defer_nested_flush = 1;
+    atomic_set(&p2m->defer_nested_flush, 1);
     if ( !set_p2m_entry(p2m, 0, _mfn(INVALID_MFN), 0,
                         p2m_invalid, p2m->default_access) )
         goto error;
@@ -323,10 +353,10 @@ int p2m_alloc_table(struct p2m_domain *p
         }
         spin_unlock(&p2m->domain->page_alloc_lock);
     }
-    p2m->defer_nested_flush = 0;
+    atomic_set(&p2m->defer_nested_flush, 0);
 
     P2M_PRINTK("p2m table initialised (%u pages)\n", page_count);
-    p2m_unlock(p2m);
+    put_p2m_global(p2m);
     return 0;
 
 error_unlock:
@@ -334,7 +364,7 @@ error_unlock:
  error:
     P2M_PRINTK("failed to initialize p2m table, gfn=%05lx, mfn=%"
                PRI_mfn "\n", gfn, mfn_x(mfn));
-    p2m_unlock(p2m);
+    put_p2m_global(p2m);
     return -ENOMEM;
 }
 
@@ -354,26 +384,28 @@ void p2m_teardown(struct p2m_domain *p2m
     if (p2m == NULL)
         return;
 
+    get_p2m_global(p2m);
+
 #ifdef __x86_64__
     for ( gfn=0; gfn < p2m->max_mapped_pfn; gfn++ )
     {
-        mfn = gfn_to_mfn_type_p2m(p2m, gfn, &t, &a, p2m_query, NULL);
+        mfn = p2m->get_entry(p2m, gfn, &t, &a, p2m_query, NULL);
         if ( mfn_valid(mfn) && (t == p2m_ram_shared) )
         {
             ASSERT(!p2m_is_nestedp2m(p2m));
+            /* The p2m allows an exclusive global holder to recursively
+             * lock sub-ranges. For this. */
             BUG_ON(mem_sharing_unshare_page(d, gfn, MEM_SHARING_DESTROY_GFN));
         }
 
     }
 #endif
 
-    p2m_lock(p2m);
-
     p2m->phys_table = pagetable_null();
 
     while ( (pg = page_list_remove_head(&p2m->pages)) )
         d->arch.paging.free_page(d, pg);
-    p2m_unlock(p2m);
+    put_p2m_global(p2m);
 }
 
 static void p2m_teardown_nestedp2m(struct domain *d)
@@ -401,6 +433,7 @@ void p2m_final_teardown(struct domain *d
 }
 
 
+/* Locks held on entry */
 static void
 p2m_remove_page(struct p2m_domain *p2m, unsigned long gfn, unsigned long mfn,
                 unsigned int page_order)
@@ -438,11 +471,11 @@ guest_physmap_remove_page(struct domain 
                           unsigned long mfn, unsigned int page_order)
 {
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
-    p2m_lock(p2m);
+    get_p2m(p2m, gfn, page_order);
     audit_p2m(p2m, 1);
     p2m_remove_page(p2m, gfn, mfn, page_order);
     audit_p2m(p2m, 1);
-    p2m_unlock(p2m);
+    put_p2m(p2m, gfn, page_order);
 }
 
 int
@@ -480,7 +513,7 @@ guest_physmap_add_entry(struct domain *d
     if ( rc != 0 )
         return rc;
 
-    p2m_lock(p2m);
+    get_p2m(p2m, gfn, page_order);
     audit_p2m(p2m, 0);
 
     P2M_DEBUG("adding gfn=%#lx mfn=%#lx\n", gfn, mfn);
@@ -488,12 +521,13 @@ guest_physmap_add_entry(struct domain *d
     /* First, remove m->p mappings for existing p->m mappings */
     for ( i = 0; i < (1UL << page_order); i++ )
     {
-        omfn = gfn_to_mfn_query(d, gfn + i, &ot);
+        p2m_access_t a;
+        omfn = p2m->get_entry(p2m, gfn + i, &ot, &a, p2m_query,
NULL);
         if ( p2m_is_grant(ot) )
         {
             /* Really shouldn''t be unmapping grant maps this way */
+            put_p2m(p2m, gfn, page_order);
             domain_crash(d);
-            p2m_unlock(p2m);
             return -EINVAL;
         }
         else if ( p2m_is_ram(ot) )
@@ -523,11 +557,12 @@ guest_physmap_add_entry(struct domain *d
             && (ogfn != INVALID_M2P_ENTRY)
             && (ogfn != gfn + i) )
         {
+            p2m_access_t a;
             /* This machine frame is already mapped at another physical
              * address */
             P2M_DEBUG("aliased! mfn=%#lx, old gfn=%#lx, new
gfn=%#lx\n",
                       mfn + i, ogfn, gfn + i);
-            omfn = gfn_to_mfn_query(d, ogfn, &ot);
+            omfn = p2m->get_entry(p2m, ogfn, &ot, &a, p2m_query,
NULL);
             if ( p2m_is_ram(ot) )
             {
                 ASSERT(mfn_valid(omfn));
@@ -567,7 +602,7 @@ guest_physmap_add_entry(struct domain *d
     }
 
     audit_p2m(p2m, 1);
-    p2m_unlock(p2m);
+    put_p2m(p2m, gfn, page_order);
 
     return rc;
 }
@@ -579,18 +614,19 @@ p2m_type_t p2m_change_type(struct domain
                            p2m_type_t ot, p2m_type_t nt)
 {
     p2m_type_t pt;
+    p2m_access_t a;
     mfn_t mfn;
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
 
     BUG_ON(p2m_is_grant(ot) || p2m_is_grant(nt));
 
-    p2m_lock(p2m);
+    get_p2m_gfn(p2m, gfn);
 
-    mfn = gfn_to_mfn_query(d, gfn, &pt);
+    mfn = p2m->get_entry(p2m, gfn, &pt, &a, p2m_query, NULL);
     if ( pt == ot )
         set_p2m_entry(p2m, gfn, mfn, 0, nt, p2m->default_access);
 
-    p2m_unlock(p2m);
+    put_p2m_gfn(p2m, gfn);
 
     return pt;
 }
@@ -608,20 +644,23 @@ void p2m_change_type_range(struct domain
 
     BUG_ON(p2m_is_grant(ot) || p2m_is_grant(nt));
 
-    p2m_lock(p2m);
-    p2m->defer_nested_flush = 1;
+    atomic_set(&p2m->defer_nested_flush, 1);
 
+    /* We''ve been given a number instead of an order, so lock each
+     * gfn individually */
     for ( gfn = start; gfn < end; gfn++ )
     {
-        mfn = gfn_to_mfn_query(d, gfn, &pt);
+        p2m_access_t a;
+        get_p2m_gfn(p2m, gfn);
+        mfn = p2m->get_entry(p2m, gfn, &pt, &a, p2m_query, NULL);
         if ( pt == ot )
             set_p2m_entry(p2m, gfn, mfn, 0, nt, p2m->default_access);
+        put_p2m_gfn(p2m, gfn);
     }
 
-    p2m->defer_nested_flush = 0;
+    atomic_set(&p2m->defer_nested_flush, 0);
     if ( nestedhvm_enabled(d) )
         p2m_flush_nestedp2m(d);
-    p2m_unlock(p2m);
 }
 
 
@@ -631,17 +670,18 @@ set_mmio_p2m_entry(struct domain *d, uns
 {
     int rc = 0;
     p2m_type_t ot;
+    p2m_access_t a;
     mfn_t omfn;
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
 
     if ( !paging_mode_translate(d) )
         return 0;
 
-    p2m_lock(p2m);
-    omfn = gfn_to_mfn_query(d, gfn, &ot);
+    get_p2m_gfn(p2m, gfn);
+    omfn = p2m->get_entry(p2m, gfn, &ot, &a, p2m_query, NULL);
     if ( p2m_is_grant(ot) )
     {
-        p2m_unlock(p2m);
+        put_p2m_gfn(p2m, gfn);
         domain_crash(d);
         return 0;
     }
@@ -654,11 +694,11 @@ set_mmio_p2m_entry(struct domain *d, uns
     P2M_DEBUG("set mmio %lx %lx\n", gfn, mfn_x(mfn));
     rc = set_p2m_entry(p2m, gfn, mfn, 0, p2m_mmio_direct,
p2m->default_access);
     audit_p2m(p2m, 1);
-    p2m_unlock(p2m);
+    put_p2m_gfn(p2m, gfn);
     if ( 0 == rc )
         gdprintk(XENLOG_ERR,
             "set_mmio_p2m_entry: set_p2m_entry failed! mfn=%08lx\n",
-            mfn_x(gfn_to_mfn_query(d, gfn, &ot)));
+            mfn_x(gfn_to_mfn_query_unlocked(d, gfn, &ot)));
     return rc;
 }
 
@@ -668,13 +708,14 @@ clear_mmio_p2m_entry(struct domain *d, u
     int rc = 0;
     mfn_t mfn;
     p2m_type_t t;
+    p2m_access_t a;
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
 
     if ( !paging_mode_translate(d) )
         return 0;
 
-    p2m_lock(p2m);
-    mfn = gfn_to_mfn_query(d, gfn, &t);
+    get_p2m_gfn(p2m, gfn);
+    mfn = p2m->get_entry(p2m, gfn, &t, &a, p2m_query, NULL);
 
     /* Do not use mfn_valid() here as it will usually fail for MMIO pages. */
     if ( (INVALID_MFN == mfn_x(mfn)) || (t != p2m_mmio_direct) )
@@ -687,8 +728,7 @@ clear_mmio_p2m_entry(struct domain *d, u
     audit_p2m(p2m, 1);
 
 out:
-    p2m_unlock(p2m);
-
+    put_p2m_gfn(p2m, gfn);
     return rc;
 }
 
@@ -698,13 +738,14 @@ set_shared_p2m_entry(struct domain *d, u
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
     int rc = 0;
     p2m_type_t ot;
+    p2m_access_t a;
     mfn_t omfn;
 
     if ( !paging_mode_translate(p2m->domain) )
         return 0;
 
-    p2m_lock(p2m);
-    omfn = gfn_to_mfn_query(p2m->domain, gfn, &ot);
+    get_p2m_gfn(p2m, gfn);
+    omfn = p2m->get_entry(p2m, gfn, &ot, &a, p2m_query, NULL);
     /* At the moment we only allow p2m change if gfn has already been made
      * sharable first */
     ASSERT(p2m_is_shared(ot));
@@ -714,11 +755,11 @@ set_shared_p2m_entry(struct domain *d, u
 
     P2M_DEBUG("set shared %lx %lx\n", gfn, mfn_x(mfn));
     rc = set_p2m_entry(p2m, gfn, mfn, 0, p2m_ram_shared,
p2m->default_access);
-    p2m_unlock(p2m);
+    put_p2m_gfn(p2m, gfn);
     if ( 0 == rc )
         gdprintk(XENLOG_ERR,
             "set_shared_p2m_entry: set_p2m_entry failed!
mfn=%08lx\n",
-            mfn_x(gfn_to_mfn_query(d, gfn, &ot)));
+            mfn_x(gfn_to_mfn_query_unlocked(d, gfn, &ot)));
     return rc;
 }
 
@@ -732,7 +773,7 @@ int p2m_mem_paging_nominate(struct domai
     mfn_t mfn;
     int ret;
 
-    p2m_lock(p2m);
+    get_p2m_gfn(p2m, gfn);
 
     mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, p2m_query, NULL);
 
@@ -765,7 +806,7 @@ int p2m_mem_paging_nominate(struct domai
     ret = 0;
 
  out:
-    p2m_unlock(p2m);
+    put_p2m_gfn(p2m, gfn);
     return ret;
 }
 
@@ -778,7 +819,7 @@ int p2m_mem_paging_evict(struct domain *
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
     int ret = -EINVAL;
 
-    p2m_lock(p2m);
+    get_p2m_gfn(p2m, gfn);
 
     /* Get mfn */
     mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, p2m_query, NULL);
@@ -824,7 +865,7 @@ int p2m_mem_paging_evict(struct domain *
     put_page(page);
 
  out:
-    p2m_unlock(p2m);
+    put_p2m_gfn(p2m, gfn);
     return ret;
 }
 
@@ -863,7 +904,7 @@ void p2m_mem_paging_populate(struct doma
     req.type = MEM_EVENT_TYPE_PAGING;
 
     /* Fix p2m mapping */
-    p2m_lock(p2m);
+    get_p2m_gfn(p2m, gfn);
     mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, p2m_query, NULL);
     /* Allow only nominated or evicted pages to enter page-in path */
     if ( p2mt == p2m_ram_paging_out || p2mt == p2m_ram_paged )
@@ -875,7 +916,7 @@ void p2m_mem_paging_populate(struct doma
         set_p2m_entry(p2m, gfn, mfn, 0, p2m_ram_paging_in_start, a);
         audit_p2m(p2m, 1);
     }
-    p2m_unlock(p2m);
+    put_p2m_gfn(p2m, gfn);
 
     /* Pause domain if request came from guest and gfn has paging type */
     if (  p2m_is_paging(p2mt) && v->domain->domain_id ==
d->domain_id )
@@ -908,7 +949,7 @@ int p2m_mem_paging_prep(struct domain *d
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
     int ret = -ENOMEM;
 
-    p2m_lock(p2m);
+    get_p2m_gfn(p2m, gfn);
 
     mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, p2m_query, NULL);
 
@@ -931,7 +972,7 @@ int p2m_mem_paging_prep(struct domain *d
     ret = 0;
 
  out:
-    p2m_unlock(p2m);
+    put_p2m_gfn(p2m, gfn);
     return ret;
 }
 
@@ -949,12 +990,12 @@ void p2m_mem_paging_resume(struct domain
     /* Fix p2m entry if the page was not dropped */
     if ( !(rsp.flags & MEM_EVENT_FLAG_DROP_PAGE) )
     {
-        p2m_lock(p2m);
+        get_p2m_gfn(p2m, rsp.gfn);
         mfn = p2m->get_entry(p2m, rsp.gfn, &p2mt, &a, p2m_query,
NULL);
         set_p2m_entry(p2m, rsp.gfn, mfn, 0, p2m_ram_rw, a);
         set_gpfn_from_mfn(mfn_x(mfn), rsp.gfn);
         audit_p2m(p2m, 1);
-        p2m_unlock(p2m);
+        put_p2m_gfn(p2m, rsp.gfn);
     }
 
     /* Unpause domain */
@@ -979,16 +1020,16 @@ void p2m_mem_access_check(unsigned long 
     p2m_access_t p2ma;
     
     /* First, handle rx2rw conversion automatically */
-    p2m_lock(p2m);
+    get_p2m_gfn(p2m, gfn);
     mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, p2m_query, NULL);
 
     if ( access_w && p2ma == p2m_access_rx2rw ) 
     {
         p2m->set_entry(p2m, gfn, mfn, PAGE_ORDER_4K, p2mt, p2m_access_rw);
-        p2m_unlock(p2m);
+        put_p2m_gfn(p2m, gfn);
         return;
     }
-    p2m_unlock(p2m);
+    put_p2m_gfn(p2m, gfn);
 
     /* Otherwise, check if there is a memory event listener, and send the
message along */
     res = mem_event_check_ring(d, &d->mem_access);
@@ -1006,9 +1047,9 @@ void p2m_mem_access_check(unsigned long 
         else
         {
             /* A listener is not required, so clear the access restrictions */
-            p2m_lock(p2m);
+            get_p2m_gfn(p2m, gfn);
             p2m->set_entry(p2m, gfn, mfn, PAGE_ORDER_4K, p2mt,
p2m_access_rwx);
-            p2m_unlock(p2m);
+            put_p2m_gfn(p2m, gfn);
         }
 
         return;
@@ -1064,7 +1105,7 @@ int p2m_set_mem_access(struct domain *d,
 {
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
     unsigned long pfn;
-    p2m_access_t a;
+    p2m_access_t a, _a;
     p2m_type_t t;
     mfn_t mfn;
     int rc = 0;
@@ -1095,17 +1136,20 @@ int p2m_set_mem_access(struct domain *d,
         return 0;
     }
 
-    p2m_lock(p2m);
+    /* Because we don''t get an order, rather a number, we need to lock
each
+     * entry individually */
     for ( pfn = start_pfn; pfn < start_pfn + nr; pfn++ )
     {
-        mfn = gfn_to_mfn_query(d, pfn, &t);
+        get_p2m_gfn(p2m, pfn);
+        mfn = p2m->get_entry(p2m, pfn, &t, &_a, p2m_query, NULL);
         if ( p2m->set_entry(p2m, pfn, mfn, PAGE_ORDER_4K, t, a) == 0 )
         {
+            put_p2m_gfn(p2m, pfn);
             rc = -ENOMEM;
             break;
         }
+        put_p2m_gfn(p2m, pfn);
     }
-    p2m_unlock(p2m);
     return rc;
 }
 
@@ -1138,7 +1182,10 @@ int p2m_get_mem_access(struct domain *d,
         return 0;
     }
 
+    get_p2m_gfn(p2m, pfn);
     mfn = p2m->get_entry(p2m, pfn, &t, &a, p2m_query, NULL);
+    put_p2m_gfn(p2m, pfn);
+
     if ( mfn_x(mfn) == INVALID_MFN )
         return -ESRCH;
     
@@ -1175,7 +1222,7 @@ p2m_flush_table(struct p2m_domain *p2m)
     struct domain *d = p2m->domain;
     void *p;
 
-    p2m_lock(p2m);
+    get_p2m_global(p2m);
 
     /* "Host" p2m tables can have shared entries &c that need a
bit more
      * care when discarding them */
@@ -1203,7 +1250,7 @@ p2m_flush_table(struct p2m_domain *p2m)
             d->arch.paging.free_page(d, pg);
     page_list_add(top, &p2m->pages);
 
-    p2m_unlock(p2m);
+    put_p2m_global(p2m);
 }
 
 void
@@ -1245,7 +1292,7 @@ p2m_get_nestedp2m(struct vcpu *v, uint64
     p2m = nv->nv_p2m;
     if ( p2m ) 
     {
-        p2m_lock(p2m);
+        get_p2m_global(p2m);
         if ( p2m->cr3 == cr3 || p2m->cr3 == CR3_EADDR )
         {
             nv->nv_flushp2m = 0;
@@ -1255,24 +1302,24 @@ p2m_get_nestedp2m(struct vcpu *v, uint64
                 hvm_asid_flush_vcpu(v);
             p2m->cr3 = cr3;
             cpu_set(v->processor, p2m->p2m_dirty_cpumask);
-            p2m_unlock(p2m);
+            put_p2m_global(p2m);
             nestedp2m_unlock(d);
             return p2m;
         }
-        p2m_unlock(p2m);
+        put_p2m_global(p2m);
     }
 
     /* All p2m''s are or were in use. Take the least recent used one,
      * flush it and reuse. */
     p2m = p2m_getlru_nestedp2m(d, NULL);
     p2m_flush_table(p2m);
-    p2m_lock(p2m);
+    get_p2m_global(p2m);
     nv->nv_p2m = p2m;
     p2m->cr3 = cr3;
     nv->nv_flushp2m = 0;
     hvm_asid_flush_vcpu(v);
     cpu_set(v->processor, p2m->p2m_dirty_cpumask);
-    p2m_unlock(p2m);
+    put_p2m_global(p2m);
     nestedp2m_unlock(d);
 
     return p2m;
diff -r 8a98179666de -r 471d4f2754d6 xen/include/asm-ia64/mm.h
--- a/xen/include/asm-ia64/mm.h
+++ b/xen/include/asm-ia64/mm.h
@@ -561,6 +561,11 @@ extern u64 translate_domain_pte(u64 ptev
     ((get_gpfn_from_mfn((madr) >> PAGE_SHIFT) << PAGE_SHIFT) | \
     ((madr) & ~PAGE_MASK))
 
+/* Because x86-specific p2m fine-grained lock releases are called from common
+ * code, we put a dummy placeholder here */
+#define drop_p2m_gfn(p, g, m)           ((void)0)
+#define drop_p2m_gfn_domain(p, g, m)    ((void)0)
+
 /* Internal use only: returns 0 in case of bad address.  */
 extern unsigned long paddr_to_maddr(unsigned long paddr);
 
diff -r 8a98179666de -r 471d4f2754d6 xen/include/asm-x86/p2m.h
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -220,7 +220,7 @@ struct p2m_domain {
      * tables on every host-p2m change.  The setter of this flag 
      * is responsible for performing the full flush before releasing the
      * host p2m''s lock. */
-    int                defer_nested_flush;
+    atomic_t           defer_nested_flush;
 
     /* Pages used to construct the p2m */
     struct page_list_head pages;
@@ -298,6 +298,15 @@ struct p2m_domain *p2m_get_p2m(struct vc
 #define p2m_get_pagetable(p2m)  ((p2m)->phys_table)
 
 
+/* No matter what value you get out of a query, the p2m has been locked for
+ * that range. No matter what you do, you need to drop those locks.
+ * You need to pass back the mfn obtained when locking, not the new one,
+ * as the refcount of the original mfn was bumped. */
+void drop_p2m_gfn(struct p2m_domain *p2m, unsigned long gfn, 
+                        unsigned long mfn);
+#define drop_p2m_gfn_domain(d, g, m)    \
+        drop_p2m_gfn(p2m_get_hostp2m((d)), (g), (m))
+
 /* Read a particular P2M table, mapping pages as we go.  Most callers
  * should _not_ call this directly; use the other gfn_to_mfn_* functions
  * below unless you know you want to walk a p2m that isn''t a
domain''s
@@ -327,6 +336,28 @@ static inline mfn_t gfn_to_mfn_type(stru
 #define gfn_to_mfn_guest(d, g, t)   gfn_to_mfn_type((d), (g), (t), p2m_guest)
 #define gfn_to_mfn_unshare(d, g, t) gfn_to_mfn_type((d), (g), (t), p2m_unshare)
 
+/* This one applies to very specific situations in which you''re
querying
+ * a p2m entry and will be done "immediately" (such as a printk or
computing a
+ * return value). Use this only if there are no expectations of the p2m entry
+ * holding steady. */
+static inline mfn_t gfn_to_mfn_type_unlocked(struct domain *d,
+                                        unsigned long gfn, p2m_type_t *t,
+                                        p2m_query_t q)
+{
+    mfn_t mfn = gfn_to_mfn_type(d, gfn, t, q);
+    drop_p2m_gfn_domain(d, gfn, mfn_x(mfn));
+    return mfn;
+}
+
+#define gfn_to_mfn_unlocked(d, g, t)            \
+    gfn_to_mfn_type_unlocked((d), (g), (t), p2m_alloc)
+#define gfn_to_mfn_query_unlocked(d, g, t)    \
+    gfn_to_mfn_type_unlocked((d), (g), (t), p2m_query)
+#define gfn_to_mfn_guest_unlocked(d, g, t)    \
+    gfn_to_mfn_type_unlocked((d), (g), (t), p2m_guest)
+#define gfn_to_mfn_unshare_unlocked(d, g, t)    \
+    gfn_to_mfn_type_unlocked((d), (g), (t), p2m_unshare)
+
 /* Compatibility function exporting the old untyped interface */
 static inline unsigned long gmfn_to_mfn(struct domain *d, unsigned long gpfn)
 {
@@ -338,6 +369,15 @@ static inline unsigned long gmfn_to_mfn(
     return INVALID_MFN;
 }
 
+/* Same comments apply re unlocking */
+static inline unsigned long gmfn_to_mfn_unlocked(struct domain *d,
+                                                 unsigned long gpfn)
+{
+    unsigned long mfn = gmfn_to_mfn(d, gpfn);
+    drop_p2m_gfn_domain(d, gpfn, mfn);
+    return mfn;
+}
+
 /* General conversion function from mfn to gfn */
 static inline unsigned long mfn_to_gfn(struct domain *d, mfn_t mfn)
 {
@@ -521,7 +561,8 @@ static inline int p2m_gfn_check_limit(
 #define p2m_gfn_check_limit(d, g, o) 0
 #endif
 
-/* Directly set a p2m entry: only for use by p2m code */
+/* Directly set a p2m entry: only for use by p2m code. It expects locks to 
+ * be held on entry */
 int set_p2m_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn, 
                   unsigned int page_order, p2m_type_t p2mt, p2m_access_t p2ma);
 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andres Lagar-Cavilla

2011-Oct-27 04:33 UTC

head link

[Xen-devel] [PATCH 9 of 9] Modify all call sites of queries into the p2m to use the new fine-grained locking

xen/arch/x86/cpu/mcheck/vmce.c     |    7 +-
 xen/arch/x86/debug.c               |    7 +-
 xen/arch/x86/domain.c              |   24 +++++-
 xen/arch/x86/domctl.c              |    9 ++-
 xen/arch/x86/hvm/emulate.c         |   25 ++++++-
 xen/arch/x86/hvm/hvm.c             |  126 ++++++++++++++++++++++++++++++------
 xen/arch/x86/hvm/mtrr.c            |    2 +-
 xen/arch/x86/hvm/nestedhvm.c       |    2 +-
 xen/arch/x86/hvm/stdvga.c          |    4 +-
 xen/arch/x86/hvm/svm/nestedsvm.c   |   12 ++-
 xen/arch/x86/hvm/svm/svm.c         |   11 ++-
 xen/arch/x86/hvm/viridian.c        |    4 +
 xen/arch/x86/hvm/vmx/vmx.c         |   13 +++-
 xen/arch/x86/hvm/vmx/vvmx.c        |   11 ++-
 xen/arch/x86/mm.c                  |  126 +++++++++++++++++++++++++++++++++---
 xen/arch/x86/mm/guest_walk.c       |   11 +++
 xen/arch/x86/mm/hap/guest_walk.c   |   15 +++-
 xen/arch/x86/mm/mem_event.c        |   28 ++++++-
 xen/arch/x86/mm/mem_sharing.c      |   23 +++++-
 xen/arch/x86/mm/shadow/common.c    |    4 +-
 xen/arch/x86/mm/shadow/multi.c     |   67 +++++++++++++++----
 xen/arch/x86/physdev.c             |    9 ++
 xen/arch/x86/traps.c               |   17 +++-
 xen/common/grant_table.c           |   27 +++++++-
 xen/common/memory.c                |    9 ++
 xen/common/tmem_xen.c              |   21 ++++-
 xen/include/asm-x86/hvm/hvm.h      |    5 +-
 xen/include/asm-x86/hvm/vmx/vvmx.h |    1 +
 28 files changed, 519 insertions(+), 101 deletions(-)


This patch is humongous, unfortunately, given the dozens of call sites
involved.

For callers outside of the p2m code, we also perform a get_page on the
resulting mfn of the query. This ensures that the caller, while operating
on the gfn, has exclusive control of the p2m entry, and that the underlying
mfn will not go away.

We cannot enforce ordering of this fine-grained p2m lock at this point
because there are some inversions present in the current code (pod sweeps,
unshare page) that will take more time to unroot.

Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>

diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/cpu/mcheck/vmce.c
--- a/xen/arch/x86/cpu/mcheck/vmce.c
+++ b/xen/arch/x86/cpu/mcheck/vmce.c
@@ -574,6 +574,7 @@ int unmmap_broken_page(struct domain *d,
 {
     mfn_t r_mfn;
     p2m_type_t pt;
+    int rc;
 
     /* Always trust dom0''s MCE handler will prevent future access */
     if ( d == dom0 )
@@ -585,14 +586,16 @@ int unmmap_broken_page(struct domain *d,
     if ( !is_hvm_domain(d) || !paging_mode_hap(d) )
         return -ENOSYS;
 
+    rc = -1;
     r_mfn = gfn_to_mfn_query(d, gfn, &pt);
     if ( p2m_to_mask(pt) & P2M_UNMAP_TYPES)
     {
         ASSERT(mfn_x(r_mfn) == mfn_x(mfn));
         p2m_change_type(d, gfn, pt, p2m_ram_broken);
-        return 0;
+        rc = 0;
     }
+    drop_p2m_gfn_domain(d, gfn, mfn_x(r_mfn));
 
-    return -1;
+    return rc;
 }
 
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/debug.c
--- a/xen/arch/x86/debug.c
+++ b/xen/arch/x86/debug.c
@@ -45,7 +45,8 @@
 static unsigned long 
 dbg_hvm_va2mfn(dbgva_t vaddr, struct domain *dp, int toaddr)
 {
-    unsigned long mfn, gfn;
+    unsigned long gfn;
+    mfn_t mfn;
     uint32_t pfec = PFEC_page_present;
     p2m_type_t gfntype;
 
@@ -58,7 +59,7 @@ dbg_hvm_va2mfn(dbgva_t vaddr, struct dom
         return INVALID_MFN;
     }
 
-    mfn = mfn_x(gfn_to_mfn(dp, gfn, &gfntype)); 
+    mfn = gfn_to_mfn_unlocked(dp, gfn, &gfntype); 
     if ( p2m_is_readonly(gfntype) && toaddr )
     {
         DBGP2("kdb:p2m_is_readonly: gfntype:%x\n", gfntype);
@@ -66,7 +67,7 @@ dbg_hvm_va2mfn(dbgva_t vaddr, struct dom
     }
 
     DBGP2("X: vaddr:%lx domid:%d mfn:%lx\n", vaddr, dp->domain_id,
mfn);
-    return mfn;
+    return mfn_x(mfn);
 }
 
 #if defined(__x86_64__)
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/domain.c
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -720,6 +720,7 @@ int arch_set_info_guest(
     struct vcpu *v, vcpu_guest_context_u c)
 {
     struct domain *d = v->domain;
+    unsigned long cr3_gfn;
     unsigned long cr3_pfn = INVALID_MFN;
     unsigned long flags, cr4;
     unsigned int i;
@@ -931,7 +932,8 @@ int arch_set_info_guest(
 
     if ( !compat )
     {
-        cr3_pfn = gmfn_to_mfn(d, xen_cr3_to_pfn(c.nat->ctrlreg[3]));
+        cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[3]);
+        cr3_pfn = gmfn_to_mfn(d, cr3_gfn);
 
         if ( !mfn_valid(cr3_pfn) ||
              (paging_mode_refcounts(d)
@@ -939,16 +941,18 @@ int arch_set_info_guest(
               : !get_page_and_type(mfn_to_page(cr3_pfn), d,
                                    PGT_base_page_table)) )
         {
+            drop_p2m_gfn_domain(d, cr3_gfn, cr3_pfn);
             destroy_gdt(v);
             return -EINVAL;
         }
 
         v->arch.guest_table = pagetable_from_pfn(cr3_pfn);
-
+        drop_p2m_gfn_domain(d, cr3_gfn, cr3_pfn);
 #ifdef __x86_64__
         if ( c.nat->ctrlreg[1] )
         {
-            cr3_pfn = gmfn_to_mfn(d, xen_cr3_to_pfn(c.nat->ctrlreg[1]));
+            cr3_gfn = xen_cr3_to_pfn(c.nat->ctrlreg[1]);
+            cr3_pfn = gmfn_to_mfn(d, cr3_gfn);
 
             if ( !mfn_valid(cr3_pfn) ||
                  (paging_mode_refcounts(d)
@@ -962,11 +966,13 @@ int arch_set_info_guest(
                     put_page(mfn_to_page(cr3_pfn));
                 else
                     put_page_and_type(mfn_to_page(cr3_pfn));
+                drop_p2m_gfn_domain(d, cr3_gfn, cr3_pfn); 
                 destroy_gdt(v);
                 return -EINVAL;
             }
 
             v->arch.guest_table_user = pagetable_from_pfn(cr3_pfn);
+            drop_p2m_gfn_domain(d, cr3_gfn, cr3_pfn); 
         }
         else if ( !(flags & VGCF_in_kernel) )
         {
@@ -978,7 +984,8 @@ int arch_set_info_guest(
     {
         l4_pgentry_t *l4tab;
 
-        cr3_pfn = gmfn_to_mfn(d, compat_cr3_to_pfn(c.cmp->ctrlreg[3]));
+        cr3_gfn = compat_cr3_to_pfn(c.cmp->ctrlreg[3]);
+        cr3_pfn = gmfn_to_mfn(d, cr3_gfn);
 
         if ( !mfn_valid(cr3_pfn) ||
              (paging_mode_refcounts(d)
@@ -986,6 +993,7 @@ int arch_set_info_guest(
               : !get_page_and_type(mfn_to_page(cr3_pfn), d,
                                    PGT_l3_page_table)) )
         {
+            drop_p2m_gfn_domain(d, cr3_gfn, cr3_pfn); 
             destroy_gdt(v);
             return -EINVAL;
         }
@@ -993,6 +1001,7 @@ int arch_set_info_guest(
         l4tab = __va(pagetable_get_paddr(v->arch.guest_table));
         *l4tab = l4e_from_pfn(
             cr3_pfn, _PAGE_PRESENT|_PAGE_RW|_PAGE_USER|_PAGE_ACCESSED);
+        drop_p2m_gfn_domain(d, cr3_gfn, cr3_pfn); 
 #endif
     }
 
@@ -1058,11 +1067,12 @@ unmap_vcpu_info(struct vcpu *v)
  * event doesn''t get missed.
  */
 static int
-map_vcpu_info(struct vcpu *v, unsigned long mfn, unsigned offset)
+map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned offset)
 {
     struct domain *d = v->domain;
     void *mapping;
     vcpu_info_t *new_info;
+    unsigned long mfn;
     int i;
 
     if ( offset > (PAGE_SIZE - sizeof(vcpu_info_t)) )
@@ -1075,7 +1085,7 @@ map_vcpu_info(struct vcpu *v, unsigned l
     if ( (v != current) && !test_bit(_VPF_down, &v->pause_flags)
)
         return -EINVAL;
 
-    mfn = gmfn_to_mfn(d, mfn);
+    mfn = gmfn_to_mfn(d, gfn);
     if ( !mfn_valid(mfn) ||
          !get_page_and_type(mfn_to_page(mfn), d, PGT_writable_page) )
         return -EINVAL;
@@ -1084,6 +1094,7 @@ map_vcpu_info(struct vcpu *v, unsigned l
     if ( mapping == NULL )
     {
         put_page_and_type(mfn_to_page(mfn));
+        drop_p2m_gfn_domain(d, gfn, mfn); 
         return -ENOMEM;
     }
 
@@ -1113,6 +1124,7 @@ map_vcpu_info(struct vcpu *v, unsigned l
     for ( i = 0; i < BITS_PER_EVTCHN_WORD(d); i++ )
         set_bit(i, &vcpu_info(v, evtchn_pending_sel));
 
+    drop_p2m_gfn_domain(d, gfn, mfn); 
     return 0;
 }
 
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/domctl.c
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -235,6 +235,7 @@ long arch_do_domctl(
                         type = XEN_DOMCTL_PFINFO_XTAB;
 
                     arr[j] = type;
+                    drop_p2m_gfn_domain(d, arr[j], mfn);
                 }
 
                 if ( copy_to_guest_offset(domctl->u.getpageframeinfo3.array,
@@ -299,6 +300,7 @@ long arch_do_domctl(
             for ( j = 0; j < k; j++ )
             {      
                 struct page_info *page;
+                unsigned long gfn = arr32[j];
                 unsigned long mfn = gmfn_to_mfn(d, arr32[j]);
 
                 page = mfn_to_page(mfn);
@@ -310,8 +312,10 @@ long arch_do_domctl(
                      unlikely(is_xen_heap_mfn(mfn)) )
                     arr32[j] |= XEN_DOMCTL_PFINFO_XTAB;
                 else if ( xsm_getpageframeinfo(page) != 0 )
+                {
+                    drop_p2m_gfn_domain(d, gfn, mfn); 
                     continue;
-                else if ( likely(get_page(page, d)) )
+                } else if ( likely(get_page(page, d)) )
                 {
                     unsigned long type = 0;
 
@@ -339,6 +343,7 @@ long arch_do_domctl(
                 else
                     arr32[j] |= XEN_DOMCTL_PFINFO_XTAB;
 
+                drop_p2m_gfn_domain(d, gfn, mfn); 
             }
 
             if ( copy_to_guest_offset(domctl->u.getpageframeinfo2.array,
@@ -431,6 +436,7 @@ long arch_do_domctl(
         if ( !mfn_valid(mfn) ||
              !get_page_and_type(mfn_to_page(mfn), d, PGT_writable_page) )
         {
+            drop_p2m_gfn_domain(d, gmfn, mfn); 
             rcu_unlock_domain(d);
             break;
         }
@@ -443,6 +449,7 @@ long arch_do_domctl(
 
         put_page_and_type(mfn_to_page(mfn));
 
+        drop_p2m_gfn_domain(d, gmfn, mfn); 
         rcu_unlock_domain(d);
     }
     break;
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/hvm/emulate.c
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -66,10 +66,14 @@ static int hvmemul_do_io(
     if ( p2m_is_paging(p2mt) )
     {
         p2m_mem_paging_populate(curr->domain, ram_gfn);
+        drop_p2m_gfn_domain(curr->domain, ram_gfn, mfn_x(ram_mfn)); 
         return X86EMUL_RETRY;
     }
     if ( p2m_is_shared(p2mt) )
+    {
+        drop_p2m_gfn_domain(curr->domain, ram_gfn, mfn_x(ram_mfn)); 
         return X86EMUL_RETRY;
+    }
 
     /*
      * Weird-sized accesses have undefined behaviour: we discard writes
@@ -81,6 +85,7 @@ static int hvmemul_do_io(
         ASSERT(p_data != NULL); /* cannot happen with a REP prefix */
         if ( dir == IOREQ_READ )
             memset(p_data, ~0, size);
+        drop_p2m_gfn_domain(curr->domain, ram_gfn, mfn_x(ram_mfn)); 
         return X86EMUL_UNHANDLEABLE;
     }
 
@@ -98,7 +103,10 @@ static int hvmemul_do_io(
             paddr_t pa = curr->arch.hvm_vcpu.mmio_large_write_pa;
             unsigned int bytes = curr->arch.hvm_vcpu.mmio_large_write_bytes;
             if ( (addr >= pa) && ((addr + size) <= (pa + bytes))
)
+            {
+                drop_p2m_gfn_domain(curr->domain, ram_gfn, mfn_x(ram_mfn)); 
                 return X86EMUL_OKAY;
+            }
         }
         else
         {
@@ -108,6 +116,7 @@ static int hvmemul_do_io(
             {
                 memcpy(p_data, &curr->arch.hvm_vcpu.mmio_large_read[addr
- pa],
                        size);
+                drop_p2m_gfn_domain(curr->domain, ram_gfn, mfn_x(ram_mfn)); 
                 return X86EMUL_OKAY;
             }
         }
@@ -120,15 +129,22 @@ static int hvmemul_do_io(
     case HVMIO_completed:
         curr->arch.hvm_vcpu.io_state = HVMIO_none;
         if ( p_data == NULL )
+        {
+            drop_p2m_gfn_domain(curr->domain, ram_gfn, mfn_x(ram_mfn)); 
             return X86EMUL_UNHANDLEABLE;
+        }
         goto finish_access;
     case HVMIO_dispatched:
         /* May have to wait for previous cycle of a multi-write to complete. */
         if ( is_mmio && !value_is_ptr && (dir == IOREQ_WRITE)
&&
              (addr == (curr->arch.hvm_vcpu.mmio_large_write_pa +
                        curr->arch.hvm_vcpu.mmio_large_write_bytes)) )
+        {
+            drop_p2m_gfn_domain(curr->domain, ram_gfn, mfn_x(ram_mfn)); 
             return X86EMUL_RETRY;
+        }
     default:
+        drop_p2m_gfn_domain(curr->domain, ram_gfn, mfn_x(ram_mfn)); 
         return X86EMUL_UNHANDLEABLE;
     }
 
@@ -136,6 +152,7 @@ static int hvmemul_do_io(
     {
         gdprintk(XENLOG_WARNING, "WARNING: io already pending
(%d)?\n",
                  p->state);
+        drop_p2m_gfn_domain(curr->domain, ram_gfn, mfn_x(ram_mfn)); 
         return X86EMUL_UNHANDLEABLE;
     }
 
@@ -186,7 +203,10 @@ static int hvmemul_do_io(
     }
 
     if ( rc != X86EMUL_OKAY )
+    {
+        drop_p2m_gfn_domain(curr->domain, ram_gfn, mfn_x(ram_mfn)); 
         return rc;
+    }
 
  finish_access:
     if ( p_data != NULL )
@@ -221,6 +241,7 @@ static int hvmemul_do_io(
         }
     }
 
+    drop_p2m_gfn_domain(curr->domain, ram_gfn, mfn_x(ram_mfn)); 
     return X86EMUL_OKAY;
 }
 
@@ -669,12 +690,12 @@ static int hvmemul_rep_movs(
     if ( rc != X86EMUL_OKAY )
         return rc;
 
-    (void)gfn_to_mfn(current->domain, sgpa >> PAGE_SHIFT, &p2mt);
+    (void)gfn_to_mfn_unlocked(current->domain, sgpa >> PAGE_SHIFT,
&p2mt);
     if ( !p2m_is_ram(p2mt) && !p2m_is_grant(p2mt) )
         return hvmemul_do_mmio(
             sgpa, reps, bytes_per_rep, dgpa, IOREQ_READ, df, NULL);
 
-    (void)gfn_to_mfn(current->domain, dgpa >> PAGE_SHIFT, &p2mt);
+    (void)gfn_to_mfn_unlocked(current->domain, dgpa >> PAGE_SHIFT,
&p2mt);
     if ( !p2m_is_ram(p2mt) && !p2m_is_grant(p2mt) )
         return hvmemul_do_mmio(
             dgpa, reps, bytes_per_rep, sgpa, IOREQ_WRITE, df, NULL);
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/hvm/hvm.c
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -357,24 +357,35 @@ static int hvm_set_ioreq_page(
 
     mfn = mfn_x(gfn_to_mfn_unshare(d, gmfn, &p2mt));
     if ( !p2m_is_ram(p2mt) )
+    {
+        drop_p2m_gfn_domain(d, gmfn, mfn);
         return -EINVAL;
+    }
     if ( p2m_is_paging(p2mt) )
     {
         p2m_mem_paging_populate(d, gmfn);
+        drop_p2m_gfn_domain(d, gmfn, mfn);
         return -ENOENT;
     }
     if ( p2m_is_shared(p2mt) )
+    {
+        drop_p2m_gfn_domain(d, gmfn, mfn);
         return -ENOENT;
+    }
     ASSERT(mfn_valid(mfn));
 
     page = mfn_to_page(mfn);
     if ( !get_page_and_type(page, d, PGT_writable_page) )
+    {
+        drop_p2m_gfn_domain(d, gmfn, mfn);
         return -EINVAL;
+    }
 
     va = map_domain_page_global(mfn);
     if ( va == NULL )
     {
         put_page_and_type(page);
+        drop_p2m_gfn_domain(d, gmfn, mfn);
         return -ENOMEM;
     }
 
@@ -385,12 +396,14 @@ static int hvm_set_ioreq_page(
         spin_unlock(&iorp->lock);
         unmap_domain_page_global(va);
         put_page_and_type(mfn_to_page(mfn));
+        drop_p2m_gfn_domain(d, gmfn, mfn);
         return -EINVAL;
     }
 
     iorp->va = va;
     iorp->page = page;
 
+    drop_p2m_gfn_domain(d, gmfn, mfn);
     spin_unlock(&iorp->lock);
 
     domain_unpause(d);
@@ -1182,6 +1195,7 @@ int hvm_hap_nested_page_fault(unsigned l
     mfn_t mfn;
     struct vcpu *v = current;
     struct p2m_domain *p2m;
+    int rc;
 
     /* On Nested Virtualization, walk the guest page table.
      * If this succeeds, all is fine.
@@ -1251,8 +1265,8 @@ int hvm_hap_nested_page_fault(unsigned l
         if ( violation )
         {
             p2m_mem_access_check(gpa, gla_valid, gla, access_r, access_w,
access_x);
-
-            return 1;
+            rc = 1;
+            goto out_put_p2m;
         }
     }
 
@@ -1264,7 +1278,8 @@ int hvm_hap_nested_page_fault(unsigned l
     {
         if ( !handle_mmio() )
             hvm_inject_exception(TRAP_gp_fault, 0, 0);
-        return 1;
+        rc = 1;
+        goto out_put_p2m;
     }
 
 #ifdef __x86_64__
@@ -1277,7 +1292,8 @@ int hvm_hap_nested_page_fault(unsigned l
     {
         ASSERT(!p2m_is_nestedp2m(p2m));
         mem_sharing_unshare_page(p2m->domain, gfn, 0);
-        return 1;
+        rc = 1;
+        goto out_put_p2m;
     }
 #endif
  
@@ -1291,7 +1307,8 @@ int hvm_hap_nested_page_fault(unsigned l
          */
         paging_mark_dirty(v->domain, mfn_x(mfn));
         p2m_change_type(v->domain, gfn, p2m_ram_logdirty, p2m_ram_rw);
-        return 1;
+        rc = 1;
+        goto out_put_p2m;
     }
 
     /* Shouldn''t happen: Maybe the guest was writing to a r/o grant
mapping? */
@@ -1300,10 +1317,14 @@ int hvm_hap_nested_page_fault(unsigned l
         gdprintk(XENLOG_WARNING,
                  "trying to write to read-only grant mapping\n");
         hvm_inject_exception(TRAP_gp_fault, 0, 0);
-        return 1;
+        rc = 1;
+        goto out_put_p2m;
     }
 
-    return 0;
+    rc = 0;
+out_put_p2m:
+    drop_p2m_gfn(p2m, gfn, mfn_x(mfn));
+    return rc;
 }
 
 int hvm_handle_xsetbv(u64 new_bv)
@@ -1530,6 +1551,7 @@ int hvm_set_cr0(unsigned long value)
             if ( !p2m_is_ram(p2mt) || !mfn_valid(mfn) ||
                  !get_page(mfn_to_page(mfn), v->domain))
             {
+                drop_p2m_gfn_domain(v->domain, gfn, mfn);
                 gdprintk(XENLOG_ERR, "Invalid CR3 value = %lx
(mfn=%lx)\n",
                          v->arch.hvm_vcpu.guest_cr[3], mfn);
                 domain_crash(v->domain);
@@ -1541,6 +1563,7 @@ int hvm_set_cr0(unsigned long value)
 
             HVM_DBG_LOG(DBG_LEVEL_VMMU, "Update CR3 value = %lx, mfn =
%lx",
                         v->arch.hvm_vcpu.guest_cr[3], mfn);
+            drop_p2m_gfn_domain(v->domain, gfn, mfn);
         }
     }
     else if ( !(value & X86_CR0_PG) && (old_value & X86_CR0_PG)
)
@@ -1620,10 +1643,15 @@ int hvm_set_cr3(unsigned long value)
         mfn = mfn_x(gfn_to_mfn(v->domain, value >> PAGE_SHIFT,
&p2mt));
         if ( !p2m_is_ram(p2mt) || !mfn_valid(mfn) ||
              !get_page(mfn_to_page(mfn), v->domain) )
+        {
+              drop_p2m_gfn_domain(v->domain, 
+                            value >> PAGE_SHIFT, mfn);
               goto bad_cr3;
+        }
 
         put_page(pagetable_get_page(v->arch.guest_table));
         v->arch.guest_table = pagetable_from_pfn(mfn);
+        drop_p2m_gfn_domain(v->domain, value >> PAGE_SHIFT, mfn);
 
         HVM_DBG_LOG(DBG_LEVEL_VMMU, "Update CR3 value = %lx", value);
     }
@@ -1760,6 +1788,8 @@ int hvm_virtual_to_linear_addr(
     return 0;
 }
 
+/* We leave this function holding a lock on the p2m entry and a ref
+ * on the mapped mfn */
 static void *__hvm_map_guest_frame(unsigned long gfn, bool_t writable)
 {
     unsigned long mfn;
@@ -1770,10 +1800,14 @@ static void *__hvm_map_guest_frame(unsig
                 ? gfn_to_mfn_unshare(d, gfn, &p2mt)
                 : gfn_to_mfn(d, gfn, &p2mt));
     if ( (p2m_is_shared(p2mt) && writable) || !p2m_is_ram(p2mt) )
+    {
+        drop_p2m_gfn_domain(d, gfn, mfn);
         return NULL;
+    }
     if ( p2m_is_paging(p2mt) )
     {
         p2m_mem_paging_populate(d, gfn);
+        drop_p2m_gfn_domain(d, gfn, mfn);
         return NULL;
     }
 
@@ -1795,10 +1829,39 @@ void *hvm_map_guest_frame_ro(unsigned lo
     return __hvm_map_guest_frame(gfn, 0);
 }
 
-void hvm_unmap_guest_frame(void *p)
+void hvm_unmap_guest_frame(void *p, unsigned long addr, int is_va)
 {
+    /* We enter this function with a map obtained in __hvm_map_guest_frame.
+     * This map performed a p2m query that locked the gfn entry and got
+     * a ref on the mfn. Must undo */
     if ( p )
+    {
+        unsigned long gfn = ~0UL;
+
+        if ( is_va )
+        {
+            if ( addr )
+            {
+                uint32_t pfec = 0;
+                gfn = paging_gva_to_gfn(current, addr, &pfec);
+            } else {
+                gfn = ~0UL;
+            }
+        } else {
+            gfn = addr;
+        }
+
+        if ( gfn != ~0UL )
+        {
+            /* And we get a recursive lock and second ref */
+            p2m_type_t t;
+            unsigned long mfn = mfn_x(gfn_to_mfn(current->domain, gfn,
&t));
+            drop_p2m_gfn_domain(current->domain, gfn, mfn);
+            drop_p2m_gfn_domain(current->domain, gfn, mfn);
+        }
+
         unmap_domain_page(p);
+    }
 }
 
 static void *hvm_map_entry(unsigned long va)
@@ -1835,9 +1898,9 @@ static void *hvm_map_entry(unsigned long
     return NULL;
 }
 
-static void hvm_unmap_entry(void *p)
+static void hvm_unmap_entry(void *p, unsigned long va)
 {
-    hvm_unmap_guest_frame(p);
+    hvm_unmap_guest_frame(p, va, 1);
 }
 
 static int hvm_load_segment_selector(
@@ -1849,6 +1912,7 @@ static int hvm_load_segment_selector(
     int fault_type = TRAP_invalid_tss;
     struct cpu_user_regs *regs = guest_cpu_user_regs();
     struct vcpu *v = current;
+    unsigned long va_desc;
 
     if ( regs->eflags & X86_EFLAGS_VM )
     {
@@ -1882,7 +1946,8 @@ static int hvm_load_segment_selector(
     if ( ((sel & 0xfff8) + 7) > desctab.limit )
         goto fail;
 
-    pdesc = hvm_map_entry(desctab.base + (sel & 0xfff8));
+    va_desc = desctab.base + (sel & 0xfff8);
+    pdesc = hvm_map_entry(va_desc);
     if ( pdesc == NULL )
         goto hvm_map_fail;
 
@@ -1942,7 +2007,7 @@ static int hvm_load_segment_selector(
     desc.b |= 0x100;
 
  skip_accessed_flag:
-    hvm_unmap_entry(pdesc);
+    hvm_unmap_entry(pdesc, va_desc);
 
     segr.base = (((desc.b <<  0) & 0xff000000u) |
                  ((desc.b << 16) & 0x00ff0000u) |
@@ -1958,7 +2023,7 @@ static int hvm_load_segment_selector(
     return 0;
 
  unmap_and_fail:
-    hvm_unmap_entry(pdesc);
+    hvm_unmap_entry(pdesc, va_desc);
  fail:
     hvm_inject_exception(fault_type, sel & 0xfffc, 0);
  hvm_map_fail:
@@ -1973,7 +2038,7 @@ void hvm_task_switch(
     struct cpu_user_regs *regs = guest_cpu_user_regs();
     struct segment_register gdt, tr, prev_tr, segr;
     struct desc_struct *optss_desc = NULL, *nptss_desc = NULL, tss_desc;
-    unsigned long eflags;
+    unsigned long eflags, va_optss = 0, va_nptss = 0;
     int exn_raised, rc;
     struct {
         u16 back_link,__blh;
@@ -1999,11 +2064,13 @@ void hvm_task_switch(
         goto out;
     }
 
-    optss_desc = hvm_map_entry(gdt.base + (prev_tr.sel & 0xfff8));
+    va_optss = gdt.base + (prev_tr.sel & 0xfff8);
+    optss_desc = hvm_map_entry(va_optss);
     if ( optss_desc == NULL )
         goto out;
 
-    nptss_desc = hvm_map_entry(gdt.base + (tss_sel & 0xfff8));
+    va_nptss = gdt.base + (tss_sel & 0xfff8);
+    nptss_desc = hvm_map_entry(va_nptss);
     if ( nptss_desc == NULL )
         goto out;
 
@@ -2168,8 +2235,8 @@ void hvm_task_switch(
     }
 
  out:
-    hvm_unmap_entry(optss_desc);
-    hvm_unmap_entry(nptss_desc);
+    hvm_unmap_entry(optss_desc, va_optss);
+    hvm_unmap_entry(nptss_desc, va_nptss);
 }
 
 #define HVMCOPY_from_guest (0u<<0)
@@ -2182,7 +2249,7 @@ static enum hvm_copy_result __hvm_copy(
     void *buf, paddr_t addr, int size, unsigned int flags, uint32_t pfec)
 {
     struct vcpu *curr = current;
-    unsigned long gfn, mfn;
+    unsigned long gfn = 0, mfn = 0; /* gcc ... */
     p2m_type_t p2mt;
     char *p;
     int count, todo = size;
@@ -2231,14 +2298,24 @@ static enum hvm_copy_result __hvm_copy(
         if ( p2m_is_paging(p2mt) )
         {
             p2m_mem_paging_populate(curr->domain, gfn);
+            drop_p2m_gfn_domain(curr->domain, gfn, mfn);
             return HVMCOPY_gfn_paged_out;
         }
         if ( p2m_is_shared(p2mt) )
+        {
+            drop_p2m_gfn_domain(curr->domain, gfn, mfn);
             return HVMCOPY_gfn_shared;
+        }
         if ( p2m_is_grant(p2mt) )
+        {
+            drop_p2m_gfn_domain(curr->domain, gfn, mfn);
             return HVMCOPY_unhandleable;
+        }
         if ( !p2m_is_ram(p2mt) )
+        {
+            drop_p2m_gfn_domain(curr->domain, gfn, mfn);
             return HVMCOPY_bad_gfn_to_mfn;
+        }
         ASSERT(mfn_valid(mfn));
 
         p = (char *)map_domain_page(mfn) + (addr & ~PAGE_MASK);
@@ -2269,6 +2346,7 @@ static enum hvm_copy_result __hvm_copy(
         addr += count;
         buf  += count;
         todo -= count;
+        drop_p2m_gfn_domain(curr->domain, gfn, mfn);
     }
 
     return HVMCOPY_okay;
@@ -3688,7 +3766,7 @@ long do_hvm_op(unsigned long op, XEN_GUE
             if ( p2m_is_paging(t) )
             {
                 p2m_mem_paging_populate(d, pfn);
-
+                drop_p2m_gfn_domain(d, pfn, mfn_x(mfn));
                 rc = -EINVAL;
                 goto param_fail3;
             }
@@ -3703,6 +3781,7 @@ long do_hvm_op(unsigned long op, XEN_GUE
                 /* don''t take a long time and don''t die
either */
                 sh_remove_shadows(d->vcpu[0], mfn, 1, 0);
             }
+            drop_p2m_gfn_domain(d, pfn, mfn_x(mfn));
         }
 
     param_fail3:
@@ -3726,7 +3805,7 @@ long do_hvm_op(unsigned long op, XEN_GUE
         rc = -EINVAL;
         if ( is_hvm_domain(d) )
         {
-            gfn_to_mfn_unshare(d, a.pfn, &t);
+            gfn_to_mfn_unshare_unlocked(d, a.pfn, &t);
             if ( p2m_is_mmio(t) )
                 a.mem_type =  HVMMEM_mmio_dm;
             else if ( p2m_is_readonly(t) )
@@ -3783,16 +3862,19 @@ long do_hvm_op(unsigned long op, XEN_GUE
             if ( p2m_is_paging(t) )
             {
                 p2m_mem_paging_populate(d, pfn);
+                drop_p2m_gfn_domain(d, pfn, mfn_x(mfn));
                 rc = -EINVAL;
                 goto param_fail4;
             }
             if ( p2m_is_shared(t) )
             {
+                drop_p2m_gfn_domain(d, pfn, mfn_x(mfn));
                 rc = -EINVAL;
                 goto param_fail4;
             } 
             if ( p2m_is_grant(t) )
             {
+                drop_p2m_gfn_domain(d, pfn, mfn_x(mfn));
                 gdprintk(XENLOG_WARNING,
                          "type for pfn 0x%lx changed to grant while "
                          "we were working?\n", pfn);
@@ -3803,6 +3885,7 @@ long do_hvm_op(unsigned long op, XEN_GUE
                 nt = p2m_change_type(d, pfn, t, memtype[a.hvmmem_type]);
                 if ( nt != t )
                 {
+                    drop_p2m_gfn_domain(d, pfn, mfn_x(mfn));
                     gdprintk(XENLOG_WARNING,
                              "type of pfn 0x%lx changed from %d to %d
while "
                              "we were trying to change it to %d\n",
@@ -3810,6 +3893,7 @@ long do_hvm_op(unsigned long op, XEN_GUE
                     goto param_fail4;
                 }
             }
+            drop_p2m_gfn_domain(d, pfn, mfn_x(mfn));
         }
 
         rc = 0;
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/hvm/mtrr.c
--- a/xen/arch/x86/hvm/mtrr.c
+++ b/xen/arch/x86/hvm/mtrr.c
@@ -389,7 +389,7 @@ uint32_t get_pat_flags(struct vcpu *v,
     {
         struct domain *d = v->domain;
         p2m_type_t p2mt;
-        gfn_to_mfn_query(d, paddr_to_pfn(gpaddr), &p2mt);
+        gfn_to_mfn_query_unlocked(d, paddr_to_pfn(gpaddr), &p2mt);
         if (p2m_is_ram(p2mt))
             gdprintk(XENLOG_WARNING,
                     "Conflict occurs for a given guest l1e flags:%x "
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/hvm/nestedhvm.c
--- a/xen/arch/x86/hvm/nestedhvm.c
+++ b/xen/arch/x86/hvm/nestedhvm.c
@@ -56,7 +56,7 @@ nestedhvm_vcpu_reset(struct vcpu *v)
     nv->nv_ioportED = 0;
 
     if (nv->nv_vvmcx)
-        hvm_unmap_guest_frame(nv->nv_vvmcx);
+        hvm_unmap_guest_frame(nv->nv_vvmcx, nv->nv_vvmcxaddr >>
PAGE_SHIFT, 0);
     nv->nv_vvmcx = NULL;
     nv->nv_vvmcxaddr = VMCX_EADDR;
     nv->nv_flushp2m = 0;
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/hvm/stdvga.c
--- a/xen/arch/x86/hvm/stdvga.c
+++ b/xen/arch/x86/hvm/stdvga.c
@@ -482,7 +482,7 @@ static int mmio_move(struct hvm_hw_stdvg
                 if ( hvm_copy_to_guest_phys(data, &tmp, p->size) !      
HVMCOPY_okay )
                 {
-                    (void)gfn_to_mfn(d, data >> PAGE_SHIFT, &p2mt);
+                    (void)gfn_to_mfn_unlocked(d, data >> PAGE_SHIFT,
&p2mt);
                     /*
                      * The only case we handle is vga_mem <-> vga_mem.
                      * Anything else disables caching and leaves it to qemu-dm.
@@ -504,7 +504,7 @@ static int mmio_move(struct hvm_hw_stdvg
                 if ( hvm_copy_from_guest_phys(&tmp, data, p->size) !    
HVMCOPY_okay )
                 {
-                    (void)gfn_to_mfn(d, data >> PAGE_SHIFT, &p2mt);
+                    (void)gfn_to_mfn_unlocked(d, data >> PAGE_SHIFT,
&p2mt);
                     if ( (p2mt != p2m_mmio_dm) || (data < VGA_MEM_BASE) ||
                          ((data + p->size) > (VGA_MEM_BASE +
VGA_MEM_SIZE)) )
                         return 0;
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/hvm/svm/nestedsvm.c
--- a/xen/arch/x86/hvm/svm/nestedsvm.c
+++ b/xen/arch/x86/hvm/svm/nestedsvm.c
@@ -71,7 +71,7 @@ int nestedsvm_vmcb_map(struct vcpu *v, u
     if (nv->nv_vvmcx != NULL && nv->nv_vvmcxaddr != vmcbaddr) {
         ASSERT(nv->nv_vvmcx != NULL);
         ASSERT(nv->nv_vvmcxaddr != VMCX_EADDR);
-        hvm_unmap_guest_frame(nv->nv_vvmcx);
+        hvm_unmap_guest_frame(nv->nv_vvmcx, nv->nv_vvmcxaddr >>
PAGE_SHIFT, 0);
         nv->nv_vvmcx = NULL;
         nv->nv_vvmcxaddr = VMCX_EADDR;
     }
@@ -353,7 +353,7 @@ static int nsvm_vmrun_permissionmap(stru
     ASSERT(ns_viomap != NULL);
     ioport_80 = test_bit(0x80, ns_viomap);
     ioport_ed = test_bit(0xed, ns_viomap);
-    hvm_unmap_guest_frame(ns_viomap);
+    hvm_unmap_guest_frame(ns_viomap, svm->ns_iomap_pa >> PAGE_SHIFT,
0);
 
     svm->ns_iomap = nestedhvm_vcpu_iomap_get(ioport_80, ioport_ed);
 
@@ -857,23 +857,25 @@ nsvm_vmcb_guest_intercepts_ioio(paddr_t 
     ioio_info_t ioinfo;
     uint16_t port;
     bool_t enabled;
+    unsigned long gfn = 0; /* gcc ... */
 
     ioinfo.bytes = exitinfo1;
     port = ioinfo.fields.port;
 
     switch (port) {
     case 0 ... 32767: /* first 4KB page */
-        io_bitmap = hvm_map_guest_frame_ro(iopm_gfn);
+        gfn = iopm_gfn;
         break;
     case 32768 ... 65535: /* second 4KB page */
         port -= 32768;
-        io_bitmap = hvm_map_guest_frame_ro(iopm_gfn+1);
+        gfn = iopm_gfn + 1;
         break;
     default:
         BUG();
         break;
     }
 
+    io_bitmap = hvm_map_guest_frame_ro(gfn);
     if (io_bitmap == NULL) {
         gdprintk(XENLOG_ERR,
             "IOIO intercept: mapping of permission map failed\n");
@@ -881,7 +883,7 @@ nsvm_vmcb_guest_intercepts_ioio(paddr_t 
     }
 
     enabled = test_bit(port, io_bitmap);
-    hvm_unmap_guest_frame(io_bitmap);
+    hvm_unmap_guest_frame(io_bitmap, gfn, 0);
     if (!enabled)
         return NESTEDHVM_VMEXIT_HOST;
 
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/hvm/svm/svm.c
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -247,6 +247,8 @@ static int svm_vmcb_restore(struct vcpu 
             mfn = mfn_x(gfn_to_mfn(v->domain, c->cr3 >> PAGE_SHIFT,
&p2mt));
             if ( !p2m_is_ram(p2mt) || !get_page(mfn_to_page(mfn), v->domain)
)
             {
+                drop_p2m_gfn_domain(v->domain, 
+                                                c->cr3 >> PAGE_SHIFT,
mfn);
                 gdprintk(XENLOG_ERR, "Invalid CR3
value=0x%"PRIx64"\n",
                          c->cr3);
                 return -EINVAL;
@@ -257,6 +259,10 @@ static int svm_vmcb_restore(struct vcpu 
             put_page(pagetable_get_page(v->arch.guest_table));
 
         v->arch.guest_table = pagetable_from_pfn(mfn);
+        if ( c->cr0 & X86_CR0_PG )
+        {
+            drop_p2m_gfn_domain(v->domain, c->cr3 >> PAGE_SHIFT,
mfn);
+        }
     }
 
     v->arch.hvm_vcpu.guest_cr[0] = c->cr0 | X86_CR0_ET;
@@ -1160,7 +1166,9 @@ static void svm_do_nested_pgfault(struct
         p2m = p2m_get_p2m(v);
         _d.gpa = gpa;
         _d.qualification = 0;
-        _d.mfn = mfn_x(gfn_to_mfn_type_p2m(p2m, gfn, &_d.p2mt, &p2ma,
p2m_query, NULL));
+        mfn = gfn_to_mfn_type_p2m(p2m, gfn, &_d.p2mt, &p2ma, p2m_query,
NULL);
+        _d.mfn = mfn_x(mfn);
+        drop_p2m_gfn(p2m, gfn, mfn_x(mfn));
         
         __trace_var(TRC_HVM_NPF, 0, sizeof(_d), &_d);
     }
@@ -1184,6 +1192,7 @@ static void svm_do_nested_pgfault(struct
     gdprintk(XENLOG_ERR,
          "SVM violation gpa %#"PRIpaddr", mfn %#lx, type
%i\n",
          gpa, mfn_x(mfn), p2mt);
+    drop_p2m_gfn(p2m, gfn, mfn_x(mfn));
     domain_crash(v->domain);
 }
 
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/hvm/viridian.c
--- a/xen/arch/x86/hvm/viridian.c
+++ b/xen/arch/x86/hvm/viridian.c
@@ -140,6 +140,7 @@ static void enable_hypercall_page(struct
     if ( !mfn_valid(mfn) ||
          !get_page_and_type(mfn_to_page(mfn), d, PGT_writable_page) )
     {
+        drop_p2m_gfn_domain(d, gmfn, mfn); 
         gdprintk(XENLOG_WARNING, "Bad GMFN %lx (MFN %lx)\n", gmfn,
mfn);
         return;
     }
@@ -162,6 +163,7 @@ static void enable_hypercall_page(struct
     unmap_domain_page(p);
 
     put_page_and_type(mfn_to_page(mfn));
+    drop_p2m_gfn_domain(d, gmfn, mfn); 
 }
 
 void initialize_apic_assist(struct vcpu *v)
@@ -184,6 +186,7 @@ void initialize_apic_assist(struct vcpu 
     if ( !mfn_valid(mfn) ||
          !get_page_and_type(mfn_to_page(mfn), d, PGT_writable_page) )
     {
+        drop_p2m_gfn_domain(d, gmfn, mfn); 
         gdprintk(XENLOG_WARNING, "Bad GMFN %lx (MFN %lx)\n", gmfn,
mfn);
         return;
     }
@@ -195,6 +198,7 @@ void initialize_apic_assist(struct vcpu 
     unmap_domain_page(p);
 
     put_page_and_type(mfn_to_page(mfn));
+    drop_p2m_gfn_domain(d, gmfn, mfn); 
 }
 
 int wrmsr_viridian_regs(uint32_t idx, uint64_t val)
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/hvm/vmx/vmx.c
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -490,6 +490,7 @@ static int vmx_restore_cr0_cr3(
             mfn = mfn_x(gfn_to_mfn(v->domain, cr3 >> PAGE_SHIFT,
&p2mt));
             if ( !p2m_is_ram(p2mt) || !get_page(mfn_to_page(mfn), v->domain)
)
             {
+                drop_p2m_gfn_domain(v->domain, cr3 >> PAGE_SHIFT,
mfn);
                 gdprintk(XENLOG_ERR, "Invalid CR3 value=0x%lx\n",
cr3);
                 return -EINVAL;
             }
@@ -499,6 +500,10 @@ static int vmx_restore_cr0_cr3(
             put_page(pagetable_get_page(v->arch.guest_table));
 
         v->arch.guest_table = pagetable_from_pfn(mfn);
+        if ( cr0 & X86_CR0_PG )
+        {
+            drop_p2m_gfn_domain(v->domain, cr3 >> PAGE_SHIFT, mfn);
+        }
     }
 
     v->arch.hvm_vcpu.guest_cr[0] = cr0 | X86_CR0_ET;
@@ -1009,7 +1014,10 @@ static void vmx_load_pdptrs(struct vcpu 
 
     mfn = mfn_x(gfn_to_mfn(v->domain, cr3 >> PAGE_SHIFT, &p2mt));
     if ( !p2m_is_ram(p2mt) )
+    {
+        drop_p2m_gfn_domain(v->domain, cr3 >> PAGE_SHIFT, mfn);
         goto crash;
+    }
 
     p = map_domain_page(mfn);
 
@@ -1037,6 +1045,7 @@ static void vmx_load_pdptrs(struct vcpu 
     vmx_vmcs_exit(v);
 
     unmap_domain_page(p);
+    drop_p2m_gfn_domain(v->domain, cr3 >> PAGE_SHIFT, mfn);
     return;
 
  crash:
@@ -2088,7 +2097,7 @@ static void ept_handle_violation(unsigne
 
         _d.gpa = gpa;
         _d.qualification = qualification;
-        _d.mfn = mfn_x(gfn_to_mfn_query(d, gfn, &_d.p2mt));
+        _d.mfn = mfn_x(gfn_to_mfn_query_unlocked(d, gfn, &_d.p2mt));
         
         __trace_var(TRC_HVM_NPF, 0, sizeof(_d), &_d);
     }
@@ -2104,7 +2113,7 @@ static void ept_handle_violation(unsigne
         return;
 
     /* Everything else is an error. */
-    mfn = gfn_to_mfn_guest(d, gfn, &p2mt);
+    mfn = gfn_to_mfn_guest_unlocked(d, gfn, &p2mt);
     gdprintk(XENLOG_ERR, "EPT violation %#lx (%c%c%c/%c%c%c), "
              "gpa %#"PRIpaddr", mfn %#lx, type %i.\n", 
              qualification, 
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/hvm/vmx/vvmx.c
--- a/xen/arch/x86/hvm/vmx/vvmx.c
+++ b/xen/arch/x86/hvm/vmx/vvmx.c
@@ -558,8 +558,10 @@ static void __map_io_bitmap(struct vcpu 
 
     index = vmcs_reg == IO_BITMAP_A ? 0 : 1;
     if (nvmx->iobitmap[index])
-        hvm_unmap_guest_frame (nvmx->iobitmap[index]);
+        hvm_unmap_guest_frame (nvmx->iobitmap[index], 
+                               nvmx->iobitmap_gfn[index], 0);
     gpa = __get_vvmcs(vcpu_nestedhvm(v).nv_vvmcx, vmcs_reg);
+    nvmx->iobitmap_gfn[index] = gpa >> PAGE_SHIFT;
     nvmx->iobitmap[index] = hvm_map_guest_frame_ro (gpa >>
PAGE_SHIFT);
 }
 
@@ -577,13 +579,14 @@ static void nvmx_purge_vvmcs(struct vcpu
 
     __clear_current_vvmcs(v);
     if ( nvcpu->nv_vvmcxaddr != VMCX_EADDR )
-        hvm_unmap_guest_frame (nvcpu->nv_vvmcx);
+        hvm_unmap_guest_frame (nvcpu->nv_vvmcx, nvcpu->nv_vvmcxaddr
>> PAGE_SHIFT, 0);
     nvcpu->nv_vvmcx == NULL;
     nvcpu->nv_vvmcxaddr = VMCX_EADDR;
     for (i=0; i<2; i++) {
         if ( nvmx->iobitmap[i] ) {
-            hvm_unmap_guest_frame (nvmx->iobitmap[i]);
+            hvm_unmap_guest_frame (nvmx->iobitmap[i],
nvmx->iobitmap_gfn[i], 0);
             nvmx->iobitmap[i] = NULL;
+            nvmx->iobitmap_gfn[i] = 0;
         }
     }
 }
@@ -1198,7 +1201,7 @@ int nvmx_handle_vmclear(struct cpu_user_
         vvmcs = hvm_map_guest_frame_rw(gpa >> PAGE_SHIFT);
         if ( vvmcs ) 
             __set_vvmcs(vvmcs, NVMX_LAUNCH_STATE, 0);
-        hvm_unmap_guest_frame(vvmcs);
+        hvm_unmap_guest_frame(vvmcs, gpa >> PAGE_SHIFT, 0);
     }
 
     vmreturn(regs, VMSUCCEED);
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/mm.c
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -665,11 +665,17 @@ int map_ldt_shadow_page(unsigned int off
     gmfn = l1e_get_pfn(l1e);
     mfn = gmfn_to_mfn(d, gmfn);
     if ( unlikely(!mfn_valid(mfn)) )
+    {
+        drop_p2m_gfn_domain(d, gmfn, mfn); 
         return 0;
+    }
 
     okay = get_page_and_type(mfn_to_page(mfn), d, PGT_seg_desc_page);
     if ( unlikely(!okay) )
+    {
+        drop_p2m_gfn_domain(d, gmfn, mfn); 
         return 0;
+    }
 
     nl1e = l1e_from_pfn(mfn, l1e_get_flags(l1e) | _PAGE_RW);
 
@@ -678,6 +684,7 @@ int map_ldt_shadow_page(unsigned int off
     v->arch.pv_vcpu.shadow_ldt_mapcnt++;
     spin_unlock(&v->arch.pv_vcpu.shadow_ldt_lock);
 
+    drop_p2m_gfn_domain(d, gmfn, mfn); 
     return 1;
 }
 
@@ -1796,7 +1803,6 @@ static int mod_l1_entry(l1_pgentry_t *pl
 {
     l1_pgentry_t ol1e;
     struct domain *pt_dom = pt_vcpu->domain;
-    unsigned long mfn;
     p2m_type_t p2mt;
     int rc = 0;
 
@@ -1813,9 +1819,14 @@ static int mod_l1_entry(l1_pgentry_t *pl
     if ( l1e_get_flags(nl1e) & _PAGE_PRESENT )
     {
         /* Translate foreign guest addresses. */
-        mfn = mfn_x(gfn_to_mfn(pg_dom, l1e_get_pfn(nl1e), &p2mt));
+        unsigned long mfn, gfn;
+        gfn = l1e_get_pfn(nl1e);
+        mfn = mfn_x(gfn_to_mfn(pg_dom, gfn, &p2mt));
         if ( !p2m_is_ram(p2mt) || unlikely(mfn == INVALID_MFN) )
+        {
+            drop_p2m_gfn_domain(pg_dom, gfn, mfn);
             return -EINVAL;
+        }
         ASSERT((mfn & ~(PADDR_MASK >> PAGE_SHIFT)) == 0);
         nl1e = l1e_from_pfn(mfn, l1e_get_flags(nl1e));
 
@@ -1823,6 +1834,7 @@ static int mod_l1_entry(l1_pgentry_t *pl
         {
             MEM_LOG("Bad L1 flags %x",
                     l1e_get_flags(nl1e) & l1_disallow_mask(pt_dom));
+            drop_p2m_gfn_domain(pg_dom, gfn, mfn);
             return -EINVAL;
         }
 
@@ -1833,12 +1845,14 @@ static int mod_l1_entry(l1_pgentry_t *pl
             if ( UPDATE_ENTRY(l1, pl1e, ol1e, nl1e, gl1mfn, pt_vcpu,
                               preserve_ad) )
                 return 0;
+            drop_p2m_gfn_domain(pg_dom, gfn, mfn);
             return -EBUSY;
         }
 
         switch ( rc = get_page_from_l1e(nl1e, pt_dom, pg_dom) )
         {
         default:
+            drop_p2m_gfn_domain(pg_dom, gfn, mfn);
             return rc;
         case 0:
             break;
@@ -1854,6 +1868,7 @@ static int mod_l1_entry(l1_pgentry_t *pl
             ol1e = nl1e;
             rc = -EBUSY;
         }
+        drop_p2m_gfn_domain(pg_dom, gfn, mfn);
     }
     else if ( unlikely(!UPDATE_ENTRY(l1, pl1e, ol1e, nl1e, gl1mfn, pt_vcpu,
                                      preserve_ad)) )
@@ -3030,6 +3045,7 @@ int do_mmuext_op(
                     rc = -EAGAIN;
                 else if ( rc != -EAGAIN )
                     MEM_LOG("Error while pinning mfn %lx", mfn);
+                drop_p2m_gfn_domain(pg_owner, op.arg1.mfn, mfn);
                 break;
             }
 
@@ -3038,6 +3054,7 @@ int do_mmuext_op(
             if ( (rc = xsm_memory_pin_page(d, page)) != 0 )
             {
                 put_page_and_type(page);
+                drop_p2m_gfn_domain(pg_owner, op.arg1.mfn, mfn);
                 okay = 0;
                 break;
             }
@@ -3047,6 +3064,7 @@ int do_mmuext_op(
             {
                 MEM_LOG("Mfn %lx already pinned", mfn);
                 put_page_and_type(page);
+                drop_p2m_gfn_domain(pg_owner, op.arg1.mfn, mfn);
                 okay = 0;
                 break;
             }
@@ -3065,6 +3083,7 @@ int do_mmuext_op(
                 spin_unlock(&pg_owner->page_alloc_lock);
                 if ( drop_ref )
                     put_page_and_type(page);
+                drop_p2m_gfn_domain(pg_owner, op.arg1.mfn, mfn);
             }
 
             break;
@@ -3080,6 +3099,7 @@ int do_mmuext_op(
             mfn = gmfn_to_mfn(pg_owner, op.arg1.mfn);
             if ( unlikely(!(okay = get_page_from_pagenr(mfn, pg_owner))) )
             {
+                drop_p2m_gfn_domain(pg_owner, op.arg1.mfn, mfn);
                 MEM_LOG("Mfn %lx bad domain", mfn);
                 break;
             }
@@ -3090,6 +3110,7 @@ int do_mmuext_op(
             {
                 okay = 0;
                 put_page(page);
+                drop_p2m_gfn_domain(pg_owner, op.arg1.mfn, mfn);
                 MEM_LOG("Mfn %lx not pinned", mfn);
                 break;
             }
@@ -3100,12 +3121,16 @@ int do_mmuext_op(
             /* A page is dirtied when its pin status is cleared. */
             paging_mark_dirty(pg_owner, mfn);
 
+            drop_p2m_gfn_domain(pg_owner, op.arg1.mfn, mfn);
             break;
         }
 
-        case MMUEXT_NEW_BASEPTR:
-            okay = new_guest_cr3(gmfn_to_mfn(d, op.arg1.mfn));
+        case MMUEXT_NEW_BASEPTR: {
+            unsigned long mfn = gmfn_to_mfn(d, op.arg1.mfn);
+            okay = new_guest_cr3(mfn);
+            drop_p2m_gfn_domain(pg_owner, op.arg1.mfn, mfn);
             break;
+        }
         
 #ifdef __x86_64__
         case MMUEXT_NEW_USER_BASEPTR: {
@@ -3121,6 +3146,7 @@ int do_mmuext_op(
                         mfn, PGT_root_page_table, d, 0, 0);
                 if ( unlikely(!okay) )
                 {
+                    drop_p2m_gfn_domain(pg_owner, op.arg1.mfn, mfn);
                     MEM_LOG("Error while installing new mfn %lx",
mfn);
                     break;
                 }
@@ -3128,6 +3154,7 @@ int do_mmuext_op(
 
             old_mfn = pagetable_get_pfn(curr->arch.guest_table_user);
             curr->arch.guest_table_user = pagetable_from_pfn(mfn);
+            drop_p2m_gfn_domain(pg_owner, op.arg1.mfn, mfn);
 
             if ( old_mfn != 0 )
             {
@@ -3249,6 +3276,7 @@ int do_mmuext_op(
                 mfn, PGT_writable_page, d, 0, 0);
             if ( unlikely(!okay) )
             {
+                drop_p2m_gfn_domain(pg_owner, op.arg1.mfn, mfn);
                 MEM_LOG("Error while clearing mfn %lx", mfn);
                 break;
             }
@@ -3261,6 +3289,7 @@ int do_mmuext_op(
             fixunmap_domain_page(ptr);
 
             put_page_and_type(mfn_to_page(mfn));
+            drop_p2m_gfn_domain(pg_owner, op.arg1.mfn, mfn);
             break;
         }
 
@@ -3274,6 +3303,8 @@ int do_mmuext_op(
             okay = get_page_from_pagenr(src_mfn, d);
             if ( unlikely(!okay) )
             {
+                drop_p2m_gfn_domain(pg_owner, 
+                                op.arg2.src_mfn, src_mfn);
                 MEM_LOG("Error while copying from mfn %lx", src_mfn);
                 break;
             }
@@ -3283,7 +3314,10 @@ int do_mmuext_op(
                 mfn, PGT_writable_page, d, 0, 0);
             if ( unlikely(!okay) )
             {
+                drop_p2m_gfn_domain(pg_owner, op.arg1.mfn, mfn);
                 put_page(mfn_to_page(src_mfn));
+                drop_p2m_gfn_domain(pg_owner, 
+                                op.arg2.src_mfn, src_mfn);
                 MEM_LOG("Error while copying to mfn %lx", mfn);
                 break;
             }
@@ -3297,8 +3331,11 @@ int do_mmuext_op(
             fixunmap_domain_page(dst);
             unmap_domain_page(src);
 
+            drop_p2m_gfn_domain(pg_owner, op.arg1.mfn, mfn);
             put_page_and_type(mfn_to_page(mfn));
             put_page(mfn_to_page(src_mfn));
+            drop_p2m_gfn_domain(pg_owner, 
+                            op.arg2.src_mfn, src_mfn);
             break;
         }
 
@@ -3488,12 +3525,18 @@ int do_mmu_update(
             gmfn = req.ptr >> PAGE_SHIFT;
             mfn = mfn_x(gfn_to_mfn(pt_owner, gmfn, &p2mt));
             if ( !p2m_is_valid(p2mt) )
+            {
+              /* In the odd case we ever got a valid mfn with an invalid type,
+               * we drop the ref obtained in the p2m lookup */
+              if (mfn != INVALID_MFN)
+                 put_page(mfn_to_page(mfn));
               mfn = INVALID_MFN;
+            }
 
             if ( p2m_is_paged(p2mt) )
             {
                 p2m_mem_paging_populate(pg_owner, gmfn);
-
+                drop_p2m_gfn_domain(pt_owner, gmfn, mfn);
                 rc = -ENOENT;
                 break;
             }
@@ -3501,6 +3544,7 @@ int do_mmu_update(
             if ( unlikely(!get_page_from_pagenr(mfn, pt_owner)) )
             {
                 MEM_LOG("Could not get page for normal update");
+                drop_p2m_gfn_domain(pt_owner, gmfn, mfn);
                 break;
             }
 
@@ -3511,6 +3555,7 @@ int do_mmu_update(
 
             rc = xsm_mmu_normal_update(d, req.val, page);
             if ( rc ) {
+                drop_p2m_gfn_domain(pt_owner, gmfn, mfn);
                 unmap_domain_page_with_cache(va, &mapcache);
                 put_page(page);
                 break;
@@ -3524,16 +3569,20 @@ int do_mmu_update(
                 {
                     l1_pgentry_t l1e = l1e_from_intpte(req.val);
                     p2m_type_t l1e_p2mt;
-                    gfn_to_mfn(pg_owner, l1e_get_pfn(l1e), &l1e_p2mt);
+                    unsigned long l1egfn = l1e_get_pfn(l1e), l1emfn;
+    
+                    l1emfn = mfn_x(gfn_to_mfn(pg_owner, l1egfn,
&l1e_p2mt));
 
                     if ( p2m_is_paged(l1e_p2mt) )
                     {
                         p2m_mem_paging_populate(pg_owner, l1e_get_pfn(l1e));
+                        drop_p2m_gfn_domain(pg_owner, l1egfn, l1emfn);
                         rc = -ENOENT;
                         break;
                     }
                     else if ( p2m_ram_paging_in_start == l1e_p2mt &&
!mfn_valid(mfn) )
                     {
+                        drop_p2m_gfn_domain(pg_owner, l1egfn, l1emfn);
                         rc = -ENOENT;
                         break;
                     }
@@ -3550,7 +3599,10 @@ int do_mmu_update(
                                                           l1e_get_pfn(l1e), 
                                                           0);
                             if ( rc )
+                            {
+                                drop_p2m_gfn_domain(pg_owner, l1egfn, l1emfn);
                                 break; 
+                            }
                         }
                     } 
 #endif
@@ -3558,27 +3610,33 @@ int do_mmu_update(
                     rc = mod_l1_entry(va, l1e, mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v,
                                       pg_owner);
+                    drop_p2m_gfn_domain(pg_owner, l1egfn, l1emfn);
                 }
                 break;
                 case PGT_l2_page_table:
                 {
                     l2_pgentry_t l2e = l2e_from_intpte(req.val);
                     p2m_type_t l2e_p2mt;
-                    gfn_to_mfn(pg_owner, l2e_get_pfn(l2e), &l2e_p2mt);
+                    unsigned long l2egfn = l2e_get_pfn(l2e), l2emfn;
+
+                    l2emfn = mfn_x(gfn_to_mfn(pg_owner, l2egfn,
&l2e_p2mt));
 
                     if ( p2m_is_paged(l2e_p2mt) )
                     {
+                        drop_p2m_gfn_domain(pg_owner, l2egfn, l2emfn);
                         p2m_mem_paging_populate(pg_owner, l2e_get_pfn(l2e));
                         rc = -ENOENT;
                         break;
                     }
                     else if ( p2m_ram_paging_in_start == l2e_p2mt &&
!mfn_valid(mfn) )
                     {
+                        drop_p2m_gfn_domain(pg_owner, l2egfn, l2emfn);
                         rc = -ENOENT;
                         break;
                     }
                     else if ( p2m_ram_shared == l2e_p2mt )
                     {
+                        drop_p2m_gfn_domain(pg_owner, l2egfn, l2emfn);
                         MEM_LOG("Unexpected attempt to map shared
page.\n");
                         break;
                     }
@@ -3586,33 +3644,40 @@ int do_mmu_update(
 
                     rc = mod_l2_entry(va, l2e, mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+                    drop_p2m_gfn_domain(pg_owner, l2egfn, l2emfn);
                 }
                 break;
                 case PGT_l3_page_table:
                 {
                     l3_pgentry_t l3e = l3e_from_intpte(req.val);
                     p2m_type_t l3e_p2mt;
-                    gfn_to_mfn(pg_owner, l3e_get_pfn(l3e), &l3e_p2mt);
+                    unsigned long l3egfn = l3e_get_pfn(l3e), l3emfn;
+
+                    l3emfn = mfn_x(gfn_to_mfn(pg_owner, l3egfn,
&l3e_p2mt));
 
                     if ( p2m_is_paged(l3e_p2mt) )
                     {
                         p2m_mem_paging_populate(pg_owner, l3e_get_pfn(l3e));
+                        drop_p2m_gfn_domain(pg_owner, l3egfn, l3emfn);
                         rc = -ENOENT;
                         break;
                     }
                     else if ( p2m_ram_paging_in_start == l3e_p2mt &&
!mfn_valid(mfn) )
                     {
+                        drop_p2m_gfn_domain(pg_owner, l3egfn, l3emfn);
                         rc = -ENOENT;
                         break;
                     }
                     else if ( p2m_ram_shared == l3e_p2mt )
                     {
+                        drop_p2m_gfn_domain(pg_owner, l3egfn, l3emfn);
                         MEM_LOG("Unexpected attempt to map shared
page.\n");
                         break;
                     }
 
                     rc = mod_l3_entry(va, l3e, mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, 1, v);
+                    drop_p2m_gfn_domain(pg_owner, l3egfn, l3emfn);
                 }
                 break;
 #if CONFIG_PAGING_LEVELS >= 4
@@ -3620,27 +3685,33 @@ int do_mmu_update(
                 {
                     l4_pgentry_t l4e = l4e_from_intpte(req.val);
                     p2m_type_t l4e_p2mt;
-                    gfn_to_mfn(pg_owner, l4e_get_pfn(l4e), &l4e_p2mt);
+                    unsigned long l4egfn = l4e_get_pfn(l4e), l4emfn;
+
+                    l4emfn = mfn_x(gfn_to_mfn(pg_owner, l4egfn,
&l4e_p2mt));
 
                     if ( p2m_is_paged(l4e_p2mt) )
                     {
                         p2m_mem_paging_populate(pg_owner, l4e_get_pfn(l4e));
+                        drop_p2m_gfn_domain(pg_owner, l4egfn, l4emfn);
                         rc = -ENOENT;
                         break;
                     }
                     else if ( p2m_ram_paging_in_start == l4e_p2mt &&
!mfn_valid(mfn) )
                     {
+                        drop_p2m_gfn_domain(pg_owner, l4egfn, l4emfn);
                         rc = -ENOENT;
                         break;
                     }
                     else if ( p2m_ram_shared == l4e_p2mt )
                     {
+                        drop_p2m_gfn_domain(pg_owner, l4egfn, l4emfn);
                         MEM_LOG("Unexpected attempt to map shared
page.\n");
                         break;
                     }
 
                     rc = mod_l4_entry(va, l4e, mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, 1, v);
+                    drop_p2m_gfn_domain(pg_owner, l4egfn, l4emfn);
                 }
                 break;
 #endif
@@ -3662,6 +3733,7 @@ int do_mmu_update(
                 put_page_type(page);
             }
 
+            drop_p2m_gfn_domain(pt_owner, gmfn, mfn);
             unmap_domain_page_with_cache(va, &mapcache);
             put_page(page);
         }
@@ -3754,6 +3826,7 @@ static int create_grant_pte_mapping(
 
     if ( unlikely(!get_page_from_pagenr(mfn, current->domain)) )
     {
+        drop_p2m_gfn_domain(d, gmfn, mfn);
         MEM_LOG("Could not get page for normal update");
         return GNTST_general_error;
     }
@@ -3790,6 +3863,7 @@ static int create_grant_pte_mapping(
 
  failed:
     unmap_domain_page(va);
+    drop_p2m_gfn_domain(d, gmfn, mfn);
     put_page(page);
 
     return rc;
@@ -3809,6 +3883,7 @@ static int destroy_grant_pte_mapping(
 
     if ( unlikely(!get_page_from_pagenr(mfn, current->domain)) )
     {
+        drop_p2m_gfn_domain(d, gmfn, mfn);
         MEM_LOG("Could not get page for normal update");
         return GNTST_general_error;
     }
@@ -3860,6 +3935,7 @@ static int destroy_grant_pte_mapping(
  failed:
     unmap_domain_page(va);
     put_page(page);
+    drop_p2m_gfn_domain(d, gmfn, mfn);
     return rc;
 }
 
@@ -4051,7 +4127,7 @@ static int replace_grant_p2m_mapping(
     if ( new_addr != 0 || (flags & GNTMAP_contains_pte) )
         return GNTST_general_error;
 
-    old_mfn = gfn_to_mfn(d, gfn, &type);
+    old_mfn = gfn_to_mfn_unlocked(d, gfn, &type);
     if ( !p2m_is_grant(type) || mfn_x(old_mfn) != frame )
     {
         gdprintk(XENLOG_WARNING,
@@ -4441,14 +4517,19 @@ long set_gdt(struct vcpu *v,
     struct domain *d = v->domain;
     /* NB. There are 512 8-byte entries per GDT page. */
     int i, nr_pages = (entries + 511) / 512;
-    unsigned long mfn;
+    unsigned long mfn, *pfns;
 
     if ( entries > FIRST_RESERVED_GDT_ENTRY )
         return -EINVAL;
 
+    pfns = xmalloc_array(unsigned long, nr_pages);
+    if ( !pfns )
+        return -ENOMEM;
+
     /* Check the pages in the new GDT. */
     for ( i = 0; i < nr_pages; i++ )
     {
+        pfns[i] = frames[i];
         mfn = frames[i] = gmfn_to_mfn(d, frames[i]);
         if ( !mfn_valid(mfn) ||
              !get_page_and_type(mfn_to_page(mfn), d, PGT_seg_desc_page) )
@@ -4465,13 +4546,19 @@ long set_gdt(struct vcpu *v,
         v->arch.pv_vcpu.gdt_frames[i] = frames[i];
         l1e_write(&v->arch.perdomain_ptes[i],
                   l1e_from_pfn(frames[i], __PAGE_HYPERVISOR));
+        drop_p2m_gfn_domain(d, pfns[i], frames[i]);
     }
 
+    xfree(pfns);
     return 0;
 
  fail:
     while ( i-- > 0 )
+    {
         put_page_and_type(mfn_to_page(frames[i]));
+        drop_p2m_gfn_domain(d, pfns[i], frames[i]);
+    }
+    xfree(pfns);
     return -EINVAL;
 }
 
@@ -4519,11 +4606,17 @@ long do_update_descriptor(u64 pa, u64 de
     if ( (((unsigned int)pa % sizeof(struct desc_struct)) != 0) ||
          !mfn_valid(mfn) ||
          !check_descriptor(dom, &d) )
+    {
+        drop_p2m_gfn_domain(dom, gmfn, mfn);
         return -EINVAL;
+    }
 
     page = mfn_to_page(mfn);
     if ( unlikely(!get_page(page, dom)) )
+    {
+        drop_p2m_gfn_domain(dom, gmfn, mfn);
         return -EINVAL;
+    }
 
     /* Check if the given frame is in use in an unsafe context. */
     switch ( page->u.inuse.type_info & PGT_type_mask )
@@ -4551,6 +4644,7 @@ long do_update_descriptor(u64 pa, u64 de
 
  out:
     put_page(page);
+    drop_p2m_gfn_domain(dom, gmfn, mfn);
 
     return ret;
 }
@@ -4592,6 +4686,7 @@ static int handle_iomem_range(unsigned l
 long arch_memory_op(int op, XEN_GUEST_HANDLE(void) arg)
 {
     struct page_info *page = NULL;
+    unsigned long gfn = 0; /* gcc ... */
     int rc;
 
     switch ( op )
@@ -4649,11 +4744,13 @@ long arch_memory_op(int op, XEN_GUEST_HA
         case XENMAPSPACE_gmfn:
         {
             p2m_type_t p2mt;
+            gfn = xatp.idx;
 
             xatp.idx = mfn_x(gfn_to_mfn_unshare(d, xatp.idx, &p2mt));
             /* If the page is still shared, exit early */
             if ( p2m_is_shared(p2mt) )
             {
+                drop_p2m_gfn_domain(d, gfn, xatp.idx);
                 rcu_unlock_domain(d);
                 return -ENOMEM;
             }
@@ -4671,6 +4768,8 @@ long arch_memory_op(int op, XEN_GUEST_HA
         {
             if ( page )
                 put_page(page);
+            if ( xatp.space == XENMAPSPACE_gmfn )
+                drop_p2m_gfn_domain(d, gfn, mfn);
             rcu_unlock_domain(d);
             return -EINVAL;
         }
@@ -4691,6 +4790,8 @@ long arch_memory_op(int op, XEN_GUEST_HA
                 /* Normal domain memory is freed, to avoid leaking memory. */
                 guest_remove_page(d, xatp.gpfn);
         }
+        /* In the XENMAPSPACE_gmfn case we still hold a ref on the old page. */
+        drop_p2m_gfn_domain(d, xatp.gpfn, prev_mfn);
 
         /* Unmap from old location, if any. */
         gpfn = get_gpfn_from_mfn(mfn);
@@ -4701,6 +4802,9 @@ long arch_memory_op(int op, XEN_GUEST_HA
         /* Map at new location. */
         rc = guest_physmap_add_page(d, xatp.gpfn, mfn, 0);
 
+        /* In the XENMAPSPACE_gmfn, we took a ref and locked the p2m at the top
*/
+        if ( xatp.space == XENMAPSPACE_gmfn )
+            drop_p2m_gfn_domain(d, gfn, mfn);
         domain_unlock(d);
 
         rcu_unlock_domain(d);
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/mm/guest_walk.c
--- a/xen/arch/x86/mm/guest_walk.c
+++ b/xen/arch/x86/mm/guest_walk.c
@@ -86,6 +86,8 @@ static uint32_t set_ad_bits(void *guest_
     return 0;
 }
 
+/* We leave this function with a lock on the p2m and a ref on the 
+ * mapped page. Regardless of the map, you need to call drop_p2m_gfn. */
 static inline void *map_domain_gfn(struct p2m_domain *p2m,
                                    gfn_t gfn, 
                                    mfn_t *mfn,
@@ -120,6 +122,9 @@ static inline void *map_domain_gfn(struc
 
 
 /* Walk the guest pagetables, after the manner of a hardware walker. */
+/* Because the walk is essentially random, it can cause a deadlock 
+ * warning in the p2m locking code. Highly unlikely this is an actual
+ * deadlock, because who would walk page table in the opposite order? */
 uint32_t
 guest_walk_tables(struct vcpu *v, struct p2m_domain *p2m,
                   unsigned long va, walk_t *gw, 
@@ -348,11 +353,17 @@ set_ad:
  out:
 #if GUEST_PAGING_LEVELS == 4
     if ( l3p ) unmap_domain_page(l3p);
+    drop_p2m_gfn(p2m, gfn_x(guest_l4e_get_gfn(gw->l4e)), 
+                    mfn_x(gw->l3mfn));
 #endif
 #if GUEST_PAGING_LEVELS >= 3
     if ( l2p ) unmap_domain_page(l2p);
+    drop_p2m_gfn(p2m, gfn_x(guest_l3e_get_gfn(gw->l3e)), 
+                    mfn_x(gw->l2mfn));
 #endif
     if ( l1p ) unmap_domain_page(l1p);
+    drop_p2m_gfn(p2m, gfn_x(guest_l2e_get_gfn(gw->l2e)), 
+                    mfn_x(gw->l1mfn));
 
     return rc;
 }
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/mm/hap/guest_walk.c
--- a/xen/arch/x86/mm/hap/guest_walk.c
+++ b/xen/arch/x86/mm/hap/guest_walk.c
@@ -56,9 +56,11 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PA
     p2m_type_t p2mt;
     p2m_access_t p2ma;
     walk_t gw;
+    unsigned long top_gfn;
 
     /* Get the top-level table''s MFN */
-    top_mfn = gfn_to_mfn_type_p2m(p2m, cr3 >> PAGE_SHIFT, 
+    top_gfn = cr3 >> PAGE_SHIFT;
+    top_mfn = gfn_to_mfn_type_p2m(p2m, top_gfn,
                                   &p2mt, &p2ma, p2m_unshare, NULL);
     if ( p2m_is_paging(p2mt) )
     {
@@ -66,16 +68,19 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PA
         p2m_mem_paging_populate(p2m->domain, cr3 >> PAGE_SHIFT);
 
         pfec[0] = PFEC_page_paged;
+        drop_p2m_gfn(p2m, top_gfn, mfn_x(top_mfn));
         return INVALID_GFN;
     }
     if ( p2m_is_shared(p2mt) )
     {
         pfec[0] = PFEC_page_shared;
+        drop_p2m_gfn(p2m, top_gfn, mfn_x(top_mfn));
         return INVALID_GFN;
     }
     if ( !p2m_is_ram(p2mt) )
     {
         pfec[0] &= ~PFEC_page_present;
+        drop_p2m_gfn(p2m, top_gfn, mfn_x(top_mfn));
         return INVALID_GFN;
     }
 
@@ -87,26 +92,32 @@ unsigned long hap_p2m_ga_to_gfn(GUEST_PA
 #endif
     missing = guest_walk_tables(v, p2m, ga, &gw, pfec[0], top_mfn,
top_map);
     unmap_domain_page(top_map);
+    drop_p2m_gfn(p2m, top_gfn, mfn_x(top_mfn));
 
     /* Interpret the answer */
     if ( missing == 0 )
     {
         gfn_t gfn = guest_l1e_get_gfn(gw.l1e);
-        gfn_to_mfn_type_p2m(p2m, gfn_x(gfn), &p2mt, &p2ma, p2m_unshare,
NULL);
+        mfn_t eff_l1_mfn = gfn_to_mfn_type_p2m(p2m, gfn_x(gfn), &p2mt, 
+                                                &p2ma, p2m_unshare, NULL);
         if ( p2m_is_paging(p2mt) )
         {
             ASSERT(!p2m_is_nestedp2m(p2m));
             p2m_mem_paging_populate(p2m->domain, gfn_x(gfn));
 
             pfec[0] = PFEC_page_paged;
+            drop_p2m_gfn(p2m, gfn_x(gfn), mfn_x(eff_l1_mfn));
             return INVALID_GFN;
         }
         if ( p2m_is_shared(p2mt) )
         {
             pfec[0] = PFEC_page_shared;
+            drop_p2m_gfn(p2m, gfn_x(gfn), mfn_x(eff_l1_mfn));
             return INVALID_GFN;
         }
 
+        drop_p2m_gfn(p2m, gfn_x(gfn), mfn_x(eff_l1_mfn));
+
         if ( page_order )
             *page_order = guest_walk_to_page_order(&gw);
 
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/mm/mem_event.c
--- a/xen/arch/x86/mm/mem_event.c
+++ b/xen/arch/x86/mm/mem_event.c
@@ -47,7 +47,7 @@ static int mem_event_enable(struct domai
     unsigned long ring_addr = mec->ring_addr;
     unsigned long shared_addr = mec->shared_addr;
     l1_pgentry_t l1e;
-    unsigned long gfn;
+    unsigned long shared_gfn = 0, ring_gfn = 0; /* gcc ... */
     p2m_type_t p2mt;
     mfn_t ring_mfn;
     mfn_t shared_mfn;
@@ -60,23 +60,41 @@ static int mem_event_enable(struct domai
 
     /* Get MFN of ring page */
     guest_get_eff_l1e(v, ring_addr, &l1e);
-    gfn = l1e_get_pfn(l1e);
-    ring_mfn = gfn_to_mfn(dom_mem_event, gfn, &p2mt);
+    ring_gfn = l1e_get_pfn(l1e);
+    /* We''re grabbing these two in an order that could deadlock
+     * dom0 if 1. it were an hvm 2. there were two concurrent
+     * enables 3. the two gfn''s in each enable criss-crossed
+     * 2MB regions. Duly noted.... */
+    ring_mfn = gfn_to_mfn(dom_mem_event, ring_gfn, &p2mt);
 
     if ( unlikely(!mfn_valid(mfn_x(ring_mfn))) )
+    {
+        drop_p2m_gfn_domain(dom_mem_event, 
+                        ring_gfn, mfn_x(ring_mfn));
         return -EINVAL;
+    }
 
     /* Get MFN of shared page */
     guest_get_eff_l1e(v, shared_addr, &l1e);
-    gfn = l1e_get_pfn(l1e);
-    shared_mfn = gfn_to_mfn(dom_mem_event, gfn, &p2mt);
+    shared_gfn = l1e_get_pfn(l1e);
+    shared_mfn = gfn_to_mfn(dom_mem_event, shared_gfn, &p2mt);
 
     if ( unlikely(!mfn_valid(mfn_x(shared_mfn))) )
+    {
+        drop_p2m_gfn_domain(dom_mem_event, 
+                        ring_gfn, mfn_x(ring_mfn));
+        drop_p2m_gfn_domain(dom_mem_event, 
+                        shared_gfn, mfn_x(shared_mfn));
         return -EINVAL;
+    }
 
     /* Map ring and shared pages */
     med->ring_page = map_domain_page(mfn_x(ring_mfn));
     med->shared_page = map_domain_page(mfn_x(shared_mfn));
+    drop_p2m_gfn_domain(dom_mem_event, ring_gfn, 
+                    mfn_x(ring_mfn));
+    drop_p2m_gfn_domain(dom_mem_event, shared_gfn, 
+                    mfn_x(shared_mfn));
 
     /* Allocate event channel */
     rc = alloc_unbound_xen_event_channel(d->vcpu[0],
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/mm/mem_sharing.c
--- a/xen/arch/x86/mm/mem_sharing.c
+++ b/xen/arch/x86/mm/mem_sharing.c
@@ -227,7 +227,7 @@ static void mem_sharing_audit(void)
                             g->domain, g->gfn, mfn_x(e->mfn));
                     continue;
                 }
-                mfn = gfn_to_mfn(d, g->gfn, &t); 
+                mfn = gfn_to_mfn_unlocked(d, g->gfn, &t); 
                 if(mfn_x(mfn) != mfn_x(e->mfn))
                     MEM_SHARING_DEBUG("Incorrect P2M for d=%d,
PFN=%lx."
                                       "Expecting MFN=%ld, got %ld\n",
@@ -335,7 +335,7 @@ int mem_sharing_debug_gfn(struct domain 
     p2m_type_t p2mt;
     mfn_t mfn;
 
-    mfn = gfn_to_mfn(d, gfn, &p2mt);
+    mfn = gfn_to_mfn_unlocked(d, gfn, &p2mt);
 
     printk("Debug for domain=%d, gfn=%lx, ", 
             d->domain_id, 
@@ -524,6 +524,7 @@ int mem_sharing_nominate_page(struct dom
     ret = 0;
 
 out:
+    drop_p2m_gfn_domain(d, gfn, mfn_x(mfn));
     shr_unlock();
     return ret;
 }
@@ -593,14 +594,18 @@ int mem_sharing_unshare_page(struct doma
     shr_handle_t handle;
     struct list_head *le;
 
+    /* Remove the gfn_info from the list */
+   
+    /* This is one of the reasons why we can''t enforce ordering
+     * between shr_lock and p2m fine-grained locks in mm-lock. 
+     * Callers may walk in here already holding the lock for this gfn */
     shr_lock();
     mem_sharing_audit();
-    
-    /* Remove the gfn_info from the list */
     mfn = gfn_to_mfn(d, gfn, &p2mt);
     
     /* Has someone already unshared it? */
     if (!p2m_is_shared(p2mt)) {
+        drop_p2m_gfn_domain(d, gfn, mfn_x(mfn));
         shr_unlock();
         return 0;
     }
@@ -634,6 +639,7 @@ gfn_found:
             /* Even though we don''t allocate a private page, we have
to account
              * for the MFN that originally backed this PFN. */
             atomic_dec(&nr_saved_mfns);
+        drop_p2m_gfn_domain(d, gfn, mfn_x(mfn));
         shr_unlock();
         put_page_and_type(page);
         if(last_gfn && 
@@ -653,6 +659,7 @@ gfn_found:
         /* We''ve failed to obtain memory for private page. Need to
re-add the
          * gfn_info to relevant list */
         list_add(&gfn_info->list, &hash_entry->gfns);
+        drop_p2m_gfn_domain(d, gfn, mfn_x(mfn));
         shr_unlock();
         return -ENOMEM;
     }
@@ -665,6 +672,13 @@ gfn_found:
 
     BUG_ON(set_shared_p2m_entry(d, gfn, page_to_mfn(page)) == 0);
     put_page_and_type(old_page);
+    /* After switching the p2m entry we still hold it locked, and
+     * we have a ref count to the old page (mfn). Drop the ref
+     * on the old page, and set mfn to invalid, so the refcount is
+     * no further decremented. We are the only cpu who knows about
+     * the new page, so we don''t need additional refs on it. */
+    put_page(mfn_to_page(mfn));
+    mfn = _mfn(INVALID_MFN);
 
 private_page_found:    
     /* We''ve got a private page, we can commit the gfn destruction */
@@ -683,6 +697,7 @@ private_page_found:
     /* Update m2p entry */
     set_gpfn_from_mfn(mfn_x(page_to_mfn(page)), gfn);
 
+    drop_p2m_gfn_domain(d, gfn, mfn_x(mfn));
     shr_unlock();
     return 0;
 }
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/mm/shadow/common.c
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -3741,6 +3741,8 @@ int shadow_track_dirty_vram(struct domai
                 }
             }
 
+            drop_p2m_gfn_domain(d, begin_pfn + i, mfn_x(mfn));
+
             if ( dirty )
             {
                 dirty_vram->dirty_bitmap[i / 8] |= 1 << (i % 8);
@@ -3761,7 +3763,7 @@ int shadow_track_dirty_vram(struct domai
                 /* was clean for more than two seconds, try to disable guest
                  * write access */
                 for ( i = begin_pfn; i < end_pfn; i++ ) {
-                    mfn_t mfn = gfn_to_mfn_query(d, i, &t);
+                    mfn_t mfn = gfn_to_mfn_query_unlocked(d, i, &t);
                     if (mfn_x(mfn) != INVALID_MFN)
                         flush_tlb |= sh_remove_write_access(d->vcpu[0], mfn,
1, 0);
                 }
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/mm/shadow/multi.c
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -2275,6 +2275,7 @@ static int validate_gl4e(struct vcpu *v,
         if ( mfn_valid(sl3mfn) )
             shadow_resync_all(v);
 #endif
+        drop_p2m_gfn_domain(d, gfn_x(gl3gfn), mfn_x(gl3mfn));
     }
     l4e_propagate_from_guest(v, new_gl4e, sl3mfn, &new_sl4e, ft_prefetch);
 
@@ -2332,6 +2333,7 @@ static int validate_gl3e(struct vcpu *v,
         if ( mfn_valid(sl2mfn) )
             shadow_resync_all(v);
 #endif
+        drop_p2m_gfn_domain(v->domain, gfn_x(gl2gfn), mfn_x(gl2mfn));
     }
     l3e_propagate_from_guest(v, new_gl3e, sl2mfn, &new_sl3e, ft_prefetch);
     result |= shadow_set_l3e(v, sl3p, new_sl3e, sl3mfn);
@@ -2376,6 +2378,7 @@ static int validate_gl2e(struct vcpu *v,
                 sl1mfn = get_shadow_status(v, gl1mfn, SH_type_l1_shadow); 
             else if ( p2mt != p2m_populate_on_demand )
                 result |= SHADOW_SET_ERROR;
+            drop_p2m_gfn_domain(v->domain, gfn_x(gl1gfn), mfn_x(gl1mfn));
         }
     }
     l2e_propagate_from_guest(v, new_gl2e, sl1mfn, &new_sl2e, ft_prefetch);
@@ -2463,6 +2466,7 @@ static int validate_gl1e(struct vcpu *v,
     }
 #endif /* OOS */
 
+    drop_p2m_gfn_domain(v->domain, gfn_x(gfn), mfn_x(gmfn));
     return result;
 }
 
@@ -2505,6 +2509,7 @@ void sh_resync_l1(struct vcpu *v, mfn_t 
             l1e_propagate_from_guest(v, gl1e, gmfn, &nsl1e, ft_prefetch,
p2mt);
             rc |= shadow_set_l1e(v, sl1p, nsl1e, p2mt, sl1mfn);
 
+            drop_p2m_gfn_domain(v->domain, gfn_x(gfn), mfn_x(gmfn));
             *snpl1p = gl1e;
         }
     });
@@ -2834,6 +2839,8 @@ static void sh_prefetch(struct vcpu *v, 
         if ( snpl1p != NULL )
             snpl1p[i] = gl1e;
 #endif /* OOS */
+
+        drop_p2m_gfn_domain(v->domain, gfn_x(gfn), mfn_x(gmfn));
     }
     if ( gl1p != NULL )
         sh_unmap_domain_page(gl1p);
@@ -3192,6 +3199,7 @@ static int sh_page_fault(struct vcpu *v,
         SHADOW_PRINTK("BAD gfn=%"SH_PRI_gfn"
gmfn=%"PRI_mfn"\n",
                       gfn_x(gfn), mfn_x(gmfn));
         reset_early_unshadow(v);
+        drop_p2m_gfn_domain(d, gfn_x(gfn), mfn_x(gmfn));
         goto propagate;
     }
 
@@ -3236,6 +3244,7 @@ static int sh_page_fault(struct vcpu *v,
     if ( rc & GW_RMWR_REWALK )
     {
         paging_unlock(d);
+        drop_p2m_gfn_domain(d, gfn_x(gfn), mfn_x(gmfn));
         goto rewalk;
     }
 #endif /* OOS */
@@ -3244,6 +3253,7 @@ static int sh_page_fault(struct vcpu *v,
     {
         perfc_incr(shadow_inconsistent_gwalk);
         paging_unlock(d);
+        drop_p2m_gfn_domain(d, gfn_x(gfn), mfn_x(gmfn));
         goto rewalk;
     }
 
@@ -3270,6 +3280,7 @@ static int sh_page_fault(struct vcpu *v,
         ASSERT(d->is_shutting_down);
 #endif
         paging_unlock(d);
+        drop_p2m_gfn_domain(d, gfn_x(gfn), mfn_x(gmfn));
         trace_shadow_gen(TRC_SHADOW_DOMF_DYING, va);
         return 0;
     }
@@ -3287,6 +3298,7 @@ static int sh_page_fault(struct vcpu *v,
          * failed. We cannot safely continue since some page is still
          * OOS but not in the hash table anymore. */
         paging_unlock(d);
+        drop_p2m_gfn_domain(d, gfn_x(gfn), mfn_x(gmfn));
         return 0;
     }
 
@@ -3296,6 +3308,7 @@ static int sh_page_fault(struct vcpu *v,
     {
         perfc_incr(shadow_inconsistent_gwalk);
         paging_unlock(d);
+        drop_p2m_gfn_domain(d, gfn_x(gfn), mfn_x(gmfn));
         goto rewalk;
     }
 #endif /* OOS */
@@ -3389,6 +3402,7 @@ static int sh_page_fault(struct vcpu *v,
     SHADOW_PRINTK("fixed\n");
     shadow_audit_tables(v);
     paging_unlock(d);
+    drop_p2m_gfn_domain(d, gfn_x(gfn), mfn_x(gmfn));
     return EXCRET_fault_fixed;
 
  emulate:
@@ -3457,6 +3471,7 @@ static int sh_page_fault(struct vcpu *v,
     sh_audit_gw(v, &gw);
     shadow_audit_tables(v);
     paging_unlock(d);
+    drop_p2m_gfn_domain(d, gfn_x(gfn), mfn_x(gmfn));
 
     this_cpu(trace_emulate_write_val) = 0;
 
@@ -3595,6 +3610,7 @@ static int sh_page_fault(struct vcpu *v,
     shadow_audit_tables(v);
     reset_early_unshadow(v);
     paging_unlock(d);
+    drop_p2m_gfn_domain(d, gfn_x(gfn), mfn_x(gmfn));
     trace_shadow_gen(TRC_SHADOW_MMIO, va);
     return (handle_mmio_with_translation(va, gpa >> PAGE_SHIFT)
             ? EXCRET_fault_fixed : 0);
@@ -3605,6 +3621,7 @@ static int sh_page_fault(struct vcpu *v,
     shadow_audit_tables(v);
     reset_early_unshadow(v);
     paging_unlock(d);
+    drop_p2m_gfn_domain(d, gfn_x(gfn), mfn_x(gmfn));
 
 propagate:
     trace_not_shadow_fault(gw.l1e, va);
@@ -4292,7 +4309,7 @@ sh_update_cr3(struct vcpu *v, int do_loc
             if ( guest_l3e_get_flags(gl3e[i]) & _PAGE_PRESENT )
             {
                 gl2gfn = guest_l3e_get_gfn(gl3e[i]);
-                gl2mfn = gfn_to_mfn_query(d, gl2gfn, &p2mt);
+                gl2mfn = gfn_to_mfn_query_unlocked(d, gfn_x(gl2gfn),
&p2mt);
                 if ( p2m_is_ram(p2mt) )
                     flush |= sh_remove_write_access(v, gl2mfn, 2, 0);
             }
@@ -4312,6 +4329,8 @@ sh_update_cr3(struct vcpu *v, int do_loc
                                            : SH_type_l2_shadow);
                 else
                     sh_set_toplevel_shadow(v, i, _mfn(INVALID_MFN), 0); 
+                drop_p2m_gfn_domain(d, gfn_x(gl2gfn),
+                                mfn_x(gl2mfn));
             }
             else
                 sh_set_toplevel_shadow(v, i, _mfn(INVALID_MFN), 0); 
@@ -4689,11 +4708,12 @@ static void sh_pagetable_dying(struct vc
     int flush = 0;
     int fast_path = 0;
     paddr_t gcr3 = 0;
-    mfn_t smfn, gmfn;
     p2m_type_t p2mt;
     char *gl3pa = NULL;
     guest_l3e_t *gl3e = NULL;
     paddr_t gl2a = 0;
+    unsigned long l3gfn;
+    mfn_t l3mfn;
 
     paging_lock(v->domain);
 
@@ -4702,8 +4722,9 @@ static void sh_pagetable_dying(struct vc
     if ( gcr3 == gpa )
         fast_path = 1;
 
-    gmfn = gfn_to_mfn_query(v->domain, _gfn(gpa >> PAGE_SHIFT),
&p2mt);
-    if ( !mfn_valid(gmfn) || !p2m_is_ram(p2mt) )
+    l3gfn = gpa >> PAGE_SHIFT;
+    l3mfn = gfn_to_mfn_query(v->domain, _gfn(l3gfn), &p2mt);
+    if ( !mfn_valid(l3mfn) || !p2m_is_ram(p2mt) )
     {
         printk(XENLOG_DEBUG "sh_pagetable_dying: gpa not valid
%"PRIpaddr"\n",
                gpa);
@@ -4711,19 +4732,24 @@ static void sh_pagetable_dying(struct vc
     }
     if ( !fast_path )
     {
-        gl3pa = sh_map_domain_page(gmfn);
+        gl3pa = sh_map_domain_page(l3mfn);
         gl3e = (guest_l3e_t *)(gl3pa + ((unsigned long)gpa & ~PAGE_MASK));
     }
     for ( i = 0; i < 4; i++ )
     {
+        unsigned long gfn;
+        mfn_t smfn, gmfn;
+
         if ( fast_path )
             smfn = _mfn(pagetable_get_pfn(v->arch.shadow_table[i]));
         else
         {
             /* retrieving the l2s */
             gl2a = guest_l3e_get_paddr(gl3e[i]);
-            gmfn = gfn_to_mfn_query(v->domain, _gfn(gl2a >>
PAGE_SHIFT), &p2mt);
+            gfn = gl2a >> PAGE_SHIFT;
+            gmfn = gfn_to_mfn_query(v->domain, _gfn(gfn), &p2mt);
             smfn = shadow_hash_lookup(v, mfn_x(gmfn), SH_type_l2_pae_shadow);
+            drop_p2m_gfn_domain(v->domain, gfn, mfn_x(gmfn));
         }
 
         if ( mfn_valid(smfn) )
@@ -4747,6 +4773,7 @@ static void sh_pagetable_dying(struct vc
 out:
     if ( !fast_path )
         unmap_domain_page(gl3pa);
+    drop_p2m_gfn_domain(v->domain, l3gfn, mfn_x(l3mfn));
     paging_unlock(v->domain);
 }
 #else
@@ -4763,6 +4790,9 @@ static void sh_pagetable_dying(struct vc
 #else
     smfn = shadow_hash_lookup(v, mfn_x(gmfn), SH_type_l4_64_shadow);
 #endif
+    drop_p2m_gfn_domain(v->domain, 
+                    gpa >> PAGE_SHIFT, mfn_x(gmfn));
+    
     if ( mfn_valid(smfn) )
     {
         mfn_to_page(gmfn)->shadow_flags |= SHF_pagetable_dying;
@@ -4814,12 +4844,19 @@ static mfn_t emulate_gva_to_mfn(struct v
     mfn = gfn_to_mfn_guest(v->domain, _gfn(gfn), &p2mt);
         
     if ( p2m_is_readonly(p2mt) )
+    {
+        drop_p2m_gfn_domain(v->domain, gfn, mfn_x(mfn));
         return _mfn(READONLY_GFN);
+    }
     if ( !p2m_is_ram(p2mt) )
+    {
+        drop_p2m_gfn_domain(v->domain, gfn, mfn_x(mfn));
         return _mfn(BAD_GFN_TO_MFN);
+    }
 
     ASSERT(mfn_valid(mfn));
     v->arch.paging.last_write_was_pt = !!sh_mfn_is_a_page_table(mfn);
+    drop_p2m_gfn_domain(v->domain, gfn, mfn_x(mfn));
     return mfn;
 }
 
@@ -5220,7 +5257,7 @@ int sh_audit_l1_table(struct vcpu *v, mf
             {
                 gfn = guest_l1e_get_gfn(*gl1e);
                 mfn = shadow_l1e_get_mfn(*sl1e);
-                gmfn = gfn_to_mfn_query(v->domain, gfn, &p2mt);
+                gmfn = gfn_to_mfn_query_unlocked(v->domain, gfn_x(gfn),
&p2mt);
                 if ( !p2m_is_grant(p2mt) && mfn_x(gmfn) != mfn_x(mfn) )
                     AUDIT_FAIL(1, "bad translation: gfn %" SH_PRI_gfn
                                " --> %" PRI_mfn " != mfn
%" PRI_mfn,
@@ -5291,16 +5328,17 @@ int sh_audit_l2_table(struct vcpu *v, mf
             mfn = shadow_l2e_get_mfn(*sl2e);
             gmfn = (guest_l2e_get_flags(*gl2e) & _PAGE_PSE)  
                 ? get_fl1_shadow_status(v, gfn)
-                : get_shadow_status(v, gfn_to_mfn_query(v->domain, gfn,
&p2mt),
-                                    SH_type_l1_shadow);
+                : get_shadow_status(v, 
+                    gfn_to_mfn_query_unlocked(v->domain, gfn_x(gfn), 
+                                        &p2mt), SH_type_l1_shadow);
             if ( mfn_x(gmfn) != mfn_x(mfn) )
                 AUDIT_FAIL(2, "bad translation: gfn %" SH_PRI_gfn
                            " (--> %" PRI_mfn ")"
                            " --> %" PRI_mfn " != mfn %"
PRI_mfn,
                            gfn_x(gfn), 
                            (guest_l2e_get_flags(*gl2e) & _PAGE_PSE) ? 0
-                           : mfn_x(gfn_to_mfn_query(v->domain,
-                                   gfn, &p2mt)), mfn_x(gmfn), mfn_x(mfn));
+                           : mfn_x(gfn_to_mfn_query_unlocked(v->domain,
+                                   gfn_x(gfn), &p2mt)), mfn_x(gmfn),
mfn_x(mfn));
         }
     });
     sh_unmap_domain_page(gp);
@@ -5339,7 +5377,8 @@ int sh_audit_l3_table(struct vcpu *v, mf
         {
             gfn = guest_l3e_get_gfn(*gl3e);
             mfn = shadow_l3e_get_mfn(*sl3e);
-            gmfn = get_shadow_status(v, gfn_to_mfn_query(v->domain, gfn,
&p2mt),
+            gmfn = get_shadow_status(v, gfn_to_mfn_query_unlocked(
+                                        v->domain, gfn_x(gfn), &p2mt),
                                      ((GUEST_PAGING_LEVELS == 3 ||
                                        is_pv_32on64_vcpu(v))
                                       &&
!shadow_mode_external(v->domain)
@@ -5387,8 +5426,8 @@ int sh_audit_l4_table(struct vcpu *v, mf
         {
             gfn = guest_l4e_get_gfn(*gl4e);
             mfn = shadow_l4e_get_mfn(*sl4e);
-            gmfn = get_shadow_status(v, gfn_to_mfn_query(v->domain,
-                                     gfn, &p2mt), 
+            gmfn = get_shadow_status(v, gfn_to_mfn_query_unlocked(
+                                     v->domain, gfn_x(gfn), &p2mt), 
                                      SH_type_l3_shadow);
             if ( mfn_x(gmfn) != mfn_x(mfn) )
                 AUDIT_FAIL(4, "bad translation: gfn %" SH_PRI_gfn
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/physdev.c
--- a/xen/arch/x86/physdev.c
+++ b/xen/arch/x86/physdev.c
@@ -288,12 +288,18 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
         if ( !mfn_valid(mfn) ||
              !get_page_and_type(mfn_to_page(mfn), v->domain,
                                 PGT_writable_page) )
+        {
+            drop_p2m_gfn_domain(current->domain, 
+                            info.gmfn, mfn);
             break;
+        }
 
         if ( cmpxchg(&v->domain->arch.pv_domain.pirq_eoi_map_mfn,
                      0, mfn) != 0 )
         {
             put_page_and_type(mfn_to_page(mfn));
+            drop_p2m_gfn_domain(current->domain, 
+                            info.gmfn, mfn);
             ret = -EBUSY;
             break;
         }
@@ -303,10 +309,13 @@ ret_t do_physdev_op(int cmd, XEN_GUEST_H
         {
             v->domain->arch.pv_domain.pirq_eoi_map_mfn = 0;
             put_page_and_type(mfn_to_page(mfn));
+            drop_p2m_gfn_domain(current->domain, 
+                            info.gmfn, mfn);
             ret = -ENOSPC;
             break;
         }
 
+        drop_p2m_gfn_domain(current->domain, info.gmfn, mfn);
         ret = 0;
         break;
     }
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/arch/x86/traps.c
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -678,6 +678,7 @@ int wrmsr_hypervisor_regs(uint32_t idx, 
         if ( !mfn_valid(mfn) ||
              !get_page_and_type(mfn_to_page(mfn), d, PGT_writable_page) )
         {
+            drop_p2m_gfn_domain(d, gmfn, mfn);
             gdprintk(XENLOG_WARNING,
                      "Bad GMFN %lx (MFN %lx) to MSR %08x\n",
                      gmfn, mfn, base + idx);
@@ -689,6 +690,7 @@ int wrmsr_hypervisor_regs(uint32_t idx, 
         unmap_domain_page(hypercall_page);
 
         put_page_and_type(mfn_to_page(mfn));
+        drop_p2m_gfn_domain(d, gmfn, mfn);
         break;
     }
 
@@ -2347,18 +2349,25 @@ static int emulate_privileged_op(struct 
             arch_set_cr2(v, *reg);
             break;
 
-        case 3: /* Write CR3 */
+        case 3: {/* Write CR3 */
+            unsigned long mfn, gfn;
             domain_lock(v->domain);
             if ( !is_pv_32on64_vcpu(v) )
-                rc = new_guest_cr3(gmfn_to_mfn(v->domain,
xen_cr3_to_pfn(*reg)));
+            {
+                gfn = xen_cr3_to_pfn(*reg);
 #ifdef CONFIG_COMPAT
-            else
-                rc = new_guest_cr3(gmfn_to_mfn(v->domain,
compat_cr3_to_pfn(*reg)));
+            } else {
+                gfn = compat_cr3_to_pfn(*reg);
 #endif
+            }
+            mfn = gmfn_to_mfn(v->domain, gfn);
+            rc = new_guest_cr3(mfn);
+            drop_p2m_gfn_domain(v->domain, gfn, mfn);
             domain_unlock(v->domain);
             if ( rc == 0 ) /* not okay */
                 goto fail;
             break;
+        }
 
         case 4: /* Write CR4 */
             v->arch.pv_vcpu.ctrlreg[4] = pv_guest_cr4_fixup(v, *reg);
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/common/grant_table.c
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -164,9 +164,11 @@ static int __get_paged_frame(unsigned lo
         if ( p2m_is_paging(p2mt) )
         {
             p2m_mem_paging_populate(rd, gfn);
+            drop_p2m_gfn_domain(rd, gfn, mfn_x(mfn));
             rc = GNTST_eagain;
         }
     } else {
+       drop_p2m_gfn_domain(rd, gfn, mfn_x(mfn));
        *frame = INVALID_MFN;
        rc = GNTST_bad_page;
     }
@@ -474,7 +476,7 @@ __gnttab_map_grant_ref(
     u32            old_pin;
     u32            act_pin;
     unsigned int   cache_flags;
-    struct active_grant_entry *act;
+    struct active_grant_entry *act = NULL; /* gcc ... */
     struct grant_mapping *mt;
     grant_entry_v1_t *sha1;
     grant_entry_v2_t *sha2;
@@ -698,6 +700,7 @@ __gnttab_map_grant_ref(
     op->handle       = handle;
     op->status       = GNTST_okay;
 
+    drop_p2m_gfn_domain(rd, act->gfn, act->frame);
     rcu_unlock_domain(rd);
     return;
 
@@ -735,6 +738,7 @@ __gnttab_map_grant_ref(
         gnttab_clear_flag(_GTF_reading, status);
 
  unlock_out:
+    drop_p2m_gfn_domain(rd, act->gfn, act->frame);
     spin_unlock(&rd->grant_table->lock);
     op->status = rc;
     put_maptrack_handle(ld->grant_table, handle);
@@ -1454,7 +1458,7 @@ gnttab_transfer(
     struct page_info *page;
     int i;
     struct gnttab_transfer gop;
-    unsigned long mfn;
+    unsigned long mfn, drop_mfn;
     unsigned int max_bitsize;
 
     for ( i = 0; i < count; i++ )
@@ -1475,6 +1479,7 @@ gnttab_transfer(
         /* Check the passed page frame for basic validity. */
         if ( unlikely(!mfn_valid(mfn)) )
         { 
+            drop_p2m_gfn_domain(d, gop.mfn, mfn);
             gdprintk(XENLOG_INFO, "gnttab_transfer: out-of-range
%lx\n",
                     (unsigned long)gop.mfn);
             gop.status = GNTST_bad_page;
@@ -1484,6 +1489,7 @@ gnttab_transfer(
         page = mfn_to_page(mfn);
         if ( unlikely(is_xen_heap_page(page)) )
         { 
+            drop_p2m_gfn_domain(d, gop.mfn, mfn);
             gdprintk(XENLOG_INFO, "gnttab_transfer: xen frame %lx\n",
                     (unsigned long)gop.mfn);
             gop.status = GNTST_bad_page;
@@ -1492,6 +1498,7 @@ gnttab_transfer(
 
         if ( steal_page(d, page, 0) < 0 )
         {
+            drop_p2m_gfn_domain(d, gop.mfn, mfn);
             gop.status = GNTST_bad_page;
             goto copyback;
         }
@@ -1504,6 +1511,7 @@ gnttab_transfer(
         /* Find the target domain. */
         if ( unlikely((e = rcu_lock_domain_by_id(gop.domid)) == NULL) )
         {
+            drop_p2m_gfn_domain(d, gop.mfn, mfn);
             gdprintk(XENLOG_INFO, "gnttab_transfer: can''t find
domain %d\n",
                     gop.domid);
             page->count_info &= ~(PGC_count_mask|PGC_allocated);
@@ -1514,6 +1522,7 @@ gnttab_transfer(
 
         if ( xsm_grant_transfer(d, e) )
         {
+            drop_p2m_gfn_domain(d, gop.mfn, mfn);
             gop.status = GNTST_permission_denied;
         unlock_and_copyback:
             rcu_unlock_domain(e);
@@ -1542,9 +1551,15 @@ gnttab_transfer(
             unmap_domain_page(dp);
             unmap_domain_page(sp);
 
+            /* We took a ref on acquiring the p2m entry. Drop the ref */
+            put_page(page);
+            drop_mfn = INVALID_MFN; /* Further drops of the p2m entry
won''t drop anyone''s refcount */
             page->count_info &= ~(PGC_count_mask|PGC_allocated);
             free_domheap_page(page);
             page = new_page;
+            /* BY the way, this doesn''t update mfn, which is used
later below ... */
+        } else {
+            drop_mfn = mfn;
         }
 
         spin_lock(&e->page_alloc_lock);
@@ -1566,6 +1581,7 @@ gnttab_transfer(
                         e->tot_pages, e->max_pages, gop.ref,
e->is_dying);
             spin_unlock(&e->page_alloc_lock);
             rcu_unlock_domain(e);
+            drop_p2m_gfn_domain(d, gop.mfn, drop_mfn);
             page->count_info &= ~(PGC_count_mask|PGC_allocated);
             free_domheap_page(page);
             gop.status = GNTST_general_error;
@@ -1579,6 +1595,7 @@ gnttab_transfer(
         page_set_owner(page, e);
 
         spin_unlock(&e->page_alloc_lock);
+        drop_p2m_gfn_domain(d, gop.mfn, drop_mfn);
 
         TRACE_1D(TRC_MEM_PAGE_GRANT_TRANSFER, e->domain_id);
 
@@ -1852,6 +1869,8 @@ __acquire_grant_for_copy(
             rc = __get_paged_frame(gfn, &grant_frame, readonly, rd);
             if ( rc != GNTST_okay )
                 goto unlock_out;
+            /* We drop this immediately per the comments at the top */
+            drop_p2m_gfn_domain(rd, gfn, grant_frame);
             act->gfn = gfn;
             is_sub_page = 0;
             trans_page_off = 0;
@@ -1864,6 +1883,7 @@ __acquire_grant_for_copy(
             rc = __get_paged_frame(gfn, &grant_frame, readonly, rd);
             if ( rc != GNTST_okay )
                 goto unlock_out;
+            drop_p2m_gfn_domain(rd, gfn, grant_frame);
             act->gfn = gfn;
             is_sub_page = 0;
             trans_page_off = 0;
@@ -1876,6 +1896,7 @@ __acquire_grant_for_copy(
             rc = __get_paged_frame(gfn, &grant_frame, readonly, rd);
             if ( rc != GNTST_okay )
                 goto unlock_out;
+            drop_p2m_gfn_domain(rd, gfn, grant_frame);
             act->gfn = gfn;
             is_sub_page = 1;
             trans_page_off = sha2->sub_page.page_off;
@@ -1973,6 +1994,7 @@ __gnttab_copy(
     {
 #ifdef CONFIG_X86
         rc = __get_paged_frame(op->source.u.gmfn, &s_frame, 1, sd);
+        drop_p2m_gfn_domain(sd, op->source.u.gmfn, s_frame);
         if ( rc != GNTST_okay )
             goto error_out;
 #else
@@ -2012,6 +2034,7 @@ __gnttab_copy(
     {
 #ifdef CONFIG_X86
         rc = __get_paged_frame(op->dest.u.gmfn, &d_frame, 0, dd);
+        drop_p2m_gfn_domain(dd, op->dest.u.gmfn, d_frame);
         if ( rc != GNTST_okay )
             goto error_out;
 #else
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/common/memory.c
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -167,6 +167,7 @@ int guest_remove_page(struct domain *d, 
     {
         guest_physmap_remove_page(d, gmfn, mfn, 0);
         p2m_mem_paging_drop_page(d, gmfn);
+        drop_p2m_gfn_domain(d, gmfn, mfn);
         return 1;
     }
 #else
@@ -174,6 +175,7 @@ int guest_remove_page(struct domain *d, 
 #endif
     if ( unlikely(!mfn_valid(mfn)) )
     {
+        drop_p2m_gfn_domain(d, gmfn, mfn);
         gdprintk(XENLOG_INFO, "Domain %u page number %lx invalid\n",
                 d->domain_id, gmfn);
         return 0;
@@ -187,12 +189,14 @@ int guest_remove_page(struct domain *d, 
     {
         put_page_and_type(page);
         guest_physmap_remove_page(d, gmfn, mfn, 0);
+        drop_p2m_gfn_domain(d, gmfn, mfn);
         return 1;
     }
 
 #endif /* CONFIG_X86 */
     if ( unlikely(!get_page(page, d)) )
     {
+        drop_p2m_gfn_domain(d, gmfn, mfn);
         gdprintk(XENLOG_INFO, "Bad page free for domain %u\n",
d->domain_id);
         return 0;
     }
@@ -204,6 +208,7 @@ int guest_remove_page(struct domain *d, 
         put_page(page);
 
     guest_physmap_remove_page(d, gmfn, mfn, 0);
+    drop_p2m_gfn_domain(d, gmfn, mfn);
 
     put_page(page);
 
@@ -366,6 +371,7 @@ static long memory_exchange(XEN_GUEST_HA
                 mfn = mfn_x(gfn_to_mfn_unshare(d, gmfn + k, &p2mt));
                 if ( p2m_is_shared(p2mt) )
                 {
+                    drop_p2m_gfn_domain(d, gmfn + k, mfn);
                     rc = -ENOMEM;
                     goto fail; 
                 }
@@ -374,6 +380,7 @@ static long memory_exchange(XEN_GUEST_HA
 #endif
                 if ( unlikely(!mfn_valid(mfn)) )
                 {
+                    drop_p2m_gfn_domain(d, gmfn + k, mfn);
                     rc = -EINVAL;
                     goto fail;
                 }
@@ -382,11 +389,13 @@ static long memory_exchange(XEN_GUEST_HA
 
                 if ( unlikely(steal_page(d, page, MEMF_no_refcount)) )
                 {
+                    drop_p2m_gfn_domain(d, gmfn + k, mfn);
                     rc = -EINVAL;
                     goto fail;
                 }
 
                 page_list_add(page, &in_chunk_list);
+                drop_p2m_gfn_domain(d, gmfn + k, mfn);
             }
         }
 
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/common/tmem_xen.c
--- a/xen/common/tmem_xen.c
+++ b/xen/common/tmem_xen.c
@@ -111,20 +111,28 @@ static inline void *cli_get_page(tmem_cl
 
     cli_mfn = mfn_x(gfn_to_mfn(current->domain, cmfn, &t));
     if ( t != p2m_ram_rw || !mfn_valid(cli_mfn) )
+    {
+            drop_p2m_gfn_domain(current->domain, 
+                            (unsigned long) cmfn, cli_mfn);
             return NULL;
+    }
     page = mfn_to_page(cli_mfn);
     if ( cli_write )
         ret = get_page_and_type(page, current->domain, PGT_writable_page);
     else
         ret = get_page(page, current->domain);
     if ( !ret )
+    {
+        drop_p2m_gfn_domain(current->domain, 
+                        (unsigned long) cmfn, cli_mfn);
         return NULL;
+    }
     *pcli_mfn = cli_mfn;
     *pcli_pfp = (pfp_t *)page;
     return map_domain_page(cli_mfn);
 }
 
-static inline void cli_put_page(void *cli_va, pfp_t *cli_pfp,
+static inline void cli_put_page(tmem_cli_mfn_t cmfn, void *cli_va, pfp_t
*cli_pfp,
                                 unsigned long cli_mfn, bool_t mark_dirty)
 {
     if ( mark_dirty )
@@ -135,6 +143,7 @@ static inline void cli_put_page(void *cl
     else
         put_page((struct page_info *)cli_pfp);
     unmap_domain_page(cli_va);
+    drop_p2m_gfn_domain(current->domain, (unsigned long) cmfn, cli_mfn);
 }
 #endif
 
@@ -169,7 +178,7 @@ EXPORT int tmh_copy_from_client(pfp_t *p
               (pfn_offset+len <= PAGE_SIZE) )
         memcpy((char *)tmem_va+tmem_offset,(char *)cli_va+pfn_offset,len);
     if ( !tmemc )
-        cli_put_page(cli_va, cli_pfp, cli_mfn, 0);
+        cli_put_page(cmfn, cli_va, cli_pfp, cli_mfn, 0);
     unmap_domain_page(tmem_va);
     return 1;
 }
@@ -197,7 +206,7 @@ EXPORT int tmh_compress_from_client(tmem
     ASSERT(ret == LZO_E_OK);
     *out_va = dmem;
     if ( !tmemc )
-        cli_put_page(cli_va, cli_pfp, cli_mfn, 0);
+        cli_put_page(cmfn, cli_va, cli_pfp, cli_mfn, 0);
     unmap_domain_page(cli_va);
     return 1;
 }
@@ -225,7 +234,7 @@ EXPORT int tmh_copy_to_client(tmem_cli_m
         memcpy((char *)cli_va+pfn_offset,(char *)tmem_va+tmem_offset,len);
     unmap_domain_page(tmem_va);
     if ( !tmemc )
-        cli_put_page(cli_va, cli_pfp, cli_mfn, 1);
+        cli_put_page(cmfn, cli_va, cli_pfp, cli_mfn, 1);
     mb();
     return 1;
 }
@@ -249,7 +258,7 @@ EXPORT int tmh_decompress_to_client(tmem
     ASSERT(ret == LZO_E_OK);
     ASSERT(out_len == PAGE_SIZE);
     if ( !tmemc )
-        cli_put_page(cli_va, cli_pfp, cli_mfn, 1);
+        cli_put_page(cmfn, cli_va, cli_pfp, cli_mfn, 1);
     mb();
     return 1;
 }
@@ -271,7 +280,7 @@ EXPORT int tmh_copy_tze_to_client(tmem_c
         memcpy((char *)cli_va,(char *)tmem_va,len);
     if ( len < PAGE_SIZE )
         memset((char *)cli_va+len,0,PAGE_SIZE-len);
-    cli_put_page(cli_va, cli_pfp, cli_mfn, 1);
+    cli_put_page(cmfn, cli_va, cli_pfp, cli_mfn, 1);
     mb();
     return 1;
 }
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/include/asm-x86/hvm/hvm.h
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -394,7 +394,10 @@ int hvm_virtual_to_linear_addr(
 
 void *hvm_map_guest_frame_rw(unsigned long gfn);
 void *hvm_map_guest_frame_ro(unsigned long gfn);
-void hvm_unmap_guest_frame(void *p);
+/* We pass back either the guest virtual or physical frame mapped,
+ * in order to drop any locks/refcounts we may have had on p2m 
+ * entries or underlying mfn''s while using the map */
+void hvm_unmap_guest_frame(void *p, unsigned long addr, int is_va);
 
 static inline void hvm_set_info_guest(struct vcpu *v)
 {
diff -r 471d4f2754d6 -r d13f91c2fe18 xen/include/asm-x86/hvm/vmx/vvmx.h
--- a/xen/include/asm-x86/hvm/vmx/vvmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vvmx.h
@@ -25,6 +25,7 @@
 
 struct nestedvmx {
     paddr_t    vmxon_region_pa;
+    unsigned long iobitmap_gfn[2];
     void       *iobitmap[2];		/* map (va) of L1 guest I/O bitmap */
     /* deferred nested interrupt */
     struct {

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tim Deegan

2011-Oct-27 13:36 UTC

head link

Re: [Xen-devel] [PATCH 3 of 9] Enforce ordering constraints for the page alloc lock in the PoD code

Hi, 

The intent of these first three patches looks good to me, but:

- I think it would be better to generate generic spin-lock-with-level
  and unlock-with-level wrapper functions rather than generating the
  various checks and having to assemble them into lock_page_alloc() and
  unlock_page_alloc() by hand.
- p2m->pod.page_alloc_unlock_level is wrong, I think; I can see that you
  need somewhere to store the unlock-level but it shouldn''t live in 
  the p2m state - it''s at most a per-domain variable, so it should
  live in the struct domain; might as well be beside the lock itself.

Tim.

At 00:33 -0400 on 27 Oct (1319675628), Andres Lagar-Cavilla
wrote:>  xen/arch/x86/mm/mm-locks.h |  11 +++++++++++
>  xen/arch/x86/mm/p2m-pod.c  |  40 +++++++++++++++++++++++++++-------------
>  xen/include/asm-x86/p2m.h  |   5 +++++
>  3 files changed, 43 insertions(+), 13 deletions(-)
> 
> 
> The page alloc lock is sometimes used in the PoD code, with an
> explicit expectation of ordering. Use our ordering constructs in the
> mm layer to enforce this.
> 
> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
> 
> diff -r c915609e4235 -r 332775f72a30 xen/arch/x86/mm/mm-locks.h
> --- a/xen/arch/x86/mm/mm-locks.h
> +++ b/xen/arch/x86/mm/mm-locks.h
> @@ -155,6 +155,17 @@ declare_mm_lock(p2m)
>  #define p2m_unlock(p)         mm_unlock(&(p)->lock)
>  #define p2m_locked_by_me(p)   mm_locked_by_me(&(p)->lock)
>  
> +/* Page alloc lock (per-domain)
> + *
> + * This is an external lock, not represented by an mm_lock_t. However, 
> + * pod code uses it in conjunction with the p2m lock, and expecting
> + * the ordering which we enforce here */
> +
> +declare_mm_order_constraint(page_alloc)
> +#define page_alloc_mm_pre_lock()   mm_enforce_order_lock_pre_page_alloc()
> +#define page_alloc_mm_post_lock(l)
mm_enforce_order_lock_post_page_alloc(&(l))
> +#define page_alloc_mm_unlock(l)    mm_enforce_order_unlock((l))
> +
>  /* Paging lock (per-domain)
>   *
>   * For shadow pagetables, this lock protects
> diff -r c915609e4235 -r 332775f72a30 xen/arch/x86/mm/p2m-pod.c
> --- a/xen/arch/x86/mm/p2m-pod.c
> +++ b/xen/arch/x86/mm/p2m-pod.c
> @@ -45,6 +45,20 @@
>  
>  #define superpage_aligned(_x)  (((_x)&(SUPERPAGE_PAGES-1))==0)
>  
> +/* Enforce lock ordering when grabbing the "external" page_alloc
lock */
> +static inline void lock_page_alloc(struct p2m_domain *p2m)
> +{
> +    page_alloc_mm_pre_lock();
> +    spin_lock(&(p2m->domain->page_alloc_lock));
> +    page_alloc_mm_post_lock(p2m->pod.page_alloc_unlock_level);
> +}
> +
> +static inline void unlock_page_alloc(struct p2m_domain *p2m)
> +{
> +    page_alloc_mm_unlock(p2m->pod.page_alloc_unlock_level);
> +    spin_unlock(&(p2m->domain->page_alloc_lock));
> +}
> +
>  /*
>   * Populate-on-demand functionality
>   */
> @@ -100,7 +114,7 @@ p2m_pod_cache_add(struct p2m_domain *p2m
>          unmap_domain_page(b);
>      }
>  
> -    spin_lock(&d->page_alloc_lock);
> +    lock_page_alloc(p2m);
>  
>      /* First, take all pages off the domain list */
>      for(i=0; i < 1 << order ; i++)
> @@ -128,7 +142,7 @@ p2m_pod_cache_add(struct p2m_domain *p2m
>       * This may cause "zombie domains" since the page will never
be freed. */
>      BUG_ON( d->arch.relmem != RELMEM_not_started );
>  
> -    spin_unlock(&d->page_alloc_lock);
> +    unlock_page_alloc(p2m);
>  
>      return 0;
>  }
> @@ -245,7 +259,7 @@ p2m_pod_set_cache_target(struct p2m_doma
>  
>          /* Grab the lock before checking that pod.super is empty, or the
last
>           * entries may disappear before we grab the lock. */
> -        spin_lock(&d->page_alloc_lock);
> +        lock_page_alloc(p2m);
>  
>          if ( (p2m->pod.count - pod_target) > SUPERPAGE_PAGES
>               && !page_list_empty(&p2m->pod.super) )
> @@ -257,7 +271,7 @@ p2m_pod_set_cache_target(struct p2m_doma
>  
>          ASSERT(page != NULL);
>  
> -        spin_unlock(&d->page_alloc_lock);
> +        unlock_page_alloc(p2m);
>  
>          /* Then free them */
>          for ( i = 0 ; i < (1 << order) ; i++ )
> @@ -378,7 +392,7 @@ p2m_pod_empty_cache(struct domain *d)
>      BUG_ON(!d->is_dying);
>      spin_barrier(&p2m->lock.lock);
>  
> -    spin_lock(&d->page_alloc_lock);
> +    lock_page_alloc(p2m);
>  
>      while ( (page = page_list_remove_head(&p2m->pod.super)) )
>      {
> @@ -403,7 +417,7 @@ p2m_pod_empty_cache(struct domain *d)
>  
>      BUG_ON(p2m->pod.count != 0);
>  
> -    spin_unlock(&d->page_alloc_lock);
> +    unlock_page_alloc(p2m);
>  }
>  
>  int
> @@ -417,7 +431,7 @@ p2m_pod_offline_or_broken_hit(struct pag
>      if ( !(d = page_get_owner(p)) || !(p2m = p2m_get_hostp2m(d)) )
>          return 0;
>  
> -    spin_lock(&d->page_alloc_lock);
> +    lock_page_alloc(p2m);
>      bmfn = mfn_x(page_to_mfn(p));
>      page_list_for_each_safe(q, tmp, &p2m->pod.super)
>      {
> @@ -448,12 +462,12 @@ p2m_pod_offline_or_broken_hit(struct pag
>          }
>      }
>  
> -    spin_unlock(&d->page_alloc_lock);
> +    unlock_page_alloc(p2m);
>      return 0;
>  
>  pod_hit:
>      page_list_add_tail(p, &d->arch.relmem_list);
> -    spin_unlock(&d->page_alloc_lock);
> +    unlock_page_alloc(p2m);
>      return 1;
>  }
>  
> @@ -994,7 +1008,7 @@ p2m_pod_demand_populate(struct p2m_domai
>      if ( q == p2m_guest && gfn > p2m->pod.max_guest )
>          p2m->pod.max_guest = gfn;
>  
> -    spin_lock(&d->page_alloc_lock);
> +    lock_page_alloc(p2m);
>  
>      if ( p2m->pod.count == 0 )
>          goto out_of_memory;
> @@ -1008,7 +1022,7 @@ p2m_pod_demand_populate(struct p2m_domai
>  
>      BUG_ON((mfn_x(mfn) & ((1 << order)-1)) != 0);
>  
> -    spin_unlock(&d->page_alloc_lock);
> +    unlock_page_alloc(p2m);
>  
>      gfn_aligned = (gfn >> order) << order;
>  
> @@ -1040,7 +1054,7 @@ p2m_pod_demand_populate(struct p2m_domai
>  
>      return 0;
>  out_of_memory:
> -    spin_unlock(&d->page_alloc_lock);
> +    unlock_page_alloc(p2m);
>  
>      printk("%s: Out of populate-on-demand memory! tot_pages %"
PRIu32 " pod_entries %" PRIi32 "\n",
>             __func__, d->tot_pages, p2m->pod.entry_count);
> @@ -1049,7 +1063,7 @@ out_fail:
>      return -1;
>  remap_and_retry:
>      BUG_ON(order != PAGE_ORDER_2M);
> -    spin_unlock(&d->page_alloc_lock);
> +    unlock_page_alloc(p2m);
>  
>      /* Remap this 2-meg region in singleton chunks */
>      gfn_aligned = (gfn>>order)<<order;
> diff -r c915609e4235 -r 332775f72a30 xen/include/asm-x86/p2m.h
> --- a/xen/include/asm-x86/p2m.h
> +++ b/xen/include/asm-x86/p2m.h
> @@ -270,6 +270,10 @@ struct p2m_domain {
>       * + p2m_pod_demand_populate() grabs both; the p2m lock to avoid
>       *   double-demand-populating of pages, the page_alloc lock to
>       *   protect moving stuff from the PoD cache to the domain page list.
> +     *
> +     * We enforce this lock ordering through a construct in mm-locks.h.
> +     * This demands, however, that we store the previous lock-ordering
> +     * level in effect before grabbing the page_alloc lock.
>       */
>      struct {
>          struct page_list_head super,   /* List of superpages              
*/
> @@ -279,6 +283,7 @@ struct p2m_domain {
>          unsigned         reclaim_super; /* Last gpfn of a scan */
>          unsigned         reclaim_single; /* Last gpfn of a scan */
>          unsigned         max_guest;    /* gpfn of max guest
demand-populate */
> +        int              page_alloc_unlock_level; /* To enforce lock
ordering */
>      } pod;
>  };
>  
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tim Deegan

2011-Oct-27 13:55 UTC

head link

Re: [Xen-devel] [PATCH 4 of 9] Rework locking in the PoD layer

At 00:33 -0400 on 27 Oct (1319675629), Andres Lagar-Cavilla
wrote:> The PoD layer has a fragile locking discipline. It relies on the
> p2m being globally locked, and it also relies on the page alloc
> lock to protect some of its data structures. Replace this all by an
> explicit pod lock: per p2m, order enforced.
> 
> Two consequences:
>     - Critical sections in the pod code protected by the page alloc
>       lock are now reduced to modifications of the domain page list.
>     - When the p2m lock becomes fine-grained, there are no
>       assumptions broken in the PoD layer.
> 
> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
The bulk of this looks OK to me, but will definitely need an Ack from
George Dunlap as well.  Two comments:
> diff -r 332775f72a30 -r 981073d78f7f xen/arch/x86/mm/mm-locks.h
> --- a/xen/arch/x86/mm/mm-locks.h
> +++ b/xen/arch/x86/mm/mm-locks.h
> @@ -155,6 +155,15 @@ declare_mm_lock(p2m)
>  #define p2m_unlock(p)         mm_unlock(&(p)->lock)
>  #define p2m_locked_by_me(p)   mm_locked_by_me(&(p)->lock)
> 
> +/* PoD lock (per-p2m-table)
> + *
> + * Protects private PoD data structs. */
> +
> +declare_mm_lock(pod)
> +#define pod_lock(p)           mm_lock(pod, &(p)->pod.lock)
> +#define pod_unlock(p)         mm_unlock(&(p)->pod.lock)
> +#define pod_locked_by_me(p)   mm_locked_by_me(&(p)->pod.lock)
Can the explanatory comment be more explicit about what it covers? It is
everything in the struct pod or just the page-lists that were mentioned
in the comment you removed from p2m.h?
> @@ -841,7 +854,6 @@ p2m_pod_zero_check(struct p2m_domain *p2
>              if( *(map[i]+j) != 0 )
>                  break;
>  
> -        unmap_domain_page(map[i]);
>  
>          /* See comment in p2m_pod_zero_check_superpage() re gnttab
>           * check timing.  */
> @@ -849,8 +861,15 @@ p2m_pod_zero_check(struct p2m_domain *p2
>          {
>              set_p2m_entry(p2m, gfns[i], mfns[i], PAGE_ORDER_4K,
>                  types[i], p2m->default_access);
> +            unmap_domain_page(map[i]);
> +            map[i] = NULL;
>          }
> -        else
> +    }
> +
> +    /* Finally, add to cache */
> +    for ( i=0; i < count; i++ )
> +    {
> +        if ( map[i] ) 
>          {
>              if ( tb_init_done )
>              {
> @@ -867,6 +886,8 @@ p2m_pod_zero_check(struct p2m_domain *p2
>                  __trace_var(TRC_MEM_POD_ZERO_RECLAIM, 0, sizeof(t),
&t);
>              }
>  
> +            unmap_domain_page(map[i]);
> +
>              /* Add to cache, and account for the new p2m PoD entry */
>              p2m_pod_cache_add(p2m, mfn_to_page(mfns[i]), PAGE_ORDER_4K);
>              p2m->pod.entry_count++;
That seems to be reshuffling the running order of this function but I
don''t see how it''s related to locking.  Is this an unrelated
change
that snuck in?

(Oh, a third thing just occurred to me - might be worth making some of
those ''lock foo held on entry'' comments into
ASSERT(lock_held_by_me()). )

Cheers,

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tim Deegan

2011-Oct-27 14:43 UTC

head link

Re: [Xen-devel] [PATCH 5 of 9] Fine-grained concurrency control structure for the p2m

At 00:33 -0400 on 27 Oct (1319675630), Andres Lagar-Cavilla
wrote:> Introduce a fine-grained concurrency control structure for the p2m. This
> allows for locking 2M-aligned chunks of the p2m at a time, exclusively.
> Recursive locking is allowed. Global locking of the whole p2m is also
> allowed for certain operations. Simple deadlock detection heuristics are
> put in place.
> 
> Note the patch creates backwards-compatible shortcuts that will lock the
> p2m globally. So it should remain functionally identical to what is
currently
> in place.
> 
> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
Wow.  What a lot of code. :)  I took a look through, but I can''t
guarantee to have got all the details.  Things I saw:

- You use atomic_t for the count but only ever update it under a
  lock. :)  If you just need to be sure of atomic writes, then 
  atomic_set will do that without using a locked increment/decrement.

- You allocate the bitmaps from xenheap - they should really be using
  p2m memory, so as to avoid changing the memory overhead of the domain 
  as it runs.   That will involve map_domain_page()ing the bitmaps as
  you go, but at least on x86_64 that''s very cheap. 

- panic() on out-of-memory is pretty rude. 

But stepping back, I''m not sure that we need all this just yet.  I
think
it would be worth doing the interface changes with a single p2m lock and
measuring how bad it is before getting stuck in to fine-grained locking
(fun though it might be).

I suspect that if this is a contention point, allowing multiple readers
will become important, especially if there are particular pages that
often get emulated access.

And also, I''d  like to get some sort of plan for handling long-lived
foreign mappings, if only to make sure that this phase-1 fix doesn''t 
conflict wih it.

Oh, one more thing: 
> +/* Some deadlock book-keeping. Say CPU A holds a lock on range A, CPU B
holds a
> + * lock on range B. Now, CPU A wants to lock range B and vice-versa.
Deadlock.
> + * We detect this by remembering the start of the current locked range.
> + * We keep a fairly small stack of guards (8), because we don''t
anticipate
> + * a great deal of recursive locking because (a) recursive locking is rare
> + * (b) it is evil (c) only PoD seems to do it (is PoD therefore evil?) */
If PoD could ba adjusted not to do it, could we get rid of all the
recursive locking entirely?  That would simplify things a lot. 

Tim.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tim Deegan

2011-Oct-27 14:57 UTC

head link

Re: [Xen-devel] [PATCH 8 of 9] Modify all internal p2m functions to use the new fine-grained locking

Hi, 

At 00:33 -0400 on 27 Oct (1319675633), Andres Lagar-Cavilla
wrote:> This patch only modifies code internal to the p2m, adding convenience
> macros, etc. It will yield a compiling code base but an incorrect
> hypervisor (external callers of queries into the p2m will not unlock).
> Next patch takes care of external callers, split done for the benefit
> of conciseness.
Better to do it the other way round: put the enormous change-all-callers
patch first, with noop unlock functions, and then hook in the unlocks.
That way you won''t cause chaos when people try to bisect to find when a
bug was introduced. 
> diff -r 8a98179666de -r 471d4f2754d6 xen/include/asm-x86/p2m.h
> --- a/xen/include/asm-x86/p2m.h
> +++ b/xen/include/asm-x86/p2m.h
> @@ -220,7 +220,7 @@ struct p2m_domain {
>       * tables on every host-p2m change.  The setter of this flag 
>       * is responsible for performing the full flush before releasing the
>       * host p2m''s lock. */
> -    int                defer_nested_flush;
> +    atomic_t           defer_nested_flush;
>  
>      /* Pages used to construct the p2m */
>      struct page_list_head pages;
> @@ -298,6 +298,15 @@ struct p2m_domain *p2m_get_p2m(struct vc
>  #define p2m_get_pagetable(p2m)  ((p2m)->phys_table)
>  
>  
> +/* No matter what value you get out of a query, the p2m has been locked
for
> + * that range. No matter what you do, you need to drop those locks.
> + * You need to pass back the mfn obtained when locking, not the new one,
> + * as the refcount of the original mfn was bumped. */
Surely the caller doesn''t need to remember the old MFN for this?  After
allm, the whole point of the lock was that nobody else could change the
p2m entry under our feet!

In any case, I thing there needs to be a big block comment a bit futher
up that describes what all this locking and refcounting does, and why. 
> +void drop_p2m_gfn(struct p2m_domain *p2m, unsigned long gfn, 
> +                        unsigned long mfn);
> +#define drop_p2m_gfn_domain(d, g, m)    \
> +        drop_p2m_gfn(p2m_get_hostp2m((d)), (g), (m))
> +
>  /* Read a particular P2M table, mapping pages as we go.  Most callers
>   * should _not_ call this directly; use the other gfn_to_mfn_* functions
>   * below unless you know you want to walk a p2m that isn''t a
domain''s
> @@ -327,6 +336,28 @@ static inline mfn_t gfn_to_mfn_type(stru
>  #define gfn_to_mfn_guest(d, g, t)   gfn_to_mfn_type((d), (g), (t),
p2m_guest)
>  #define gfn_to_mfn_unshare(d, g, t) gfn_to_mfn_type((d), (g), (t),
p2m_unshare)
>  
> +/* This one applies to very specific situations in which you''re
querying
> + * a p2m entry and will be done "immediately" (such as a printk
or computing a
> + * return value). Use this only if there are no expectations of the p2m
entry
> + * holding steady. */
> +static inline mfn_t gfn_to_mfn_type_unlocked(struct domain *d,
> +                                        unsigned long gfn, p2m_type_t *t,
> +                                        p2m_query_t q)
> +{
> +    mfn_t mfn = gfn_to_mfn_type(d, gfn, t, q);
> +    drop_p2m_gfn_domain(d, gfn, mfn_x(mfn));
> +    return mfn;
> +}
> +
> +#define gfn_to_mfn_unlocked(d, g, t)            \
> +    gfn_to_mfn_type_unlocked((d), (g), (t), p2m_alloc)
> +#define gfn_to_mfn_query_unlocked(d, g, t)    \
> +    gfn_to_mfn_type_unlocked((d), (g), (t), p2m_query)
> +#define gfn_to_mfn_guest_unlocked(d, g, t)    \
> +    gfn_to_mfn_type_unlocked((d), (g), (t), p2m_guest)
> +#define gfn_to_mfn_unshare_unlocked(d, g, t)    \
> +    gfn_to_mfn_type_unlocked((d), (g), (t), p2m_unshare)
> +
Ugh.  This could really benefit from having the gfn_to_mfn_* functions
take a set of flags instead of an enum.  This exponential blowup in
interface is going too far. :)

That oughtn''t to stop this interface from going in, of course, but if
we''re going to tinker with the p2m callers once, we should do it all
together. 

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tim Deegan

2011-Oct-27 15:02 UTC

head link

Re: [Xen-devel] [PATCH 9 of 9] Modify all call sites of queries into the p2m to use the new fine-grained locking

At 00:33 -0400 on 27 Oct (1319675634), Andres Lagar-Cavilla
wrote:>  28 files changed, 519 insertions(+), 101 deletions(-)
And I thought patch 5 was big :)

I''m not going to read the detail of this this time around -
I''d like to
only have to review it once. :)  

I wonder whether it would be worth changing the name/signature of the
generic p2m functions in an incompatime way while we''re there.  It
would
have three advantages:

 - allow the lookup/drop pairs to have nice matching names
 - get rid of the confusingly-named ''gmfn_to_mfn'' function 
 - avoid later bugs if patches are forward-ported across this chaneg
   that add p2m lookups (but not corresoponding drops)

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

andres@lagarcavilla.com

2011-Nov-02 13:59 UTC

head link

Re: [Xen-devel] [PATCH 3 of 9] Enforce ordering constraints for the page alloc lock in the PoD code

Hi also,> Hi,
>
> The intent of these first three patches looks good to me, but:
>
> - I think it would be better to generate generic spin-lock-with-level
>   and unlock-with-level wrapper functions rather than generating the
>   various checks and having to assemble them into lock_page_alloc() and
>   unlock_page_alloc() by hand.
The final intent is to have these macros establish ordering constraints
for the fine-grained p2m lock, which is not only "grab a spinlock".
Granted, we do not know yet whether we''ll need such a fine-grained
approach, but I think it''s worth keeping things separate.

As a side-note, an earlier version of my patches did enforce ordering,
except things got really hairy with mem_sharing_unshare_page (which would
jump levels up to grab shr_lock) and pod sweeps. I (think I) have
solutions for both, but I''m not ready to push those yet.
> - p2m->pod.page_alloc_unlock_level is wrong, I think; I can see that you
>   need somewhere to store the unlock-level but it shouldn''t live
in
>   the p2m state - it''s at most a per-domain variable, so it should
>   live in the struct domain; might as well be beside the lock itself.
Ok, sure. Although I think I need to make clear that this ordering
constraint only applies within the pod code, and that''s why I wanted to
keep the book-keeping within the pod struct.

Andres> Tim.
>
> At 00:33 -0400 on 27 Oct (1319675628), Andres Lagar-Cavilla wrote:
>>  xen/arch/x86/mm/mm-locks.h |  11 +++++++++++
>>  xen/arch/x86/mm/p2m-pod.c  |  40
>> +++++++++++++++++++++++++++-------------
>>  xen/include/asm-x86/p2m.h  |   5 +++++
>>  3 files changed, 43 insertions(+), 13 deletions(-)
>>
>>
>> The page alloc lock is sometimes used in the PoD code, with an
>> explicit expectation of ordering. Use our ordering constructs in the
>> mm layer to enforce this.
>>
>> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
>>
>> diff -r c915609e4235 -r 332775f72a30 xen/arch/x86/mm/mm-locks.h
>> --- a/xen/arch/x86/mm/mm-locks.h
>> +++ b/xen/arch/x86/mm/mm-locks.h
>> @@ -155,6 +155,17 @@ declare_mm_lock(p2m)
>>  #define p2m_unlock(p)         mm_unlock(&(p)->lock)
>>  #define p2m_locked_by_me(p)   mm_locked_by_me(&(p)->lock)
>>
>> +/* Page alloc lock (per-domain)
>> + *
>> + * This is an external lock, not represented by an mm_lock_t. However,
>> + * pod code uses it in conjunction with the p2m lock, and expecting
>> + * the ordering which we enforce here */
>> +
>> +declare_mm_order_constraint(page_alloc)
>> +#define page_alloc_mm_pre_lock()
>> mm_enforce_order_lock_pre_page_alloc()
>> +#define page_alloc_mm_post_lock(l)
>> mm_enforce_order_lock_post_page_alloc(&(l))
>> +#define page_alloc_mm_unlock(l)    mm_enforce_order_unlock((l))
>> +
>>  /* Paging lock (per-domain)
>>   *
>>   * For shadow pagetables, this lock protects
>> diff -r c915609e4235 -r 332775f72a30 xen/arch/x86/mm/p2m-pod.c
>> --- a/xen/arch/x86/mm/p2m-pod.c
>> +++ b/xen/arch/x86/mm/p2m-pod.c
>> @@ -45,6 +45,20 @@
>>
>>  #define superpage_aligned(_x)  (((_x)&(SUPERPAGE_PAGES-1))==0)
>>
>> +/* Enforce lock ordering when grabbing the "external"
page_alloc lock
>> */
>> +static inline void lock_page_alloc(struct p2m_domain *p2m)
>> +{
>> +    page_alloc_mm_pre_lock();
>> +    spin_lock(&(p2m->domain->page_alloc_lock));
>> +    page_alloc_mm_post_lock(p2m->pod.page_alloc_unlock_level);
>> +}
>> +
>> +static inline void unlock_page_alloc(struct p2m_domain *p2m)
>> +{
>> +    page_alloc_mm_unlock(p2m->pod.page_alloc_unlock_level);
>> +    spin_unlock(&(p2m->domain->page_alloc_lock));
>> +}
>> +
>>  /*
>>   * Populate-on-demand functionality
>>   */
>> @@ -100,7 +114,7 @@ p2m_pod_cache_add(struct p2m_domain *p2m
>>          unmap_domain_page(b);
>>      }
>>
>> -    spin_lock(&d->page_alloc_lock);
>> +    lock_page_alloc(p2m);
>>
>>      /* First, take all pages off the domain list */
>>      for(i=0; i < 1 << order ; i++)
>> @@ -128,7 +142,7 @@ p2m_pod_cache_add(struct p2m_domain *p2m
>>       * This may cause "zombie domains" since the page will
never be
>> freed. */
>>      BUG_ON( d->arch.relmem != RELMEM_not_started );
>>
>> -    spin_unlock(&d->page_alloc_lock);
>> +    unlock_page_alloc(p2m);
>>
>>      return 0;
>>  }
>> @@ -245,7 +259,7 @@ p2m_pod_set_cache_target(struct p2m_doma
>>
>>          /* Grab the lock before checking that pod.super is empty, or
>> the last
>>           * entries may disappear before we grab the lock. */
>> -        spin_lock(&d->page_alloc_lock);
>> +        lock_page_alloc(p2m);
>>
>>          if ( (p2m->pod.count - pod_target) > SUPERPAGE_PAGES
>>               && !page_list_empty(&p2m->pod.super) )
>> @@ -257,7 +271,7 @@ p2m_pod_set_cache_target(struct p2m_doma
>>
>>          ASSERT(page != NULL);
>>
>> -        spin_unlock(&d->page_alloc_lock);
>> +        unlock_page_alloc(p2m);
>>
>>          /* Then free them */
>>          for ( i = 0 ; i < (1 << order) ; i++ )
>> @@ -378,7 +392,7 @@ p2m_pod_empty_cache(struct domain *d)
>>      BUG_ON(!d->is_dying);
>>      spin_barrier(&p2m->lock.lock);
>>
>> -    spin_lock(&d->page_alloc_lock);
>> +    lock_page_alloc(p2m);
>>
>>      while ( (page = page_list_remove_head(&p2m->pod.super)) )
>>      {
>> @@ -403,7 +417,7 @@ p2m_pod_empty_cache(struct domain *d)
>>
>>      BUG_ON(p2m->pod.count != 0);
>>
>> -    spin_unlock(&d->page_alloc_lock);
>> +    unlock_page_alloc(p2m);
>>  }
>>
>>  int
>> @@ -417,7 +431,7 @@ p2m_pod_offline_or_broken_hit(struct pag
>>      if ( !(d = page_get_owner(p)) || !(p2m = p2m_get_hostp2m(d)) )
>>          return 0;
>>
>> -    spin_lock(&d->page_alloc_lock);
>> +    lock_page_alloc(p2m);
>>      bmfn = mfn_x(page_to_mfn(p));
>>      page_list_for_each_safe(q, tmp, &p2m->pod.super)
>>      {
>> @@ -448,12 +462,12 @@ p2m_pod_offline_or_broken_hit(struct pag
>>          }
>>      }
>>
>> -    spin_unlock(&d->page_alloc_lock);
>> +    unlock_page_alloc(p2m);
>>      return 0;
>>
>>  pod_hit:
>>      page_list_add_tail(p, &d->arch.relmem_list);
>> -    spin_unlock(&d->page_alloc_lock);
>> +    unlock_page_alloc(p2m);
>>      return 1;
>>  }
>>
>> @@ -994,7 +1008,7 @@ p2m_pod_demand_populate(struct p2m_domai
>>      if ( q == p2m_guest && gfn > p2m->pod.max_guest )
>>          p2m->pod.max_guest = gfn;
>>
>> -    spin_lock(&d->page_alloc_lock);
>> +    lock_page_alloc(p2m);
>>
>>      if ( p2m->pod.count == 0 )
>>          goto out_of_memory;
>> @@ -1008,7 +1022,7 @@ p2m_pod_demand_populate(struct p2m_domai
>>
>>      BUG_ON((mfn_x(mfn) & ((1 << order)-1)) != 0);
>>
>> -    spin_unlock(&d->page_alloc_lock);
>> +    unlock_page_alloc(p2m);
>>
>>      gfn_aligned = (gfn >> order) << order;
>>
>> @@ -1040,7 +1054,7 @@ p2m_pod_demand_populate(struct p2m_domai
>>
>>      return 0;
>>  out_of_memory:
>> -    spin_unlock(&d->page_alloc_lock);
>> +    unlock_page_alloc(p2m);
>>
>>      printk("%s: Out of populate-on-demand memory! tot_pages
%" PRIu32 "
>> pod_entries %" PRIi32 "\n",
>>             __func__, d->tot_pages, p2m->pod.entry_count);
>> @@ -1049,7 +1063,7 @@ out_fail:
>>      return -1;
>>  remap_and_retry:
>>      BUG_ON(order != PAGE_ORDER_2M);
>> -    spin_unlock(&d->page_alloc_lock);
>> +    unlock_page_alloc(p2m);
>>
>>      /* Remap this 2-meg region in singleton chunks */
>>      gfn_aligned = (gfn>>order)<<order;
>> diff -r c915609e4235 -r 332775f72a30 xen/include/asm-x86/p2m.h
>> --- a/xen/include/asm-x86/p2m.h
>> +++ b/xen/include/asm-x86/p2m.h
>> @@ -270,6 +270,10 @@ struct p2m_domain {
>>       * + p2m_pod_demand_populate() grabs both; the p2m lock to avoid
>>       *   double-demand-populating of pages, the page_alloc lock to
>>       *   protect moving stuff from the PoD cache to the domain page
>> list.
>> +     *
>> +     * We enforce this lock ordering through a construct in
mm-locks.h.
>> +     * This demands, however, that we store the previous lock-ordering
>> +     * level in effect before grabbing the page_alloc lock.
>>       */
>>      struct {
>>          struct page_list_head super,   /* List of superpages
>>     */
>> @@ -279,6 +283,7 @@ struct p2m_domain {
>>          unsigned         reclaim_super; /* Last gpfn of a scan */
>>          unsigned         reclaim_single; /* Last gpfn of a scan */
>>          unsigned         max_guest;    /* gpfn of max guest
>> demand-populate */
>> +        int              page_alloc_unlock_level; /* To enforce lock
>> ordering */
>>      } pod;
>>  };
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

andres@lagarcavilla.com

2011-Nov-02 14:04 UTC

head link

Re: [Xen-devel] [PATCH 4 of 9] Rework locking in the PoD layer

All righty, I''ll include George from now on.

The chunk at the bottom was an irrelevant refactor. I''l repost without
it

Good point on the asserts. I''ll come up with some constructs that are
forwards-compatible with the fin-grained case. e.g:
(ASSERT(p2m_gfn_locked_by_me(p2m, gfn))

Andres
> At 00:33 -0400 on 27 Oct (1319675629), Andres Lagar-Cavilla wrote:
>> The PoD layer has a fragile locking discipline. It relies on the
>> p2m being globally locked, and it also relies on the page alloc
>> lock to protect some of its data structures. Replace this all by an
>> explicit pod lock: per p2m, order enforced.
>>
>> Two consequences:
>>     - Critical sections in the pod code protected by the page alloc
>>       lock are now reduced to modifications of the domain page list.
>>     - When the p2m lock becomes fine-grained, there are no
>>       assumptions broken in the PoD layer.
>>
>> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
>
> The bulk of this looks OK to me, but will definitely need an Ack from
> George Dunlap as well.  Two comments:
>
>> diff -r 332775f72a30 -r 981073d78f7f xen/arch/x86/mm/mm-locks.h
>> --- a/xen/arch/x86/mm/mm-locks.h
>> +++ b/xen/arch/x86/mm/mm-locks.h
>> @@ -155,6 +155,15 @@ declare_mm_lock(p2m)
>>  #define p2m_unlock(p)         mm_unlock(&(p)->lock)
>>  #define p2m_locked_by_me(p)   mm_locked_by_me(&(p)->lock)
>>
>> +/* PoD lock (per-p2m-table)
>> + *
>> + * Protects private PoD data structs. */
>> +
>> +declare_mm_lock(pod)
>> +#define pod_lock(p)           mm_lock(pod, &(p)->pod.lock)
>> +#define pod_unlock(p)         mm_unlock(&(p)->pod.lock)
>> +#define pod_locked_by_me(p)   mm_locked_by_me(&(p)->pod.lock)
>
> Can the explanatory comment be more explicit about what it covers? It is
> everything in the struct pod or just the page-lists that were mentioned
> in the comment you removed from p2m.h?
>
>> @@ -841,7 +854,6 @@ p2m_pod_zero_check(struct p2m_domain *p2
>>              if( *(map[i]+j) != 0 )
>>                  break;
>>
>> -        unmap_domain_page(map[i]);
>>
>>          /* See comment in p2m_pod_zero_check_superpage() re gnttab
>>           * check timing.  */
>> @@ -849,8 +861,15 @@ p2m_pod_zero_check(struct p2m_domain *p2
>>          {
>>              set_p2m_entry(p2m, gfns[i], mfns[i], PAGE_ORDER_4K,
>>                  types[i], p2m->default_access);
>> +            unmap_domain_page(map[i]);
>> +            map[i] = NULL;
>>          }
>> -        else
>> +    }
>> +
>> +    /* Finally, add to cache */
>> +    for ( i=0; i < count; i++ )
>> +    {
>> +        if ( map[i] )
>>          {
>>              if ( tb_init_done )
>>              {
>> @@ -867,6 +886,8 @@ p2m_pod_zero_check(struct p2m_domain *p2
>>                  __trace_var(TRC_MEM_POD_ZERO_RECLAIM, 0, sizeof(t),
>> &t);
>>              }
>>
>> +            unmap_domain_page(map[i]);
>> +
>>              /* Add to cache, and account for the new p2m PoD entry */
>>              p2m_pod_cache_add(p2m, mfn_to_page(mfns[i]),
>> PAGE_ORDER_4K);
>>              p2m->pod.entry_count++;
>
> That seems to be reshuffling the running order of this function but I
> don''t see how it''s related to locking.  Is this an
unrelated change
> that snuck in?
>
> (Oh, a third thing just occurred to me - might be worth making some of
> those ''lock foo held on entry'' comments into
ASSERT(lock_held_by_me()). )
>
> Cheers,
>
> Tim.
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

andres@lagarcavilla.com

2011-Nov-02 14:20 UTC

head link

Re: [Xen-devel] [PATCH 5 of 9] Fine-grained concurrency control structure for the p2m

Hey there,
(many inlines on this one)
> At 00:33 -0400 on 27 Oct (1319675630), Andres Lagar-Cavilla wrote:
>> Introduce a fine-grained concurrency control structure for the p2m.
This
>> allows for locking 2M-aligned chunks of the p2m at a time, exclusively.
>> Recursive locking is allowed. Global locking of the whole p2m is also
>> allowed for certain operations. Simple deadlock detection heuristics
are
>> put in place.
>>
>> Note the patch creates backwards-compatible shortcuts that will lock
the
>> p2m globally. So it should remain functionally identical to what is
>> currently
>> in place.
>>
>> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
>
> Wow.  What a lot of code. :)  I took a look through, but I can''t
> guarantee to have got all the details.  Things I saw:
>
> - You use atomic_t for the count but only ever update it under a
>   lock. :)  If you just need to be sure of atomic writes, then
>   atomic_set will do that without using a locked increment/decrement.
I''m a bit flaky on my atomics. And paranoid. I''ll be less
lenient next time.
>
> - You allocate the bitmaps from xenheap - they should really be using
>   p2m memory, so as to avoid changing the memory overhead of the domain
>   as it runs.   That will involve map_domain_page()ing the bitmaps as
>   you go, but at least on x86_64 that''s very cheap.
p2m_alloc_ptp? Sure.
>
> - panic() on out-of-memory is pretty rude.
>Yeah, but unwinding all possible lock callers to handle ENOMEM was over my
threshold. Reality is that on your run-of-the-mill 4GB domain you have 4
or 5 single page allocations. You have bigger problems if that fails.
> But stepping back, I''m not sure that we need all this just yet.  I
think
> it would be worth doing the interface changes with a single p2m lock and
> measuring how bad it is before getting stuck in to fine-grained locking
> (fun though it might be).
Completely agree. I think this will also ease adoption and bug isolation.
It''ll allow me to be more gradual. I''ll rework the order.
Thanks, very
good.
>
> I suspect that if this is a contention point, allowing multiple readers
> will become important, especially if there are particular pages that
> often get emulated access.
>
> And also, I''d  like to get some sort of plan for handling
long-lived
> foreign mappings, if only to make sure that this phase-1 fix
doesn''t
> conflict wih it.
>
If foreign mappings will hold a lock/ref on a p2m subrange, then
they''ll
disallow global operations, and you''ll get a clash between log-dirty
and,
say, qemu. Ka-blam live migration.

Read-only foreign mappings are only problematic insofar paging happens.
With proper p2m update/lookups serialization (global or fine-grained) that
problem is gone.

Write-able foreign mappings are trickier because of sharing and w^x. Is
there a reason left, today, to not type PGT_writable an hvm-domain''s
page
when a foreign mapping happens? That would solve sharing problems. w^x
really can''t be solved short of putting the vcpu on a waitqueue
(preferable to me), or destroying the mapping and forcing the foreign OS
to remap later. All a few steps ahead, I hope.

Who/what''s using w^x by the way? If the refcount is zero, I think I
know
what I''ll do ;)

That is my current "long term plan".
> Oh, one more thing:
>
>> +/* Some deadlock book-keeping. Say CPU A holds a lock on range A, CPU
B
>> holds a
>> + * lock on range B. Now, CPU A wants to lock range B and vice-versa.
>> Deadlock.
>> + * We detect this by remembering the start of the current locked
range.
>> + * We keep a fairly small stack of guards (8), because we
don''t
>> anticipate
>> + * a great deal of recursive locking because (a) recursive locking is
>> rare
>> + * (b) it is evil (c) only PoD seems to do it (is PoD therefore evil?)
>> */
>
> If PoD could ba adjusted not to do it, could we get rid of all the
> recursive locking entirely?  That would simplify things a lot.
>
My comment is an exaggeration. In a fine-grained scenario, recursive
locking happens massively throughout the code. We just need to live with
it. I was ranting for free on the "evil" adjective.

What is a real problem is that pod sweeps can cause deadlocks. There is a
simple step to mitigate this: start the sweep from the current gfn and
never wrap around -- too bad if the gfn is too high. But this alters the
sweeping algorithm. I''ll deal with it when its it''s turn.

Andres> Tim.
>
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

andres@lagarcavilla.com

2011-Nov-02 14:24 UTC

head link

Re: [Xen-devel] [PATCH 8 of 9] Modify all internal p2m functions to use the new fine-grained locking

Hello,> Hi,
>
> At 00:33 -0400 on 27 Oct (1319675633), Andres Lagar-Cavilla wrote:
>> This patch only modifies code internal to the p2m, adding convenience
>> macros, etc. It will yield a compiling code base but an incorrect
>> hypervisor (external callers of queries into the p2m will not unlock).
>> Next patch takes care of external callers, split done for the benefit
>> of conciseness.
>
> Better to do it the other way round: put the enormous change-all-callers
> patch first, with noop unlock functions, and then hook in the unlocks.
> That way you won''t cause chaos when people try to bisect to find
when a
> bug was introduced.
Indeed, excellent point.
>
>> diff -r 8a98179666de -r 471d4f2754d6 xen/include/asm-x86/p2m.h
>> --- a/xen/include/asm-x86/p2m.h
>> +++ b/xen/include/asm-x86/p2m.h
>> @@ -220,7 +220,7 @@ struct p2m_domain {
>>       * tables on every host-p2m change.  The setter of this flag
>>       * is responsible for performing the full flush before releasing
>> the
>>       * host p2m''s lock. */
>> -    int                defer_nested_flush;
>> +    atomic_t           defer_nested_flush;
>>
>>      /* Pages used to construct the p2m */
>>      struct page_list_head pages;
>> @@ -298,6 +298,15 @@ struct p2m_domain *p2m_get_p2m(struct vc
>>  #define p2m_get_pagetable(p2m)  ((p2m)->phys_table)
>>
>>
>> +/* No matter what value you get out of a query, the p2m has been
locked
>> for
>> + * that range. No matter what you do, you need to drop those locks.
>> + * You need to pass back the mfn obtained when locking, not the new
>> one,
>> + * as the refcount of the original mfn was bumped. */
>
> Surely the caller doesn''t need to remember the old MFN for this? 
After
> allm, the whole point of the lock was that nobody else could change the
> p2m entry under our feet!
>
> In any case, I thing there needs to be a big block comment a bit futher
> up that describes what all this locking and refcounting does, and why.
Comment will be added. I was being doubly-paranoid. I can undo the
get_page/put_page of the old mfn. I''m not a 100% behind it.

Andres
>
>> +void drop_p2m_gfn(struct p2m_domain *p2m, unsigned long gfn,
>> +                        unsigned long mfn);
>> +#define drop_p2m_gfn_domain(d, g, m)    \
>> +        drop_p2m_gfn(p2m_get_hostp2m((d)), (g), (m))
>> +
>>  /* Read a particular P2M table, mapping pages as we go.  Most callers
>>   * should _not_ call this directly; use the other gfn_to_mfn_*
>> functions
>>   * below unless you know you want to walk a p2m that isn''t a
domain''s
>> @@ -327,6 +336,28 @@ static inline mfn_t gfn_to_mfn_type(stru
>>  #define gfn_to_mfn_guest(d, g, t)   gfn_to_mfn_type((d), (g), (t),
>> p2m_guest)
>>  #define gfn_to_mfn_unshare(d, g, t) gfn_to_mfn_type((d), (g), (t),
>> p2m_unshare)
>>
>> +/* This one applies to very specific situations in which
you''re
>> querying
>> + * a p2m entry and will be done "immediately" (such as a
printk or
>> computing a
>> + * return value). Use this only if there are no expectations of the
p2m
>> entry
>> + * holding steady. */
>> +static inline mfn_t gfn_to_mfn_type_unlocked(struct domain *d,
>> +                                        unsigned long gfn, p2m_type_t
>> *t,
>> +                                        p2m_query_t q)
>> +{
>> +    mfn_t mfn = gfn_to_mfn_type(d, gfn, t, q);
>> +    drop_p2m_gfn_domain(d, gfn, mfn_x(mfn));
>> +    return mfn;
>> +}
>> +
>> +#define gfn_to_mfn_unlocked(d, g, t)            \
>> +    gfn_to_mfn_type_unlocked((d), (g), (t), p2m_alloc)
>> +#define gfn_to_mfn_query_unlocked(d, g, t)    \
>> +    gfn_to_mfn_type_unlocked((d), (g), (t), p2m_query)
>> +#define gfn_to_mfn_guest_unlocked(d, g, t)    \
>> +    gfn_to_mfn_type_unlocked((d), (g), (t), p2m_guest)
>> +#define gfn_to_mfn_unshare_unlocked(d, g, t)    \
>> +    gfn_to_mfn_type_unlocked((d), (g), (t), p2m_unshare)
>> +
>
> Ugh.  This could really benefit from having the gfn_to_mfn_* functions
> take a set of flags instead of an enum.  This exponential blowup in
> interface is going too far. :)
I don''t think these names are the most terrible -- we''ve all
seen far
worse :) I mean, the naming encodes the arguments, and I don''t see an
intrinsic advantage to
gfn_to_mfn(d, g, t, p2m_guest, p2m_unlocked)
over
gfn_to_mfn_guest_unlocked(d,g,t)


Andres>
> That oughtn''t to stop this interface from going in, of course, but
if
> we''re going to tinker with the p2m callers once, we should do it
all
> together.
>
> Tim.
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

andres@lagarcavilla.com

2011-Nov-02 14:32 UTC

head link

Re: [Xen-devel] [PATCH 9 of 9] Modify all call sites of queries into the p2m to use the new fine-grained locking

I don''t know that a massive sed on all these names is a good idea. I
guess
forcing everyone to compile-fail will also make them realize they need to
add a call to drop the p2m locks they got...

Can you elaborate on the naming preferences here: would you prefer
gfn_to_mfn/put_gfn? get_p2m_gfn/put_p2m_gfn? get_gfn/put_gfn

Andres
> At 00:33 -0400 on 27 Oct (1319675634), Andres Lagar-Cavilla wrote:
>>  28 files changed, 519 insertions(+), 101 deletions(-)
>
> And I thought patch 5 was big :)
>
> I''m not going to read the detail of this this time around -
I''d like to
> only have to review it once. :)
>
> I wonder whether it would be worth changing the name/signature of the
> generic p2m functions in an incompatime way while we''re there.  It
would
> have three advantages:
>
>  - allow the lookup/drop pairs to have nice matching names
>  - get rid of the confusingly-named ''gmfn_to_mfn''
function
>  - avoid later bugs if patches are forward-ported across this chaneg
>    that add p2m lookups (but not corresoponding drops)
>
> Tim.
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

George Dunlap

2011-Nov-02 22:30 UTC

head link

Re: [Xen-devel] [PATCH 4 of 9] Rework locking in the PoD layer

On Thu, Oct 27, 2011 at 1:33 PM, Andres Lagar-Cavilla
<andres@lagarcavilla.org> wrote:>  xen/arch/x86/mm/mm-locks.h |    9 ++
>  xen/arch/x86/mm/p2m-pod.c  |  145
+++++++++++++++++++++++++++------------------
>  xen/arch/x86/mm/p2m-pt.c   |    3 +
>  xen/arch/x86/mm/p2m.c      |    7 +-
>  xen/include/asm-x86/p2m.h  |   25 ++-----
>  5 files changed, 113 insertions(+), 76 deletions(-)
>
>
> The PoD layer has a fragile locking discipline. It relies on the
> p2m being globally locked, and it also relies on the page alloc
> lock to protect some of its data structures. Replace this all by an
> explicit pod lock: per p2m, order enforced.
>
> Two consequences:
>    - Critical sections in the pod code protected by the page alloc
>      lock are now reduced to modifications of the domain page list.
>    - When the p2m lock becomes fine-grained, there are no
>      assumptions broken in the PoD layer.
>
> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
>
> diff -r 332775f72a30 -r 981073d78f7f xen/arch/x86/mm/mm-locks.h
> --- a/xen/arch/x86/mm/mm-locks.h
> +++ b/xen/arch/x86/mm/mm-locks.h
> @@ -155,6 +155,15 @@ declare_mm_lock(p2m)
>  #define p2m_unlock(p)         mm_unlock(&(p)->lock)
>  #define p2m_locked_by_me(p)   mm_locked_by_me(&(p)->lock)
>
> +/* PoD lock (per-p2m-table)
> + *
> + * Protects private PoD data structs. */
> +
> +declare_mm_lock(pod)
> +#define pod_lock(p)           mm_lock(pod, &(p)->pod.lock)
> +#define pod_unlock(p)         mm_unlock(&(p)->pod.lock)
> +#define pod_locked_by_me(p)   mm_locked_by_me(&(p)->pod.lock)
> +
>  /* Page alloc lock (per-domain)
>  *
>  * This is an external lock, not represented by an mm_lock_t. However,
> diff -r 332775f72a30 -r 981073d78f7f xen/arch/x86/mm/p2m-pod.c
> --- a/xen/arch/x86/mm/p2m-pod.c
> +++ b/xen/arch/x86/mm/p2m-pod.c
> @@ -63,6 +63,7 @@ static inline void unlock_page_alloc(str
>  * Populate-on-demand functionality
>  */
>
> +/* PoD lock held on entry */
>  static int
>  p2m_pod_cache_add(struct p2m_domain *p2m,
>                   struct page_info *page,
> @@ -114,43 +115,42 @@ p2m_pod_cache_add(struct p2m_domain *p2m
>         unmap_domain_page(b);
>     }
>
> +    /* First, take all pages off the domain list */
>     lock_page_alloc(p2m);
> -
> -    /* First, take all pages off the domain list */
>     for(i=0; i < 1 << order ; i++)
>     {
>         p = page + i;
>         page_list_del(p, &d->page_list);
>     }
>
> -    /* Then add the first one to the appropriate populate-on-demand list
*/
> -    switch(order)
> -    {
> -    case PAGE_ORDER_2M:
> -        page_list_add_tail(page, &p2m->pod.super); /* lock:
page_alloc */
> -        p2m->pod.count += 1 << order;
> -        break;
> -    case PAGE_ORDER_4K:
> -        page_list_add_tail(page, &p2m->pod.single); /* lock:
page_alloc */
> -        p2m->pod.count += 1;
> -        break;
> -    default:
> -        BUG();
> -    }
> -
>     /* Ensure that the PoD cache has never been emptied.
>      * This may cause "zombie domains" since the page will never
be freed. */
>     BUG_ON( d->arch.relmem != RELMEM_not_started );
>
>     unlock_page_alloc(p2m);
>
> +    /* Then add the first one to the appropriate populate-on-demand list
*/
> +    switch(order)
> +    {
> +    case PAGE_ORDER_2M:
> +        page_list_add_tail(page, &p2m->pod.super);
> +        p2m->pod.count += 1 << order;
> +        break;
> +    case PAGE_ORDER_4K:
> +        page_list_add_tail(page, &p2m->pod.single);
> +        p2m->pod.count += 1;
> +        break;
> +    default:
> +        BUG();
> +    }
> +
>     return 0;
>  }
>
>  /* Get a page of size order from the populate-on-demand cache.  Will break
>  * down 2-meg pages into singleton pages automatically.  Returns null if
> - * a superpage is requested and no superpages are available.  Must be
called
> - * with the d->page_lock held. */
> + * a superpage is requested and no superpages are available. */
> +/* PoD lock held on entry */
>  static struct page_info * p2m_pod_cache_get(struct p2m_domain *p2m,
>                                             unsigned long order)
>  {
> @@ -185,7 +185,7 @@ static struct page_info * p2m_pod_cache_
>     case PAGE_ORDER_2M:
>         BUG_ON( page_list_empty(&p2m->pod.super) );
>         p = page_list_remove_head(&p2m->pod.super);
> -        p2m->pod.count -= 1 << order; /* Lock: page_alloc */
> +        p2m->pod.count -= 1 << order;
>         break;
>     case PAGE_ORDER_4K:
>         BUG_ON( page_list_empty(&p2m->pod.single) );
> @@ -197,16 +197,19 @@ static struct page_info * p2m_pod_cache_
>     }
>
>     /* Put the pages back on the domain page_list */
> +    lock_page_alloc(p2m);
>     for ( i = 0 ; i < (1 << order); i++ )
>     {
>         BUG_ON(page_get_owner(p + i) != p2m->domain);
>         page_list_add_tail(p + i, &p2m->domain->page_list);
>     }
> +    unlock_page_alloc(p2m);
>
>     return p;
>  }
>
>  /* Set the size of the cache, allocating or freeing as necessary. */
> +/* PoD lock held on entry */
>  static int
>  p2m_pod_set_cache_target(struct p2m_domain *p2m, unsigned long pod_target,
int preemptible)
>  {
> @@ -259,8 +262,6 @@ p2m_pod_set_cache_target(struct p2m_doma
>
>         /* Grab the lock before checking that pod.super is empty, or the
last
>          * entries may disappear before we grab the lock. */
> -        lock_page_alloc(p2m);
> -
>         if ( (p2m->pod.count - pod_target) > SUPERPAGE_PAGES
>              && !page_list_empty(&p2m->pod.super) )
>             order = PAGE_ORDER_2M;
> @@ -271,8 +272,6 @@ p2m_pod_set_cache_target(struct p2m_doma
>
>         ASSERT(page != NULL);
>
> -        unlock_page_alloc(p2m);
> -
>         /* Then free them */
>         for ( i = 0 ; i < (1 << order) ; i++ )
>         {
> @@ -348,7 +347,7 @@ p2m_pod_set_mem_target(struct domain *d,
>     int ret = 0;
>     unsigned long populated;
>
> -    p2m_lock(p2m);
> +    pod_lock(p2m);
>
>     /* P == B: Nothing to do. */
>     if ( p2m->pod.entry_count == 0 )
> @@ -377,7 +376,7 @@ p2m_pod_set_mem_target(struct domain *d,
>     ret = p2m_pod_set_cache_target(p2m, pod_target, 1/*preemptible*/);
>
>  out:
> -    p2m_unlock(p2m);
> +    pod_unlock(p2m);
>
>     return ret;
>  }
> @@ -390,7 +389,7 @@ p2m_pod_empty_cache(struct domain *d)
>
>     /* After this barrier no new PoD activities can happen. */
>     BUG_ON(!d->is_dying);
> -    spin_barrier(&p2m->lock.lock);
> +    spin_barrier(&p2m->pod.lock.lock);
>
>     lock_page_alloc(p2m);
>
> @@ -431,7 +430,8 @@ p2m_pod_offline_or_broken_hit(struct pag
>     if ( !(d = page_get_owner(p)) || !(p2m = p2m_get_hostp2m(d)) )
>         return 0;
>
> -    lock_page_alloc(p2m);
> +    pod_lock(p2m);
> +
>     bmfn = mfn_x(page_to_mfn(p));
>     page_list_for_each_safe(q, tmp, &p2m->pod.super)
>     {
> @@ -462,12 +462,14 @@ p2m_pod_offline_or_broken_hit(struct pag
>         }
>     }
>
> -    unlock_page_alloc(p2m);
> +    pod_unlock(p2m);
>     return 0;
>
>  pod_hit:
> +    lock_page_alloc(p2m);
>     page_list_add_tail(p, &d->arch.relmem_list);
>     unlock_page_alloc(p2m);
> +    pod_unlock(p2m);
>     return 1;
>  }
>
> @@ -486,9 +488,9 @@ p2m_pod_offline_or_broken_replace(struct
>     if ( unlikely(!p) )
>         return;
>
> -    p2m_lock(p2m);
> +    pod_lock(p2m);
>     p2m_pod_cache_add(p2m, p, PAGE_ORDER_4K);
> -    p2m_unlock(p2m);
> +    pod_unlock(p2m);
>     return;
>  }
>
> @@ -512,6 +514,7 @@ p2m_pod_decrease_reservation(struct doma
>     int steal_for_cache = 0;
>     int pod = 0, nonpod = 0, ram = 0;
>
> +    pod_lock(p2m);
>
>     /* If we don''t have any outstanding PoD entries, let things
take their
>      * course */
> @@ -521,11 +524,10 @@ p2m_pod_decrease_reservation(struct doma
>     /* Figure out if we need to steal some freed memory for our cache */
>     steal_for_cache =  ( p2m->pod.entry_count > p2m->pod.count );
>
> -    p2m_lock(p2m);
>     audit_p2m(p2m, 1);
>
>     if ( unlikely(d->is_dying) )
> -        goto out_unlock;
> +        goto out;
>
>     /* See what''s in here. */
>     /* FIXME: Add contiguous; query for PSE entries? */
I don''t think this can be quite right.

The point of holding the p2m lock here is so that the p2m entries
don''t change between the gfn_to_mfn_query() here and the
set_p2m_entries() below.  The balloon driver racing with other vcpus
populating pages is exactly the kind of race we expect to experience.
And in any case, this change will cause set_p2m_entry() to ASSERT()
because we''re not holding the p2m lock.

Or am I missing something?

I haven''t yet looked at the rest of the patch series, but it would
definitely be better for people in the future looking back and trying
to figure out why the code is the way that it is if even transitory
changesets don''t introduce "temporary" violations of
invariants. :-)
> @@ -547,14 +549,14 @@ p2m_pod_decrease_reservation(struct doma
>
>     /* No populate-on-demand?  Don''t need to steal anything?  Then
we''re done!*/
>     if(!pod && !steal_for_cache)
> -        goto out_unlock;
> +        goto out_audit;
>
>     if ( !nonpod )
>     {
>         /* All PoD: Mark the whole region invalid and tell caller
>          * we''re done. */
>         set_p2m_entry(p2m, gpfn, _mfn(INVALID_MFN), order, p2m_invalid,
p2m->default_access);
> -        p2m->pod.entry_count-=(1<<order); /* Lock: p2m */
> +        p2m->pod.entry_count-=(1<<order);
>         BUG_ON(p2m->pod.entry_count < 0);
>         ret = 1;
>         goto out_entry_check;
> @@ -577,7 +579,7 @@ p2m_pod_decrease_reservation(struct doma
>         if ( t == p2m_populate_on_demand )
>         {
>             set_p2m_entry(p2m, gpfn + i, _mfn(INVALID_MFN), 0, p2m_invalid,
p2m->default_access);
> -            p2m->pod.entry_count--; /* Lock: p2m */
> +            p2m->pod.entry_count--;
>             BUG_ON(p2m->pod.entry_count < 0);
>             pod--;
>         }
> @@ -613,11 +615,11 @@ out_entry_check:
>         p2m_pod_set_cache_target(p2m, p2m->pod.entry_count,
0/*can''t preempt*/);
>     }
>
> -out_unlock:
> +out_audit:
>     audit_p2m(p2m, 1);
> -    p2m_unlock(p2m);
>
>  out:
> +    pod_unlock(p2m);
>     return ret;
>  }
>
> @@ -630,20 +632,24 @@ void p2m_pod_dump_data(struct domain *d)
>
>
>  /* Search for all-zero superpages to be reclaimed as superpages for the
> - * PoD cache. Must be called w/ p2m lock held, page_alloc lock not held.
*/
> -static int
> + * PoD cache. Must be called w/ pod lock held, page_alloc lock not held.
*/
> +static void
For the same reason, this must be called with the p2m lock held: it
calls gfn_to_mfn_query() and then calls set_p2m_entry().  As it
happens, this always *is* called with the p2m lock held at the moment;
but the comment still needs to reflect this.  Similarly in
p2m_pod_zero_check().
>  p2m_pod_zero_check_superpage(struct p2m_domain *p2m, unsigned long gfn)
>  {
>     mfn_t mfn, mfn0 = _mfn(INVALID_MFN);
>     p2m_type_t type, type0 = 0;
>     unsigned long * map = NULL;
> -    int ret=0, reset = 0;
> +    int success = 0, reset = 0;
>     int i, j;
>     int max_ref = 1;
>     struct domain *d = p2m->domain;
>
>     if ( !superpage_aligned(gfn) )
> -        goto out;
> +        return;
> +
> +    /* If we were enforcing ordering against p2m locks, this is a place
> +     * to drop the PoD lock and re-acquire it once we''re done
mucking with
> +     * the p2m. */
>
>     /* Allow an extra refcount for one shadow pt mapping in shadowed
domains */
>     if ( paging_mode_shadow(d) )
> @@ -751,19 +757,24 @@ p2m_pod_zero_check_superpage(struct p2m_
>         __trace_var(TRC_MEM_POD_ZERO_RECLAIM, 0, sizeof(t), &t);
>     }
>
> -    /* Finally!  We''ve passed all the checks, and can add the mfn
superpage
> -     * back on the PoD cache, and account for the new p2m PoD entries */
> -    p2m_pod_cache_add(p2m, mfn_to_page(mfn0), PAGE_ORDER_2M);
> -    p2m->pod.entry_count += SUPERPAGE_PAGES;
> +    success = 1;
> +
>
>  out_reset:
>     if ( reset )
>         set_p2m_entry(p2m, gfn, mfn0, 9, type0, p2m->default_access);
>
>  out:
> -    return ret;
> +    if ( success )
> +    {
> +        /* Finally!  We''ve passed all the checks, and can add the
mfn superpage
> +         * back on the PoD cache, and account for the new p2m PoD entries
*/
> +        p2m_pod_cache_add(p2m, mfn_to_page(mfn0), PAGE_ORDER_2M);
> +        p2m->pod.entry_count += SUPERPAGE_PAGES;
> +    }
>  }
>
> +/* On entry, PoD lock is held */
>  static void
>  p2m_pod_zero_check(struct p2m_domain *p2m, unsigned long *gfns, int count)
>  {
> @@ -775,6 +786,8 @@ p2m_pod_zero_check(struct p2m_domain *p2
>     int i, j;
>     int max_ref = 1;
>
> +    /* Also the right time to drop pod_lock if enforcing ordering against
p2m_lock */
> +
>     /* Allow an extra refcount for one shadow pt mapping in shadowed
domains */
>     if ( paging_mode_shadow(d) )
>         max_ref++;
> @@ -841,7 +854,6 @@ p2m_pod_zero_check(struct p2m_domain *p2
>             if( *(map[i]+j) != 0 )
>                 break;
>
> -        unmap_domain_page(map[i]);
>
>         /* See comment in p2m_pod_zero_check_superpage() re gnttab
>          * check timing.  */
> @@ -849,8 +861,15 @@ p2m_pod_zero_check(struct p2m_domain *p2
>         {
>             set_p2m_entry(p2m, gfns[i], mfns[i], PAGE_ORDER_4K,
>                 types[i], p2m->default_access);
> +            unmap_domain_page(map[i]);
> +            map[i] = NULL;
>         }
> -        else
> +    }
> +
> +    /* Finally, add to cache */
> +    for ( i=0; i < count; i++ )
> +    {
> +        if ( map[i] )
>         {
>             if ( tb_init_done )
>             {
> @@ -867,6 +886,8 @@ p2m_pod_zero_check(struct p2m_domain *p2
>                 __trace_var(TRC_MEM_POD_ZERO_RECLAIM, 0, sizeof(t),
&t);
>             }
>
> +            unmap_domain_page(map[i]);
> +
>             /* Add to cache, and account for the new p2m PoD entry */
>             p2m_pod_cache_add(p2m, mfn_to_page(mfns[i]), PAGE_ORDER_4K);
>             p2m->pod.entry_count++;
> @@ -876,6 +897,7 @@ p2m_pod_zero_check(struct p2m_domain *p2
>  }
>
>  #define POD_SWEEP_LIMIT 1024
> +/* Only one CPU at a time is guaranteed to enter a sweep */
>  static void
>  p2m_pod_emergency_sweep_super(struct p2m_domain *p2m)
>  {
> @@ -964,7 +986,8 @@ p2m_pod_demand_populate(struct p2m_domai
>
>     ASSERT(p2m_locked_by_me(p2m));
>
> -    /* This check is done with the p2m lock held.  This will make sure
that
> +    pod_lock(p2m);
> +    /* This check is done with the pod lock held.  This will make sure
that
>      * even if d->is_dying changes under our feet, p2m_pod_empty_cache()
>      * won''t start until we''re done. */
>     if ( unlikely(d->is_dying) )
> @@ -974,6 +997,7 @@ p2m_pod_demand_populate(struct p2m_domai
>      * 1GB region to 2MB chunks for a retry. */
>     if ( order == PAGE_ORDER_1G )
>     {
> +        pod_unlock(p2m);
>         gfn_aligned = (gfn >> order) << order;
>         /* Note that we are supposed to call set_p2m_entry() 512 times to
>          * split 1GB into 512 2MB pages here. But We only do once here
because
> @@ -983,6 +1007,7 @@ p2m_pod_demand_populate(struct p2m_domai
>         set_p2m_entry(p2m, gfn_aligned, _mfn(0), PAGE_ORDER_2M,
>                       p2m_populate_on_demand, p2m->default_access);
>         audit_p2m(p2m, 1);
> +        /* This is because the ept/pt caller locks the p2m recursively */
>         p2m_unlock(p2m);
>         return 0;
>     }
> @@ -996,11 +1021,15 @@ p2m_pod_demand_populate(struct p2m_domai
>
>         /* If we''re low, start a sweep */
>         if ( order == PAGE_ORDER_2M &&
page_list_empty(&p2m->pod.super) )
> +            /* Note that sweeps scan other ranges in the p2m. In an
scenario
> +             * in which p2m locks are order-enforced wrt pod lock and p2m
> +             * locks are fine grained, this will result in deadlock */
>             p2m_pod_emergency_sweep_super(p2m);
>
>         if ( page_list_empty(&p2m->pod.single) &&
>              ( ( order == PAGE_ORDER_4K )
>                || (order == PAGE_ORDER_2M &&
page_list_empty(&p2m->pod.super) ) ) )
> +            /* Same comment regarding deadlock applies */
>             p2m_pod_emergency_sweep(p2m);
>     }
>
> @@ -1008,8 +1037,6 @@ p2m_pod_demand_populate(struct p2m_domai
>     if ( q == p2m_guest && gfn > p2m->pod.max_guest )
>         p2m->pod.max_guest = gfn;
>
> -    lock_page_alloc(p2m);
> -
>     if ( p2m->pod.count == 0 )
>         goto out_of_memory;
>
> @@ -1022,8 +1049,6 @@ p2m_pod_demand_populate(struct p2m_domai
>
>     BUG_ON((mfn_x(mfn) & ((1 << order)-1)) != 0);
>
> -    unlock_page_alloc(p2m);
> -
>     gfn_aligned = (gfn >> order) << order;
>
>     set_p2m_entry(p2m, gfn_aligned, mfn, order, p2m_ram_rw,
p2m->default_access);
> @@ -1034,8 +1059,9 @@ p2m_pod_demand_populate(struct p2m_domai
>         paging_mark_dirty(d, mfn_x(mfn) + i);
>     }
>
> -    p2m->pod.entry_count -= (1 << order); /* Lock: p2m */
> +    p2m->pod.entry_count -= (1 << order);
>     BUG_ON(p2m->pod.entry_count < 0);
> +    pod_unlock(p2m);
>
>     if ( tb_init_done )
>     {
> @@ -1054,16 +1080,17 @@ p2m_pod_demand_populate(struct p2m_domai
>
>     return 0;
>  out_of_memory:
> -    unlock_page_alloc(p2m);
> +    pod_unlock(p2m);
>
>     printk("%s: Out of populate-on-demand memory! tot_pages %"
PRIu32 " pod_entries %" PRIi32 "\n",
>            __func__, d->tot_pages, p2m->pod.entry_count);
>     domain_crash(d);
>  out_fail:
> +    pod_unlock(p2m);
>     return -1;
>  remap_and_retry:
>     BUG_ON(order != PAGE_ORDER_2M);
> -    unlock_page_alloc(p2m);
> +    pod_unlock(p2m);
>
>     /* Remap this 2-meg region in singleton chunks */
>     gfn_aligned = (gfn>>order)<<order;
> @@ -1133,9 +1160,11 @@ guest_physmap_mark_populate_on_demand(st
>         rc = -EINVAL;
>     else
>     {
> -        p2m->pod.entry_count += 1 << order; /* Lock: p2m */
> +        pod_lock(p2m);
> +        p2m->pod.entry_count += 1 << order;
>         p2m->pod.entry_count -= pod_count;
>         BUG_ON(p2m->pod.entry_count < 0);
> +        pod_unlock(p2m);
>     }
>
>     audit_p2m(p2m, 1);
> diff -r 332775f72a30 -r 981073d78f7f xen/arch/x86/mm/p2m-pt.c
> --- a/xen/arch/x86/mm/p2m-pt.c
> +++ b/xen/arch/x86/mm/p2m-pt.c
> @@ -1001,6 +1001,7 @@ void audit_p2m(struct p2m_domain *p2m, i
>     if ( !paging_mode_translate(d) )
>         return;
>
> +    pod_lock(p2m);
>     //P2M_PRINTK("p2m audit starts\n");
>
>     test_linear = ( (d == current->domain)
> @@ -1247,6 +1248,8 @@ void audit_p2m(struct p2m_domain *p2m, i
>                    pmbad, mpbad);
>         WARN();
>     }
> +
> +    pod_unlock(p2m);
>  }
>  #endif /* P2M_AUDIT */
>
> diff -r 332775f72a30 -r 981073d78f7f xen/arch/x86/mm/p2m.c
> --- a/xen/arch/x86/mm/p2m.c
> +++ b/xen/arch/x86/mm/p2m.c
> @@ -72,6 +72,7 @@ boolean_param("hap_2mb", opt_hap_2mb);
>  static void p2m_initialise(struct domain *d, struct p2m_domain *p2m)
>  {
>     mm_lock_init(&p2m->lock);
> +    mm_lock_init(&p2m->pod.lock);
>     INIT_LIST_HEAD(&p2m->np2m_list);
>     INIT_PAGE_LIST_HEAD(&p2m->pages);
>     INIT_PAGE_LIST_HEAD(&p2m->pod.super);
> @@ -506,8 +507,10 @@ guest_physmap_add_entry(struct domain *d
>             rc = -EINVAL;
>         else
>         {
> -            p2m->pod.entry_count -= pod_count; /* Lock: p2m */
> +            pod_lock(p2m);
> +            p2m->pod.entry_count -= pod_count;
>             BUG_ON(p2m->pod.entry_count < 0);
> +            pod_unlock(p2m);
>         }
>     }
>
> @@ -1125,8 +1128,10 @@ p2m_flush_table(struct p2m_domain *p2m)
>     /* "Host" p2m tables can have shared entries &c that need
a bit more
>      * care when discarding them */
>     ASSERT(p2m_is_nestedp2m(p2m));
> +    pod_lock(p2m);
>     ASSERT(page_list_empty(&p2m->pod.super));
>     ASSERT(page_list_empty(&p2m->pod.single));
> +    pod_unlock(p2m);
>
>     /* This is no longer a valid nested p2m for any address space */
>     p2m->cr3 = CR3_EADDR;
> diff -r 332775f72a30 -r 981073d78f7f xen/include/asm-x86/p2m.h
> --- a/xen/include/asm-x86/p2m.h
> +++ b/xen/include/asm-x86/p2m.h
> @@ -257,24 +257,13 @@ struct p2m_domain {
>     unsigned long max_mapped_pfn;
>
>     /* Populate-on-demand variables
> -     * NB on locking.  {super,single,count} are
> -     * covered by d->page_alloc_lock, since they''re almost
always used in
> -     * conjunction with that functionality.  {entry_count} is covered by
> -     * the domain p2m lock, since it''s almost always used in
conjunction
> -     * with changing the p2m tables.
>      *
> -     * At this point, both locks are held in two places.  In both,
> -     * the order is [p2m,page_alloc]:
> -     * + p2m_pod_decrease_reservation() calls p2m_pod_cache_add(),
> -     *   which grabs page_alloc
> -     * + p2m_pod_demand_populate() grabs both; the p2m lock to avoid
> -     *   double-demand-populating of pages, the page_alloc lock to
> -     *   protect moving stuff from the PoD cache to the domain page list.
> -     *
> -     * We enforce this lock ordering through a construct in mm-locks.h.
> -     * This demands, however, that we store the previous lock-ordering
> -     * level in effect before grabbing the page_alloc lock.
> -     */
> +     * All variables are protected with the pod lock. We cannot rely on
> +     * the p2m lock if it''s turned into a fine-grained lock.
> +     * We only use the domain page_alloc lock for additions and
> +     * deletions to the domain''s page list. Because we use it
nested
> +     * within the PoD lock, we enforce it''s ordering (by
remembering
> +     * the unlock level). */
>     struct {
>         struct page_list_head super,   /* List of superpages              
 */
>                          single;       /* Non-super lists                  
*/
> @@ -283,6 +272,8 @@ struct p2m_domain {
>         unsigned         reclaim_super; /* Last gpfn of a scan */
>         unsigned         reclaim_single; /* Last gpfn of a scan */
>         unsigned         max_guest;    /* gpfn of max guest demand-populate
*/
> +        mm_lock_t        lock;         /* Locking of private pod structs,
  *
> +                                        * not relying on the p2m lock.    
 */
>         int              page_alloc_unlock_level; /* To enforce lock
ordering */
>     } pod;
>  };
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

George Dunlap

2011-Nov-02 23:00 UTC

head link

Re: [Xen-devel] [PATCH 8 of 9] Modify all internal p2m functions to use the new fine-grained locking

On Thu, Oct 27, 2011 at 1:33 PM, Andres Lagar-Cavilla
<andres@lagarcavilla.org> wrote:>  xen/arch/x86/mm/hap/hap.c        |    2 +-
>  xen/arch/x86/mm/hap/nested_hap.c |   21 ++-
>  xen/arch/x86/mm/p2m-ept.c        |   26 +----
>  xen/arch/x86/mm/p2m-pod.c        |   42 +++++--
>  xen/arch/x86/mm/p2m-pt.c         |   20 +---
>  xen/arch/x86/mm/p2m.c            |  185
++++++++++++++++++++++++--------------
>  xen/include/asm-ia64/mm.h        |    5 +
>  xen/include/asm-x86/p2m.h        |   45 +++++++++-
>  8 files changed, 217 insertions(+), 129 deletions(-)
>
>
> This patch only modifies code internal to the p2m, adding convenience
> macros, etc. It will yield a compiling code base but an incorrect
> hypervisor (external callers of queries into the p2m will not unlock).
> Next patch takes care of external callers, split done for the benefit
> of conciseness.
It''s not obvious to me where in this patch to find a description of
what the new locking regime is.  What does the _unlocked() mean?  When
do I have to call that vs a different one, and when do I have to lock
/ unlock / whatever?

I think that should ideally be both in the commit message (at least a
summary), and also in a comment in a header somewhere.  Perhaps it is
already in the patch somewhere, but a quick glance through didn''t find
it...
>
> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
>
> diff -r 8a98179666de -r 471d4f2754d6 xen/arch/x86/mm/hap/hap.c
> --- a/xen/arch/x86/mm/hap/hap.c
> +++ b/xen/arch/x86/mm/hap/hap.c
> @@ -861,7 +861,7 @@ hap_write_p2m_entry(struct vcpu *v, unsi
>     old_flags = l1e_get_flags(*p);
>
>     if ( nestedhvm_enabled(d) && (old_flags & _PAGE_PRESENT)
> -         && !p2m_get_hostp2m(d)->defer_nested_flush ) {
> +         &&
!atomic_read(&(p2m_get_hostp2m(d)->defer_nested_flush)) ) {
>         /* We are replacing a valid entry so we need to flush nested p2ms,
>          * unless the only change is an increase in access rights. */
>         mfn_t omfn = _mfn(l1e_get_pfn(*p));
> diff -r 8a98179666de -r 471d4f2754d6 xen/arch/x86/mm/hap/nested_hap.c
> --- a/xen/arch/x86/mm/hap/nested_hap.c
> +++ b/xen/arch/x86/mm/hap/nested_hap.c
> @@ -105,8 +105,6 @@ nestedhap_fix_p2m(struct vcpu *v, struct
>     ASSERT(p2m);
>     ASSERT(p2m->set_entry);
>
> -    p2m_lock(p2m);
> -
>     /* If this p2m table has been flushed or recycled under our feet,
>      * leave it alone.  We''ll pick up the right one as we try to
>      * vmenter the guest. */
> @@ -122,11 +120,13 @@ nestedhap_fix_p2m(struct vcpu *v, struct
>         gfn = (L2_gpa >> PAGE_SHIFT) & mask;
>         mfn = _mfn((L0_gpa >> PAGE_SHIFT) & mask);
>
> +        /* Not bumping refcount of pages underneath because we''re
getting
> +         * rid of whatever was there */
> +        get_p2m(p2m, gfn, page_order);
>         rv = set_p2m_entry(p2m, gfn, mfn, page_order, p2mt, p2ma);
> +        put_p2m(p2m, gfn, page_order);
>     }
>
> -    p2m_unlock(p2m);
> -
>     if (rv == 0) {
>         gdprintk(XENLOG_ERR,
>                "failed to set entry for 0x%"PRIx64" ->
0x%"PRIx64"\n",
> @@ -146,19 +146,26 @@ nestedhap_walk_L0_p2m(struct p2m_domain
>     mfn_t mfn;
>     p2m_type_t p2mt;
>     p2m_access_t p2ma;
> +    int rc;
>
>     /* walk L0 P2M table */
>     mfn = gfn_to_mfn_type_p2m(p2m, L1_gpa >> PAGE_SHIFT, &p2mt,
&p2ma,
>                               p2m_query, page_order);
>
> +    rc = NESTEDHVM_PAGEFAULT_ERROR;
>     if ( p2m_is_paging(p2mt) || p2m_is_shared(p2mt) || !p2m_is_ram(p2mt) )
> -        return NESTEDHVM_PAGEFAULT_ERROR;
> +        goto out;
>
> +    rc = NESTEDHVM_PAGEFAULT_ERROR;
>     if ( !mfn_valid(mfn) )
> -        return NESTEDHVM_PAGEFAULT_ERROR;
> +        goto out;
>
>     *L0_gpa = (mfn_x(mfn) << PAGE_SHIFT) + (L1_gpa & ~PAGE_MASK);
> -    return NESTEDHVM_PAGEFAULT_DONE;
> +    rc = NESTEDHVM_PAGEFAULT_DONE;
> +
> +out:
> +    drop_p2m_gfn(p2m, L1_gpa >> PAGE_SHIFT, mfn_x(mfn));
> +    return rc;
>  }
>
>  /* This function uses L2_gpa to walk the P2M page table in L1. If the
> diff -r 8a98179666de -r 471d4f2754d6 xen/arch/x86/mm/p2m-ept.c
> --- a/xen/arch/x86/mm/p2m-ept.c
> +++ b/xen/arch/x86/mm/p2m-ept.c
> @@ -43,29 +43,16 @@
>  #define is_epte_present(ept_entry)      ((ept_entry)->epte & 0x7)
>  #define is_epte_superpage(ept_entry)    ((ept_entry)->sp)
>
> -/* Non-ept "lock-and-check" wrapper */
> +/* Ept-specific check wrapper */
>  static int ept_pod_check_and_populate(struct p2m_domain *p2m, unsigned
long gfn,
>                                       ept_entry_t *entry, int order,
>                                       p2m_query_t q)
>  {
> -    int r;
> -
> -    /* This is called from the p2m lookups, which can happen with or
> -     * without the lock hed. */
> -    p2m_lock_recursive(p2m);
> -
>     /* Check to make sure this is still PoD */
>     if ( entry->sa_p2mt != p2m_populate_on_demand )
> -    {
> -        p2m_unlock(p2m);
>         return 0;
> -    }
>
> -    r = p2m_pod_demand_populate(p2m, gfn, order, q);
> -
> -    p2m_unlock(p2m);
> -
> -    return r;
> +    return p2m_pod_demand_populate(p2m, gfn, order, q);
>  }
>
>  static void ept_p2m_type_to_flags(ept_entry_t *entry, p2m_type_t type,
p2m_access_t access)
> @@ -265,9 +252,9 @@ static int ept_next_level(struct p2m_dom
>
>     ept_entry = (*table) + index;
>
> -    /* ept_next_level() is called (sometimes) without a lock.  Read
> +    /* ept_next_level() is called (never) without a lock.  Read
>      * the entry once, and act on the "cached" entry after that
to
> -     * avoid races. */
> +     * avoid races. AAA */
>     e = atomic_read_ept_entry(ept_entry);
>
>     if ( !is_epte_present(&e) )
> @@ -733,7 +720,8 @@ void ept_change_entry_emt_with_range(str
>     int order = 0;
>     struct p2m_domain *p2m = p2m_get_hostp2m(d);
>
> -    p2m_lock(p2m);
> +    /* This is a global operation, essentially */
> +    get_p2m_global(p2m);
>     for ( gfn = start_gfn; gfn <= end_gfn; gfn++ )
>     {
>         int level = 0;
> @@ -773,7 +761,7 @@ void ept_change_entry_emt_with_range(str
>                 ept_set_entry(p2m, gfn, mfn, order, e.sa_p2mt, e.access);
>         }
>     }
> -    p2m_unlock(p2m);
> +    put_p2m_global(p2m);
>  }
>
>  /*
> diff -r 8a98179666de -r 471d4f2754d6 xen/arch/x86/mm/p2m-pod.c
> --- a/xen/arch/x86/mm/p2m-pod.c
> +++ b/xen/arch/x86/mm/p2m-pod.c
> @@ -102,8 +102,6 @@ p2m_pod_cache_add(struct p2m_domain *p2m
>     }
>  #endif
>
> -    ASSERT(p2m_locked_by_me(p2m));
> -
>     /*
>      * Pages from domain_alloc and returned by the balloon driver
aren''t
>      * guaranteed to be zero; but by reclaiming zero pages, we implicitly
> @@ -536,7 +534,7 @@ p2m_pod_decrease_reservation(struct doma
>     {
>         p2m_type_t t;
>
> -        gfn_to_mfn_query(d, gpfn + i, &t);
> +        gfn_to_mfn_query_unlocked(d, gpfn + i, &t);
>
>         if ( t == p2m_populate_on_demand )
>             pod++;
The rest of the code makes it seem like gfn_to_mfn_query() will grab
the p2m lock for a range, but the _unlocked() version will not.  Is
that correct?

Shouldn''t this remain as it is then?
> @@ -602,6 +600,7 @@ p2m_pod_decrease_reservation(struct doma
>             nonpod--;
>             ram--;
>         }
> +        drop_p2m_gfn(p2m, gpfn + i, mfn_x(mfn));
>     }
And how does this fit with the _query() call above?
>
>     /* If there are no more non-PoD entries, tell decrease_reservation()
that
> @@ -661,12 +660,15 @@ p2m_pod_zero_check_superpage(struct p2m_
>     for ( i=0; i<SUPERPAGE_PAGES; i++ )
>     {
>
> -        mfn = gfn_to_mfn_query(d, gfn + i, &type);
> -
>         if ( i == 0 )
>         {
> +            /* Only lock the p2m entry the first time, that will lock
> +             * server for the whole superpage */
> +            mfn = gfn_to_mfn_query(d, gfn + i, &type);
>             mfn0 = mfn;
>             type0 = type;
> +        } else {
> +            mfn = gfn_to_mfn_query_unlocked(d, gfn + i, &type);
>         }
>
>         /* Conditions that must be met for superpage-superpage:
> @@ -773,6 +775,10 @@ out:
>         p2m_pod_cache_add(p2m, mfn_to_page(mfn0), PAGE_ORDER_2M);
>         p2m->pod.entry_count += SUPERPAGE_PAGES;
>     }
> +
> +    /* We got p2m locks once for the whole superpage, with the original
> +     * mfn0. We drop it here. */
> +    drop_p2m_gfn(p2m, gfn, mfn_x(mfn0));
>  }
>
>  /* On entry, PoD lock is held */
> @@ -894,6 +900,12 @@ p2m_pod_zero_check(struct p2m_domain *p2
>             p2m->pod.entry_count++;
>         }
>     }
> +
> +    /* Drop all p2m locks and references */
> +    for ( i=0; i<count; i++ )
> +    {
> +        drop_p2m_gfn(p2m, gfns[i], mfn_x(mfns[i]));
> +    }
>
>  }
>
> @@ -928,7 +940,9 @@ p2m_pod_emergency_sweep_super(struct p2m
>     p2m->pod.reclaim_super = i ? i - SUPERPAGE_PAGES : 0;
>  }
>
> -#define POD_SWEEP_STRIDE  16
> +/* Note that spinlock recursion counters have 4 bits, so 16 or higher
> + * will overflow a single 2M spinlock in a zero check. */
> +#define POD_SWEEP_STRIDE  15
>  static void
>  p2m_pod_emergency_sweep(struct p2m_domain *p2m)
>  {
> @@ -946,7 +960,7 @@ p2m_pod_emergency_sweep(struct p2m_domai
>     /* FIXME: Figure out how to avoid superpages */
>     for ( i=p2m->pod.reclaim_single; i > 0 ; i-- )
>     {
> -        gfn_to_mfn_query(p2m->domain, i, &t );
> +        gfn_to_mfn_query_unlocked(p2m->domain, i, &t );
>         if ( p2m_is_ram(t) )
>         {
>             gfns[j] = i;
> @@ -974,6 +988,7 @@ p2m_pod_emergency_sweep(struct p2m_domai
>
>  }
>
> +/* The gfn and order need to be locked in the p2m before you walk in here
*/
>  int
>  p2m_pod_demand_populate(struct p2m_domain *p2m, unsigned long gfn,
>                         unsigned int order,
> @@ -985,8 +1000,6 @@ p2m_pod_demand_populate(struct p2m_domai
>     mfn_t mfn;
>     int i;
>
> -    ASSERT(p2m_locked_by_me(p2m));
> -
>     pod_lock(p2m);
>     /* This check is done with the pod lock held.  This will make sure that
>      * even if d->is_dying changes under our feet, p2m_pod_empty_cache()
> @@ -1008,8 +1021,6 @@ p2m_pod_demand_populate(struct p2m_domai
>         set_p2m_entry(p2m, gfn_aligned, _mfn(0), PAGE_ORDER_2M,
>                       p2m_populate_on_demand, p2m->default_access);
>         audit_p2m(p2m, 1);
> -        /* This is because the ept/pt caller locks the p2m recursively */
> -        p2m_unlock(p2m);
>         return 0;
>     }
>
> @@ -1132,7 +1143,9 @@ guest_physmap_mark_populate_on_demand(st
>     if ( rc != 0 )
>         return rc;
>
> -    p2m_lock(p2m);
> +    /* Pre-lock all the p2m entries. We don''t take refs to the
> +     * pages, because there shouldn''t be any pages underneath. */
> +    get_p2m(p2m, gfn, order);
>     audit_p2m(p2m, 1);
>
>     P2M_DEBUG("mark pod gfn=%#lx\n", gfn);
> @@ -1140,7 +1153,8 @@ guest_physmap_mark_populate_on_demand(st
>     /* Make sure all gpfns are unused */
>     for ( i = 0; i < (1UL << order); i++ )
>     {
> -        omfn = gfn_to_mfn_query(d, gfn + i, &ot);
> +        p2m_access_t a;
> +        omfn = p2m->get_entry(p2m, gfn + i, &ot, &a, p2m_query,
NULL);
>         if ( p2m_is_ram(ot) )
>         {
>             printk("%s: gfn_to_mfn returned type %d!\n",
> @@ -1169,9 +1183,9 @@ guest_physmap_mark_populate_on_demand(st
>     }
>
>     audit_p2m(p2m, 1);
> -    p2m_unlock(p2m);
>
>  out:
> +    put_p2m(p2m, gfn, order);
>     return rc;
>  }
>
> diff -r 8a98179666de -r 471d4f2754d6 xen/arch/x86/mm/p2m-pt.c
> --- a/xen/arch/x86/mm/p2m-pt.c
> +++ b/xen/arch/x86/mm/p2m-pt.c
> @@ -487,31 +487,16 @@ out:
>  }
>
>
> -/* Non-ept "lock-and-check" wrapper */
> +/* PT-specific check wrapper */
>  static int p2m_pod_check_and_populate(struct p2m_domain *p2m, unsigned
long gfn,
>                                       l1_pgentry_t *p2m_entry, int order,
>                                       p2m_query_t q)
>  {
> -    int r;
> -
> -    /* This is called from the p2m lookups, which can happen with or
> -     * without the lock hed. */
> -    p2m_lock_recursive(p2m);
> -    audit_p2m(p2m, 1);
> -
>     /* Check to make sure this is still PoD */
>     if ( p2m_flags_to_type(l1e_get_flags(*p2m_entry)) !=
p2m_populate_on_demand )
> -    {
> -        p2m_unlock(p2m);
>         return 0;
> -    }
>
> -    r = p2m_pod_demand_populate(p2m, gfn, order, q);
> -
> -    audit_p2m(p2m, 1);
> -    p2m_unlock(p2m);
> -
> -    return r;
> +    return p2m_pod_demand_populate(p2m, gfn, order, q);
>  }
>
>  /* Read the current domain''s p2m table (through the linear
mapping). */
> @@ -894,6 +879,7 @@ static void p2m_change_type_global(struc
>     if ( pagetable_get_pfn(p2m_get_pagetable(p2m)) == 0 )
>         return;
>
> +    /* Checks for exclusive lock */
>     ASSERT(p2m_locked_by_me(p2m));
>
>  #if CONFIG_PAGING_LEVELS == 4
> diff -r 8a98179666de -r 471d4f2754d6 xen/arch/x86/mm/p2m.c
> --- a/xen/arch/x86/mm/p2m.c
> +++ b/xen/arch/x86/mm/p2m.c
> @@ -143,9 +143,9 @@ void p2m_change_entry_type_global(struct
>                                   p2m_type_t ot, p2m_type_t nt)
>  {
>     struct p2m_domain *p2m = p2m_get_hostp2m(d);
> -    p2m_lock(p2m);
> +    get_p2m_global(p2m);
>     p2m->change_entry_type_global(p2m, ot, nt);
> -    p2m_unlock(p2m);
> +    put_p2m_global(p2m);
>  }
>
>  mfn_t gfn_to_mfn_type_p2m(struct p2m_domain *p2m, unsigned long gfn,
> @@ -162,12 +162,17 @@ mfn_t gfn_to_mfn_type_p2m(struct p2m_dom
>         return _mfn(gfn);
>     }
>
> +    /* We take the lock for this single gfn. The caller has to put this
lock */
> +    get_p2m_gfn(p2m, gfn);
> +
>     mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order);
>
>  #ifdef __x86_64__
>     if ( q == p2m_unshare && p2m_is_shared(*t) )
>     {
>         ASSERT(!p2m_is_nestedp2m(p2m));
> +        /* p2m locking is recursive, so we won''t deadlock going
> +         * into the sharing code */
>         mem_sharing_unshare_page(p2m->domain, gfn, 0);
>         mfn = p2m->get_entry(p2m, gfn, t, a, q, page_order);
>     }
> @@ -179,13 +184,28 @@ mfn_t gfn_to_mfn_type_p2m(struct p2m_dom
>         /* Return invalid_mfn to avoid caller''s access */
>         mfn = _mfn(INVALID_MFN);
>         if (q == p2m_guest)
> +        {
> +            put_p2m_gfn(p2m, gfn);
>             domain_crash(p2m->domain);
> +        }
>     }
>  #endif
>
> +    /* Take an extra reference to the page. It won''t disappear
beneath us */
> +    if ( mfn_valid(mfn) )
> +    {
> +        /* Use this because we don''t necessarily know who owns
the page */
> +        if ( !page_get_owner_and_reference(mfn_to_page(mfn)) )
> +        {
> +            mfn = _mfn(INVALID_MFN);
> +        }
> +    }
> +
> +    /* We leave holding the p2m lock for this gfn */
>     return mfn;
>  }
>
> +/* Appropriate locks held on entry */
>  int set_p2m_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
>                   unsigned int page_order, p2m_type_t p2mt, p2m_access_t
p2ma)
>  {
> @@ -194,8 +214,6 @@ int set_p2m_entry(struct p2m_domain *p2m
>     unsigned int order;
>     int rc = 1;
>
> -    ASSERT(p2m_locked_by_me(p2m));
> -
>     while ( todo )
>     {
>         if ( hap_enabled(d) )
> @@ -217,6 +235,18 @@ int set_p2m_entry(struct p2m_domain *p2m
>     return rc;
>  }
>
> +void drop_p2m_gfn(struct p2m_domain *p2m, unsigned long gfn,
> +                    unsigned long frame)
> +{
> +    mfn_t mfn = _mfn(frame);
> +    /* For non-translated domains, locks are never taken */
> +    if ( !p2m || !paging_mode_translate(p2m->domain) )
> +        return;
> +    if ( mfn_valid(mfn) )
> +        put_page(mfn_to_page(mfn));
> +    put_p2m_gfn(p2m, gfn);
> +}
> +
>  struct page_info *p2m_alloc_ptp(struct p2m_domain *p2m, unsigned long
type)
>  {
>     struct page_info *pg;
> @@ -262,12 +292,12 @@ int p2m_alloc_table(struct p2m_domain *p
>     unsigned long gfn = -1UL;
>     struct domain *d = p2m->domain;
>
> -    p2m_lock(p2m);
> +    get_p2m_global(p2m);
>
>     if ( pagetable_get_pfn(p2m_get_pagetable(p2m)) != 0 )
>     {
>         P2M_ERROR("p2m already allocated for this domain\n");
> -        p2m_unlock(p2m);
> +        put_p2m_global(p2m);
>         return -EINVAL;
>     }
>
> @@ -283,7 +313,7 @@ int p2m_alloc_table(struct p2m_domain *p
>
>     if ( p2m_top == NULL )
>     {
> -        p2m_unlock(p2m);
> +        put_p2m_global(p2m);
>         return -ENOMEM;
>     }
>
> @@ -295,7 +325,7 @@ int p2m_alloc_table(struct p2m_domain *p
>     P2M_PRINTK("populating p2m table\n");
>
>     /* Initialise physmap tables for slot zero. Other code assumes this. */
> -    p2m->defer_nested_flush = 1;
> +    atomic_set(&p2m->defer_nested_flush, 1);
>     if ( !set_p2m_entry(p2m, 0, _mfn(INVALID_MFN), 0,
>                         p2m_invalid, p2m->default_access) )
>         goto error;
> @@ -323,10 +353,10 @@ int p2m_alloc_table(struct p2m_domain *p
>         }
>         spin_unlock(&p2m->domain->page_alloc_lock);
>     }
> -    p2m->defer_nested_flush = 0;
> +    atomic_set(&p2m->defer_nested_flush, 0);
>
>     P2M_PRINTK("p2m table initialised (%u pages)\n", page_count);
> -    p2m_unlock(p2m);
> +    put_p2m_global(p2m);
>     return 0;
>
>  error_unlock:
> @@ -334,7 +364,7 @@ error_unlock:
>  error:
>     P2M_PRINTK("failed to initialize p2m table, gfn=%05lx, mfn=%"
>                PRI_mfn "\n", gfn, mfn_x(mfn));
> -    p2m_unlock(p2m);
> +    put_p2m_global(p2m);
>     return -ENOMEM;
>  }
>
> @@ -354,26 +384,28 @@ void p2m_teardown(struct p2m_domain *p2m
>     if (p2m == NULL)
>         return;
>
> +    get_p2m_global(p2m);
> +
>  #ifdef __x86_64__
>     for ( gfn=0; gfn < p2m->max_mapped_pfn; gfn++ )
>     {
> -        mfn = gfn_to_mfn_type_p2m(p2m, gfn, &t, &a, p2m_query,
NULL);
> +        mfn = p2m->get_entry(p2m, gfn, &t, &a, p2m_query,
NULL);
>         if ( mfn_valid(mfn) && (t == p2m_ram_shared) )
>         {
>             ASSERT(!p2m_is_nestedp2m(p2m));
> +            /* The p2m allows an exclusive global holder to recursively
> +             * lock sub-ranges. For this. */
>             BUG_ON(mem_sharing_unshare_page(d, gfn,
MEM_SHARING_DESTROY_GFN));
>         }
>
>     }
>  #endif
>
> -    p2m_lock(p2m);
> -
>     p2m->phys_table = pagetable_null();
>
>     while ( (pg = page_list_remove_head(&p2m->pages)) )
>         d->arch.paging.free_page(d, pg);
> -    p2m_unlock(p2m);
> +    put_p2m_global(p2m);
>  }
>
>  static void p2m_teardown_nestedp2m(struct domain *d)
> @@ -401,6 +433,7 @@ void p2m_final_teardown(struct domain *d
>  }
>
>
> +/* Locks held on entry */
>  static void
>  p2m_remove_page(struct p2m_domain *p2m, unsigned long gfn, unsigned long
mfn,
>                 unsigned int page_order)
> @@ -438,11 +471,11 @@ guest_physmap_remove_page(struct domain
>                           unsigned long mfn, unsigned int page_order)
>  {
>     struct p2m_domain *p2m = p2m_get_hostp2m(d);
> -    p2m_lock(p2m);
> +    get_p2m(p2m, gfn, page_order);
>     audit_p2m(p2m, 1);
>     p2m_remove_page(p2m, gfn, mfn, page_order);
>     audit_p2m(p2m, 1);
> -    p2m_unlock(p2m);
> +    put_p2m(p2m, gfn, page_order);
>  }
>
>  int
> @@ -480,7 +513,7 @@ guest_physmap_add_entry(struct domain *d
>     if ( rc != 0 )
>         return rc;
>
> -    p2m_lock(p2m);
> +    get_p2m(p2m, gfn, page_order);
>     audit_p2m(p2m, 0);
>
>     P2M_DEBUG("adding gfn=%#lx mfn=%#lx\n", gfn, mfn);
> @@ -488,12 +521,13 @@ guest_physmap_add_entry(struct domain *d
>     /* First, remove m->p mappings for existing p->m mappings */
>     for ( i = 0; i < (1UL << page_order); i++ )
>     {
> -        omfn = gfn_to_mfn_query(d, gfn + i, &ot);
> +        p2m_access_t a;
> +        omfn = p2m->get_entry(p2m, gfn + i, &ot, &a, p2m_query,
NULL);
>         if ( p2m_is_grant(ot) )
>         {
>             /* Really shouldn''t be unmapping grant maps this way
*/
> +            put_p2m(p2m, gfn, page_order);
>             domain_crash(d);
> -            p2m_unlock(p2m);
>             return -EINVAL;
>         }
>         else if ( p2m_is_ram(ot) )
> @@ -523,11 +557,12 @@ guest_physmap_add_entry(struct domain *d
>             && (ogfn != INVALID_M2P_ENTRY)
>             && (ogfn != gfn + i) )
>         {
> +            p2m_access_t a;
>             /* This machine frame is already mapped at another physical
>              * address */
>             P2M_DEBUG("aliased! mfn=%#lx, old gfn=%#lx, new
gfn=%#lx\n",
>                       mfn + i, ogfn, gfn + i);
> -            omfn = gfn_to_mfn_query(d, ogfn, &ot);
> +            omfn = p2m->get_entry(p2m, ogfn, &ot, &a,
p2m_query, NULL);
>             if ( p2m_is_ram(ot) )
>             {
>                 ASSERT(mfn_valid(omfn));
> @@ -567,7 +602,7 @@ guest_physmap_add_entry(struct domain *d
>     }
>
>     audit_p2m(p2m, 1);
> -    p2m_unlock(p2m);
> +    put_p2m(p2m, gfn, page_order);
>
>     return rc;
>  }
> @@ -579,18 +614,19 @@ p2m_type_t p2m_change_type(struct domain
>                            p2m_type_t ot, p2m_type_t nt)
>  {
>     p2m_type_t pt;
> +    p2m_access_t a;
>     mfn_t mfn;
>     struct p2m_domain *p2m = p2m_get_hostp2m(d);
>
>     BUG_ON(p2m_is_grant(ot) || p2m_is_grant(nt));
>
> -    p2m_lock(p2m);
> +    get_p2m_gfn(p2m, gfn);
>
> -    mfn = gfn_to_mfn_query(d, gfn, &pt);
> +    mfn = p2m->get_entry(p2m, gfn, &pt, &a, p2m_query, NULL);
>     if ( pt == ot )
>         set_p2m_entry(p2m, gfn, mfn, 0, nt, p2m->default_access);
>
> -    p2m_unlock(p2m);
> +    put_p2m_gfn(p2m, gfn);
>
>     return pt;
>  }
> @@ -608,20 +644,23 @@ void p2m_change_type_range(struct domain
>
>     BUG_ON(p2m_is_grant(ot) || p2m_is_grant(nt));
>
> -    p2m_lock(p2m);
> -    p2m->defer_nested_flush = 1;
> +    atomic_set(&p2m->defer_nested_flush, 1);
>
> +    /* We''ve been given a number instead of an order, so lock
each
> +     * gfn individually */
>     for ( gfn = start; gfn < end; gfn++ )
>     {
> -        mfn = gfn_to_mfn_query(d, gfn, &pt);
> +        p2m_access_t a;
> +        get_p2m_gfn(p2m, gfn);
> +        mfn = p2m->get_entry(p2m, gfn, &pt, &a, p2m_query,
NULL);
>         if ( pt == ot )
>             set_p2m_entry(p2m, gfn, mfn, 0, nt, p2m->default_access);
> +        put_p2m_gfn(p2m, gfn);
>     }
>
> -    p2m->defer_nested_flush = 0;
> +    atomic_set(&p2m->defer_nested_flush, 0);
>     if ( nestedhvm_enabled(d) )
>         p2m_flush_nestedp2m(d);
> -    p2m_unlock(p2m);
>  }
>
>
> @@ -631,17 +670,18 @@ set_mmio_p2m_entry(struct domain *d, uns
>  {
>     int rc = 0;
>     p2m_type_t ot;
> +    p2m_access_t a;
>     mfn_t omfn;
>     struct p2m_domain *p2m = p2m_get_hostp2m(d);
>
>     if ( !paging_mode_translate(d) )
>         return 0;
>
> -    p2m_lock(p2m);
> -    omfn = gfn_to_mfn_query(d, gfn, &ot);
> +    get_p2m_gfn(p2m, gfn);
> +    omfn = p2m->get_entry(p2m, gfn, &ot, &a, p2m_query, NULL);
>     if ( p2m_is_grant(ot) )
>     {
> -        p2m_unlock(p2m);
> +        put_p2m_gfn(p2m, gfn);
>         domain_crash(d);
>         return 0;
>     }
> @@ -654,11 +694,11 @@ set_mmio_p2m_entry(struct domain *d, uns
>     P2M_DEBUG("set mmio %lx %lx\n", gfn, mfn_x(mfn));
>     rc = set_p2m_entry(p2m, gfn, mfn, 0, p2m_mmio_direct,
p2m->default_access);
>     audit_p2m(p2m, 1);
> -    p2m_unlock(p2m);
> +    put_p2m_gfn(p2m, gfn);
>     if ( 0 == rc )
>         gdprintk(XENLOG_ERR,
>             "set_mmio_p2m_entry: set_p2m_entry failed!
mfn=%08lx\n",
> -            mfn_x(gfn_to_mfn_query(d, gfn, &ot)));
> +            mfn_x(gfn_to_mfn_query_unlocked(d, gfn, &ot)));
>     return rc;
>  }
>
> @@ -668,13 +708,14 @@ clear_mmio_p2m_entry(struct domain *d, u
>     int rc = 0;
>     mfn_t mfn;
>     p2m_type_t t;
> +    p2m_access_t a;
>     struct p2m_domain *p2m = p2m_get_hostp2m(d);
>
>     if ( !paging_mode_translate(d) )
>         return 0;
>
> -    p2m_lock(p2m);
> -    mfn = gfn_to_mfn_query(d, gfn, &t);
> +    get_p2m_gfn(p2m, gfn);
> +    mfn = p2m->get_entry(p2m, gfn, &t, &a, p2m_query, NULL);
>
>     /* Do not use mfn_valid() here as it will usually fail for MMIO pages.
*/
>     if ( (INVALID_MFN == mfn_x(mfn)) || (t != p2m_mmio_direct) )
> @@ -687,8 +728,7 @@ clear_mmio_p2m_entry(struct domain *d, u
>     audit_p2m(p2m, 1);
>
>  out:
> -    p2m_unlock(p2m);
> -
> +    put_p2m_gfn(p2m, gfn);
>     return rc;
>  }
>
> @@ -698,13 +738,14 @@ set_shared_p2m_entry(struct domain *d, u
>     struct p2m_domain *p2m = p2m_get_hostp2m(d);
>     int rc = 0;
>     p2m_type_t ot;
> +    p2m_access_t a;
>     mfn_t omfn;
>
>     if ( !paging_mode_translate(p2m->domain) )
>         return 0;
>
> -    p2m_lock(p2m);
> -    omfn = gfn_to_mfn_query(p2m->domain, gfn, &ot);
> +    get_p2m_gfn(p2m, gfn);
> +    omfn = p2m->get_entry(p2m, gfn, &ot, &a, p2m_query, NULL);
>     /* At the moment we only allow p2m change if gfn has already been made
>      * sharable first */
>     ASSERT(p2m_is_shared(ot));
> @@ -714,11 +755,11 @@ set_shared_p2m_entry(struct domain *d, u
>
>     P2M_DEBUG("set shared %lx %lx\n", gfn, mfn_x(mfn));
>     rc = set_p2m_entry(p2m, gfn, mfn, 0, p2m_ram_shared,
p2m->default_access);
> -    p2m_unlock(p2m);
> +    put_p2m_gfn(p2m, gfn);
>     if ( 0 == rc )
>         gdprintk(XENLOG_ERR,
>             "set_shared_p2m_entry: set_p2m_entry failed!
mfn=%08lx\n",
> -            mfn_x(gfn_to_mfn_query(d, gfn, &ot)));
> +            mfn_x(gfn_to_mfn_query_unlocked(d, gfn, &ot)));
>     return rc;
>  }
>
> @@ -732,7 +773,7 @@ int p2m_mem_paging_nominate(struct domai
>     mfn_t mfn;
>     int ret;
>
> -    p2m_lock(p2m);
> +    get_p2m_gfn(p2m, gfn);
>
>     mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, p2m_query, NULL);
>
> @@ -765,7 +806,7 @@ int p2m_mem_paging_nominate(struct domai
>     ret = 0;
>
>  out:
> -    p2m_unlock(p2m);
> +    put_p2m_gfn(p2m, gfn);
>     return ret;
>  }
>
> @@ -778,7 +819,7 @@ int p2m_mem_paging_evict(struct domain *
>     struct p2m_domain *p2m = p2m_get_hostp2m(d);
>     int ret = -EINVAL;
>
> -    p2m_lock(p2m);
> +    get_p2m_gfn(p2m, gfn);
>
>     /* Get mfn */
>     mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, p2m_query, NULL);
> @@ -824,7 +865,7 @@ int p2m_mem_paging_evict(struct domain *
>     put_page(page);
>
>  out:
> -    p2m_unlock(p2m);
> +    put_p2m_gfn(p2m, gfn);
>     return ret;
>  }
>
> @@ -863,7 +904,7 @@ void p2m_mem_paging_populate(struct doma
>     req.type = MEM_EVENT_TYPE_PAGING;
>
>     /* Fix p2m mapping */
> -    p2m_lock(p2m);
> +    get_p2m_gfn(p2m, gfn);
>     mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, p2m_query, NULL);
>     /* Allow only nominated or evicted pages to enter page-in path */
>     if ( p2mt == p2m_ram_paging_out || p2mt == p2m_ram_paged )
> @@ -875,7 +916,7 @@ void p2m_mem_paging_populate(struct doma
>         set_p2m_entry(p2m, gfn, mfn, 0, p2m_ram_paging_in_start, a);
>         audit_p2m(p2m, 1);
>     }
> -    p2m_unlock(p2m);
> +    put_p2m_gfn(p2m, gfn);
>
>     /* Pause domain if request came from guest and gfn has paging type */
>     if (  p2m_is_paging(p2mt) && v->domain->domain_id ==
d->domain_id )
> @@ -908,7 +949,7 @@ int p2m_mem_paging_prep(struct domain *d
>     struct p2m_domain *p2m = p2m_get_hostp2m(d);
>     int ret = -ENOMEM;
>
> -    p2m_lock(p2m);
> +    get_p2m_gfn(p2m, gfn);
>
>     mfn = p2m->get_entry(p2m, gfn, &p2mt, &a, p2m_query, NULL);
>
> @@ -931,7 +972,7 @@ int p2m_mem_paging_prep(struct domain *d
>     ret = 0;
>
>  out:
> -    p2m_unlock(p2m);
> +    put_p2m_gfn(p2m, gfn);
>     return ret;
>  }
>
> @@ -949,12 +990,12 @@ void p2m_mem_paging_resume(struct domain
>     /* Fix p2m entry if the page was not dropped */
>     if ( !(rsp.flags & MEM_EVENT_FLAG_DROP_PAGE) )
>     {
> -        p2m_lock(p2m);
> +        get_p2m_gfn(p2m, rsp.gfn);
>         mfn = p2m->get_entry(p2m, rsp.gfn, &p2mt, &a, p2m_query,
NULL);
>         set_p2m_entry(p2m, rsp.gfn, mfn, 0, p2m_ram_rw, a);
>         set_gpfn_from_mfn(mfn_x(mfn), rsp.gfn);
>         audit_p2m(p2m, 1);
> -        p2m_unlock(p2m);
> +        put_p2m_gfn(p2m, rsp.gfn);
>     }
>
>     /* Unpause domain */
> @@ -979,16 +1020,16 @@ void p2m_mem_access_check(unsigned long
>     p2m_access_t p2ma;
>
>     /* First, handle rx2rw conversion automatically */
> -    p2m_lock(p2m);
> +    get_p2m_gfn(p2m, gfn);
>     mfn = p2m->get_entry(p2m, gfn, &p2mt, &p2ma, p2m_query,
NULL);
>
>     if ( access_w && p2ma == p2m_access_rx2rw )
>     {
>         p2m->set_entry(p2m, gfn, mfn, PAGE_ORDER_4K, p2mt,
p2m_access_rw);
> -        p2m_unlock(p2m);
> +        put_p2m_gfn(p2m, gfn);
>         return;
>     }
> -    p2m_unlock(p2m);
> +    put_p2m_gfn(p2m, gfn);
>
>     /* Otherwise, check if there is a memory event listener, and send the
message along */
>     res = mem_event_check_ring(d, &d->mem_access);
> @@ -1006,9 +1047,9 @@ void p2m_mem_access_check(unsigned long
>         else
>         {
>             /* A listener is not required, so clear the access restrictions
*/
> -            p2m_lock(p2m);
> +            get_p2m_gfn(p2m, gfn);
>             p2m->set_entry(p2m, gfn, mfn, PAGE_ORDER_4K, p2mt,
p2m_access_rwx);
> -            p2m_unlock(p2m);
> +            put_p2m_gfn(p2m, gfn);
>         }
>
>         return;
> @@ -1064,7 +1105,7 @@ int p2m_set_mem_access(struct domain *d,
>  {
>     struct p2m_domain *p2m = p2m_get_hostp2m(d);
>     unsigned long pfn;
> -    p2m_access_t a;
> +    p2m_access_t a, _a;
>     p2m_type_t t;
>     mfn_t mfn;
>     int rc = 0;
> @@ -1095,17 +1136,20 @@ int p2m_set_mem_access(struct domain *d,
>         return 0;
>     }
>
> -    p2m_lock(p2m);
> +    /* Because we don''t get an order, rather a number, we need to
lock each
> +     * entry individually */
>     for ( pfn = start_pfn; pfn < start_pfn + nr; pfn++ )
>     {
> -        mfn = gfn_to_mfn_query(d, pfn, &t);
> +        get_p2m_gfn(p2m, pfn);
> +        mfn = p2m->get_entry(p2m, pfn, &t, &_a, p2m_query,
NULL);
>         if ( p2m->set_entry(p2m, pfn, mfn, PAGE_ORDER_4K, t, a) == 0 )
>         {
> +            put_p2m_gfn(p2m, pfn);
>             rc = -ENOMEM;
>             break;
>         }
> +        put_p2m_gfn(p2m, pfn);
>     }
> -    p2m_unlock(p2m);
>     return rc;
>  }
>
> @@ -1138,7 +1182,10 @@ int p2m_get_mem_access(struct domain *d,
>         return 0;
>     }
>
> +    get_p2m_gfn(p2m, pfn);
>     mfn = p2m->get_entry(p2m, pfn, &t, &a, p2m_query, NULL);
> +    put_p2m_gfn(p2m, pfn);
> +
>     if ( mfn_x(mfn) == INVALID_MFN )
>         return -ESRCH;
>
> @@ -1175,7 +1222,7 @@ p2m_flush_table(struct p2m_domain *p2m)
>     struct domain *d = p2m->domain;
>     void *p;
>
> -    p2m_lock(p2m);
> +    get_p2m_global(p2m);
>
>     /* "Host" p2m tables can have shared entries &c that need
a bit more
>      * care when discarding them */
> @@ -1203,7 +1250,7 @@ p2m_flush_table(struct p2m_domain *p2m)
>             d->arch.paging.free_page(d, pg);
>     page_list_add(top, &p2m->pages);
>
> -    p2m_unlock(p2m);
> +    put_p2m_global(p2m);
>  }
>
>  void
> @@ -1245,7 +1292,7 @@ p2m_get_nestedp2m(struct vcpu *v, uint64
>     p2m = nv->nv_p2m;
>     if ( p2m )
>     {
> -        p2m_lock(p2m);
> +        get_p2m_global(p2m);
>         if ( p2m->cr3 == cr3 || p2m->cr3 == CR3_EADDR )
>         {
>             nv->nv_flushp2m = 0;
> @@ -1255,24 +1302,24 @@ p2m_get_nestedp2m(struct vcpu *v, uint64
>                 hvm_asid_flush_vcpu(v);
>             p2m->cr3 = cr3;
>             cpu_set(v->processor, p2m->p2m_dirty_cpumask);
> -            p2m_unlock(p2m);
> +            put_p2m_global(p2m);
>             nestedp2m_unlock(d);
>             return p2m;
>         }
> -        p2m_unlock(p2m);
> +        put_p2m_global(p2m);
>     }
>
>     /* All p2m''s are or were in use. Take the least recent used
one,
>      * flush it and reuse. */
>     p2m = p2m_getlru_nestedp2m(d, NULL);
>     p2m_flush_table(p2m);
> -    p2m_lock(p2m);
> +    get_p2m_global(p2m);
>     nv->nv_p2m = p2m;
>     p2m->cr3 = cr3;
>     nv->nv_flushp2m = 0;
>     hvm_asid_flush_vcpu(v);
>     cpu_set(v->processor, p2m->p2m_dirty_cpumask);
> -    p2m_unlock(p2m);
> +    put_p2m_global(p2m);
>     nestedp2m_unlock(d);
>
>     return p2m;
> diff -r 8a98179666de -r 471d4f2754d6 xen/include/asm-ia64/mm.h
> --- a/xen/include/asm-ia64/mm.h
> +++ b/xen/include/asm-ia64/mm.h
> @@ -561,6 +561,11 @@ extern u64 translate_domain_pte(u64 ptev
>     ((get_gpfn_from_mfn((madr) >> PAGE_SHIFT) << PAGE_SHIFT) |
\
>     ((madr) & ~PAGE_MASK))
>
> +/* Because x86-specific p2m fine-grained lock releases are called from
common
> + * code, we put a dummy placeholder here */
> +#define drop_p2m_gfn(p, g, m)           ((void)0)
> +#define drop_p2m_gfn_domain(p, g, m)    ((void)0)
> +
>  /* Internal use only: returns 0 in case of bad address.  */
>  extern unsigned long paddr_to_maddr(unsigned long paddr);
>
> diff -r 8a98179666de -r 471d4f2754d6 xen/include/asm-x86/p2m.h
> --- a/xen/include/asm-x86/p2m.h
> +++ b/xen/include/asm-x86/p2m.h
> @@ -220,7 +220,7 @@ struct p2m_domain {
>      * tables on every host-p2m change.  The setter of this flag
>      * is responsible for performing the full flush before releasing the
>      * host p2m''s lock. */
> -    int                defer_nested_flush;
> +    atomic_t           defer_nested_flush;
>
>     /* Pages used to construct the p2m */
>     struct page_list_head pages;
> @@ -298,6 +298,15 @@ struct p2m_domain *p2m_get_p2m(struct vc
>  #define p2m_get_pagetable(p2m)  ((p2m)->phys_table)
>
>
> +/* No matter what value you get out of a query, the p2m has been locked
for
> + * that range. No matter what you do, you need to drop those locks.
> + * You need to pass back the mfn obtained when locking, not the new one,
> + * as the refcount of the original mfn was bumped. */
> +void drop_p2m_gfn(struct p2m_domain *p2m, unsigned long gfn,
> +                        unsigned long mfn);
> +#define drop_p2m_gfn_domain(d, g, m)    \
> +        drop_p2m_gfn(p2m_get_hostp2m((d)), (g), (m))
> +
>  /* Read a particular P2M table, mapping pages as we go.  Most callers
>  * should _not_ call this directly; use the other gfn_to_mfn_* functions
>  * below unless you know you want to walk a p2m that isn''t a
domain''s
> @@ -327,6 +336,28 @@ static inline mfn_t gfn_to_mfn_type(stru
>  #define gfn_to_mfn_guest(d, g, t)   gfn_to_mfn_type((d), (g), (t),
p2m_guest)
>  #define gfn_to_mfn_unshare(d, g, t) gfn_to_mfn_type((d), (g), (t),
p2m_unshare)
>
> +/* This one applies to very specific situations in which you''re
querying
> + * a p2m entry and will be done "immediately" (such as a printk
or computing a
> + * return value). Use this only if there are no expectations of the p2m
entry
> + * holding steady. */
> +static inline mfn_t gfn_to_mfn_type_unlocked(struct domain *d,
> +                                        unsigned long gfn, p2m_type_t *t,
> +                                        p2m_query_t q)
> +{
> +    mfn_t mfn = gfn_to_mfn_type(d, gfn, t, q);
> +    drop_p2m_gfn_domain(d, gfn, mfn_x(mfn));
> +    return mfn;
> +}
> +
> +#define gfn_to_mfn_unlocked(d, g, t)            \
> +    gfn_to_mfn_type_unlocked((d), (g), (t), p2m_alloc)
> +#define gfn_to_mfn_query_unlocked(d, g, t)    \
> +    gfn_to_mfn_type_unlocked((d), (g), (t), p2m_query)
> +#define gfn_to_mfn_guest_unlocked(d, g, t)    \
> +    gfn_to_mfn_type_unlocked((d), (g), (t), p2m_guest)
> +#define gfn_to_mfn_unshare_unlocked(d, g, t)    \
> +    gfn_to_mfn_type_unlocked((d), (g), (t), p2m_unshare)
> +
>  /* Compatibility function exporting the old untyped interface */
>  static inline unsigned long gmfn_to_mfn(struct domain *d, unsigned long
gpfn)
>  {
> @@ -338,6 +369,15 @@ static inline unsigned long gmfn_to_mfn(
>     return INVALID_MFN;
>  }
>
> +/* Same comments apply re unlocking */
> +static inline unsigned long gmfn_to_mfn_unlocked(struct domain *d,
> +                                                 unsigned long gpfn)
> +{
> +    unsigned long mfn = gmfn_to_mfn(d, gpfn);
> +    drop_p2m_gfn_domain(d, gpfn, mfn);
> +    return mfn;
> +}
> +
>  /* General conversion function from mfn to gfn */
>  static inline unsigned long mfn_to_gfn(struct domain *d, mfn_t mfn)
>  {
> @@ -521,7 +561,8 @@ static inline int p2m_gfn_check_limit(
>  #define p2m_gfn_check_limit(d, g, o) 0
>  #endif
>
> -/* Directly set a p2m entry: only for use by p2m code */
> +/* Directly set a p2m entry: only for use by p2m code. It expects locks to
> + * be held on entry */
>  int set_p2m_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
>                   unsigned int page_order, p2m_type_t p2mt, p2m_access_t
p2ma);
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tim Deegan

2011-Nov-03 13:49 UTC

head link

Re: [Xen-devel] [PATCH 3 of 9] Enforce ordering constraints for the page alloc lock in the PoD code

At 06:59 -0700 on 02 Nov (1320217144), andres@lagarcavilla.com
wrote:> > - I think it would be better to generate generic spin-lock-with-level
> >   and unlock-with-level wrapper functions rather than generating the
> >   various checks and having to assemble them into lock_page_alloc()
and
> >   unlock_page_alloc() by hand.
> 
> The final intent is to have these macros establish ordering constraints
> for the fine-grained p2m lock, which is not only "grab a
spinlock".
> Granted, we do not know yet whether we''ll need such a fine-grained
> approach, but I think it''s worth keeping things separate.
OK.  We can keep it as it is for now and maybe there''sll be an
opportubnity to tidy up later on. 
> As a side-note, an earlier version of my patches did enforce ordering,
> except things got really hairy with mem_sharing_unshare_page (which would
> jump levels up to grab shr_lock) and pod sweeps. I (think I) have
> solutions for both, but I''m not ready to push those yet.
Great!
> > - p2m->pod.page_alloc_unlock_level is wrong, I think; I can see
that you
> >   need somewhere to store the unlock-level but it shouldn''t
live in
> >   the p2m state - it''s at most a per-domain variable, so it
should
> >   live in the struct domain; might as well be beside the lock itself.
> 
> Ok, sure. Although I think I need to make clear that this ordering
> constraint only applies within the pod code, and that''s why I
wanted to
> keep the book-keeping within the pod struct.
I see.  That makes sense, but since there are now multiple p2m structs
per domain, I think it''s better to put it beside the lock with a
comment
saying that it''s only used by pod. 

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tim Deegan

2011-Nov-03 14:29 UTC

head link

Re: [Xen-devel] [PATCH 5 of 9] Fine-grained concurrency control structure for the p2m

Hi, 

At 07:20 -0700 on 02 Nov (1320218409), andres@lagarcavilla.com
wrote:> > I suspect that if this is a contention point, allowing multiple
readers
> > will become important, especially if there are particular pages that
> > often get emulated access.
> >
> > And also, I''d  like to get some sort of plan for handling
long-lived
> > foreign mappings, if only to make sure that this phase-1 fix
doesn''t
> > conflict wih it.
> >
> 
> If foreign mappings will hold a lock/ref on a p2m subrange, then
they''ll
> disallow global operations, and you''ll get a clash between
log-dirty and,
> say, qemu. Ka-blam live migration.
Yep.  That''s a tricky one.  Log-dirty could be special-cased but I
guess
we''ll have the same problem with paging, mem-event &c. :(
> Read-only foreign mappings are only problematic insofar paging happens.
> With proper p2m update/lookups serialization (global or fine-grained) that
> problem is gone.
> 
> Write-able foreign mappings are trickier because of sharing and w^x. Is
> there a reason left, today, to not type PGT_writable an
hvm-domain''s page
> when a foreign mapping happens?
Unfortunately, yes.  The shadow pagetable code uses the typecount to
detect whether the guest has any writeable mappings of the page; without
that it would have to brute-force search all the L1 shadows in order to
be sure that it had write-protected a page.
> That would solve sharing problems. w^x
> really can''t be solved short of putting the vcpu on a waitqueue
> (preferable to me), or destroying the mapping and forcing the foreign OS
> to remap later. All a few steps ahead, I hope.
OK, so if I understand correctly your plan is to add this mutual
exclusion for all other users of the p2m (emulation &c) but leave
foreign mappings alone for now, with the general plan of fixing that up
using waitqueues.  That''s OK by me.
> Who/what''s using w^x by the way? If the refcount is zero, I think
I know
> what I''ll do ;)
I think the original authors are using it in their product.  I haven''t
heard of any other users but there might be some. 
> What is a real problem is that pod sweeps can cause deadlocks. There is a
> simple step to mitigate this: start the sweep from the current gfn and
> never wrap around -- too bad if the gfn is too high. But this alters the
> sweeping algorithm. I''ll deal with it when its it''s turn.
OK.  If there''s some chance that Olaf can make PoD a special case of
paging maybe we can get rid of the sweeps altogether (i.e., have the
domain pause when it runs out of PoD and let the pager fix it up).  But
I know George has spent a fair amount of time tuning the performance of
PoD so that may not be acceptable. 

Cheers,

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tim Deegan

2011-Nov-03 14:33 UTC

head link

Re: [Xen-devel] [PATCH 8 of 9] Modify all internal p2m functions to use the new fine-grained locking

Hi, 

At 07:24 -0700 on 02 Nov (1320218674), andres@lagarcavilla.com
wrote:> >> +/* No matter what value you get out of a query, the p2m has been
locked
> >> for
> >> + * that range. No matter what you do, you need to drop those
locks.
> >> + * You need to pass back the mfn obtained when locking, not the
new
> >> one,
> >> + * as the refcount of the original mfn was bumped. */
> >
> > Surely the caller doesn''t need to remember the old MFN for
this?  After
> > allm, the whole point of the lock was that nobody else could change
the
> > p2m entry under our feet!
> >
> > In any case, I thing there needs to be a big block comment a bit
futher
> > up that describes what all this locking and refcounting does, and why.
> 
> Comment will be added. I was being doubly-paranoid. I can undo the
> get_page/put_page of the old mfn. I''m not a 100% behind it.
I meant to suggest that the p2m code should me able to do the
get_page/put_page without the caller remembering the mfn, since by
definition it should be able to look it up in the unlock, knowing no-one
else can have changed it. 
> I don''t think these names are the most terrible -- we''ve
all seen far
> worse :) I mean, the naming encodes the arguments, and I don''t see
an
> intrinsic advantage to
> gfn_to_mfn(d, g, t, p2m_guest, p2m_unlocked)
> over
> gfn_to_mfn_guest_unlocked(d,g,t)
Yep, it''s definitely not the worst. :)  It''s really just a
question of
verbosity in the headers.

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Tim Deegan

2011-Nov-03 14:38 UTC

head link

Re: [Xen-devel] [PATCH 9 of 9] Modify all call sites of queries into the p2m to use the new fine-grained locking

At 07:32 -0700 on 02 Nov (1320219175), andres@lagarcavilla.com
wrote:> I don''t know that a massive sed on all these names is a good idea.
I guess
> forcing everyone to compile-fail will also make them realize they need to
> add a call to drop the p2m locks they got...
> 
> Can you elaborate on the naming preferences here: would you prefer
> gfn_to_mfn/put_gfn? get_p2m_gfn/put_p2m_gfn? get_gfn/put_gfn
I think I''d prefer get_gfn/put_gfn.  And maybe set_gfn for writes too?
But I''m willing to be persuaded otherwise if anyone feels strongly
about
it.

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andres Lagar-Cavilla

2011-Nov-03 14:46 UTC

head link

Re: [Xen-devel] [PATCH 5 of 9] Fine-grained concurrency control structure for the p2m

I view PoD as a special case of paging, with sweeps implemented by the
user-space pager, etc.

But the question that lingers in my mind is what do you do in modes that
don''t support ept+paging today (software shadow, amd npt).

Also, paging needs to get waitqueues before being a palatable replacement
for PoD, imho. We''re aiming in that direction too.

Andres> Hi,
>
> At 07:20 -0700 on 02 Nov (1320218409), andres@lagarcavilla.com wrote:
>> > I suspect that if this is a contention point, allowing multiple
>> readers
>> > will become important, especially if there are particular pages
that
>> > often get emulated access.
>> >
>> > And also, I''d  like to get some sort of plan for handling
long-lived
>> > foreign mappings, if only to make sure that this phase-1 fix
doesn''t
>> > conflict wih it.
>> >
>>
>> If foreign mappings will hold a lock/ref on a p2m subrange, then
they''ll
>> disallow global operations, and you''ll get a clash between
log-dirty
>> and,
>> say, qemu. Ka-blam live migration.
>
> Yep.  That''s a tricky one.  Log-dirty could be special-cased but I
guess
> we''ll have the same problem with paging, mem-event &c. :(
>
>> Read-only foreign mappings are only problematic insofar paging happens.
>> With proper p2m update/lookups serialization (global or fine-grained)
>> that
>> problem is gone.
>>
>> Write-able foreign mappings are trickier because of sharing and w^x. Is
>> there a reason left, today, to not type PGT_writable an
hvm-domain''s
>> page
>> when a foreign mapping happens?
>
> Unfortunately, yes.  The shadow pagetable code uses the typecount to
> detect whether the guest has any writeable mappings of the page; without
> that it would have to brute-force search all the L1 shadows in order to
> be sure that it had write-protected a page.
>
>> That would solve sharing problems. w^x
>> really can''t be solved short of putting the vcpu on a
waitqueue
>> (preferable to me), or destroying the mapping and forcing the foreign
OS
>> to remap later. All a few steps ahead, I hope.
>
> OK, so if I understand correctly your plan is to add this mutual
> exclusion for all other users of the p2m (emulation &c) but leave
> foreign mappings alone for now, with the general plan of fixing that up
> using waitqueues.  That''s OK by me.
>
>> Who/what''s using w^x by the way? If the refcount is zero, I
think I know
>> what I''ll do ;)
>
> I think the original authors are using it in their product.  I
haven''t
> heard of any other users but there might be some.
>
>> What is a real problem is that pod sweeps can cause deadlocks. There is
>> a
>> simple step to mitigate this: start the sweep from the current gfn and
>> never wrap around -- too bad if the gfn is too high. But this alters
the
>> sweeping algorithm. I''ll deal with it when its it''s
turn.
>
> OK.  If there''s some chance that Olaf can make PoD a special case
of
> paging maybe we can get rid of the sweeps altogether (i.e., have the
> domain pause when it runs out of PoD and let the pager fix it up).  But
> I know George has spent a fair amount of time tuning the performance of
> PoD so that may not be acceptable.
>
> Cheers,
>
> Tim.
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andres Lagar-Cavilla

2011-Nov-03 14:57 UTC

head link

Re: [Xen-devel] [PATCH 4 of 9] Rework locking in the PoD layer

I''ll leave this alone for the moment, but I''ll try to explain
here the
end-goal:
1. we need to protect p2m entries on lookups, any lookups
2. If performance becomes prohibitive, then we need to break-up that lock
3. pod locking breaks, so pod will need its own lock
4. hence this patch
Agree with you it''s ahead of the curve by removing p2m_lock''s
before
p2mm_lock''s become fine-grained. So, I''ll leave this on the
side for now.

Andres> On Thu, Oct 27, 2011 at 1:33 PM, Andres Lagar-Cavilla
> <andres@lagarcavilla.org> wrote:
>>  xen/arch/x86/mm/mm-locks.h |    9 ++
>>  xen/arch/x86/mm/p2m-pod.c  |  145
>> +++++++++++++++++++++++++++------------------
>>  xen/arch/x86/mm/p2m-pt.c   |    3 +
>>  xen/arch/x86/mm/p2m.c      |    7 +-
>>  xen/include/asm-x86/p2m.h  |   25 ++-----
>>  5 files changed, 113 insertions(+), 76 deletions(-)
>>
>>
>> The PoD layer has a fragile locking discipline. It relies on the
>> p2m being globally locked, and it also relies on the page alloc
>> lock to protect some of its data structures. Replace this all by an
>> explicit pod lock: per p2m, order enforced.
>>
>> Two consequences:
>>    - Critical sections in the pod code protected by the page alloc
>>      lock are now reduced to modifications of the domain page list.
>>    - When the p2m lock becomes fine-grained, there are no
>>      assumptions broken in the PoD layer.
>>
>> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
>>
>> diff -r 332775f72a30 -r 981073d78f7f xen/arch/x86/mm/mm-locks.h
>> --- a/xen/arch/x86/mm/mm-locks.h
>> +++ b/xen/arch/x86/mm/mm-locks.h
>> @@ -155,6 +155,15 @@ declare_mm_lock(p2m)
>>  #define p2m_unlock(p)         mm_unlock(&(p)->lock)
>>  #define p2m_locked_by_me(p)   mm_locked_by_me(&(p)->lock)
>>
>> +/* PoD lock (per-p2m-table)
>> + *
>> + * Protects private PoD data structs. */
>> +
>> +declare_mm_lock(pod)
>> +#define pod_lock(p)           mm_lock(pod, &(p)->pod.lock)
>> +#define pod_unlock(p)         mm_unlock(&(p)->pod.lock)
>> +#define pod_locked_by_me(p)   mm_locked_by_me(&(p)->pod.lock)
>> +
>>  /* Page alloc lock (per-domain)
>>  *
>>  * This is an external lock, not represented by an mm_lock_t. However,
>> diff -r 332775f72a30 -r 981073d78f7f xen/arch/x86/mm/p2m-pod.c
>> --- a/xen/arch/x86/mm/p2m-pod.c
>> +++ b/xen/arch/x86/mm/p2m-pod.c
>> @@ -63,6 +63,7 @@ static inline void unlock_page_alloc(str
>>  * Populate-on-demand functionality
>>  */
>>
>> +/* PoD lock held on entry */
>>  static int
>>  p2m_pod_cache_add(struct p2m_domain *p2m,
>>                   struct page_info *page,
>> @@ -114,43 +115,42 @@ p2m_pod_cache_add(struct p2m_domain *p2m
>>         unmap_domain_page(b);
>>     }
>>
>> +    /* First, take all pages off the domain list */
>>     lock_page_alloc(p2m);
>> -
>> -    /* First, take all pages off the domain list */
>>     for(i=0; i < 1 << order ; i++)
>>     {
>>         p = page + i;
>>         page_list_del(p, &d->page_list);
>>     }
>>
>> -    /* Then add the first one to the appropriate populate-on-demand
>> list */
>> -    switch(order)
>> -    {
>> -    case PAGE_ORDER_2M:
>> -        page_list_add_tail(page, &p2m->pod.super); /* lock:
page_alloc
>> */
>> -        p2m->pod.count += 1 << order;
>> -        break;
>> -    case PAGE_ORDER_4K:
>> -        page_list_add_tail(page, &p2m->pod.single); /* lock:
page_alloc
>> */
>> -        p2m->pod.count += 1;
>> -        break;
>> -    default:
>> -        BUG();
>> -    }
>> -
>>     /* Ensure that the PoD cache has never been emptied.
>>      * This may cause "zombie domains" since the page will
never be
>> freed. */
>>     BUG_ON( d->arch.relmem != RELMEM_not_started );
>>
>>     unlock_page_alloc(p2m);
>>
>> +    /* Then add the first one to the appropriate populate-on-demand
>> list */
>> +    switch(order)
>> +    {
>> +    case PAGE_ORDER_2M:
>> +        page_list_add_tail(page, &p2m->pod.super);
>> +        p2m->pod.count += 1 << order;
>> +        break;
>> +    case PAGE_ORDER_4K:
>> +        page_list_add_tail(page, &p2m->pod.single);
>> +        p2m->pod.count += 1;
>> +        break;
>> +    default:
>> +        BUG();
>> +    }
>> +
>>     return 0;
>>  }
>>
>>  /* Get a page of size order from the populate-on-demand cache.  Will
>> break
>>  * down 2-meg pages into singleton pages automatically.  Returns null
if
>> - * a superpage is requested and no superpages are available.  Must be
>> called
>> - * with the d->page_lock held. */
>> + * a superpage is requested and no superpages are available. */
>> +/* PoD lock held on entry */
>>  static struct page_info * p2m_pod_cache_get(struct p2m_domain *p2m,
>>                                             unsigned long order)
>>  {
>> @@ -185,7 +185,7 @@ static struct page_info * p2m_pod_cache_
>>     case PAGE_ORDER_2M:
>>         BUG_ON( page_list_empty(&p2m->pod.super) );
>>         p = page_list_remove_head(&p2m->pod.super);
>> -        p2m->pod.count -= 1 << order; /* Lock: page_alloc */
>> +        p2m->pod.count -= 1 << order;
>>         break;
>>     case PAGE_ORDER_4K:
>>         BUG_ON( page_list_empty(&p2m->pod.single) );
>> @@ -197,16 +197,19 @@ static struct page_info * p2m_pod_cache_
>>     }
>>
>>     /* Put the pages back on the domain page_list */
>> +    lock_page_alloc(p2m);
>>     for ( i = 0 ; i < (1 << order); i++ )
>>     {
>>         BUG_ON(page_get_owner(p + i) != p2m->domain);
>>         page_list_add_tail(p + i, &p2m->domain->page_list);
>>     }
>> +    unlock_page_alloc(p2m);
>>
>>     return p;
>>  }
>>
>>  /* Set the size of the cache, allocating or freeing as necessary. */
>> +/* PoD lock held on entry */
>>  static int
>>  p2m_pod_set_cache_target(struct p2m_domain *p2m, unsigned long
>> pod_target, int preemptible)
>>  {
>> @@ -259,8 +262,6 @@ p2m_pod_set_cache_target(struct p2m_doma
>>
>>         /* Grab the lock before checking that pod.super is empty, or
the
>> last
>>          * entries may disappear before we grab the lock. */
>> -        lock_page_alloc(p2m);
>> -
>>         if ( (p2m->pod.count - pod_target) > SUPERPAGE_PAGES
>>              && !page_list_empty(&p2m->pod.super) )
>>             order = PAGE_ORDER_2M;
>> @@ -271,8 +272,6 @@ p2m_pod_set_cache_target(struct p2m_doma
>>
>>         ASSERT(page != NULL);
>>
>> -        unlock_page_alloc(p2m);
>> -
>>         /* Then free them */
>>         for ( i = 0 ; i < (1 << order) ; i++ )
>>         {
>> @@ -348,7 +347,7 @@ p2m_pod_set_mem_target(struct domain *d,
>>     int ret = 0;
>>     unsigned long populated;
>>
>> -    p2m_lock(p2m);
>> +    pod_lock(p2m);
>>
>>     /* P == B: Nothing to do. */
>>     if ( p2m->pod.entry_count == 0 )
>> @@ -377,7 +376,7 @@ p2m_pod_set_mem_target(struct domain *d,
>>     ret = p2m_pod_set_cache_target(p2m, pod_target, 1/*preemptible*/);
>>
>>  out:
>> -    p2m_unlock(p2m);
>> +    pod_unlock(p2m);
>>
>>     return ret;
>>  }
>> @@ -390,7 +389,7 @@ p2m_pod_empty_cache(struct domain *d)
>>
>>     /* After this barrier no new PoD activities can happen. */
>>     BUG_ON(!d->is_dying);
>> -    spin_barrier(&p2m->lock.lock);
>> +    spin_barrier(&p2m->pod.lock.lock);
>>
>>     lock_page_alloc(p2m);
>>
>> @@ -431,7 +430,8 @@ p2m_pod_offline_or_broken_hit(struct pag
>>     if ( !(d = page_get_owner(p)) || !(p2m = p2m_get_hostp2m(d)) )
>>         return 0;
>>
>> -    lock_page_alloc(p2m);
>> +    pod_lock(p2m);
>> +
>>     bmfn = mfn_x(page_to_mfn(p));
>>     page_list_for_each_safe(q, tmp, &p2m->pod.super)
>>     {
>> @@ -462,12 +462,14 @@ p2m_pod_offline_or_broken_hit(struct pag
>>         }
>>     }
>>
>> -    unlock_page_alloc(p2m);
>> +    pod_unlock(p2m);
>>     return 0;
>>
>>  pod_hit:
>> +    lock_page_alloc(p2m);
>>     page_list_add_tail(p, &d->arch.relmem_list);
>>     unlock_page_alloc(p2m);
>> +    pod_unlock(p2m);
>>     return 1;
>>  }
>>
>> @@ -486,9 +488,9 @@ p2m_pod_offline_or_broken_replace(struct
>>     if ( unlikely(!p) )
>>         return;
>>
>> -    p2m_lock(p2m);
>> +    pod_lock(p2m);
>>     p2m_pod_cache_add(p2m, p, PAGE_ORDER_4K);
>> -    p2m_unlock(p2m);
>> +    pod_unlock(p2m);
>>     return;
>>  }
>>
>> @@ -512,6 +514,7 @@ p2m_pod_decrease_reservation(struct doma
>>     int steal_for_cache = 0;
>>     int pod = 0, nonpod = 0, ram = 0;
>>
>> +    pod_lock(p2m);
>>
>>     /* If we don''t have any outstanding PoD entries, let
things take
>> their
>>      * course */
>> @@ -521,11 +524,10 @@ p2m_pod_decrease_reservation(struct doma
>>     /* Figure out if we need to steal some freed memory for our cache
*/
>>     steal_for_cache =  ( p2m->pod.entry_count > p2m->pod.count
);
>>
>> -    p2m_lock(p2m);
>>     audit_p2m(p2m, 1);
>>
>>     if ( unlikely(d->is_dying) )
>> -        goto out_unlock;
>> +        goto out;
>>
>>     /* See what''s in here. */
>>     /* FIXME: Add contiguous; query for PSE entries? */
>
> I don''t think this can be quite right.
>
> The point of holding the p2m lock here is so that the p2m entries
> don''t change between the gfn_to_mfn_query() here and the
> set_p2m_entries() below.  The balloon driver racing with other vcpus
> populating pages is exactly the kind of race we expect to experience.
> And in any case, this change will cause set_p2m_entry() to ASSERT()
> because we''re not holding the p2m lock.
>
> Or am I missing something?
>
> I haven''t yet looked at the rest of the patch series, but it would
> definitely be better for people in the future looking back and trying
> to figure out why the code is the way that it is if even transitory
> changesets don''t introduce "temporary" violations of
invariants. :-)
>
>> @@ -547,14 +549,14 @@ p2m_pod_decrease_reservation(struct doma
>>
>>     /* No populate-on-demand?  Don''t need to steal anything?
 Then we''re
>> done!*/
>>     if(!pod && !steal_for_cache)
>> -        goto out_unlock;
>> +        goto out_audit;
>>
>>     if ( !nonpod )
>>     {
>>         /* All PoD: Mark the whole region invalid and tell caller
>>          * we''re done. */
>>         set_p2m_entry(p2m, gpfn, _mfn(INVALID_MFN), order, p2m_invalid,
>> p2m->default_access);
>> -        p2m->pod.entry_count-=(1<<order); /* Lock: p2m */
>> +        p2m->pod.entry_count-=(1<<order);
>>         BUG_ON(p2m->pod.entry_count < 0);
>>         ret = 1;
>>         goto out_entry_check;
>> @@ -577,7 +579,7 @@ p2m_pod_decrease_reservation(struct doma
>>         if ( t == p2m_populate_on_demand )
>>         {
>>             set_p2m_entry(p2m, gpfn + i, _mfn(INVALID_MFN), 0,
>> p2m_invalid, p2m->default_access);
>> -            p2m->pod.entry_count--; /* Lock: p2m */
>> +            p2m->pod.entry_count--;
>>             BUG_ON(p2m->pod.entry_count < 0);
>>             pod--;
>>         }
>> @@ -613,11 +615,11 @@ out_entry_check:
>>         p2m_pod_set_cache_target(p2m, p2m->pod.entry_count,
0/*can''t
>> preempt*/);
>>     }
>>
>> -out_unlock:
>> +out_audit:
>>     audit_p2m(p2m, 1);
>> -    p2m_unlock(p2m);
>>
>>  out:
>> +    pod_unlock(p2m);
>>     return ret;
>>  }
>>
>> @@ -630,20 +632,24 @@ void p2m_pod_dump_data(struct domain *d)
>>
>>
>>  /* Search for all-zero superpages to be reclaimed as superpages for
the
>> - * PoD cache. Must be called w/ p2m lock held, page_alloc lock not
>> held. */
>> -static int
>> + * PoD cache. Must be called w/ pod lock held, page_alloc lock not
>> held. */
>> +static void
>
> For the same reason, this must be called with the p2m lock held: it
> calls gfn_to_mfn_query() and then calls set_p2m_entry().  As it
> happens, this always *is* called with the p2m lock held at the moment;
> but the comment still needs to reflect this.  Similarly in
> p2m_pod_zero_check().
>
>>  p2m_pod_zero_check_superpage(struct p2m_domain *p2m, unsigned long
gfn)
>>  {
>>     mfn_t mfn, mfn0 = _mfn(INVALID_MFN);
>>     p2m_type_t type, type0 = 0;
>>     unsigned long * map = NULL;
>> -    int ret=0, reset = 0;
>> +    int success = 0, reset = 0;
>>     int i, j;
>>     int max_ref = 1;
>>     struct domain *d = p2m->domain;
>>
>>     if ( !superpage_aligned(gfn) )
>> -        goto out;
>> +        return;
>> +
>> +    /* If we were enforcing ordering against p2m locks, this is a
place
>> +     * to drop the PoD lock and re-acquire it once we''re done
mucking
>> with
>> +     * the p2m. */
>>
>>     /* Allow an extra refcount for one shadow pt mapping in shadowed
>> domains */
>>     if ( paging_mode_shadow(d) )
>> @@ -751,19 +757,24 @@ p2m_pod_zero_check_superpage(struct p2m_
>>         __trace_var(TRC_MEM_POD_ZERO_RECLAIM, 0, sizeof(t), &t);
>>     }
>>
>> -    /* Finally!  We''ve passed all the checks, and can add the
mfn
>> superpage
>> -     * back on the PoD cache, and account for the new p2m PoD entries
>> */
>> -    p2m_pod_cache_add(p2m, mfn_to_page(mfn0), PAGE_ORDER_2M);
>> -    p2m->pod.entry_count += SUPERPAGE_PAGES;
>> +    success = 1;
>> +
>>
>>  out_reset:
>>     if ( reset )
>>         set_p2m_entry(p2m, gfn, mfn0, 9, type0,
p2m->default_access);
>>
>>  out:
>> -    return ret;
>> +    if ( success )
>> +    {
>> +        /* Finally!  We''ve passed all the checks, and can add
the mfn
>> superpage
>> +         * back on the PoD cache, and account for the new p2m PoD
>> entries */
>> +        p2m_pod_cache_add(p2m, mfn_to_page(mfn0), PAGE_ORDER_2M);
>> +        p2m->pod.entry_count += SUPERPAGE_PAGES;
>> +    }
>>  }
>>
>> +/* On entry, PoD lock is held */
>>  static void
>>  p2m_pod_zero_check(struct p2m_domain *p2m, unsigned long *gfns, int
>> count)
>>  {
>> @@ -775,6 +786,8 @@ p2m_pod_zero_check(struct p2m_domain *p2
>>     int i, j;
>>     int max_ref = 1;
>>
>> +    /* Also the right time to drop pod_lock if enforcing ordering
>> against p2m_lock */
>> +
>>     /* Allow an extra refcount for one shadow pt mapping in shadowed
>> domains */
>>     if ( paging_mode_shadow(d) )
>>         max_ref++;
>> @@ -841,7 +854,6 @@ p2m_pod_zero_check(struct p2m_domain *p2
>>             if( *(map[i]+j) != 0 )
>>                 break;
>>
>> -        unmap_domain_page(map[i]);
>>
>>         /* See comment in p2m_pod_zero_check_superpage() re gnttab
>>          * check timing.  */
>> @@ -849,8 +861,15 @@ p2m_pod_zero_check(struct p2m_domain *p2
>>         {
>>             set_p2m_entry(p2m, gfns[i], mfns[i], PAGE_ORDER_4K,
>>                 types[i], p2m->default_access);
>> +            unmap_domain_page(map[i]);
>> +            map[i] = NULL;
>>         }
>> -        else
>> +    }
>> +
>> +    /* Finally, add to cache */
>> +    for ( i=0; i < count; i++ )
>> +    {
>> +        if ( map[i] )
>>         {
>>             if ( tb_init_done )
>>             {
>> @@ -867,6 +886,8 @@ p2m_pod_zero_check(struct p2m_domain *p2
>>                 __trace_var(TRC_MEM_POD_ZERO_RECLAIM, 0, sizeof(t),
&t);
>>             }
>>
>> +            unmap_domain_page(map[i]);
>> +
>>             /* Add to cache, and account for the new p2m PoD entry */
>>             p2m_pod_cache_add(p2m, mfn_to_page(mfns[i]),
PAGE_ORDER_4K);
>>             p2m->pod.entry_count++;
>> @@ -876,6 +897,7 @@ p2m_pod_zero_check(struct p2m_domain *p2
>>  }
>>
>>  #define POD_SWEEP_LIMIT 1024
>> +/* Only one CPU at a time is guaranteed to enter a sweep */
>>  static void
>>  p2m_pod_emergency_sweep_super(struct p2m_domain *p2m)
>>  {
>> @@ -964,7 +986,8 @@ p2m_pod_demand_populate(struct p2m_domai
>>
>>     ASSERT(p2m_locked_by_me(p2m));
>>
>> -    /* This check is done with the p2m lock held.  This will make sure
>> that
>> +    pod_lock(p2m);
>> +    /* This check is done with the pod lock held.  This will make sure
>> that
>>      * even if d->is_dying changes under our feet,
p2m_pod_empty_cache()
>>      * won''t start until we''re done. */
>>     if ( unlikely(d->is_dying) )
>> @@ -974,6 +997,7 @@ p2m_pod_demand_populate(struct p2m_domai
>>      * 1GB region to 2MB chunks for a retry. */
>>     if ( order == PAGE_ORDER_1G )
>>     {
>> +        pod_unlock(p2m);
>>         gfn_aligned = (gfn >> order) << order;
>>         /* Note that we are supposed to call set_p2m_entry() 512 times
>> to
>>          * split 1GB into 512 2MB pages here. But We only do once here
>> because
>> @@ -983,6 +1007,7 @@ p2m_pod_demand_populate(struct p2m_domai
>>         set_p2m_entry(p2m, gfn_aligned, _mfn(0), PAGE_ORDER_2M,
>>                       p2m_populate_on_demand, p2m->default_access);
>>         audit_p2m(p2m, 1);
>> +        /* This is because the ept/pt caller locks the p2m recursively
>> */
>>         p2m_unlock(p2m);
>>         return 0;
>>     }
>> @@ -996,11 +1021,15 @@ p2m_pod_demand_populate(struct p2m_domai
>>
>>         /* If we''re low, start a sweep */
>>         if ( order == PAGE_ORDER_2M &&
page_list_empty(&p2m->pod.super)
>> )
>> +            /* Note that sweeps scan other ranges in the p2m. In an
>> scenario
>> +             * in which p2m locks are order-enforced wrt pod lock and
>> p2m
>> +             * locks are fine grained, this will result in deadlock */
>>             p2m_pod_emergency_sweep_super(p2m);
>>
>>         if ( page_list_empty(&p2m->pod.single) &&
>>              ( ( order == PAGE_ORDER_4K )
>>                || (order == PAGE_ORDER_2M &&
>> page_list_empty(&p2m->pod.super) ) ) )
>> +            /* Same comment regarding deadlock applies */
>>             p2m_pod_emergency_sweep(p2m);
>>     }
>>
>> @@ -1008,8 +1037,6 @@ p2m_pod_demand_populate(struct p2m_domai
>>     if ( q == p2m_guest && gfn > p2m->pod.max_guest )
>>         p2m->pod.max_guest = gfn;
>>
>> -    lock_page_alloc(p2m);
>> -
>>     if ( p2m->pod.count == 0 )
>>         goto out_of_memory;
>>
>> @@ -1022,8 +1049,6 @@ p2m_pod_demand_populate(struct p2m_domai
>>
>>     BUG_ON((mfn_x(mfn) & ((1 << order)-1)) != 0);
>>
>> -    unlock_page_alloc(p2m);
>> -
>>     gfn_aligned = (gfn >> order) << order;
>>
>>     set_p2m_entry(p2m, gfn_aligned, mfn, order, p2m_ram_rw,
>> p2m->default_access);
>> @@ -1034,8 +1059,9 @@ p2m_pod_demand_populate(struct p2m_domai
>>         paging_mark_dirty(d, mfn_x(mfn) + i);
>>     }
>>
>> -    p2m->pod.entry_count -= (1 << order); /* Lock: p2m */
>> +    p2m->pod.entry_count -= (1 << order);
>>     BUG_ON(p2m->pod.entry_count < 0);
>> +    pod_unlock(p2m);
>>
>>     if ( tb_init_done )
>>     {
>> @@ -1054,16 +1080,17 @@ p2m_pod_demand_populate(struct p2m_domai
>>
>>     return 0;
>>  out_of_memory:
>> -    unlock_page_alloc(p2m);
>> +    pod_unlock(p2m);
>>
>>     printk("%s: Out of populate-on-demand memory! tot_pages
%" PRIu32 "
>> pod_entries %" PRIi32 "\n",
>>            __func__, d->tot_pages, p2m->pod.entry_count);
>>     domain_crash(d);
>>  out_fail:
>> +    pod_unlock(p2m);
>>     return -1;
>>  remap_and_retry:
>>     BUG_ON(order != PAGE_ORDER_2M);
>> -    unlock_page_alloc(p2m);
>> +    pod_unlock(p2m);
>>
>>     /* Remap this 2-meg region in singleton chunks */
>>     gfn_aligned = (gfn>>order)<<order;
>> @@ -1133,9 +1160,11 @@ guest_physmap_mark_populate_on_demand(st
>>         rc = -EINVAL;
>>     else
>>     {
>> -        p2m->pod.entry_count += 1 << order; /* Lock: p2m */
>> +        pod_lock(p2m);
>> +        p2m->pod.entry_count += 1 << order;
>>         p2m->pod.entry_count -= pod_count;
>>         BUG_ON(p2m->pod.entry_count < 0);
>> +        pod_unlock(p2m);
>>     }
>>
>>     audit_p2m(p2m, 1);
>> diff -r 332775f72a30 -r 981073d78f7f xen/arch/x86/mm/p2m-pt.c
>> --- a/xen/arch/x86/mm/p2m-pt.c
>> +++ b/xen/arch/x86/mm/p2m-pt.c
>> @@ -1001,6 +1001,7 @@ void audit_p2m(struct p2m_domain *p2m, i
>>     if ( !paging_mode_translate(d) )
>>         return;
>>
>> +    pod_lock(p2m);
>>     //P2M_PRINTK("p2m audit starts\n");
>>
>>     test_linear = ( (d == current->domain)
>> @@ -1247,6 +1248,8 @@ void audit_p2m(struct p2m_domain *p2m, i
>>                    pmbad, mpbad);
>>         WARN();
>>     }
>> +
>> +    pod_unlock(p2m);
>>  }
>>  #endif /* P2M_AUDIT */
>>
>> diff -r 332775f72a30 -r 981073d78f7f xen/arch/x86/mm/p2m.c
>> --- a/xen/arch/x86/mm/p2m.c
>> +++ b/xen/arch/x86/mm/p2m.c
>> @@ -72,6 +72,7 @@ boolean_param("hap_2mb", opt_hap_2mb);
>>  static void p2m_initialise(struct domain *d, struct p2m_domain *p2m)
>>  {
>>     mm_lock_init(&p2m->lock);
>> +    mm_lock_init(&p2m->pod.lock);
>>     INIT_LIST_HEAD(&p2m->np2m_list);
>>     INIT_PAGE_LIST_HEAD(&p2m->pages);
>>     INIT_PAGE_LIST_HEAD(&p2m->pod.super);
>> @@ -506,8 +507,10 @@ guest_physmap_add_entry(struct domain *d
>>             rc = -EINVAL;
>>         else
>>         {
>> -            p2m->pod.entry_count -= pod_count; /* Lock: p2m */
>> +            pod_lock(p2m);
>> +            p2m->pod.entry_count -= pod_count;
>>             BUG_ON(p2m->pod.entry_count < 0);
>> +            pod_unlock(p2m);
>>         }
>>     }
>>
>> @@ -1125,8 +1128,10 @@ p2m_flush_table(struct p2m_domain *p2m)
>>     /* "Host" p2m tables can have shared entries &c that
need a bit more
>>      * care when discarding them */
>>     ASSERT(p2m_is_nestedp2m(p2m));
>> +    pod_lock(p2m);
>>     ASSERT(page_list_empty(&p2m->pod.super));
>>     ASSERT(page_list_empty(&p2m->pod.single));
>> +    pod_unlock(p2m);
>>
>>     /* This is no longer a valid nested p2m for any address space */
>>     p2m->cr3 = CR3_EADDR;
>> diff -r 332775f72a30 -r 981073d78f7f xen/include/asm-x86/p2m.h
>> --- a/xen/include/asm-x86/p2m.h
>> +++ b/xen/include/asm-x86/p2m.h
>> @@ -257,24 +257,13 @@ struct p2m_domain {
>>     unsigned long max_mapped_pfn;
>>
>>     /* Populate-on-demand variables
>> -     * NB on locking.  {super,single,count} are
>> -     * covered by d->page_alloc_lock, since they''re almost
always used
>> in
>> -     * conjunction with that functionality.  {entry_count} is covered
>> by
>> -     * the domain p2m lock, since it''s almost always used in
>> conjunction
>> -     * with changing the p2m tables.
>>      *
>> -     * At this point, both locks are held in two places.  In both,
>> -     * the order is [p2m,page_alloc]:
>> -     * + p2m_pod_decrease_reservation() calls p2m_pod_cache_add(),
>> -     *   which grabs page_alloc
>> -     * + p2m_pod_demand_populate() grabs both; the p2m lock to avoid
>> -     *   double-demand-populating of pages, the page_alloc lock to
>> -     *   protect moving stuff from the PoD cache to the domain page
>> list.
>> -     *
>> -     * We enforce this lock ordering through a construct in
mm-locks.h.
>> -     * This demands, however, that we store the previous lock-ordering
>> -     * level in effect before grabbing the page_alloc lock.
>> -     */
>> +     * All variables are protected with the pod lock. We cannot rely
on
>> +     * the p2m lock if it''s turned into a fine-grained lock.
>> +     * We only use the domain page_alloc lock for additions and
>> +     * deletions to the domain''s page list. Because we use it
nested
>> +     * within the PoD lock, we enforce it''s ordering (by
remembering
>> +     * the unlock level). */
>>     struct {
>>         struct page_list_head super,   /* List of superpages          
 
>>    */
>>                          single;       /* Non-super lists              
>>     */
>> @@ -283,6 +272,8 @@ struct p2m_domain {
>>         unsigned         reclaim_super; /* Last gpfn of a scan */
>>         unsigned         reclaim_single; /* Last gpfn of a scan */
>>         unsigned         max_guest;    /* gpfn of max guest
>> demand-populate */
>> +        mm_lock_t        lock;         /* Locking of private pod
>> structs,   *
>> +                                        * not relying on the p2m lock.
>>      */
>>         int              page_alloc_unlock_level; /* To enforce lock
>> ordering */
>>     } pod;
>>  };
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>>
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andres Lagar-Cavilla

2011-Nov-03 15:14 UTC

head link

Re: [Xen-devel] [PATCH 8 of 9] Modify all internal p2m functions to use the new fine-grained locking

The unlocked''s are a shorthand. I might consider removing them if they
cause to much confusion.

In the case you''re worried about, below, note there is a regular
gfn_to_mfn_query (which therefore locks the p2m entry) before the
drop_p2m. The previous scan uses unlocked -- we can forego unlocked there,
and keep the p2m entry locked the whole way. But then we need to make sure
we don''t overflow recursive counters, since locks covers a 2MB range.

Andres
> On Thu, Oct 27, 2011 at 1:33 PM, Andres Lagar-Cavilla
> <andres@lagarcavilla.org> wrote:
>>  xen/arch/x86/mm/hap/hap.c        |    2 +-
>>  xen/arch/x86/mm/hap/nested_hap.c |   21 ++-
>>  xen/arch/x86/mm/p2m-ept.c        |   26 +----
>>  xen/arch/x86/mm/p2m-pod.c        |   42 +++++--
>>  xen/arch/x86/mm/p2m-pt.c         |   20 +---
>>  xen/arch/x86/mm/p2m.c            |  185
>> ++++++++++++++++++++++++--------------
>>  xen/include/asm-ia64/mm.h        |    5 +
>>  xen/include/asm-x86/p2m.h        |   45 +++++++++-
>>  8 files changed, 217 insertions(+), 129 deletions(-)
>>
>>
>> This patch only modifies code internal to the p2m, adding convenience
>> macros, etc. It will yield a compiling code base but an incorrect
>> hypervisor (external callers of queries into the p2m will not unlock).
>> Next patch takes care of external callers, split done for the benefit
>> of conciseness.
>
> It''s not obvious to me where in this patch to find a description
of
> what the new locking regime is.  What does the _unlocked() mean?  When
> do I have to call that vs a different one, and when do I have to lock
> / unlock / whatever?
>
> I think that should ideally be both in the commit message (at least a
> summary), and also in a comment in a header somewhere.  Perhaps it is
> already in the patch somewhere, but a quick glance through didn''t
find
> it...
>
>>
>> Signed-off-by: Andres Lagar-Cavilla <andres@lagarcavilla.org>
>>
>> diff -r 8a98179666de -r 471d4f2754d6 xen/arch/x86/mm/hap/hap.c
>> --- a/xen/arch/x86/mm/hap/hap.c
>> +++ b/xen/arch/x86/mm/hap/hap.c
>> @@ -861,7 +861,7 @@ hap_write_p2m_entry(struct vcpu *v, unsi
>>     old_flags = l1e_get_flags(*p);
>>
>>     if ( nestedhvm_enabled(d) && (old_flags &
_PAGE_PRESENT)
>> -         && !p2m_get_hostp2m(d)->defer_nested_flush ) {
>> +         &&
!atomic_read(&(p2m_get_hostp2m(d)->defer_nested_flush)) ) {
>>         /* We are replacing a valid entry so we need to flush nested
>> p2ms,
>>          * unless the only chan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andres Lagar-Cavilla

2011-Nov-03 15:16 UTC

head link

Re: [Xen-devel] [PATCH 8 of 9] Modify all internal p2m functions to use the new fine-grained locking

Hey,> Hi,
>
> At 07:24 -0700 on 02 Nov (1320218674), andres@lagarcavilla.com wrote:
>> >> +/* No matter what value you get out of a query, the p2m has
been
>> locked
>> >> for
>> >> + * that range. No matter what you do, you need to drop those
locks.
>> >> + * You need to pass back the mfn obtained when locking, not
the new
>> >> one,
>> >> + * as the refcount of the original mfn was bumped. */
>> >
>> > Surely the caller doesn''t need to remember the old MFN
for this?
>> After
>> > allm, the whole point of the lock was that nobody else could
change
>> the
>> > p2m entry under our feet!
>> >
>> > In any case, I thing there needs to be a big block comment a bit
>> futher
>> > up that describes what all this locking and refcounting does, and
why.
>>
>> Comment will be added. I was being doubly-paranoid. I can undo the
>> get_page/put_page of the old mfn. I''m not a 100% behind it.
>
> I meant to suggest that the p2m code should me able to do the
> get_page/put_page without the caller remembering the mfn, since by
> definition it should be able to look it up in the unlock, knowing no-one
> else can have changed it.
Not necessarily. How about sharing? Or paging out? Tricky, tricky. I guess
the easy fix is to tell drop_p2m to not put_page in those cases.

Andres
>
>> I don''t think these names are the most terrible --
we''ve all seen far
>> worse :) I mean, the naming encodes the arguments, and I don''t
see an
>> intrinsic advantage to
>> gfn_to_mfn(d, g, t, p2m_guest, p2m_unlocked)
>> over
>> gfn_to_mfn_guest_unlocked(d,g,t)
>
> Yep, it''s definitely not the worst. :)  It''s really just
a question of
> verbosity in the headers.
>
> Tim.
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Andres Lagar-Cavilla

2011-Nov-03 15:20 UTC

head link

Re: [Xen-devel] [PATCH 9 of 9] Modify all call sites of queries into the p2m to use the new fine-grained locking

get_gfn/put_gfn it is. I''ll figure set_gfn/alternative on the way. Next
patch coming with that naming. It will make everyone cringe :)

Andres
> At 07:32 -0700 on 02 Nov (1320219175), andres@lagarcavilla.com wrote:
>> I don''t know that a massive sed on all these names is a good
idea. I
>> guess
>> forcing everyone to compile-fail will also make them realize they need
>> to
>> add a call to drop the p2m locks they got...
>>
>> Can you elaborate on the naming preferences here: would you prefer
>> gfn_to_mfn/put_gfn? get_p2m_gfn/put_p2m_gfn? get_gfn/put_gfn
>
> I think I''d prefer get_gfn/put_gfn.  And maybe set_gfn for writes
too?
> But I''m willing to be persuaded otherwise if anyone feels strongly
about
> it.
>
> Tim.
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Oct 2011 - [PATCH 0 of 9] [RFC] p2m fine-grained concurrency control

[Xen-devel] [PATCH 0 of 9] [RFC] p2m fine-grained concurrency control

[Xen-devel] [PATCH 1 of 9] Refactor mm-lock ordering constructs

[Xen-devel] [PATCH 2 of 9] Declare an order-enforcing construct for external locks used in the mm layer

[Xen-devel] [PATCH 3 of 9] Enforce ordering constraints for the page alloc lock in the PoD code

[Xen-devel] [PATCH 4 of 9] Rework locking in the PoD layer

[Xen-devel] [PATCH 5 of 9] Fine-grained concurrency control structure for the p2m

[Xen-devel] [PATCH 6 of 9] Protect superpage splitting in implementation-dependent traversals

[Xen-devel] [PATCH 7 of 9] Refactor p2m get_entry accessor

[Xen-devel] [PATCH 8 of 9] Modify all internal p2m functions to use the new fine-grained locking

[Xen-devel] [PATCH 9 of 9] Modify all call sites of queries into the p2m to use the new fine-grained locking

Re: [Xen-devel] [PATCH 3 of 9] Enforce ordering constraints for the page alloc lock in the PoD code

Re: [Xen-devel] [PATCH 4 of 9] Rework locking in the PoD layer

Re: [Xen-devel] [PATCH 5 of 9] Fine-grained concurrency control structure for the p2m

Re: [Xen-devel] [PATCH 8 of 9] Modify all internal p2m functions to use the new fine-grained locking

Re: [Xen-devel] [PATCH 9 of 9] Modify all call sites of queries into the p2m to use the new fine-grained locking

Re: [Xen-devel] [PATCH 3 of 9] Enforce ordering constraints for the page alloc lock in the PoD code

Re: [Xen-devel] [PATCH 4 of 9] Rework locking in the PoD layer

Re: [Xen-devel] [PATCH 5 of 9] Fine-grained concurrency control structure for the p2m

Re: [Xen-devel] [PATCH 8 of 9] Modify all internal p2m functions to use the new fine-grained locking

Re: [Xen-devel] [PATCH 9 of 9] Modify all call sites of queries into the p2m to use the new fine-grained locking

Re: [Xen-devel] [PATCH 4 of 9] Rework locking in the PoD layer

Re: [Xen-devel] [PATCH 8 of 9] Modify all internal p2m functions to use the new fine-grained locking

Re: [Xen-devel] [PATCH 3 of 9] Enforce ordering constraints for the page alloc lock in the PoD code

Re: [Xen-devel] [PATCH 5 of 9] Fine-grained concurrency control structure for the p2m

Re: [Xen-devel] [PATCH 8 of 9] Modify all internal p2m functions to use the new fine-grained locking

Re: [Xen-devel] [PATCH 9 of 9] Modify all call sites of queries into the p2m to use the new fine-grained locking

Re: [Xen-devel] [PATCH 5 of 9] Fine-grained concurrency control structure for the p2m

Re: [Xen-devel] [PATCH 4 of 9] Rework locking in the PoD layer

Re: [Xen-devel] [PATCH 8 of 9] Modify all internal p2m functions to use the new fine-grained locking

Re: [Xen-devel] [PATCH 8 of 9] Modify all internal p2m functions to use the new fine-grained locking

Re: [Xen-devel] [PATCH 9 of 9] Modify all call sites of queries into the p2m to use the new fine-grained locking