thr3ads.net - Xen devel - [RFC PATCH 0/2] gnttab: refactor locking for better scalability [Nov 2013]

If this information is useful, please help other people find it:
Share via:

Matt Wilson

2013-Nov-12 02:03 UTC

[RFC PATCH 0/2] gnttab: refactor locking for better scalability

From: Matt Wilson <msw@amazon.com>

As discussed in the Xen Developer Summit Storage Performance BoF,
there is a lot of room for improvement in grant table locking. Anthony
and I have been working on refactoring the locking over the past few
weeks. The performance improvement is considerable and I''d like to
hear from others if this approach is fundamentally wrong for some
reason.

The previous single spinlock per grant table is split into multiple
locks. The heavily modified components of the grant table (the
maptrack state and the active entries) are now protected by their own
spinlocks. The remaining elements of the grant table are read-mostly,
so I modified the main grant table lock to be a rwlock to improve
concurrency.

On the performance improvement: Without persistent grants, a domU with
24 VBDs plummbed to local HDDs in a streaming 2M write workload
achieved 1,400 MB/sec before this change. Performance more than
doubles with this patch, reaching 3,000 MB/sec before tuning and 3,600
MB/sec after adjusting event channel vCPU bindings.

I included the previously posted patch to __gnttab_unmap_common() in
the series since it makes a bit more sense in this context, and the
follow on refactor patch is on top of it.

DISCLAIMER: I ported this patch from a different Xen version earlier
today, and I''ve only compile tested so far. In the original state
we''ve pushed a lot of concurrent I/O through dom0 and haven''t
seen any
stability issues.

Matt Wilson (2):
  gnttab: lock the local grant table earlier in __gnttab_unmap_common()
  gnttab: refactor locking for better scalability

 docs/misc/grant-tables.txt    |   56 +++++++-
 xen/arch/x86/mm.c             |    4 +-
 xen/common/grant_table.c      |  308 ++++++++++++++++++++++++++---------------
 xen/include/xen/grant_table.h |    9 +-
 4 files changed, 261 insertions(+), 116 deletions(-)

-- 
1.7.9.5

Matt Wilson

2013-Nov-12 02:03 UTC

head link

[RFC PATCH 1/2] gnttab: lock the local grant table earlier in __gnttab_unmap_common()

From: Matt Wilson <msw@amazon.com>

Luckily today maptrack_limit never shrinks. But if at some point in
the future this were to change, checking maptrack_limit without
holding the grant table spinlock would no longer be safe.

Cc: xen-devel@lists.xenproject.org
Cc: Keir Fraser <keir@xen.org>
Cc: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Anthony Liguori <aliguori@amazon.com>
Signed-off-by: Matt Wilson <msw@amazon.com>

---
v1->v2:
* updated summary to use "local" instead of "left"
---
 xen/common/grant_table.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index 21c6a14..ef10ff4 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -842,15 +842,16 @@ __gnttab_unmap_common(
 
     op->frame = (unsigned long)(op->dev_bus_addr >> PAGE_SHIFT);
 
+    spin_lock(&lgt->lock);
     if ( unlikely(op->handle >= lgt->maptrack_limit) )
     {
+        spin_unlock(&lgt->lock);
         gdprintk(XENLOG_INFO, "Bad handle (%d).\n", op->handle);
         op->status = GNTST_bad_handle;
         return;
     }
 
     op->map = &maptrack_entry(lgt, op->handle);
-    spin_lock(&lgt->lock);
 
     if ( unlikely(!op->map->flags) )
     {
-- 
1.7.9.5

Matt Wilson

2013-Nov-12 02:03 UTC

head link

[RFC PATCH 2/2] gnttab: refactor locking for better scalability

From: Matt Wilson <msw@amazon.com>

This patch refactors grant table locking. It splits the previous
single spinlock per grant table into multiple locks. The heavily
modified components of the grant table (the maptrack state and the
active entries) are now protected by their own spinlocks. The
remaining elements of the grant table are read-mostly, so the main
grant table lock is modified to be a rwlock to improve concurrency.

Workloads with high grant table operation rates, especially map/unmap
operations as used by blkback/blkfront when persistent grants are not
supported, show marked improvement with these changes. A domU with 24
VBDs in a streaming 2M write workload achieved 1,400 MB/sec before
this change. Performance more than doubles with this patch, reaching
3,000 MB/sec before tuning and 3,600 MB/sec after adjusting event
channel vCPU bindings.

Signed-off-by: Matt Wilson <msw@amazon.com>
---
 docs/misc/grant-tables.txt    |   56 +++++++-
 xen/arch/x86/mm.c             |    4 +-
 xen/common/grant_table.c      |  309 ++++++++++++++++++++++++++---------------
 xen/include/xen/grant_table.h |    9 +-
 4 files changed, 261 insertions(+), 117 deletions(-)

diff --git a/docs/misc/grant-tables.txt b/docs/misc/grant-tables.txt
index 19db4ec..54bcba5 100644
--- a/docs/misc/grant-tables.txt
+++ b/docs/misc/grant-tables.txt
@@ -63,6 +63,7 @@ is complete.
   act->domid : remote domain being granted rights
   act->frame : machine frame being granted
   act->pin   : used to hold reference counts
+  act->lock  : spinlock used to serialize access to active entry state
 
  Map tracking
  ~~~~~~~~~~~~
@@ -74,7 +75,60 @@ is complete.
  matching map track entry is then removed, as if unmap had been invoked.
  These are not used by the transfer mechanism.
   map->domid         : owner of the mapped frame
-  map->ref_and_flags : grant reference, ro/rw, mapped for host or device
access
+  map->ref           : grant reference
+  map->flags         : ro/rw, mapped for host or device access
+
+********************************************************************************
+ Locking
+ ~~~~~~~
+ Xen uses several locks to serialize access to the internal grant table state.
+
+  grant_table->lock          : rwlock used to prevent readers from accessing
+                               inconsistent grant table state such as current
+                               version, partially initialized active table
pages,
+                               etc.
+  grant_table->maptrack_lock : spinlock used to protect the maptrack state
+  active_grant_entry->lock   : spinlock used to serialize modifications to
+                               active entries
+
+ The primary lock for the grant table is a read/write spinlock. All
+ functions that access members of struct grant_table must acquire a
+ read lock around critical sections. Any modification to the members
+ of struct grant_table (e.g., nr_status_frames, nr_grant_frames,
+ active frames, etc.) must only be made if the write lock is
+ held. These elements are read-mostly, and read critical sections can
+ be large, which makes a rwlock a good choice.
+
+ The maptrack state is protected by its own spinlock. Any access (read
+ or write) of struct grant_table members that have a "maptrack_"
+ prefix must be made while holding the maptrack lock. The maptrack
+ state can be rapidly modified under some workloads, and the critical
+ sections are very small, thus we use a spinlock to protect them.
+
+ Active entries are obtained by calling active_entry_acquire(gt, ref).
+ This function returns a pointer to the active entry after locking its
+ spinlock. The caller must hold the rwlock for the grant table in
+ question before calling active_entry_acquire(). This is because the
+ grant table can be dynamically extended via gnttab_grow_table() while
+ a domain is running and each entry''s spinlock must initialized before
+ it is used. Additionally, when the local domain is running
+ paravirtualized under iommu=strict mode, the mapcount() function
+ needs the active entries of the remote domain to be unchanged. Thus
+ the mapcount() function must be called with the remote table write
+ lock held, which prevents modifications to active entries when these
+ rules are observed.
+
+ Once all access to the active entry is complete, release the lock by
+ calling active_entry_release(act).
+
+ Summary of rules for locking:
+  active_entry_acquire() and active_entry_release() can only be
+  called when holding the relevant grant table''s lock. I.e.:
+    read_lock(&gt->lock);
+    act = active_entry_acquire(gt, ref);
+    ...
+    active_entry_release(act);
+    read_unlock(&gt->lock);
 
 ********************************************************************************
 
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 43aaceb..bd7f331 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4537,7 +4537,7 @@ static int xenmem_add_to_physmap_once(
                 mfn = virt_to_mfn(d->shared_info);
             break;
         case XENMAPSPACE_grant_table:
-            spin_lock(&d->grant_table->lock);
+            write_lock(&d->grant_table->lock);
 
             if ( d->grant_table->gt_version == 0 )
                 d->grant_table->gt_version = 1;
@@ -4560,7 +4560,7 @@ static int xenmem_add_to_physmap_once(
                     mfn = virt_to_mfn(d->grant_table->shared_raw[idx]);
             }
 
-            spin_unlock(&d->grant_table->lock);
+            write_unlock(&d->grant_table->lock);
             break;
         case XENMAPSPACE_gmfn_range:
         case XENMAPSPACE_gmfn:
diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c
index ef10ff4..7eb8839 100644
--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -141,10 +141,13 @@ struct active_grant_entry {
                                in the page.                           */
     unsigned      length:16; /* For sub-page grants, the length of the
                                 grant.                                */
+    spinlock_t    lock;      /* lock to protect access of this entry.
+                                see docs/misc/grant-tables.txt for
+                                locking protocol                      */
 };
 
 #define ACGNT_PER_PAGE (PAGE_SIZE / sizeof(struct active_grant_entry))
-#define active_entry(t, e) \
+#define active_entry_unlocked(t, e) \
     ((t)->active[(e)/ACGNT_PER_PAGE][(e)%ACGNT_PER_PAGE])
 
 static inline unsigned int
@@ -166,6 +169,29 @@ nr_active_grant_frames(struct grant_table *gt)
     return num_act_frames_from_sha_frames(nr_grant_frames(gt));
 }
 
+static inline struct active_grant_entry *
+active_entry_acquire(struct grant_table *t, grant_ref_t e)
+{
+    struct active_grant_entry *act;
+
+#ifndef NDEBUG
+    /* not perfect, but better than nothing for a debug build
+     * sanity check
+     */
+    BUG_ON(!rw_is_locked(&t->lock));
+#endif
+
+    act = &active_entry_unlocked(t, e);
+    spin_lock(&act->lock);
+
+    return act;
+}
+
+static inline void active_entry_release(struct active_grant_entry *act)
+{
+    spin_unlock(&act->lock);
+}
+    
 /* Check if the page has been paged out, or needs unsharing. 
    If rc == GNTST_okay, *page contains the page struct with a ref taken.
    Caller must do put_page(*page).
@@ -206,30 +232,6 @@ static int __get_paged_frame(unsigned long gfn, unsigned
long *frame, struct pag
     return rc;
 }
 
-static inline void
-double_gt_lock(struct grant_table *lgt, struct grant_table *rgt)
-{
-    if ( lgt < rgt )
-    {
-        spin_lock(&lgt->lock);
-        spin_lock(&rgt->lock);
-    }
-    else
-    {
-        if ( lgt != rgt )
-            spin_lock(&rgt->lock);
-        spin_lock(&lgt->lock);
-    }
-}
-
-static inline void
-double_gt_unlock(struct grant_table *lgt, struct grant_table *rgt)
-{
-    spin_unlock(&lgt->lock);
-    if ( lgt != rgt )
-        spin_unlock(&rgt->lock);
-}
-
 static inline int
 __get_maptrack_handle(
     struct grant_table *t)
@@ -245,10 +247,10 @@ static inline void
 put_maptrack_handle(
     struct grant_table *t, int handle)
 {
-    spin_lock(&t->lock);
+    spin_lock(&t->maptrack_lock);
     maptrack_entry(t, handle).ref = t->maptrack_head;
     t->maptrack_head = handle;
-    spin_unlock(&t->lock);
+    spin_unlock(&t->maptrack_lock);
 }
 
 static inline int
@@ -260,7 +262,7 @@ get_maptrack_handle(
     struct grant_mapping *new_mt;
     unsigned int          new_mt_limit, nr_frames;
 
-    spin_lock(&lgt->lock);
+    spin_lock(&lgt->maptrack_lock);
 
     while ( unlikely((handle = __get_maptrack_handle(lgt)) == -1) )
     {
@@ -289,12 +291,13 @@ get_maptrack_handle(
                  nr_frames + 1);
     }
 
-    spin_unlock(&lgt->lock);
+    spin_unlock(&lgt->maptrack_lock);
 
     return handle;
 }
 
-/* Number of grant table entries. Caller must hold d''s grant table
lock. */
+/* Number of grant table entries. Caller must hold d''s grant table
+   read lock. */
 static unsigned int nr_grant_entries(struct grant_table *gt)
 {
     ASSERT(gt->gt_version != 0);
@@ -478,6 +481,12 @@ static int _set_status(unsigned gt_version,
         return _set_status_v2(domid, readonly, mapflag, shah, act, status);
 }
 
+/* Count the number of mapped ro or rw entries tracked in the local
+ * grant table given a mfn from a remote domain. 
+ *
+ * This function takes the maptrack lock from the local grant table and
+ * must be called with the remote grant table''s write lock held.
+ */
 static void mapcount(
     struct grant_table *lgt, struct domain *rd, unsigned long mfn,
     unsigned int *wrc, unsigned int *rdc)
@@ -487,15 +496,19 @@ static void mapcount(
 
     *wrc = *rdc = 0;
 
+    spin_lock(&lgt->maptrack_lock);
+
     for ( handle = 0; handle < lgt->maptrack_limit; handle++ )
     {
         map = &maptrack_entry(lgt, handle);
         if ( !(map->flags & (GNTMAP_device_map|GNTMAP_host_map)) ||
              map->domid != rd->domain_id )
             continue;
-        if ( active_entry(rd->grant_table, map->ref).frame == mfn )
+        if ( active_entry_unlocked(rd->grant_table, map->ref).frame ==
mfn )
             (map->flags & GNTMAP_readonly) ? (*rdc)++ : (*wrc)++;
     }
+
+    spin_unlock(&lgt->maptrack_lock);
 }
 
 /*
@@ -517,7 +530,6 @@ __gnttab_map_grant_ref(
     struct page_info *pg = NULL;
     int            rc = GNTST_okay;
     u32            old_pin;
-    u32            act_pin;
     unsigned int   cache_flags;
     struct active_grant_entry *act = NULL;
     struct grant_mapping *mt;
@@ -525,6 +537,7 @@ __gnttab_map_grant_ref(
     grant_entry_v2_t *sha2;
     grant_entry_header_t *shah;
     uint16_t *status;
+    bool_t write_lock_needed = 0;
 
     led = current;
     ld = led->domain;
@@ -570,7 +583,14 @@ __gnttab_map_grant_ref(
     }
 
     rgt = rd->grant_table;
-    spin_lock(&rgt->lock);
+
+    if ( !is_hvm_domain(ld) && need_iommu(ld) )
+        write_lock_needed = 1;
+
+    if ( write_lock_needed )
+        write_lock(&rgt->lock);
+    else
+        read_lock(&rgt->lock);
 
     if ( rgt->gt_version == 0 )
         PIN_FAIL(unlock_out, GNTST_general_error,
@@ -580,7 +600,7 @@ __gnttab_map_grant_ref(
     if ( unlikely(op->ref >= nr_grant_entries(rgt)))
         PIN_FAIL(unlock_out, GNTST_bad_gntref, "Bad ref (%d).\n",
op->ref);
 
-    act = &active_entry(rgt, op->ref);
+    act = active_entry_acquire(rgt, op->ref);
     shah = shared_entry_header(rgt, op->ref);
     if (rgt->gt_version == 1) {
         sha1 = &shared_entry_v1(rgt, op->ref);
@@ -597,7 +617,7 @@ __gnttab_map_grant_ref(
          ((act->domid != ld->domain_id) ||
           (act->pin & 0x80808080U) != 0 ||
           (act->is_sub_page)) )
-        PIN_FAIL(unlock_out, GNTST_general_error,
+        PIN_FAIL(act_release_out, GNTST_general_error,
                  "Bad domain (%d != %d), or risk of counter overflow %08x,
or subpage %d\n",
                  act->domid, ld->domain_id, act->pin,
act->is_sub_page);
 
@@ -608,7 +628,7 @@ __gnttab_map_grant_ref(
         if ( (rc = _set_status(rgt->gt_version, ld->domain_id,
                                op->flags & GNTMAP_readonly,
                                1, shah, act, status) ) != GNTST_okay )
-             goto unlock_out;
+             goto act_release_out;
 
         if ( !act->pin )
         {
@@ -639,12 +659,9 @@ __gnttab_map_grant_ref(
             GNTPIN_hstr_inc : GNTPIN_hstw_inc;
 
     frame = act->frame;
-    act_pin = act->pin;
 
     cache_flags = (shah->flags & (GTF_PAT | GTF_PWT | GTF_PCD) );
 
-    spin_unlock(&rgt->lock);
-
     /* pg may be set, with a refcount included, from __get_paged_frame */
     if ( !pg )
     {
@@ -719,8 +736,6 @@ __gnttab_map_grant_ref(
         goto undo_out;
     }
 
-    double_gt_lock(lgt, rgt);
-
     if ( !is_hvm_domain(ld) && need_iommu(ld) )
     {
         unsigned int wrc, rdc;
@@ -730,21 +745,20 @@ __gnttab_map_grant_ref(
         /* We''re not translated, so we know that gmfns and mfns are
            the same things, so the IOMMU entry is always 1-to-1. */
         mapcount(lgt, rd, frame, &wrc, &rdc);
-        if ( (act_pin & (GNTPIN_hstw_mask|GNTPIN_devw_mask)) &&
+        if ( (act->pin & (GNTPIN_hstw_mask|GNTPIN_devw_mask)) &&
              !(old_pin & (GNTPIN_hstw_mask|GNTPIN_devw_mask)) )
         {
             if ( wrc == 0 )
                 err = iommu_map_page(ld, frame, frame,
                                      IOMMUF_readable|IOMMUF_writable);
         }
-        else if ( act_pin && !old_pin )
+        else if ( act->pin && !old_pin )
         {
             if ( (wrc + rdc) == 0 )
                 err = iommu_map_page(ld, frame, frame, IOMMUF_readable);
         }
         if ( err )
         {
-            double_gt_unlock(lgt, rgt);
             rc = GNTST_general_error;
             goto undo_out;
         }
@@ -757,7 +771,11 @@ __gnttab_map_grant_ref(
     mt->ref   = op->ref;
     mt->flags = op->flags;
 
-    double_gt_unlock(lgt, rgt);
+    active_entry_release(act);
+    if ( write_lock_needed )
+        write_unlock(&rgt->lock);
+    else
+        read_unlock(&rgt->lock);
 
     op->dev_bus_addr = (u64)frame << PAGE_SHIFT;
     op->handle       = handle;
@@ -780,10 +798,6 @@ __gnttab_map_grant_ref(
         put_page(pg);
     }
 
-    spin_lock(&rgt->lock);
-
-    act = &active_entry(rgt, op->ref);
-
     if ( op->flags & GNTMAP_device_map )
         act->pin -= (op->flags & GNTMAP_readonly) ?
             GNTPIN_devr_inc : GNTPIN_devw_inc;
@@ -799,8 +813,14 @@ __gnttab_map_grant_ref(
     if ( !act->pin )
         gnttab_clear_flag(_GTF_reading, status);
 
+ act_release_out:
+    active_entry_release(act);
+
  unlock_out:
-    spin_unlock(&rgt->lock);
+    if ( write_lock_needed )
+        write_unlock(&rgt->lock);
+    else
+        read_unlock(&rgt->lock);
     op->status = rc;
     put_maptrack_handle(lgt, handle);
     rcu_unlock_domain(rd);
@@ -836,33 +856,33 @@ __gnttab_unmap_common(
     struct grant_table *lgt, *rgt;
     struct active_grant_entry *act;
     s16              rc = 0;
+    bool_t write_lock_needed = 0;
 
     ld = current->domain;
     lgt = ld->grant_table;
 
     op->frame = (unsigned long)(op->dev_bus_addr >> PAGE_SHIFT);
 
-    spin_lock(&lgt->lock);
+    read_lock(&lgt->lock);
     if ( unlikely(op->handle >= lgt->maptrack_limit) )
     {
-        spin_unlock(&lgt->lock);
+        read_unlock(&lgt->lock);
         gdprintk(XENLOG_INFO, "Bad handle (%d).\n", op->handle);
         op->status = GNTST_bad_handle;
         return;
     }
 
+    read_unlock(&lgt->lock);
     op->map = &maptrack_entry(lgt, op->handle);
 
     if ( unlikely(!op->map->flags) )
     {
-        spin_unlock(&lgt->lock);
         gdprintk(XENLOG_INFO, "Zero flags for handle (%d).\n",
op->handle);
         op->status = GNTST_bad_handle;
         return;
     }
 
     dom = op->map->domid;
-    spin_unlock(&lgt->lock);
 
     if ( unlikely((rd = rcu_lock_domain_by_id(dom)) == NULL) )
     {
@@ -883,7 +903,6 @@ __gnttab_unmap_common(
     TRACE_1D(TRC_MEM_PAGE_GRANT_UNMAP, dom);
 
     rgt = rd->grant_table;
-    double_gt_lock(lgt, rgt);
 
     op->flags = op->map->flags;
     if ( unlikely(!op->flags) || unlikely(op->map->domid != dom) )
@@ -894,7 +913,16 @@ __gnttab_unmap_common(
     }
 
     op->rd = rd;
-    act = &active_entry(rgt, op->map->ref);
+
+    if ( !is_hvm_domain(ld) && need_iommu(ld) )
+        write_lock_needed = 1;
+
+    if ( write_lock_needed )
+        write_lock(&rgt->lock);
+    else
+        read_lock(&rgt->lock);
+
+    act = active_entry_acquire(rgt, op->map->ref);
 
     if ( op->frame == 0 )
     {
@@ -903,7 +931,7 @@ __gnttab_unmap_common(
     else
     {
         if ( unlikely(op->frame != act->frame) )
-            PIN_FAIL(unmap_out, GNTST_general_error,
+            PIN_FAIL(act_release_out, GNTST_general_error,
                      "Bad frame number doesn''t match gntref. (%lx
!= %lx)\n",
                      op->frame, act->frame);
         if ( op->flags & GNTMAP_device_map )
@@ -922,7 +950,7 @@ __gnttab_unmap_common(
         if ( (rc = replace_grant_host_mapping(op->host_addr,
                                               op->frame, op->new_addr, 
                                               op->flags)) < 0 )
-            goto unmap_out;
+            goto act_release_out;
 
         ASSERT(act->pin & (GNTPIN_hstw_mask | GNTPIN_hstr_mask));
         op->map->flags &= ~GNTMAP_host_map;
@@ -945,7 +973,7 @@ __gnttab_unmap_common(
         if ( err )
         {
             rc = GNTST_general_error;
-            goto unmap_out;
+            goto act_release_out;
         }
     }
 
@@ -953,8 +981,15 @@ __gnttab_unmap_common(
     if ( !(op->flags & GNTMAP_readonly) )
          gnttab_mark_dirty(rd, op->frame);
 
+ act_release_out:
+    active_entry_release(act);
+
+    if ( write_lock_needed )
+        write_unlock(&rgt->lock);
+    else
+        read_unlock(&rgt->lock);
+
  unmap_out:
-    double_gt_unlock(lgt, rgt);
     op->status = rc;
     rcu_unlock_domain(rd);
 }
@@ -984,12 +1019,12 @@ __gnttab_unmap_common_complete(struct gnttab_unmap_common
*op)
 
     rcu_lock_domain(rd);
     rgt = rd->grant_table;
-    spin_lock(&rgt->lock);
+    read_lock(&rgt->lock);
 
     if ( rgt->gt_version == 0 )
         goto unmap_out;
 
-    act = &active_entry(rgt, op->map->ref);
+    act = active_entry_acquire(rgt, op->map->ref);
     sha = shared_entry_header(rgt, op->map->ref);
 
     if ( rgt->gt_version == 1 )
@@ -1003,7 +1038,7 @@ __gnttab_unmap_common_complete(struct gnttab_unmap_common
*op)
          * Suggests that __gntab_unmap_common failed early and so
          * nothing further to do
          */
-        goto unmap_out;
+        goto act_release_out;
     }
 
     pg = mfn_to_page(op->frame);
@@ -1027,7 +1062,7 @@ __gnttab_unmap_common_complete(struct gnttab_unmap_common
*op)
              * Suggests that __gntab_unmap_common failed in
              * replace_grant_host_mapping() so nothing further to do
              */
-            goto unmap_out;
+            goto act_release_out;
         }
 
         if ( !is_iomem_page(op->frame) ) 
@@ -1048,8 +1083,11 @@ __gnttab_unmap_common_complete(struct gnttab_unmap_common
*op)
     if ( act->pin == 0 )
         gnttab_clear_flag(_GTF_reading, status);
 
+ act_release_out:
+    active_entry_release(act);
+    
  unmap_out:
-    spin_unlock(&rgt->lock);
+    read_unlock(&rgt->lock);
     if ( put_handle )
     {
         op->map->flags = 0;
@@ -1235,13 +1273,16 @@ gnttab_unpopulate_status_frames(struct domain *d, struct
grant_table *gt)
     gt->nr_status_frames = 0;
 }
 
+/* Grow the grant table. The caller must hold the grant table''s
+ * write lock before calling this function.
+ */
 int
 gnttab_grow_table(struct domain *d, unsigned int req_nr_frames)
 {
-    /* d''s grant table lock must be held by the caller */
+    /* d''s grant table write lock must be held by the caller */
 
     struct grant_table *gt = d->grant_table;
-    unsigned int i;
+    unsigned int i, j;
 
     ASSERT(req_nr_frames <= max_nr_grant_frames);
 
@@ -1256,6 +1297,10 @@ gnttab_grow_table(struct domain *d, unsigned int
req_nr_frames)
         if ( (gt->active[i] = alloc_xenheap_page()) == NULL )
             goto active_alloc_failed;
         clear_page(gt->active[i]);
+        for ( j = 0; j < ACGNT_PER_PAGE; j++ )
+        {
+            spin_lock_init(&gt->active[i][j].lock);
+        }
     }
 
     /* Shared */
@@ -1343,7 +1388,7 @@ gnttab_setup_table(
     }
 
     gt = d->grant_table;
-    spin_lock(&gt->lock);
+    write_lock(&gt->lock);
 
     if ( gt->gt_version == 0 )
         gt->gt_version = 1;
@@ -1371,7 +1416,7 @@ gnttab_setup_table(
     }
 
  out3:
-    spin_unlock(&gt->lock);
+    write_unlock(&gt->lock);
  out2:
     rcu_unlock_domain(d);
  out1:
@@ -1413,13 +1458,13 @@ gnttab_query_size(
         goto query_out_unlock;
     }
 
-    spin_lock(&d->grant_table->lock);
+    read_lock(&d->grant_table->lock);
 
     op.nr_frames     = nr_grant_frames(d->grant_table);
     op.max_nr_frames = max_nr_grant_frames;
     op.status        = GNTST_okay;
 
-    spin_unlock(&d->grant_table->lock);
+    read_unlock(&d->grant_table->lock);
 
  
  query_out_unlock:
@@ -1445,7 +1490,7 @@ gnttab_prepare_for_transfer(
     union grant_combo   scombo, prev_scombo, new_scombo;
     int                 retries = 0;
 
-    spin_lock(&rgt->lock);
+    read_lock(&rgt->lock);
 
     if ( rgt->gt_version == 0 )
     {
@@ -1496,11 +1541,11 @@ gnttab_prepare_for_transfer(
         scombo = prev_scombo;
     }
 
-    spin_unlock(&rgt->lock);
+    read_unlock(&rgt->lock);
     return 1;
 
  fail:
-    spin_unlock(&rgt->lock);
+    read_unlock(&rgt->lock);
     return 0;
 }
 
@@ -1692,7 +1737,7 @@ gnttab_transfer(
         TRACE_1D(TRC_MEM_PAGE_GRANT_TRANSFER, e->domain_id);
 
         /* Tell the guest about its new page frame. */
-        spin_lock(&e->grant_table->lock);
+        read_lock(&e->grant_table->lock);
 
         if ( e->grant_table->gt_version == 1 )
         {
@@ -1710,7 +1755,7 @@ gnttab_transfer(
         shared_entry_header(e->grant_table, gop.ref)->flags |            
GTF_transfer_completed;
 
-        spin_unlock(&e->grant_table->lock);
+        read_unlock(&e->grant_table->lock);
 
         rcu_unlock_domain(e);
 
@@ -1748,9 +1793,9 @@ __release_grant_for_copy(
     released_read = 0;
     released_write = 0;
 
-    spin_lock(&rgt->lock);
+    read_lock(&rgt->lock);
 
-    act = &active_entry(rgt, gref);
+    act = active_entry_acquire(rgt, gref);
     sha = shared_entry_header(rgt, gref);
     r_frame = act->frame;
 
@@ -1789,7 +1834,8 @@ __release_grant_for_copy(
         released_read = 1;
     }
 
-    spin_unlock(&rgt->lock);
+    active_entry_release(act);
+    read_unlock(&rgt->lock);
 
     if ( td != rd )
     {
@@ -1847,7 +1893,7 @@ __acquire_grant_for_copy(
 
     *page = NULL;
 
-    spin_lock(&rgt->lock);
+    read_lock(&rgt->lock);
 
     if ( rgt->gt_version == 0 )
         PIN_FAIL(unlock_out, GNTST_general_error,
@@ -1857,7 +1903,7 @@ __acquire_grant_for_copy(
         PIN_FAIL(unlock_out, GNTST_bad_gntref,
                  "Bad grant reference %ld\n", gref);
 
-    act = &active_entry(rgt, gref);
+    act = active_entry_acquire(rgt, gref);
     shah = shared_entry_header(rgt, gref);
     if ( rgt->gt_version == 1 )
     {
@@ -1885,7 +1931,7 @@ __acquire_grant_for_copy(
         if ( (rc = _set_status(rgt->gt_version, ldom,
                                readonly, 0, shah, act,
                                status) ) != GNTST_okay )
-             goto unlock_out;
+             goto act_release_out;
 
         td = rd;
         trans_gref = gref;
@@ -1916,17 +1962,25 @@ __acquire_grant_for_copy(
                 PIN_FAIL(unlock_out_clear, GNTST_general_error,
                          "transitive grant referenced bad domain
%d\n",
                          trans_domid);
-            spin_unlock(&rgt->lock);
+
+            /* __acquire_grant_for_copy() could take the read lock on
+               the remote grant table (if rd == td), so we have to
+               drop the lock here and reacquire */
+            active_entry_release(act);
+            read_unlock(&rgt->lock);
 
             rc = __acquire_grant_for_copy(td, trans_gref, rd->domain_id,
                                           readonly, &grant_frame, page,
                                           &trans_page_off,
&trans_length, 0);
 
-            spin_lock(&rgt->lock);
+            read_lock(&rgt->lock);
+            act = active_entry_acquire(rgt, gref);
+
             if ( rc != GNTST_okay ) {
                 __fixup_status_for_copy_pin(act, status);
+                active_entry_release(act);
+                read_unlock(&rgt->lock);
                 rcu_unlock_domain(td);
-                spin_unlock(&rgt->lock);
                 return rc;
             }
 
@@ -1938,7 +1992,8 @@ __acquire_grant_for_copy(
             {
                 __fixup_status_for_copy_pin(act, status);
                 rcu_unlock_domain(td);
-                spin_unlock(&rgt->lock);
+                active_entry_release(act);
+                read_unlock(&rgt->lock);
                 put_page(*page);
                 return __acquire_grant_for_copy(rd, gref, ldom, readonly,
                                                 frame, page, page_off, length,
@@ -2006,7 +2061,8 @@ __acquire_grant_for_copy(
     *length = act->length;
     *frame = act->frame;
 
-    spin_unlock(&rgt->lock);
+    active_entry_release(act);
+    read_unlock(&rgt->lock);
     return rc;
  
  unlock_out_clear:
@@ -2017,8 +2073,11 @@ __acquire_grant_for_copy(
     if ( !act->pin )
         gnttab_clear_flag(_GTF_reading, status);
 
+ act_release_out:
+    active_entry_release(act);
+
  unlock_out:
-    spin_unlock(&rgt->lock);
+    read_unlock(&rgt->lock);
     return rc;
 }
 
@@ -2192,7 +2251,7 @@
gnttab_set_version(XEN_GUEST_HANDLE_PARAM(gnttab_set_version_t) uop)
     if ( gt->gt_version == op.version )
         goto out;
 
-    spin_lock(&gt->lock);
+    write_lock(&gt->lock);
     /* Make sure that the grant table isn''t currently in use when we
        change the version number, except for the first 8 entries which
        are allowed to be in use (xenstore/xenconsole keeps them mapped).
@@ -2201,7 +2260,7 @@
gnttab_set_version(XEN_GUEST_HANDLE_PARAM(gnttab_set_version_t) uop)
     {
         for ( i = GNTTAB_NR_RESERVED_ENTRIES; i < nr_grant_entries(gt); i++
)
         {
-            act = &active_entry(gt, i);
+            act = &active_entry_unlocked(gt, i);
             if ( act->pin != 0 )
             {
                 gdprintk(XENLOG_WARNING,
@@ -2278,7 +2337,7 @@
gnttab_set_version(XEN_GUEST_HANDLE_PARAM(gnttab_set_version_t) uop)
     gt->gt_version = op.version;
 
 out_unlock:
-    spin_unlock(&gt->lock);
+    write_unlock(&gt->lock);
 
 out:
     op.version = gt->gt_version;
@@ -2334,7 +2393,7 @@
gnttab_get_status_frames(XEN_GUEST_HANDLE_PARAM(gnttab_get_status_frames_t) uop,
 
     op.status = GNTST_okay;
 
-    spin_lock(&gt->lock);
+    read_lock(&gt->lock);
 
     for ( i = 0; i < op.nr_frames; i++ )
     {
@@ -2343,7 +2402,7 @@
gnttab_get_status_frames(XEN_GUEST_HANDLE_PARAM(gnttab_get_status_frames_t) uop,
             op.status = GNTST_bad_virt_addr;
     }
 
-    spin_unlock(&gt->lock);
+    read_unlock(&gt->lock);
 out2:
     rcu_unlock_domain(d);
 out1:
@@ -2389,10 +2448,11 @@ __gnttab_swap_grant_ref(grant_ref_t ref_a, grant_ref_t
ref_b)
 {
     struct domain *d = rcu_lock_current_domain();
     struct grant_table *gt = d->grant_table;
-    struct active_grant_entry *act;
+    struct active_grant_entry *act_a = NULL;
+    struct active_grant_entry *act_b = NULL;
     s16 rc = GNTST_okay;
 
-    spin_lock(&gt->lock);
+    read_lock(&gt->lock);
 
     /* Bounds check on the grant refs */
     if ( unlikely(ref_a >= nr_grant_entries(d->grant_table)))
@@ -2400,12 +2460,12 @@ __gnttab_swap_grant_ref(grant_ref_t ref_a, grant_ref_t
ref_b)
     if ( unlikely(ref_b >= nr_grant_entries(d->grant_table)))
         PIN_FAIL(out, GNTST_bad_gntref, "Bad ref-b (%d).\n", ref_b);
 
-    act = &active_entry(gt, ref_a);
-    if ( act->pin )
+    act_a = active_entry_acquire(gt, ref_a);
+    if ( act_a->pin )
         PIN_FAIL(out, GNTST_eagain, "ref a %ld busy\n", (long)ref_a);
 
-    act = &active_entry(gt, ref_b);
-    if ( act->pin )
+    act_b = active_entry_acquire(gt, ref_b);
+    if ( act_b->pin )
         PIN_FAIL(out, GNTST_eagain, "ref b %ld busy\n", (long)ref_b);
 
     if ( gt->gt_version == 1 )
@@ -2432,8 +2492,11 @@ __gnttab_swap_grant_ref(grant_ref_t ref_a, grant_ref_t
ref_b)
     }
 
 out:
-    spin_unlock(&gt->lock);
-
+    if ( act_b != NULL )
+        active_entry_release(act_b);
+    if ( act_a != NULL )
+        active_entry_release(act_a);
+    read_unlock(&gt->lock);
     rcu_unlock_domain(d);
 
     return rc;
@@ -2452,7 +2515,18 @@
gnttab_swap_grant_ref(XEN_GUEST_HANDLE_PARAM(gnttab_swap_grant_ref_t) uop,
             return i;
         if ( unlikely(__copy_from_guest(&op, uop, 1)) )
             return -EFAULT;
-        op.status = __gnttab_swap_grant_ref(op.ref_a, op.ref_b);
+        if ( unlikely(op.ref_a == op.ref_b) )
+        {
+            /* noop, so avoid acquiring the same active entry
+             * twice in __gnttab_swap_grant_ref(), which would
+             * case a deadlock.
+             */
+            op.status = GNTST_okay;
+        }
+        else
+        {
+            op.status = __gnttab_swap_grant_ref(op.ref_a, op.ref_b);
+        }
         if ( unlikely(__copy_field_to_guest(uop, &op, status)) )
             return -EFAULT;
         guest_handle_add_offset(uop, 1);
@@ -2614,13 +2688,14 @@ grant_table_create(
     struct domain *d)
 {
     struct grant_table *t;
-    int                 i;
+    int                 i, j;
 
     if ( (t = xzalloc(struct grant_table)) == NULL )
         goto no_mem_0;
 
     /* Simple stuff. */
-    spin_lock_init(&t->lock);
+    rwlock_init(&t->lock);
+    spin_lock_init(&t->maptrack_lock);
     t->nr_grant_frames = INITIAL_NR_GRANT_FRAMES;
 
     /* Active grant table. */
@@ -2633,6 +2708,10 @@ grant_table_create(
         if ( (t->active[i] = alloc_xenheap_page()) == NULL )
             goto no_mem_2;
         clear_page(t->active[i]);
+        for ( j = 0; j < ACGNT_PER_PAGE; j++ )
+        {
+            spin_lock_init(&t->active[i][j].lock);
+        }
     }
 
     /* Tracking of mapped foreign frames table */
@@ -2704,6 +2783,8 @@ gnttab_release_mappings(
     uint16_t             *status;
     struct page_info     *pg;
 
+    /* we don''t take the table write lock because the domain
+       is dying */
     BUG_ON(!d->is_dying);
 
     for ( handle = 0; handle < gt->maptrack_limit; handle++ )
@@ -2727,9 +2808,9 @@ gnttab_release_mappings(
         }
 
         rgt = rd->grant_table;
-        spin_lock(&rgt->lock);
+        read_lock(&rgt->lock);
 
-        act = &active_entry(rgt, ref);
+        act = active_entry_acquire(rgt, ref);
         sha = shared_entry_header(rgt, ref);
         if (rgt->gt_version == 1)
             status = &sha->flags;
@@ -2787,7 +2868,8 @@ gnttab_release_mappings(
         if ( act->pin == 0 )
             gnttab_clear_flag(_GTF_reading, status);
 
-        spin_unlock(&rgt->lock);
+        active_entry_release(act);
+        read_unlock(&rgt->lock);
 
         rcu_unlock_domain(rd);
 
@@ -2805,7 +2887,12 @@ grant_table_destroy(
 
     if ( t == NULL )
         return;
-    
+
+    /* we don''t take the table write lock because the
+       domain is dying.
+    */
+    BUG_ON(!d->is_dying);
+
     for ( i = 0; i < nr_grant_frames(t); i++ )
         free_xenheap_page(t->shared_raw[i]);
     xfree(t->shared_raw);
@@ -2835,7 +2922,7 @@ static void gnttab_usage_print(struct domain *rd)
     printk("      -------- active --------       -------- shared
--------\n");
     printk("[ref] localdom mfn      pin          localdom gmfn    
flags\n");
 
-    spin_lock(&gt->lock);
+    write_lock(&gt->lock);
 
     if ( gt->gt_version == 0 )
         goto out;
@@ -2849,7 +2936,7 @@ static void gnttab_usage_print(struct domain *rd)
         uint16_t status;
         uint64_t frame;
 
-        act = &active_entry(gt, ref);
+        act = &active_entry_unlocked(gt, ref);
         if ( !act->pin )
             continue;
 
@@ -2884,7 +2971,7 @@ static void gnttab_usage_print(struct domain *rd)
     }
 
  out:
-    spin_unlock(&gt->lock);
+    write_unlock(&gt->lock);
 
     if ( first )
         printk("grant-table for remote domain:%5d ... "
diff --git a/xen/include/xen/grant_table.h b/xen/include/xen/grant_table.h
index 5941191..bb2828b 100644
--- a/xen/include/xen/grant_table.h
+++ b/xen/include/xen/grant_table.h
@@ -84,8 +84,11 @@ struct grant_table {
     struct grant_mapping **maptrack;
     unsigned int          maptrack_head;
     unsigned int          maptrack_limit;
-    /* Lock protecting updates to active and shared grant tables. */
-    spinlock_t            lock;
+    /* Lock protecting the maptrack page list, head, and limit */
+    spinlock_t            maptrack_lock;
+    /* Lock protecting updates to grant table state (version, active
+       entry list, etc.) */
+    rwlock_t              lock;
     /* The defined versions are 1 and 2.  Set to 0 if we don''t know
        what version to use yet. */
     unsigned              gt_version;
@@ -103,7 +106,7 @@ gnttab_release_mappings(
     struct domain *d);
 
 /* Increase the size of a domain''s grant table.
- * Caller must hold d''s grant table lock.
+ * Caller must hold d''s grant table write lock.
  */
 int
 gnttab_grow_table(struct domain *d, unsigned int req_nr_frames);
-- 
1.7.9.5

Keir Fraser

2013-Nov-12 05:37 UTC

head link

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

On 12/11/2013 02:03, "Matt Wilson" <msw@linux.com> wrote:
> + Locking
> + ~~~~~~~
> + Xen uses several locks to serialize access to the internal grant table
> state.
> +
> +  grant_table->lock          : rwlock used to prevent readers from
accessing
> +                               inconsistent grant table state such as
current
> +                               version, partially initialized active table
> pages,
> +                               etc.
> +  grant_table->maptrack_lock : spinlock used to protect the maptrack
state
> +  active_grant_entry->lock   : spinlock used to serialize modifications
to
> +                               active entries
> +
> + The primary lock for the grant table is a read/write spinlock. All
> + functions that access members of struct grant_table must acquire a
> + read lock around critical sections. Any modification to the members
> + of struct grant_table (e.g., nr_status_frames, nr_grant_frames,
> + active frames, etc.) must only be made if the write lock is
> + held. These elements are read-mostly, and read critical sections can
> + be large, which makes a rwlock a good choice.
Is there any concern about writer starvation here? I know our spinlocks
aren''t ''fair'' but our rwlocks are guaranteed to
starve out writers if there
is a steady continuous stream of readers. Perhaps we should write-bias our
rwlock, or at least make that an option. We could get fancier but would
probably hurt performance.

Looks like great work however!

 -- Keir

Matt Wilson

2013-Nov-12 07:18 UTC

head link

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

On Tue, Nov 12, 2013 at 05:37:08AM +0000, Keir Fraser
wrote:> On 12/11/2013 02:03, "Matt Wilson" <msw@linux.com> wrote:
> 
> > + Locking
> > + ~~~~~~~
> > + Xen uses several locks to serialize access to the internal grant
table
> > state.
> > +
> > +  grant_table->lock          : rwlock used to prevent readers from
accessing
> > +                               inconsistent grant table state such as
current
> > +                               version, partially initialized active
table
> > pages,
> > +                               etc.
> > +  grant_table->maptrack_lock : spinlock used to protect the
maptrack state
> > +  active_grant_entry->lock   : spinlock used to serialize
modifications to
> > +                               active entries
> > +
> > + The primary lock for the grant table is a read/write spinlock. All
> > + functions that access members of struct grant_table must acquire a
> > + read lock around critical sections. Any modification to the members
> > + of struct grant_table (e.g., nr_status_frames, nr_grant_frames,
> > + active frames, etc.) must only be made if the write lock is
> > + held. These elements are read-mostly, and read critical sections can
> > + be large, which makes a rwlock a good choice.
> 
> Is there any concern about writer starvation here? I know our spinlocks
> aren''t ''fair'' but our rwlocks are guaranteed to
starve out writers if there
> is a steady continuous stream of readers. Perhaps we should write-bias our
> rwlock, or at least make that an option. We could get fancier but would
> probably hurt performance.
Yes, I''m a little concerned about writer starvation. But so far even
in the presence of very frequent readers it seems like the infrequent
writers are able to get the lock when they need to. However, I''ve not
tested the iommu=strict path yet. I''m thinking that in that case
there''s just going to be frequent writers, so there''s less
risk of
readers starving writers. For what it''s worth, when mapcount() gets in
the picture with persistent grants, I''d expect to see some pretty
significant performance degradation for map/unmap operations. This was
also observed in [1] under different circumstances.

But right now I''m more curious about cache line bouncing between all
the readers. I''ve not done any study of inter-arrival times for
typical workloads (much less some more extreme workloads like we''ve
been testing), but lock profiling of grant table operations when a
spinlock was used showed some pretty long hold times, which should
translate fairly well to decent rwlock performance. I''m by no means an
expert in this area so I''m eager to hear the thoughts of others.

I should also mention that some of the improvement I mentioned from
3,000 MB/s to 3,600 MB/s was due to avoiding the m2p override spinlock
in dom0.
> Looks like great work however!
Thanks! I''m definitely interested to see some others''
experience with
these changes. I think that the double_gt_lock() caused a fair bit of
contention on the local domain''s grant table lock, which explains some
of the poor performance with large numbers of vCPUs as seen in [2].

--msw

[1] http://scholarship.rice.edu/bitstream/handle/1911/71683/RAM-THESIS.pdf
[2]
http://xenbits.xen.org/people/royger/persistent_grants/persistent_read24vcpus.png

Keir Fraser

2013-Nov-12 08:07 UTC

head link

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

On 12/11/2013 07:18, "Matt Wilson" <msw@linux.com> wrote:
>> Is there any concern about writer starvation here? I know our spinlocks
>> aren''t ''fair'' but our rwlocks are guaranteed
to starve out writers if there
>> is a steady continuous stream of readers. Perhaps we should write-bias
our
>> rwlock, or at least make that an option. We could get fancier but would
>> probably hurt performance.
> 
> Yes, I''m a little concerned about writer starvation. But so far
even
> in the presence of very frequent readers it seems like the infrequent
> writers are able to get the lock when they need to. However, I''ve
not
> tested the iommu=strict path yet. I''m thinking that in that case
> there''s just going to be frequent writers, so there''s
less risk of
> readers starving writers. For what it''s worth, when mapcount()
gets in
> the picture with persistent grants, I''d expect to see some pretty
> significant performance degradation for map/unmap operations. This was
> also observed in [1] under different circumstances.
The average case isn''t the only concern here, but also the worst case,
which
could maybe tie up a CPU for unbounded time. Could a malicious guest set up
such a workload? I''m just thinking we don''t want to end up
with a DoS XSA on
this down the line. :)
> But right now I''m more curious about cache line bouncing between
all
> the readers. I''ve not done any study of inter-arrival times for
> typical workloads (much less some more extreme workloads like
we''ve
> been testing), but lock profiling of grant table operations when a
> spinlock was used showed some pretty long hold times, which should
> translate fairly well to decent rwlock performance. I''m by no
means an
> expert in this area so I''m eager to hear the thoughts of others.
In the read-heavy case the only improvement would be with the old
Linux-style biglock (spinlock per CPU; writers must take all spinlocks), or
working out a lock-free scheme for readers (perhaps making use of RCU).

 -- Keir
> I should also mention that some of the improvement I mentioned from
> 3,000 MB/s to 3,600 MB/s was due to avoiding the m2p override spinlock
> in dom0.

Jan Beulich

2013-Nov-12 09:18 UTC

head link

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

>>> On 12.11.13 at 09:07, Keir Fraser <keir.xen@gmail.com> wrote:
> On 12/11/2013 07:18, "Matt Wilson" <msw@linux.com> wrote:
> 
>>> Is there any concern about writer starvation here? I know our
spinlocks
>>> aren''t ''fair'' but our rwlocks are
guaranteed to starve out writers if there
>>> is a steady continuous stream of readers. Perhaps we should
write-bias our
>>> rwlock, or at least make that an option. We could get fancier but
would
>>> probably hurt performance.
>> 
>> Yes, I''m a little concerned about writer starvation. But so
far even
>> in the presence of very frequent readers it seems like the infrequent
>> writers are able to get the lock when they need to. However,
I''ve not
>> tested the iommu=strict path yet. I''m thinking that in that
case
>> there''s just going to be frequent writers, so there''s
less risk of
>> readers starving writers. For what it''s worth, when mapcount()
gets in
>> the picture with persistent grants, I''d expect to see some
pretty
>> significant performance degradation for map/unmap operations. This was
>> also observed in [1] under different circumstances.
> 
> The average case isn''t the only concern here, but also the worst
case, which
> could maybe tie up a CPU for unbounded time. Could a malicious guest set up
> such a workload? I''m just thinking we don''t want to end
up with a DoS XSA on
> this down the line. :)
And indeed I think we should be making our rwlocks fair for writers
before pushing in the change here; I''ve been meaning to get to this
for a while, but other stuff continues to require attention. I''m also
of the opinion that we should switch to ticket spinlocks.

But of course, fairness for writers means that performance may
drop again on the read paths, unless the write lock use is strictly
limited to code paths not (normally) involved in I/O.

Jan

Jan Beulich

2013-Nov-12 09:26 UTC

head link

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

>>> On 12.11.13 at 03:03, Matt Wilson <msw@linux.com> wrote:
> From: Matt Wilson <msw@amazon.com>
> 
> This patch refactors grant table locking. It splits the previous
> single spinlock per grant table into multiple locks. The heavily
> modified components of the grant table (the maptrack state and the
> active entries) are now protected by their own spinlocks. The
> remaining elements of the grant table are read-mostly, so the main
> grant table lock is modified to be a rwlock to improve concurrency.
> 
> Workloads with high grant table operation rates, especially map/unmap
> operations as used by blkback/blkfront when persistent grants are not
> supported, show marked improvement with these changes. A domU with 24
> VBDs in a streaming 2M write workload achieved 1,400 MB/sec before
> this change. Performance more than doubles with this patch, reaching
> 3,000 MB/sec before tuning and 3,600 MB/sec after adjusting event
> channel vCPU bindings.
> 
> Signed-off-by: Matt Wilson <msw@amazon.com>
I will have to take a thorough second look at this, but considering
its size - would it be possible to split out the conversion of the main
lock to an rw one into a third patch? Correctness of the second one
wouldn''t suffer from this.

And I''d also suggest converting all the maptrack_ fields into a
separate sub-structure named maptrack. That''d not only guarantee
things to be kept together, but may also allow passing the address
of the sub-structure as argument to one or another helper function
instead of the full structure''s, thus making it more clear what such a
function actually acts upon (and in particular what it does _not_
touch).

Jan

Keir Fraser

2013-Nov-12 13:42 UTC

head link

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

On 12/11/2013 09:18, "Jan Beulich" <JBeulich@suse.com> wrote:
>>>> On 12.11.13 at 09:07, Keir Fraser <keir.xen@gmail.com>
wrote:
>> On 12/11/2013 07:18, "Matt Wilson" <msw@linux.com>
wrote:
>> 
>>> Yes, I''m a little concerned about writer starvation. But
so far even
>>> in the presence of very frequent readers it seems like the
infrequent
>>> writers are able to get the lock when they need to. However,
I''ve not
>>> tested the iommu=strict path yet. I''m thinking that in
that case
>>> there''s just going to be frequent writers, so
there''s less risk of
>>> readers starving writers. For what it''s worth, when
mapcount() gets in
>>> the picture with persistent grants, I''d expect to see some
pretty
>>> significant performance degradation for map/unmap operations. This
was
>>> also observed in [1] under different circumstances.
>> 
>> The average case isn''t the only concern here, but also the
worst case, which
>> could maybe tie up a CPU for unbounded time. Could a malicious guest
set up
>> such a workload? I''m just thinking we don''t want to
end up with a DoS XSA on
>> this down the line. :)
> 
> And indeed I think we should be making our rwlocks fair for writers
> before pushing in the change here; I''ve been meaning to get to
this
> for a while, but other stuff continues to require attention. I''m
also
> of the opinion that we should switch to ticket spinlocks.
Would queuing spinlocks (e.g. MCS locks) be even more preferable? Two atomic
ops (cmpxchg) per critical region in the uncontended case. Each CPU spins on
its own location so there''s no cacheline carnage in the highly
contended
case (a problem with simple ticket spinlocks). And it builds on cmpxchg so
the spinlock implementation has no arch-specific component (apart from
cmpxchg, which we already have).

I have a queue-based rwlock design too, does require a spinlock lock/unlock
per rwlock op though (i.e., 4 atomic ops per critical region in the
uncontended case).
> But of course, fairness for writers means that performance may
> drop again on the read paths, unless the write lock use is strictly
> limited to code paths not (normally) involved in I/O.
Yes indeed. Hence depending on other users of rwlock, just letting a writer
set its flag and wait for in-flight readers to drain might be sufficient for
now.

 -- Keir
> Jan
>

Keir Fraser

2013-Nov-12 13:58 UTC

head link

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

On 12/11/2013 13:42, "Keir Fraser" <keir.xen@gmail.com> wrote:
>> And indeed I think we should be making our rwlocks fair for writers
>> before pushing in the change here; I''ve been meaning to get to
this
>> for a while, but other stuff continues to require attention.
I''m also
>> of the opinion that we should switch to ticket spinlocks.
> 
> Would queuing spinlocks (e.g. MCS locks) be even more preferable? Two
atomic
> ops (cmpxchg) per critical region in the uncontended case. Each CPU spins
on
> its own location so there''s no cacheline carnage in the highly
contended
> case (a problem with simple ticket spinlocks). And it builds on cmpxchg so
> the spinlock implementation has no arch-specific component (apart from
> cmpxchg, which we already have).
> 
> I have a queue-based rwlock design too, does require a spinlock lock/unlock
> per rwlock op though (i.e., 4 atomic ops per critical region in the
> uncontended case).
Actually MCS has a multi-reader extension we could use, or there is another
alternative by Krieger et al. My own design was intended to build on pthread
primitives and wouldn''t be as good as the existing solutions in the
literature for purely spinning waiters.

 -- Keir

Jan Beulich

2013-Nov-12 14:11 UTC

head link

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

>>> On 12.11.13 at 14:58, Keir Fraser <keir.xen@gmail.com> wrote:
> On 12/11/2013 13:42, "Keir Fraser" <keir.xen@gmail.com>
wrote:
> 
>>> And indeed I think we should be making our rwlocks fair for writers
>>> before pushing in the change here; I''ve been meaning to
get to this
>>> for a while, but other stuff continues to require attention.
I''m also
>>> of the opinion that we should switch to ticket spinlocks.
>> 
>> Would queuing spinlocks (e.g. MCS locks) be even more preferable? Two
atomic
>> ops (cmpxchg) per critical region in the uncontended case. Each CPU
spins on
>> its own location so there''s no cacheline carnage in the highly
contended
>> case (a problem with simple ticket spinlocks). And it builds on cmpxchg
so
>> the spinlock implementation has no arch-specific component (apart from
>> cmpxchg, which we already have).
>> 
>> I have a queue-based rwlock design too, does require a spinlock
lock/unlock
>> per rwlock op though (i.e., 4 atomic ops per critical region in the
>> uncontended case).
> 
> Actually MCS has a multi-reader extension we could use, or there is another
> alternative by Krieger et al. My own design was intended to build on
pthread
> primitives and wouldn''t be as good as the existing solutions in
the
> literature for purely spinning waiters.
Sounds nice - are you going to spend time on implementing this then?

Jan

Keir Fraser

2013-Nov-12 14:24 UTC

head link

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

On 12/11/2013 14:11, "Jan Beulich" <JBeulich@suse.com> wrote:
>>>> On 12.11.13 at 14:58, Keir Fraser <keir.xen@gmail.com>
wrote:
>> On 12/11/2013 13:42, "Keir Fraser" <keir.xen@gmail.com>
wrote:
>> 
>>>> And indeed I think we should be making our rwlocks fair for
writers
>>>> before pushing in the change here; I''ve been meaning
to get to this
>>>> for a while, but other stuff continues to require attention.
I''m also
>>>> of the opinion that we should switch to ticket spinlocks.
>>> 
>>> Would queuing spinlocks (e.g. MCS locks) be even more preferable?
Two atomic
>>> ops (cmpxchg) per critical region in the uncontended case. Each CPU
spins on
>>> its own location so there''s no cacheline carnage in the
highly contended
>>> case (a problem with simple ticket spinlocks). And it builds on
cmpxchg so
>>> the spinlock implementation has no arch-specific component (apart
from
>>> cmpxchg, which we already have).
>>> 
>>> I have a queue-based rwlock design too, does require a spinlock
lock/unlock
>>> per rwlock op though (i.e., 4 atomic ops per critical region in the
>>> uncontended case).
>> 
>> Actually MCS has a multi-reader extension we could use, or there is
another
>> alternative by Krieger et al. My own design was intended to build on
pthread
>> primitives and wouldn''t be as good as the existing solutions
in the
>> literature for purely spinning waiters.
> 
> Sounds nice - are you going to spend time on implementing this then?
Yes I''ll look into it. Amazon''s benchmarking of grant-table
throughput will
be a good testbed for performance of a different lock implementation.

 -- Keir
> Jan
>

Konrad Rzeszutek Wilk

2013-Nov-12 14:52 UTC

head link

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

> I should also mention that some of the improvement I mentioned from
> 3,000 MB/s to 3,600 MB/s was due to avoiding the m2p override spinlock
> in dom0.
Could you post said patch please? I presume its pretty simple but
I would like the credit to go to you guys.

David Vrabel

2013-Nov-12 15:04 UTC

head link

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

On 12/11/13 14:52, Konrad Rzeszutek Wilk wrote:>> I should also mention that some of the improvement I mentioned from
>> 3,000 MB/s to 3,600 MB/s was due to avoiding the m2p override spinlock
>> in dom0.
> 
> Could you post said patch please? I presume its pretty simple but
> I would like the credit to go to you guys.
Roger already posted it.

https://lkml.org/lkml/2013/11/5/121

I think we''re just waiting for a repost with a more detailed commit
message.

David

Felipe Franciosi

2013-Nov-12 15:09 UTC

head link

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

We are very excited to see progress in this area following up from the
discussions we had at the BoF.
I will be happy to benchmark these patches in our environment as well.

Thanks,
Felipe

-----Original Message-----
From: Keir Fraser [mailto:keir.xen@gmail.com] 
Sent: 12 November 2013 14:25
To: Jan Beulich
Cc: Anthony Liguori; Matt Wilson; Andrew Cooper; David Vrabel; Felipe Franciosi;
Roger Pau Monne; Matt Wilson; xen-devel@lists.xenproject.org; Konrad Rzeszutek
Wilk
Subject: Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

<...>
> Sounds nice - are you going to spend time on implementing this then?
Yes I''ll look into it. Amazon''s benchmarking of grant-table
throughput will be a good testbed for performance of a different lock
implementation.

 -- Keir
> Jan
>

Anthony Liguori

2013-Nov-12 16:53 UTC

head link

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

On Tue, Nov 12, 2013 at 7:04 AM, David Vrabel <david.vrabel@citrix.com>
wrote:> On 12/11/13 14:52, Konrad Rzeszutek Wilk wrote:
>>> I should also mention that some of the improvement I mentioned from
>>> 3,000 MB/s to 3,600 MB/s was due to avoiding the m2p override
spinlock
>>> in dom0.
>>
>> Could you post said patch please? I presume its pretty simple but
>> I would like the credit to go to you guys.
>
> Roger already posted it.
>
> https://lkml.org/lkml/2013/11/5/121
>
> I think we''re just waiting for a repost with a more detailed
commit message.
Roger''s patch isn''t correct.  It doesn''t actually
restore the previous
mfn because it''s storing the new mfn in page private not the old one.

I have a patch that I''ll post this afternoon.

Regards,

Anthony Liguori
> David
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Matt Wilson

2013-Nov-12 17:58 UTC

head link

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

On Tue, Nov 12, 2013 at 09:26:28AM +0000, Jan Beulich
wrote:> >>> On 12.11.13 at 03:03, Matt Wilson <msw@linux.com> wrote:
> > From: Matt Wilson <msw@amazon.com>
> > 
> > This patch refactors grant table locking. It splits the previous
> > single spinlock per grant table into multiple locks. The heavily
> > modified components of the grant table (the maptrack state and the
> > active entries) are now protected by their own spinlocks. The
> > remaining elements of the grant table are read-mostly, so the main
> > grant table lock is modified to be a rwlock to improve concurrency.
> > 
> > Workloads with high grant table operation rates, especially map/unmap
> > operations as used by blkback/blkfront when persistent grants are not
> > supported, show marked improvement with these changes. A domU with 24
> > VBDs in a streaming 2M write workload achieved 1,400 MB/sec before
> > this change. Performance more than doubles with this patch, reaching
> > 3,000 MB/sec before tuning and 3,600 MB/sec after adjusting event
> > channel vCPU bindings.
> > 
> > Signed-off-by: Matt Wilson <msw@amazon.com>
> 
> I will have to take a thorough second look at this, but considering
> its size - would it be possible to split out the conversion of the main
> lock to an rw one into a third patch? Correctness of the second one
> wouldn''t suffer from this.
Yes, I figured that there would be a request to break this up. I''ll
have to think hard for each change to make sure that nothing breaks
(e.g., deadlock scenarios) for each step along the way. But this is a
useful exercise.

It should be possible to first add the separate locking for maptrack,
followed by fine grained locking for active entries, then finally
moving the main grant table lock to a rwlock (or some other locking
primitive should one become available).
> And I''d also suggest converting all the maptrack_ fields into a
> separate sub-structure named maptrack. That''d not only guarantee
> things to be kept together, but may also allow passing the address
> of the sub-structure as argument to one or another helper function
> instead of the full structure''s, thus making it more clear what
such a
> function actually acts upon (and in particular what it does _not_
> touch).
I agree. Do you have any thoughts on padding the spinlock out to a
cache line to avoid false sharing of the other elements? The maptrack
critical sections are extremely short, so there might not be any real
need for this. I''ve not done any measurement of maptrack spinlock
contention with this change yet. I vaguely remember Anthony doing some
measurement on an early PoC version that didn''t show contention.
It''s
probably a premature optimization...

--msw

Matt Wilson

2013-Nov-12 19:06 UTC

head link

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

On Tue, Nov 12, 2013 at 08:07:03AM +0000, Keir Fraser
wrote:> On 12/11/2013 07:18, "Matt Wilson" <msw@linux.com> wrote:
> 
> >> Is there any concern about writer starvation here? I know our
spinlocks
> >> aren''t ''fair'' but our rwlocks are
guaranteed to starve out writers if there
> >> is a steady continuous stream of readers. Perhaps we should
write-bias our
> >> rwlock, or at least make that an option. We could get fancier but
would
> >> probably hurt performance.
> > 
> > Yes, I''m a little concerned about writer starvation. But so
far even
> > in the presence of very frequent readers it seems like the infrequent
> > writers are able to get the lock when they need to. However,
I''ve not
> > tested the iommu=strict path yet. I''m thinking that in that
case
> > there''s just going to be frequent writers, so
there''s less risk of
> > readers starving writers. For what it''s worth, when
mapcount() gets in
> > the picture with persistent grants, I''d expect to see some
pretty
> > significant performance degradation for map/unmap operations. This was
> > also observed in [1] under different circumstances.
> 
> The average case isn''t the only concern here, but also the worst
case, which
> could maybe tie up a CPU for unbounded time. Could a malicious guest set up
> such a workload? I''m just thinking we don''t want to end
up with a DoS XSA on
> this down the line. :)
Indeed.
> > But right now I''m more curious about cache line bouncing
between all
> > the readers. I''ve not done any study of inter-arrival times
for
> > typical workloads (much less some more extreme workloads like
we''ve
> > been testing), but lock profiling of grant table operations when a
> > spinlock was used showed some pretty long hold times, which should
> > translate fairly well to decent rwlock performance. I''m by no
means an
> > expert in this area so I''m eager to hear the thoughts of
others.
> 
> In the read-heavy case the only improvement would be with the old
> Linux-style biglock (spinlock per CPU; writers must take all spinlocks), or
> working out a lock-free scheme for readers (perhaps making use of RCU).
> 
>  -- Keir
> 
> > I should also mention that some of the improvement I mentioned from
> > 3,000 MB/s to 3,600 MB/s was due to avoiding the m2p override spinlock
> > in dom0.
> 
>

Jan Beulich

2013-Nov-13 07:39 UTC

head link

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

>>> On 12.11.13 at 18:58, Matt Wilson <msw@linux.com> wrote:
> Do you have any thoughts on padding the spinlock out to a
> cache line to avoid false sharing of the other elements?
Wouldn''t that be counter productive? Once you need to acquire the
lock, you need ownership on the line anyway, and hence it seems
desirable to have the other elements in the same cache line. Unless
of course there are significant parts here that are read-mostly (but
I concluded from your use of a spin lock rather than an rw one that
there generally are modifications done when holding the lock).

What we may want to push out of it are elements outside the
maptrack sub-structure.

Jan

Xen devel - Nov 2013 - [RFC PATCH 0/2] gnttab: refactor locking for better scalability

[RFC PATCH 0/2] gnttab: refactor locking for better scalability

[RFC PATCH 1/2] gnttab: lock the local grant table earlier in __gnttab_unmap_common()

[RFC PATCH 2/2] gnttab: refactor locking for better scalability

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability

Re: [RFC PATCH 2/2] gnttab: refactor locking for better scalability