thr3ads.net - Libguestfs - [Libguestfs] [PATCH nbdkit 2/2] allocators: sparse: Split the highly contended mutex [Apr 2023]

If this information is useful, please help other people find it:
Share via:

Richard W.M. Jones

2023-Apr-27 14:19 UTC

[Libguestfs] [PATCH nbdkit 2/2] allocators: sparse: Split the highly contended mutex

We've long known that nbdkit-memory-plugin with the default sparse
array allocator suffers because a global lock must be taken whenever
any read or write operation is performed.  This commit aims to safely
improve performance by converting the lock into a read-write lock.

The shared (read) lock still must be acquired for any access to the
sparse array.  However if you wish to make any changes to the L1/L2
directory, which includes allocating or deallocating pages, then you
must upgrade to the exclusive (write) lock.

Unfortunately POSIX read-write locks are not upgradable, and previous
attempts to work around this lead to many potential safety issues
and/or deadlocks.  For example if at the point where you want to
update the metadata, you unlock the shared lock and attempt to acquire
the exclusive lock, then other threads can jump in and modify the
metadata underneath you.  It's very difficult to work around this
limitation safely.  This patch takes a different approach: We attempt
write operations using the shared lock, but if we detect that we need
to update the metadata during the operation then we retry the entire
operation from the start with the exclusive lock.  This should be
safe, albeit slower.

Either the test program (contrib/sparseloadtest.c), or fio, can be
used to test performance.  The test program is better because it
exercises page deallocation, which is not exercised by fio testing.

For unpatched nbdkit 1.34:

  - 4 threads, mix of 80% read, 10% write, 10% trim:

    READ: 77669.5 ops/s 5087519644.9 bytes/s
    WRITE: 9707.1 ops/s 635528046.7 bytes/s
    TRIM: 9712.0 ops/s 636737940.6 bytes/s
    TOTAL: 97088.6 ops/s 6359785632.1 bytes/s

  - 4 threads, mix of 20% read, 40% write, 40% trim:

    READ: 20135.1 ops/s 1318832390.2 bytes/s
    WRITE: 40293.6 ops/s 2640703848.7 bytes/s
    TRIM: 40240.7 ops/s 2636021905.5 bytes/s
    TOTAL: 100669.4 ops/s 6595558144.3 bytes/s

For patched nbdkit:

  - 4 threads, mix of 80% read, 10% write, 10% trim:

    READ: 75265.4 ops/s 4931584759.3 bytes/s
    WRITE: 9392.0 ops/s 614956850.5 bytes/s
    TRIM: 9426.0 ops/s 618151433.1 bytes/s
    TOTAL: 94083.4 ops/s 6164693042.9 bytes/s

    Note that performance is about 3% lower, possibly because
    read-write locks are less efficient than mutexes?

  - 4 threads, mix of 20% read, 40% write, 40% trim:

    READ: 23008.4 ops/s 1508451376.5 bytes/s
    WRITE: 46011.8 ops/s 3014531474.9 bytes/s
    TRIM: 46015.2 ops/s 3014830518.1 bytes/s
    TOTAL: 115035.3 ops/s 7537813369.6 bytes/s

    Performance is about 15% better.

With fio, unpatched nbdkit 1.34:

   READ: bw=1188MiB/s (1245MB/s), 1188MiB/s-1188MiB/s (1245MB/s-1245MB/s),
io=69.6GiB (74.7GB), run=60001-60001msec
  WRITE: bw=1187MiB/s (1245MB/s), 1187MiB/s-1187MiB/s (1245MB/s-1245MB/s),
io=69.6GiB (74.7GB), run=60001-60001msec

With fio, patched nbdkit:

   READ: bw=1309MiB/s (1372MB/s), 1309MiB/s-1309MiB/s (1372MB/s-1372MB/s),
io=76.7GiB (82.3GB), run=60001-60001msec
  WRITE: bw=1308MiB/s (1372MB/s), 1308MiB/s-1308MiB/s (1372MB/s-1372MB/s),
io=76.7GiB (82.3GB), run=60001-60001msec

fio test command:

  nbdkit -U - memory size=256M --run 'export uri; fio examples/nbd.fio'

with examples/nbd.fio from the fio git repository.
---
 common/allocators/sparse.c | 126 ++++++++++++++++++++++++++++---------
 1 file changed, 98 insertions(+), 28 deletions(-)

diff --git a/common/allocators/sparse.c b/common/allocators/sparse.c
index 0f1e9aa81..ce89715dc 100644
--- a/common/allocators/sparse.c
+++ b/common/allocators/sparse.c
@@ -129,11 +129,20 @@ DEFINE_VECTOR_TYPE (l1_dir, struct l1_entry);
 struct sparse_array {
   struct allocator a;           /* Must come first. */
 
-  /* This lock is highly contended.  When hammering the memory plugin
-   * with 8 fio threads, about 30% of the total system time was taken
-   * just waiting for this lock.  Fixing this is quite hard.
+  /* The shared (read) lock must be held if you just want to access
+   * the data without modifying any of the L1/L2 metadata or
+   * allocating or freeing any page.
+   *
+   * To modify the L1/L2 metadata including allocating or freeing any
+   * page you must hold the exclusive (write) lock.
+   *
+   * Because POSIX rwlocks are not upgradable this presents a problem.
+   * We solve it below by speculatively performing the request while
+   * holding the shared lock, but if we run into an operation that
+   * needs to update the metadata, we restart the entire request
+   * holding the exclusive lock.
    */
-  pthread_mutex_t lock;
+  pthread_rwlock_t lock;
   l1_dir l1_dir;                /* L1 directory. */
 };
 
@@ -158,7 +167,7 @@ sparse_array_free (struct allocator *a)
     for (i = 0; i < sa->l1_dir.len; ++i)
       free_l2_dir (sa->l1_dir.ptr[i].l2_dir);
     free (sa->l1_dir.ptr);
-    pthread_mutex_destroy (&sa->lock);
+    pthread_rwlock_destroy (&sa->lock);
     free (sa);
   }
 }
@@ -305,7 +314,10 @@ sparse_array_read (struct allocator *a,
                    void *buf, uint64_t count, uint64_t offset)
 {
   struct sparse_array *sa = (struct sparse_array *) a;
-  ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&sa->lock);
+  /* Because reads never modify any metadata, it is always safe to
+   * only hold the shared (read) lock.
+   */
+  ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&sa->lock);
   uint64_t n;
   void *p;
 
@@ -327,30 +339,60 @@ sparse_array_read (struct allocator *a,
   return 0;
 }
 
+#define RESTART_EXCLUSIVE -2
+
+static int
+do_write (bool exclusive, struct sparse_array *sa,
+          const void *buf, uint64_t count, uint64_t offset)
+{
+  uint64_t n;
+  void *p;
+
+  while (count > 0) {
+    if (!exclusive) {
+      /* If we only hold the shared lock, try it without allocating. */
+      p = lookup (sa, offset, false, &n, NULL);
+      if (p == NULL)
+        return RESTART_EXCLUSIVE;
+    }
+    else {
+      p = lookup (sa, offset, true, &n, NULL);
+      if (p == NULL)
+        return -1;
+    }
+
+    if (n > count)
+      n = count;
+    memcpy (p, buf, n);
+
+    buf += n;
+    count -= n;
+    offset += n;
+  }
+
+  return 0;
+}
+
 static int
 sparse_array_write (struct allocator *a,
                     const void *buf, uint64_t count, uint64_t offset)
 {
   struct sparse_array *sa = (struct sparse_array *) a;
-  ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&sa->lock);
-  uint64_t n;
-  void *p;
+  int r;
 
-  while (count > 0) {
-    p = lookup (sa, offset, true, &n, NULL);
-    if (p == NULL)
-      return -1;
-
-    if (n > count)
-      n = count;
-    memcpy (p, buf, n);
+  /* First try the write with the shared (read) lock held. */
+  {
+    ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&sa->lock);
+    r = do_write (false, sa, buf, count, offset);
+  }
 
-    buf += n;
-    count -= n;
-    offset += n;
+  /* If that failed because we need the exclusive lock, restart. */
+  if (r == RESTART_EXCLUSIVE) {
+    ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&sa->lock);
+    r = do_write (true, sa, buf, count, offset);
   }
 
-  return 0;
+  return r;
 }
 
 static int sparse_array_zero (struct allocator *a,
@@ -367,7 +409,8 @@ sparse_array_fill (struct allocator *a, char c,
   if (c == 0)
     return sparse_array_zero (a, count, offset);
 
-  ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&sa->lock);
+  /* Since fill is never called on a hot path, use the exclusive lock. */
+  ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&sa->lock);
 
   while (count > 0) {
     p = lookup (sa, offset, true, &n, NULL);
@@ -386,10 +429,9 @@ sparse_array_fill (struct allocator *a, char c,
 }
 
 static int
-sparse_array_zero (struct allocator *a, uint64_t count, uint64_t offset)
+do_zero (bool exclusive, struct sparse_array *sa,
+         uint64_t count, uint64_t offset)
 {
-  struct sparse_array *sa = (struct sparse_array *) a;
-  ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&sa->lock);
   uint64_t n;
   void *p;
   struct l2_entry *l2_entry;
@@ -407,6 +449,9 @@ sparse_array_zero (struct allocator *a, uint64_t count,
uint64_t offset)
 
       /* If the whole page is now zero, free it. */
       if (n >= PAGE_SIZE || is_zero (l2_entry->page, PAGE_SIZE)) {
+        if (!exclusive)
+          return RESTART_EXCLUSIVE;
+
         if (sa->a.debug)
           nbdkit_debug ("%s: freeing zero page at offset %" PRIu64,
                         __func__, offset);
@@ -422,6 +467,27 @@ sparse_array_zero (struct allocator *a, uint64_t count,
uint64_t offset)
   return 0;
 }
 
+static int
+sparse_array_zero (struct allocator *a, uint64_t count, uint64_t offset)
+{
+  struct sparse_array *sa = (struct sparse_array *) a;
+  int r;
+
+  /* First try the zero with the shared (read) lock held. */
+  {
+    ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&sa->lock);
+    r = do_zero (false, sa, count, offset);
+  }
+
+  /* If that failed because we need the exclusive lock, restart. */
+  if (r == RESTART_EXCLUSIVE) {
+    ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&sa->lock);
+    r = do_zero (true, sa, count, offset);
+  }
+
+  return r;
+}
+
 static int
 sparse_array_blit (struct allocator *a1,
                    struct allocator *a2,
@@ -429,7 +495,8 @@ sparse_array_blit (struct allocator *a1,
                    uint64_t offset1, uint64_t offset2)
 {
   struct sparse_array *sa2 = (struct sparse_array *) a2;
-  ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&sa2->lock);
+  /* Since blit is never called on a hot path, use the exclusive lock. */
+  ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&sa2->lock);
   uint64_t n;
   void *p;
   struct l2_entry *l2_entry;
@@ -474,7 +541,10 @@ sparse_array_extents (struct allocator *a,
                       struct nbdkit_extents *extents)
 {
   struct sparse_array *sa = (struct sparse_array *) a;
-  ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&sa->lock);
+  /* Reading extents never modifies any metadata, so it is always safe
+   * to only hold the shared (read) lock.
+   */
+  ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&sa->lock);
   uint64_t n;
   uint32_t type;
   void *p;
@@ -523,7 +593,7 @@ sparse_array_create (const void *paramsv)
     nbdkit_error ("calloc: %m");
     return NULL;
   }
-  pthread_mutex_init (&sa->lock, NULL);
+  pthread_rwlock_init (&sa->lock, NULL);
 
   return (struct allocator *) sa;
 }
-- 
2.39.2

Eric Blake

2023-Apr-27 15:35 UTC

head link

[Libguestfs] [PATCH nbdkit 2/2] allocators: sparse: Split the highly contended mutex

On Thu, Apr 27, 2023 at 03:19:12PM +0100, Richard W.M. Jones
wrote:> We've long known that nbdkit-memory-plugin with the default sparse
> array allocator suffers because a global lock must be taken whenever
> any read or write operation is performed.  This commit aims to safely
> improve performance by converting the lock into a read-write lock.
I searched for other approaches, and found this to be an interesting read:

https://stackoverflow.com/questions/2407558/pthreads-reader-writer-locks-upgrading-read-lock-to-write-lock

One of the suggestions there is using fcntl() locking (which does
support upgrade attempts) instead of pthread_rwlock*.  Might be
interesting to code up and compare an approach along those lines.
> +++ b/common/allocators/sparse.c
> @@ -129,11 +129,20 @@ DEFINE_VECTOR_TYPE (l1_dir, struct l1_entry);
>  struct sparse_array {
>    struct allocator a;           /* Must come first. */
>  
> -  /* This lock is highly contended.  When hammering the memory plugin
> -   * with 8 fio threads, about 30% of the total system time was taken
> -   * just waiting for this lock.  Fixing this is quite hard.
> +  /* The shared (read) lock must be held if you just want to access
> +   * the data without modifying any of the L1/L2 metadata or
> +   * allocating or freeing any page.
> +   *
> +   * To modify the L1/L2 metadata including allocating or freeing any
> +   * page you must hold the exclusive (write) lock.
> +   *
> +   * Because POSIX rwlocks are not upgradable this presents a problem.
> +   * We solve it below by speculatively performing the request while
> +   * holding the shared lock, but if we run into an operation that
> +   * needs to update the metadata, we restart the entire request
> +   * holding the exclusive lock.
>     */
> -  pthread_mutex_t lock;
> +  pthread_rwlock_t lock;
But as written, this does look like a nice division of labor that
reduces serialization in the common case.  And yes, it does look like
posix_rwlock is a bit heavier-weight than bare mutex, which explains
the slowdown on the read-heavy workload even while improving the
write-heavy workload.

Reviewed-by: Eric Blake <eblake at redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Libguestfs - Apr 2023 - [PATCH nbdkit 2/2] allocators: sparse: Split the highly contended mutex

[Libguestfs] [PATCH nbdkit 2/2] allocators: sparse: Split the highly contended mutex

[Libguestfs] [PATCH nbdkit 2/2] allocators: sparse: Split the highly contended mutex