Richard W.M. Jones
2023-Apr-27 14:19 UTC
[Libguestfs] [PATCH nbdkit 2/2] allocators: sparse: Split the highly contended mutex
We've long known that nbdkit-memory-plugin with the default sparse array allocator suffers because a global lock must be taken whenever any read or write operation is performed. This commit aims to safely improve performance by converting the lock into a read-write lock. The shared (read) lock still must be acquired for any access to the sparse array. However if you wish to make any changes to the L1/L2 directory, which includes allocating or deallocating pages, then you must upgrade to the exclusive (write) lock. Unfortunately POSIX read-write locks are not upgradable, and previous attempts to work around this lead to many potential safety issues and/or deadlocks. For example if at the point where you want to update the metadata, you unlock the shared lock and attempt to acquire the exclusive lock, then other threads can jump in and modify the metadata underneath you. It's very difficult to work around this limitation safely. This patch takes a different approach: We attempt write operations using the shared lock, but if we detect that we need to update the metadata during the operation then we retry the entire operation from the start with the exclusive lock. This should be safe, albeit slower. Either the test program (contrib/sparseloadtest.c), or fio, can be used to test performance. The test program is better because it exercises page deallocation, which is not exercised by fio testing. For unpatched nbdkit 1.34: - 4 threads, mix of 80% read, 10% write, 10% trim: READ: 77669.5 ops/s 5087519644.9 bytes/s WRITE: 9707.1 ops/s 635528046.7 bytes/s TRIM: 9712.0 ops/s 636737940.6 bytes/s TOTAL: 97088.6 ops/s 6359785632.1 bytes/s - 4 threads, mix of 20% read, 40% write, 40% trim: READ: 20135.1 ops/s 1318832390.2 bytes/s WRITE: 40293.6 ops/s 2640703848.7 bytes/s TRIM: 40240.7 ops/s 2636021905.5 bytes/s TOTAL: 100669.4 ops/s 6595558144.3 bytes/s For patched nbdkit: - 4 threads, mix of 80% read, 10% write, 10% trim: READ: 75265.4 ops/s 4931584759.3 bytes/s WRITE: 9392.0 ops/s 614956850.5 bytes/s TRIM: 9426.0 ops/s 618151433.1 bytes/s TOTAL: 94083.4 ops/s 6164693042.9 bytes/s Note that performance is about 3% lower, possibly because read-write locks are less efficient than mutexes? - 4 threads, mix of 20% read, 40% write, 40% trim: READ: 23008.4 ops/s 1508451376.5 bytes/s WRITE: 46011.8 ops/s 3014531474.9 bytes/s TRIM: 46015.2 ops/s 3014830518.1 bytes/s TOTAL: 115035.3 ops/s 7537813369.6 bytes/s Performance is about 15% better. With fio, unpatched nbdkit 1.34: READ: bw=1188MiB/s (1245MB/s), 1188MiB/s-1188MiB/s (1245MB/s-1245MB/s), io=69.6GiB (74.7GB), run=60001-60001msec WRITE: bw=1187MiB/s (1245MB/s), 1187MiB/s-1187MiB/s (1245MB/s-1245MB/s), io=69.6GiB (74.7GB), run=60001-60001msec With fio, patched nbdkit: READ: bw=1309MiB/s (1372MB/s), 1309MiB/s-1309MiB/s (1372MB/s-1372MB/s), io=76.7GiB (82.3GB), run=60001-60001msec WRITE: bw=1308MiB/s (1372MB/s), 1308MiB/s-1308MiB/s (1372MB/s-1372MB/s), io=76.7GiB (82.3GB), run=60001-60001msec fio test command: nbdkit -U - memory size=256M --run 'export uri; fio examples/nbd.fio' with examples/nbd.fio from the fio git repository. --- common/allocators/sparse.c | 126 ++++++++++++++++++++++++++++--------- 1 file changed, 98 insertions(+), 28 deletions(-) diff --git a/common/allocators/sparse.c b/common/allocators/sparse.c index 0f1e9aa81..ce89715dc 100644 --- a/common/allocators/sparse.c +++ b/common/allocators/sparse.c @@ -129,11 +129,20 @@ DEFINE_VECTOR_TYPE (l1_dir, struct l1_entry); struct sparse_array { struct allocator a; /* Must come first. */ - /* This lock is highly contended. When hammering the memory plugin - * with 8 fio threads, about 30% of the total system time was taken - * just waiting for this lock. Fixing this is quite hard. + /* The shared (read) lock must be held if you just want to access + * the data without modifying any of the L1/L2 metadata or + * allocating or freeing any page. + * + * To modify the L1/L2 metadata including allocating or freeing any + * page you must hold the exclusive (write) lock. + * + * Because POSIX rwlocks are not upgradable this presents a problem. + * We solve it below by speculatively performing the request while + * holding the shared lock, but if we run into an operation that + * needs to update the metadata, we restart the entire request + * holding the exclusive lock. */ - pthread_mutex_t lock; + pthread_rwlock_t lock; l1_dir l1_dir; /* L1 directory. */ }; @@ -158,7 +167,7 @@ sparse_array_free (struct allocator *a) for (i = 0; i < sa->l1_dir.len; ++i) free_l2_dir (sa->l1_dir.ptr[i].l2_dir); free (sa->l1_dir.ptr); - pthread_mutex_destroy (&sa->lock); + pthread_rwlock_destroy (&sa->lock); free (sa); } } @@ -305,7 +314,10 @@ sparse_array_read (struct allocator *a, void *buf, uint64_t count, uint64_t offset) { struct sparse_array *sa = (struct sparse_array *) a; - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&sa->lock); + /* Because reads never modify any metadata, it is always safe to + * only hold the shared (read) lock. + */ + ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&sa->lock); uint64_t n; void *p; @@ -327,30 +339,60 @@ sparse_array_read (struct allocator *a, return 0; } +#define RESTART_EXCLUSIVE -2 + +static int +do_write (bool exclusive, struct sparse_array *sa, + const void *buf, uint64_t count, uint64_t offset) +{ + uint64_t n; + void *p; + + while (count > 0) { + if (!exclusive) { + /* If we only hold the shared lock, try it without allocating. */ + p = lookup (sa, offset, false, &n, NULL); + if (p == NULL) + return RESTART_EXCLUSIVE; + } + else { + p = lookup (sa, offset, true, &n, NULL); + if (p == NULL) + return -1; + } + + if (n > count) + n = count; + memcpy (p, buf, n); + + buf += n; + count -= n; + offset += n; + } + + return 0; +} + static int sparse_array_write (struct allocator *a, const void *buf, uint64_t count, uint64_t offset) { struct sparse_array *sa = (struct sparse_array *) a; - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&sa->lock); - uint64_t n; - void *p; + int r; - while (count > 0) { - p = lookup (sa, offset, true, &n, NULL); - if (p == NULL) - return -1; - - if (n > count) - n = count; - memcpy (p, buf, n); + /* First try the write with the shared (read) lock held. */ + { + ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&sa->lock); + r = do_write (false, sa, buf, count, offset); + } - buf += n; - count -= n; - offset += n; + /* If that failed because we need the exclusive lock, restart. */ + if (r == RESTART_EXCLUSIVE) { + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&sa->lock); + r = do_write (true, sa, buf, count, offset); } - return 0; + return r; } static int sparse_array_zero (struct allocator *a, @@ -367,7 +409,8 @@ sparse_array_fill (struct allocator *a, char c, if (c == 0) return sparse_array_zero (a, count, offset); - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&sa->lock); + /* Since fill is never called on a hot path, use the exclusive lock. */ + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&sa->lock); while (count > 0) { p = lookup (sa, offset, true, &n, NULL); @@ -386,10 +429,9 @@ sparse_array_fill (struct allocator *a, char c, } static int -sparse_array_zero (struct allocator *a, uint64_t count, uint64_t offset) +do_zero (bool exclusive, struct sparse_array *sa, + uint64_t count, uint64_t offset) { - struct sparse_array *sa = (struct sparse_array *) a; - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&sa->lock); uint64_t n; void *p; struct l2_entry *l2_entry; @@ -407,6 +449,9 @@ sparse_array_zero (struct allocator *a, uint64_t count, uint64_t offset) /* If the whole page is now zero, free it. */ if (n >= PAGE_SIZE || is_zero (l2_entry->page, PAGE_SIZE)) { + if (!exclusive) + return RESTART_EXCLUSIVE; + if (sa->a.debug) nbdkit_debug ("%s: freeing zero page at offset %" PRIu64, __func__, offset); @@ -422,6 +467,27 @@ sparse_array_zero (struct allocator *a, uint64_t count, uint64_t offset) return 0; } +static int +sparse_array_zero (struct allocator *a, uint64_t count, uint64_t offset) +{ + struct sparse_array *sa = (struct sparse_array *) a; + int r; + + /* First try the zero with the shared (read) lock held. */ + { + ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&sa->lock); + r = do_zero (false, sa, count, offset); + } + + /* If that failed because we need the exclusive lock, restart. */ + if (r == RESTART_EXCLUSIVE) { + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&sa->lock); + r = do_zero (true, sa, count, offset); + } + + return r; +} + static int sparse_array_blit (struct allocator *a1, struct allocator *a2, @@ -429,7 +495,8 @@ sparse_array_blit (struct allocator *a1, uint64_t offset1, uint64_t offset2) { struct sparse_array *sa2 = (struct sparse_array *) a2; - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&sa2->lock); + /* Since blit is never called on a hot path, use the exclusive lock. */ + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&sa2->lock); uint64_t n; void *p; struct l2_entry *l2_entry; @@ -474,7 +541,10 @@ sparse_array_extents (struct allocator *a, struct nbdkit_extents *extents) { struct sparse_array *sa = (struct sparse_array *) a; - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&sa->lock); + /* Reading extents never modifies any metadata, so it is always safe + * to only hold the shared (read) lock. + */ + ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&sa->lock); uint64_t n; uint32_t type; void *p; @@ -523,7 +593,7 @@ sparse_array_create (const void *paramsv) nbdkit_error ("calloc: %m"); return NULL; } - pthread_mutex_init (&sa->lock, NULL); + pthread_rwlock_init (&sa->lock, NULL); return (struct allocator *) sa; } -- 2.39.2
Eric Blake
2023-Apr-27 15:35 UTC
[Libguestfs] [PATCH nbdkit 2/2] allocators: sparse: Split the highly contended mutex
On Thu, Apr 27, 2023 at 03:19:12PM +0100, Richard W.M. Jones wrote:> We've long known that nbdkit-memory-plugin with the default sparse > array allocator suffers because a global lock must be taken whenever > any read or write operation is performed. This commit aims to safely > improve performance by converting the lock into a read-write lock.I searched for other approaches, and found this to be an interesting read: https://stackoverflow.com/questions/2407558/pthreads-reader-writer-locks-upgrading-read-lock-to-write-lock One of the suggestions there is using fcntl() locking (which does support upgrade attempts) instead of pthread_rwlock*. Might be interesting to code up and compare an approach along those lines.> +++ b/common/allocators/sparse.c > @@ -129,11 +129,20 @@ DEFINE_VECTOR_TYPE (l1_dir, struct l1_entry); > struct sparse_array { > struct allocator a; /* Must come first. */ > > - /* This lock is highly contended. When hammering the memory plugin > - * with 8 fio threads, about 30% of the total system time was taken > - * just waiting for this lock. Fixing this is quite hard. > + /* The shared (read) lock must be held if you just want to access > + * the data without modifying any of the L1/L2 metadata or > + * allocating or freeing any page. > + * > + * To modify the L1/L2 metadata including allocating or freeing any > + * page you must hold the exclusive (write) lock. > + * > + * Because POSIX rwlocks are not upgradable this presents a problem. > + * We solve it below by speculatively performing the request while > + * holding the shared lock, but if we run into an operation that > + * needs to update the metadata, we restart the entire request > + * holding the exclusive lock. > */ > - pthread_mutex_t lock; > + pthread_rwlock_t lock;But as written, this does look like a nice division of labor that reduces serialization in the common case. And yes, it does look like posix_rwlock is a bit heavier-weight than bare mutex, which explains the slowdown on the read-heavy workload even while improving the write-heavy workload. Reviewed-by: Eric Blake <eblake at redhat.com> -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org