Richard W.M. Jones
2023-Apr-27 14:19 UTC
[Libguestfs] [PATCH nbdkit 0/2] allocators: sparse: Split the highly contended mutex
nbdkit-memory-plugin is a RAM disk, but it is quite a slow one because of the lock we must acquire on the sparse array. Try to improve performance by changing this mutex for a read-write lock. The results are somewhat mixed. Note that in the real world, workloads which constantly update the metadata are assumed to be fairly rare. The test program added in the first patch exercises metadata updates a lot more aggressively than would be expected. Rich.
Richard W.M. Jones
2023-Apr-27 14:19 UTC
[Libguestfs] [PATCH nbdkit 1/2] contrib: Add a test/benchmarking tool for the sparse memory plugin
Add a contrib directory which we can use to store random tests and other contributions that we don't necessarily want to support long term or incorporate into nbdkit. Into this directory place a test program which tests nbdkit-memory-plugin with allocator=sparse. --- configure.ac | 1 + Makefile.am | 1 + contrib/Makefile.am | 44 +++++ contrib/sparseloadtest.c | 399 +++++++++++++++++++++++++++++++++++++++ .gitignore | 1 + 5 files changed, 446 insertions(+) diff --git a/configure.ac b/configure.ac index fc4928b5d..d8a93ac14 100644 --- a/configure.ac +++ b/configure.ac @@ -1470,6 +1470,7 @@ AC_CONFIG_FILES([Makefile common/replacements/Makefile common/replacements/win32/Makefile common/utils/Makefile + contrib/Makefile docs/Makefile include/Makefile include/nbdkit-version.h diff --git a/Makefile.am b/Makefile.am index b1fcc815a..3fe149909 100644 --- a/Makefile.am +++ b/Makefile.am @@ -84,6 +84,7 @@ SUBDIRS = \ common/replacements \ common/utils \ server \ + contrib $(NULL) if HAVE_PLUGINS diff --git a/contrib/Makefile.am b/contrib/Makefile.am new file mode 100644 index 000000000..b9866e47d --- /dev/null +++ b/contrib/Makefile.am @@ -0,0 +1,44 @@ +# nbdkit +# Copyright Red Hat +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions are +# met: +# +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# +# * Neither the name of Red Hat nor the names of its contributors may be +# used to endorse or promote products derived from this software without +# specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS'' AND +# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, +# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A +# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF +# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND +# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT +# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +# SUCH DAMAGE. + +include $(top_srcdir)/common-rules.mk + +if HAVE_LIBNBD + +noinst_PROGRAMS = sparseloadtest + +sparseloadtest_SOURCES = sparseloadtest.c +sparseloadtest_CPPFLAGS = -I$(top_srcdir)/common/include +sparseloadtest_CFLAGS = $(PTHREAD_CFLAGS) $(WARNINGS_CFLAGS) $(LIBNBD_CFLAGS) +sparseloadtest_LDADD = $(LIBNBD_LIBS) +sparseloadtest_LDFLAGS = $(PTHREAD_LIBS) + +endif HAVE_LIBNBD diff --git a/contrib/sparseloadtest.c b/contrib/sparseloadtest.c new file mode 100644 index 000000000..4aa3f5104 --- /dev/null +++ b/contrib/sparseloadtest.c @@ -0,0 +1,399 @@ +/* nbdkit + * Copyright Red Hat + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are + * met: + * + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * * Neither the name of Red Hat nor the names of its contributors may be + * used to endorse or promote products derived from this software without + * specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, + * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A + * PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR + * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF + * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, + * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT + * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +/* Load nbdkit-memory-plugin + allocator=sparse + * + * This is better than fio for this plugin since it exercises + * allocation and deallocation of pages and locking. + * + * To test a mainly read workload (90% reads, 10% writes, 10% trims): + * ./contrib/sparseloadtest 4 90 + * + * To test a write-heavy workload (20% reads, 40% writes, 40% trims): + * ./contrib/sparseloadtest 4 20 + * + * nbdkit is run from the current $PATH environment variable, so to + * run the locally built nbdkit you should do: + * + * PATH=.:$PATH ./contrib/sparseloadtest [...] + */ + +#include <config.h> + +#include <stdio.h> +#include <stdlib.h> +#include <stdint.h> +#include <unistd.h> +#include <errno.h> +#include <time.h> +#include <signal.h> +#include <sys/socket.h> +#include <sys/un.h> +#include <pthread.h> + +#include <libnbd.h> + +#include "random.h" + +#define PAGE_SIZE 32768 /* See common/allocators/sparse.c */ +#define L2_SIZE 4096 +#define DISK_SIZE (4*L2_SIZE*PAGE_SIZE) +#define MAX_THREADS 16 +#define DURATION 60 /* seconds */ +#define MAX_IN_FLIGHT 64 +#define MAX_REQUEST (128*1024) /* Should be larger than PAGE_SIZE. */ + +static pid_t pid; +static char sockfile[] = "/tmp/sockXXXXXX"; +static char pidfile[] = "/tmp/pidXXXXXX"; +static unsigned nr_threads; +static double pc_read; /* % read operations. */ + +struct stats { + size_t ops; + size_t bytes; +}; + +struct thread_data { + pthread_t thread; + int status; /* returned status from the thread */ + struct nbd_handle *nbd; /* per-thread handle */ + struct stats read_stats, write_stats, trim_stats; + struct random_state state; +}; +static struct thread_data thread[MAX_THREADS]; + +static time_t start_t; + +struct command_data { + struct stats *stats; + size_t count; +}; + +/* We don't care about the data that is read, so this is just a sink + * buffer shared across all threads. + */ +static char sink[MAX_REQUEST]; + +static char wrbuf[MAX_REQUEST]; + +static void start_nbdkit (void); +static void create_random_name (char *template); +static void cleanup (void); +static void *start_thread (void *); + +int +main (int argc, char *argv[]) +{ + unsigned i; + int err; + struct stats read_total = { 0 }; + struct stats write_total = { 0 }; + struct stats trim_total = { 0 }; + + if (argc != 3) { + fprintf (stderr, "%s nr_threads percent_reads\n", argv[0]); + exit (EXIT_FAILURE); + } + if (sscanf (argv[1], "%u", &nr_threads) != 1 || + nr_threads == 0 || nr_threads > MAX_THREADS || + sscanf (argv[2], "%lg", &pc_read) != 1 || + pc_read <= 0 || pc_read > 100) { + fprintf (stderr, "%s: incorrect parameters, read the source!\n", argv[0]); + exit (EXIT_FAILURE); + } + + start_nbdkit (); + atexit (cleanup); + + /* Connect to nbdkit. */ + for (i = 0; i < nr_threads; ++i) { + struct nbd_handle *nbd = nbd_create (); + if (nbd == NULL) { + got_nbd_error: + fprintf (stderr, "%s: thread %u: %s\n", argv[0], i, nbd_get_error ()); + exit (EXIT_FAILURE); + } + if (nbd_connect_unix (nbd, sockfile) == -1) + goto got_nbd_error; + thread[i].nbd = nbd; + } + + time (&start_t); + + /* Start threads. */ + for (i = 0; i < nr_threads; ++i) { + xsrandom (i+1, &thread[i].state); + + if (i == 0) { + size_t j; + for (j = 0; j < sizeof wrbuf; ++j) + wrbuf[j] = xrandom (&thread[i].state); + } + + err = pthread_create (&thread[i].thread, NULL, start_thread, &thread[i]); + if (err != 0) { + errno = err; + perror ("pthread_create"); + exit (EXIT_FAILURE); + } + } + + /* Wait for the threads to exit. */ + for (i = 0; i < nr_threads; ++i) { + err = pthread_join (thread[i].thread, NULL); + if (err != 0) { + errno = err; + perror ("pthread_join"); + exit (EXIT_FAILURE); + } + if (thread[i].status != 0) { + fprintf (stderr, "%s: thread %u failed, see earlier errors\n", + argv[0], i); + exit (EXIT_FAILURE); + } + + read_total.ops += thread[i].read_stats.ops; + read_total.bytes += thread[i].read_stats.bytes; + write_total.ops += thread[i].write_stats.ops; + write_total.bytes += thread[i].write_stats.bytes; + trim_total.ops += thread[i].trim_stats.ops; + trim_total.bytes += thread[i].trim_stats.bytes; + + /* Drain the command queue just to avoid errors. These requests + * don't count in the final totals. + */ + while (nbd_aio_in_flight (thread[i].nbd) > 0) { + if (nbd_poll (thread[i].nbd, -1) == -1) { + fprintf (stderr, "%s\n", nbd_get_error ()); + exit (EXIT_FAILURE); + } + } + + nbd_close (thread[i].nbd); + } + + /* Print the throughput. */ + printf ("READ: %.1f ops/s %.1f bytes/s\n", + (double) read_total.ops / DURATION, + (double) read_total.bytes / DURATION); + printf ("WRITE: %.1f ops/s %.1f bytes/s\n", + (double) write_total.ops / DURATION, + (double) write_total.bytes / DURATION); + printf ("TRIM: %.1f ops/s %.1f bytes/s\n", + (double) trim_total.ops / DURATION, + (double) trim_total.bytes / DURATION); + printf ("TOTAL: %.1f ops/s %.1f bytes/s\n", + (double) (read_total.ops + write_total.ops + trim_total.ops) + / DURATION, + (double) (read_total.bytes + write_total.bytes + trim_total.bytes) + / DURATION); + printf ("--\n"); + printf ("%%read operations requested: %.1f%%, achieved: %.1f%%\n", + pc_read, + 100.0 * read_total.ops / + (read_total.ops + write_total.ops + trim_total.ops)); + printf ("%%write operations requested: %.1f%%, achieved: %.1f%%\n", + (100 - pc_read) / 2, + 100.0 * write_total.ops / + (read_total.ops + write_total.ops + trim_total.ops)); + printf ("%%trim operations requested: %.1f%%, achieved: %.1f%%\n", + (100 - pc_read) / 2, + 100.0 * trim_total.ops / + (read_total.ops + write_total.ops + trim_total.ops)); + + exit (EXIT_SUCCESS); +} + +static void +cleanup (void) +{ + if (pid > 0) { + kill (pid, SIGTERM); + unlink (sockfile); + pid = 0; + } +} + +/* Start nbdkit. + * + * We cannot use systemd socket activation because we want to use + * multi-conn. + */ +static void +start_nbdkit (void) +{ + char size[64]; + int i; + + snprintf (size, sizeof size, "%d", DISK_SIZE); + + create_random_name (sockfile); + create_random_name (pidfile); + + /* Start nbdkit. */ + pid = fork (); + if (pid == -1) { + perror ("fork"); + exit (EXIT_FAILURE); + } + if (pid == 0) { /* Child - nbdkit */ + execlp ("nbdkit", + "nbdkit", "--exit-with-parent", "-f", + "-U", sockfile, "-P", pidfile, + "memory", size, "allocator=sparse", + NULL); + perror ("execlp"); + _exit (EXIT_FAILURE); + } + + /* Wait for the pidfile to appear, indicating that nbdkit is ready. */ + for (i = 0; i < 60; ++i) { + if (access (pidfile, F_OK) == 0) + break; + sleep (1); + } + if (i == 60) { + fprintf (stderr, "nbdkit did not start up, look for errors above\n"); + exit (EXIT_FAILURE); + } + + unlink (pidfile); +} + +/* This is racy but it doesn't matter for a test. */ +static void +create_random_name (char *template) +{ + int fd; + + fd = mkstemp (template); + if (fd == -1) { + perror (template); + exit (EXIT_FAILURE); + } + close (fd); + unlink (template); +} + +static int +cb (void *user_data, int *error) +{ + struct command_data *data = user_data; + + if (*error != 0) { + fprintf (stderr, "unexpected error in completion callback\n"); + exit (EXIT_FAILURE); + } + + data->stats->ops++; + data->stats->bytes += data->count; + + free (data); + + return 1; /* retire the command */ +} + +static void * +start_thread (void *thread_data_vp) +{ + struct thread_data *thread_data = thread_data_vp; + struct nbd_handle *nbd = thread_data->nbd; + time_t end_t; + size_t total_ops; + double pc_read_actual; + uint64_t offset; + size_t count; + struct command_data *data; + int64_t r; + + while (time (&end_t), end_t - start_t < DURATION) { + /* Run the poll loop if there are too many requests in flight. */ + while (nbd_aio_in_flight (nbd) >= MAX_IN_FLIGHT) { + if (nbd_poll (nbd, -1) == -1) { + fprintf (stderr, "%s\n", nbd_get_error ()); + exit (EXIT_FAILURE); + } + } + + /* Aim to send about pc_read% read operations, and an equal random + * distribution of writes and trims. + */ + total_ops + thread_data->read_stats.ops + thread_data->write_stats.ops + + thread_data->trim_stats.ops; + if (total_ops == 0) + pc_read_actual = 100; + else + pc_read_actual = 100 * (thread_data->read_stats.ops / (double) total_ops); + + offset = xrandom (&thread_data->state) & (DISK_SIZE - MAX_REQUEST); + count = xrandom (&thread_data->state) & (MAX_REQUEST - 1); + if (count == 0) count = 1; + + data = malloc (sizeof *data); /* freed in callback */ + data->count = count; + + if (pc_read_actual < pc_read) { /* read op */ + data->stats = &thread_data->read_stats; + r = nbd_aio_pread (nbd, sink, count, offset, + (nbd_completion_callback) { + .callback = cb, + .user_data = data, + }, 0); + } + else { + if (xrandom (&thread_data->state) & 1) { /* write op */ + data->stats = &thread_data->write_stats; + r = nbd_aio_pwrite (nbd, wrbuf, count, offset, + (nbd_completion_callback) { + .callback = cb, + .user_data = data, + }, 0); + } + else { /* trim op */ + data->stats = &thread_data->trim_stats; + r = nbd_aio_trim (nbd, count, offset, + (nbd_completion_callback) { + .callback = cb, + .user_data = data, + }, 0); + } + } + if (r == -1) { + fprintf (stderr, "%s\n", nbd_get_error ()); + exit (EXIT_FAILURE); + } + } + + return NULL; +} diff --git a/.gitignore b/.gitignore index 74b4e8361..e62c23d8b 100644 --- a/.gitignore +++ b/.gitignore @@ -66,6 +66,7 @@ plugins/*/*.3 /config.status /config.sub /configure +/contrib/sparseloadtest /depcomp /docs/filter-links.pod /docs/lang-plugin-links.pod -- 2.39.2
Richard W.M. Jones
2023-Apr-27 14:19 UTC
[Libguestfs] [PATCH nbdkit 2/2] allocators: sparse: Split the highly contended mutex
We've long known that nbdkit-memory-plugin with the default sparse array allocator suffers because a global lock must be taken whenever any read or write operation is performed. This commit aims to safely improve performance by converting the lock into a read-write lock. The shared (read) lock still must be acquired for any access to the sparse array. However if you wish to make any changes to the L1/L2 directory, which includes allocating or deallocating pages, then you must upgrade to the exclusive (write) lock. Unfortunately POSIX read-write locks are not upgradable, and previous attempts to work around this lead to many potential safety issues and/or deadlocks. For example if at the point where you want to update the metadata, you unlock the shared lock and attempt to acquire the exclusive lock, then other threads can jump in and modify the metadata underneath you. It's very difficult to work around this limitation safely. This patch takes a different approach: We attempt write operations using the shared lock, but if we detect that we need to update the metadata during the operation then we retry the entire operation from the start with the exclusive lock. This should be safe, albeit slower. Either the test program (contrib/sparseloadtest.c), or fio, can be used to test performance. The test program is better because it exercises page deallocation, which is not exercised by fio testing. For unpatched nbdkit 1.34: - 4 threads, mix of 80% read, 10% write, 10% trim: READ: 77669.5 ops/s 5087519644.9 bytes/s WRITE: 9707.1 ops/s 635528046.7 bytes/s TRIM: 9712.0 ops/s 636737940.6 bytes/s TOTAL: 97088.6 ops/s 6359785632.1 bytes/s - 4 threads, mix of 20% read, 40% write, 40% trim: READ: 20135.1 ops/s 1318832390.2 bytes/s WRITE: 40293.6 ops/s 2640703848.7 bytes/s TRIM: 40240.7 ops/s 2636021905.5 bytes/s TOTAL: 100669.4 ops/s 6595558144.3 bytes/s For patched nbdkit: - 4 threads, mix of 80% read, 10% write, 10% trim: READ: 75265.4 ops/s 4931584759.3 bytes/s WRITE: 9392.0 ops/s 614956850.5 bytes/s TRIM: 9426.0 ops/s 618151433.1 bytes/s TOTAL: 94083.4 ops/s 6164693042.9 bytes/s Note that performance is about 3% lower, possibly because read-write locks are less efficient than mutexes? - 4 threads, mix of 20% read, 40% write, 40% trim: READ: 23008.4 ops/s 1508451376.5 bytes/s WRITE: 46011.8 ops/s 3014531474.9 bytes/s TRIM: 46015.2 ops/s 3014830518.1 bytes/s TOTAL: 115035.3 ops/s 7537813369.6 bytes/s Performance is about 15% better. With fio, unpatched nbdkit 1.34: READ: bw=1188MiB/s (1245MB/s), 1188MiB/s-1188MiB/s (1245MB/s-1245MB/s), io=69.6GiB (74.7GB), run=60001-60001msec WRITE: bw=1187MiB/s (1245MB/s), 1187MiB/s-1187MiB/s (1245MB/s-1245MB/s), io=69.6GiB (74.7GB), run=60001-60001msec With fio, patched nbdkit: READ: bw=1309MiB/s (1372MB/s), 1309MiB/s-1309MiB/s (1372MB/s-1372MB/s), io=76.7GiB (82.3GB), run=60001-60001msec WRITE: bw=1308MiB/s (1372MB/s), 1308MiB/s-1308MiB/s (1372MB/s-1372MB/s), io=76.7GiB (82.3GB), run=60001-60001msec fio test command: nbdkit -U - memory size=256M --run 'export uri; fio examples/nbd.fio' with examples/nbd.fio from the fio git repository. --- common/allocators/sparse.c | 126 ++++++++++++++++++++++++++++--------- 1 file changed, 98 insertions(+), 28 deletions(-) diff --git a/common/allocators/sparse.c b/common/allocators/sparse.c index 0f1e9aa81..ce89715dc 100644 --- a/common/allocators/sparse.c +++ b/common/allocators/sparse.c @@ -129,11 +129,20 @@ DEFINE_VECTOR_TYPE (l1_dir, struct l1_entry); struct sparse_array { struct allocator a; /* Must come first. */ - /* This lock is highly contended. When hammering the memory plugin - * with 8 fio threads, about 30% of the total system time was taken - * just waiting for this lock. Fixing this is quite hard. + /* The shared (read) lock must be held if you just want to access + * the data without modifying any of the L1/L2 metadata or + * allocating or freeing any page. + * + * To modify the L1/L2 metadata including allocating or freeing any + * page you must hold the exclusive (write) lock. + * + * Because POSIX rwlocks are not upgradable this presents a problem. + * We solve it below by speculatively performing the request while + * holding the shared lock, but if we run into an operation that + * needs to update the metadata, we restart the entire request + * holding the exclusive lock. */ - pthread_mutex_t lock; + pthread_rwlock_t lock; l1_dir l1_dir; /* L1 directory. */ }; @@ -158,7 +167,7 @@ sparse_array_free (struct allocator *a) for (i = 0; i < sa->l1_dir.len; ++i) free_l2_dir (sa->l1_dir.ptr[i].l2_dir); free (sa->l1_dir.ptr); - pthread_mutex_destroy (&sa->lock); + pthread_rwlock_destroy (&sa->lock); free (sa); } } @@ -305,7 +314,10 @@ sparse_array_read (struct allocator *a, void *buf, uint64_t count, uint64_t offset) { struct sparse_array *sa = (struct sparse_array *) a; - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&sa->lock); + /* Because reads never modify any metadata, it is always safe to + * only hold the shared (read) lock. + */ + ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&sa->lock); uint64_t n; void *p; @@ -327,30 +339,60 @@ sparse_array_read (struct allocator *a, return 0; } +#define RESTART_EXCLUSIVE -2 + +static int +do_write (bool exclusive, struct sparse_array *sa, + const void *buf, uint64_t count, uint64_t offset) +{ + uint64_t n; + void *p; + + while (count > 0) { + if (!exclusive) { + /* If we only hold the shared lock, try it without allocating. */ + p = lookup (sa, offset, false, &n, NULL); + if (p == NULL) + return RESTART_EXCLUSIVE; + } + else { + p = lookup (sa, offset, true, &n, NULL); + if (p == NULL) + return -1; + } + + if (n > count) + n = count; + memcpy (p, buf, n); + + buf += n; + count -= n; + offset += n; + } + + return 0; +} + static int sparse_array_write (struct allocator *a, const void *buf, uint64_t count, uint64_t offset) { struct sparse_array *sa = (struct sparse_array *) a; - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&sa->lock); - uint64_t n; - void *p; + int r; - while (count > 0) { - p = lookup (sa, offset, true, &n, NULL); - if (p == NULL) - return -1; - - if (n > count) - n = count; - memcpy (p, buf, n); + /* First try the write with the shared (read) lock held. */ + { + ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&sa->lock); + r = do_write (false, sa, buf, count, offset); + } - buf += n; - count -= n; - offset += n; + /* If that failed because we need the exclusive lock, restart. */ + if (r == RESTART_EXCLUSIVE) { + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&sa->lock); + r = do_write (true, sa, buf, count, offset); } - return 0; + return r; } static int sparse_array_zero (struct allocator *a, @@ -367,7 +409,8 @@ sparse_array_fill (struct allocator *a, char c, if (c == 0) return sparse_array_zero (a, count, offset); - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&sa->lock); + /* Since fill is never called on a hot path, use the exclusive lock. */ + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&sa->lock); while (count > 0) { p = lookup (sa, offset, true, &n, NULL); @@ -386,10 +429,9 @@ sparse_array_fill (struct allocator *a, char c, } static int -sparse_array_zero (struct allocator *a, uint64_t count, uint64_t offset) +do_zero (bool exclusive, struct sparse_array *sa, + uint64_t count, uint64_t offset) { - struct sparse_array *sa = (struct sparse_array *) a; - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&sa->lock); uint64_t n; void *p; struct l2_entry *l2_entry; @@ -407,6 +449,9 @@ sparse_array_zero (struct allocator *a, uint64_t count, uint64_t offset) /* If the whole page is now zero, free it. */ if (n >= PAGE_SIZE || is_zero (l2_entry->page, PAGE_SIZE)) { + if (!exclusive) + return RESTART_EXCLUSIVE; + if (sa->a.debug) nbdkit_debug ("%s: freeing zero page at offset %" PRIu64, __func__, offset); @@ -422,6 +467,27 @@ sparse_array_zero (struct allocator *a, uint64_t count, uint64_t offset) return 0; } +static int +sparse_array_zero (struct allocator *a, uint64_t count, uint64_t offset) +{ + struct sparse_array *sa = (struct sparse_array *) a; + int r; + + /* First try the zero with the shared (read) lock held. */ + { + ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&sa->lock); + r = do_zero (false, sa, count, offset); + } + + /* If that failed because we need the exclusive lock, restart. */ + if (r == RESTART_EXCLUSIVE) { + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&sa->lock); + r = do_zero (true, sa, count, offset); + } + + return r; +} + static int sparse_array_blit (struct allocator *a1, struct allocator *a2, @@ -429,7 +495,8 @@ sparse_array_blit (struct allocator *a1, uint64_t offset1, uint64_t offset2) { struct sparse_array *sa2 = (struct sparse_array *) a2; - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&sa2->lock); + /* Since blit is never called on a hot path, use the exclusive lock. */ + ACQUIRE_WRLOCK_FOR_CURRENT_SCOPE (&sa2->lock); uint64_t n; void *p; struct l2_entry *l2_entry; @@ -474,7 +541,10 @@ sparse_array_extents (struct allocator *a, struct nbdkit_extents *extents) { struct sparse_array *sa = (struct sparse_array *) a; - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&sa->lock); + /* Reading extents never modifies any metadata, so it is always safe + * to only hold the shared (read) lock. + */ + ACQUIRE_RDLOCK_FOR_CURRENT_SCOPE (&sa->lock); uint64_t n; uint32_t type; void *p; @@ -523,7 +593,7 @@ sparse_array_create (const void *paramsv) nbdkit_error ("calloc: %m"); return NULL; } - pthread_mutex_init (&sa->lock, NULL); + pthread_rwlock_init (&sa->lock, NULL); return (struct allocator *) sa; } -- 2.39.2