This is a subset of the last half of the larger 9-patch series. The uncontroversial first half of that series is pushed, but here, I tried to reduce the size of the patches by splitting out some of the more complex changes, so that the rest of the changes remaining in the series are more mechanical. In turn, it forced me to write timing tests, which let me spot another spot where we are wasting time, so patch 2 is completely new. As we discussed on IRC, we're early enough in the 1.16 release cycle that I'll probably just push these when I'm happy with them, whether or not a review turns up something (we can always do followup patches). But forcing me to reread the patches for legibility has been a good exercise, because I really like how it forced me to spend a good hour on the commit message of patch 1 explaining things, and how patch 2 then fell out in 10 minutes of hacking to fix the problem identified in passing in the patch 1 commit message. Eric Blake (2): server: Cache per-connection can_write server: Remember .open(readonly) status server/internal.h | 5 ++- server/backend.c | 43 ++++++++++++++++++++- server/connections.c | 5 ++- server/filters.c | 6 +-- server/plugins.c | 4 +- server/protocol-handshake.c | 74 ++++++++++++++++++------------------- 6 files changed, 88 insertions(+), 49 deletions(-) -- 2.21.0
Eric Blake
2019-Aug-30 22:55 UTC
[Libguestfs] [nbdkit PATCH v2 1/2] server: Cache per-connection can_write
The calculation of eflags skips calling .can_zero, .can_trim, and .can_fua if .can_write returns false, because 1) time spent calling .can_zero would be wasted, and 2) plugin writers can be lazy and let .can_zero return constant true, regardless of what they return for .can_write, and are relying on nbdkit to not advertise or call .zero. We don't want to regress on this (that is, it is easy to write a sh plugin script that would be noticeably slower if we call into .can_zero in spite of .can_write returning false). But at the same time, just as plugins can blindly return .can_zero==true in spite of .can_write==false, it is also possible for a plugin to blindly return .can_write==true in spite of the readonly=true argument to .open (that is, when the command line option -r is in effect), yet we were NOT skipping that call into .can_write (meaning we can write a sh plugin to observe a wasted call to .can_write [1]). Next, consider that although we don't have such a filter today, it is feasible that someone could write a filter that exposes .can_write==0 to the client but which wants to open the underlying plugin read-write and uses next_ops->pwrite and next_ops->zero (for example, a scenario of exposing read-only contents of a single file extracted from an underlying file system, where the underlying device containing the file system must be opened read-write even though its file system will only be mounted read-only, because the act of mounting triggers a journal replay that modifies the underlying device, but not the file being extracted). This implies that -r must only affect the outermost backend, and not every backend along the chain. Then, to date, we've documented that such a filter merely need check next_ops->can_zero before calling next_ops->zero, but do not actually enforce that; furthermore, we are back to the case of what to do when a plugin hard-codes can_zero to return true even when can_write is false. All filters are in-tree, so we could merely change our documentation to require the filters to also check next_ops->can_write, but it is nicer to just do the extra checking as part of backend.c. At any rate, adding more calls to can_write implies that it is important enough to be cached; and that's easy to do using the same ideas as already used in commit b9d4875b for .get_size. (Other things like can_zero and can_fua are also worth caching, but this patch is already going to be big enough to save that one for later.) The next consideration is when we should check can_write and can_zero. If the first time we actually call into the plugin is during next_ops->zero itself, we have to deal with any errors reported by next_ops->can_zero - except that the backend interface has no *err parameter for can_zero, and we are not ensured that there will be a sane errno to report for the failed .zero. It would be a lot nicer to just assert during .zero that the cache for .can_write is already populated from the handshake phase (eflags computation for the outermost backend, and during a filter's .prepare for the next backend), to prove that the caller is not using .zero except when the backend is writable. Thankfully, all filters which have both an override of .can_zero and which call next_ops->can_zero do check next_ops->can_zero during .prepare, so as long as we fix can_zero to trigger the call to can_write under the hood, we don't have to modify those filters. Finally, with all these changes in place, we can drop the gating logic during eflags computation as redundant with the gating logic in backend.c; eflags computations can just blindly call all of the underlying functions (and when the next patch adds even more can_FOO caching, we'll never risk those values being uncached during request handling). Likewise, plugins.c no longer has to fall back to .can_write during its computation of .can_zero (because we have ensured that .can_zero won't be reached). [1] Demo time: $ cat script case "$1" in get_size) echo 1m;; can_write) echo can_write >/dev/tty; sleep 1; ${DIE} ;; can_zero) echo can_zero >/dev/tty; sleep 2;; *) exit 2 ;; esac Pre-patch: $ /bin/time -f %e ./nbdkit -U - -r sh script \ --run 'qemu-nbd --list -k $unixsocket >/dev/null' can_write 1.10 Wasted call to .can_write, but at least .can_zero was skipped $ /bin/time -f %e ./nbdkit -U - sh script \ --run 'qemu-nbd --list -k $unixsocket >/dev/null' can_write can_zero can_write 4.10 Without -r, eflags computation calls .can_write, then .can_zero; but then plugin_can_zero calls .can_write again. $ DIE='exit 3' /bin/time -f %e ./nbdkit -U - sh script \ --run 'qemu-nbd --list -k $unixsocket >/dev/null' can_write 1.06 But when .can_write fails, we see how .can_zero was skipped. $ /bin/time -f %e ./nbdkit -U - -r --filter=nozero sh script zeromode=notrim \ --run 'qemu-nbd --list -k $unixsocket >/dev/null' can_zero can_write can_write 4.11 The nozero filter probes next_ops->can_zero unconditionally during .prepare (here, if we had the theoretical filter on top of nozero that advertises no write support but uses writes under the hood, then this is actually good - although perhaps it is worth a followup patch to filters to be smarter during .prepare if .open was passed readonly=true), then plugin_can_zero calls .can_write. The second .can_write comes from eflags computation (since the nozero filter did not intercept that one). $ DIE='exit 3' /bin/time -f %e ./nbdkit -U - --filter=nozero sh script \ zeromode=notrim --run 'qemu-nbd --list -k $unixsocket >/dev/null' can_zero can_write nbdkit: sh[1]: error: zeromode 'notrim' requires plugin zero support qemu-nbd: Failed to read initial magic: Unexpected end-of-file before all bytes were read Command exited with non-zero status 1 3.07 A bit faster (we didn't get to eflags computation because of the nozero failure), but still long. With this applied: $ /bin/time -f %e ./nbdkit -U - -r sh script \ --run 'qemu-nbd --list -k $unixsocket >/dev/null' 0.07 Yay - no time wasted! $ /bin/time -f %e ./nbdkit -U - -r --filter=nozero sh script zeromode=notrim \ --run 'qemu-nbd --list -k $unixsocket >/dev/null' can_write can_zero 3.10 Note the act of calling .can_zero from nozero's .prepare forces .can_write to happen first, and because it was cached, we don't see it again even during eflags computation. $ DIE='exit 3' /bin/time -f %e ./nbdkit -U - --filter=nozero sh script \ zeromode=notrim --run 'qemu-nbd --list -k $unixsocket >/dev/null' can_write nbdkit: sh[1]: error: zeromode 'notrim' requires plugin zero support qemu-nbd: Failed to read initial magic: Unexpected end-of-file before all bytes were read Command exited with non-zero status 1 1.06 We still get the nozero failure, but this time much faster (without wasting time on the .can_zero call, since the .can_write failure already constrains us). Signed-off-by: Eric Blake <eblake@redhat.com> --- server/internal.h | 2 +- server/backend.c | 33 ++++++++++++++++- server/connections.c | 3 +- server/plugins.c | 2 +- server/protocol-handshake.c | 74 ++++++++++++++++++------------------- 5 files changed, 72 insertions(+), 42 deletions(-) diff --git a/server/internal.h b/server/internal.h index 4db55315..a55d6406 100644 --- a/server/internal.h +++ b/server/internal.h @@ -155,6 +155,7 @@ struct b_conn_handle { void *handle; uint64_t exportsize; + int can_write; }; struct connection { @@ -172,7 +173,6 @@ struct connection { uint32_t cflags; uint16_t eflags; - bool readonly; bool can_flush; bool is_rotational; bool can_trim; diff --git a/server/backend.c b/server/backend.c index 52d53f80..749b1f15 100644 --- a/server/backend.c +++ b/server/backend.c @@ -174,6 +174,8 @@ backend_set_handle (struct backend *b, struct connection *conn, void *handle) conn->handles[b->i].handle = handle; } +/* Wrappers for all callbacks in a filter's struct nbdkit_next_ops. */ + int64_t backend_get_size (struct backend *b, struct connection *conn) { @@ -189,9 +191,17 @@ backend_get_size (struct backend *b, struct connection *conn) int backend_can_write (struct backend *b, struct connection *conn) { + struct b_conn_handle *h = &conn->handles[b->i]; + debug ("%s: can_write", b->name); - return b->can_write (b, conn); + if (h->can_write == -1) { + /* Special case for outermost backend when -r is in effect. */ + if (readonly && b == backend) + return h->can_write = 0; + h->can_write = b->can_write (b, conn); + } + return h->can_write; } int @@ -213,16 +223,26 @@ backend_is_rotational (struct backend *b, struct connection *conn) int backend_can_trim (struct backend *b, struct connection *conn) { + int r; + debug ("%s: can_trim", b->name); + r = backend_can_write (b, conn); + if (r != 1) + return r; return b->can_trim (b, conn); } int backend_can_zero (struct backend *b, struct connection *conn) { + int r; + debug ("%s: can_zero", b->name); + r = backend_can_write (b, conn); + if (r != 1) + return r; return b->can_zero (b, conn); } @@ -237,8 +257,13 @@ backend_can_extents (struct backend *b, struct connection *conn) int backend_can_fua (struct backend *b, struct connection *conn) { + int r; + debug ("%s: can_fua", b->name); + r = backend_can_write (b, conn); + if (r != 1) + return r; /* Relies on 0 == NBDKIT_FUA_NONE */ return b->can_fua (b, conn); } @@ -280,8 +305,10 @@ backend_pwrite (struct backend *b, struct connection *conn, const void *buf, uint32_t count, uint64_t offset, uint32_t flags, int *err) { + struct b_conn_handle *h = &conn->handles[b->i]; int r; + assert (h->can_write == 1); assert (!(flags & ~NBDKIT_FLAG_FUA)); debug ("%s: pwrite count=%" PRIu32 " offset=%" PRIu64 " fua=%d", b->name, count, offset, !!(flags & NBDKIT_FLAG_FUA)); @@ -312,8 +339,10 @@ backend_trim (struct backend *b, struct connection *conn, uint32_t count, uint64_t offset, uint32_t flags, int *err) { + struct b_conn_handle *h = &conn->handles[b->i]; int r; + assert (h->can_write == 1); assert (flags == 0); debug ("%s: trim count=%" PRIu32 " offset=%" PRIu64 " fua=%d", b->name, count, offset, !!(flags & NBDKIT_FLAG_FUA)); @@ -329,8 +358,10 @@ backend_zero (struct backend *b, struct connection *conn, uint32_t count, uint64_t offset, uint32_t flags, int *err) { + struct b_conn_handle *h = &conn->handles[b->i]; int r; + assert (h->can_write == 1); assert (!(flags & ~(NBDKIT_FLAG_MAY_TRIM | NBDKIT_FLAG_FUA))); debug ("%s: zero count=%" PRIu32 " offset=%" PRIu64 " may_trim=%d fua=%d", b->name, count, offset, !!(flags & NBDKIT_FLAG_MAY_TRIM), diff --git a/server/connections.c b/server/connections.c index 2a5150cd..7609e9a7 100644 --- a/server/connections.c +++ b/server/connections.c @@ -274,8 +274,9 @@ new_connection (int sockin, int sockout, int nworkers) return NULL; } conn->nr_handles = backend->i + 1; + memset (conn->handles, -1, conn->nr_handles * sizeof *conn->handles); for_each_backend (b) - conn->handles[b->i].exportsize = -1; + conn->handles[b->i].handle = NULL; conn->status = 1; conn->nworkers = nworkers; diff --git a/server/plugins.c b/server/plugins.c index 828b0dee..1dec4008 100644 --- a/server/plugins.c +++ b/server/plugins.c @@ -361,7 +361,7 @@ plugin_can_zero (struct backend *b, struct connection *conn) if (p->plugin.can_zero && p->plugin.can_zero (connection_get_handle (conn, 0)) == -1) return -1; - return plugin_can_write (b, conn); + return 1; } static int diff --git a/server/protocol-handshake.c b/server/protocol-handshake.c index 4d12b3dc..6d44deb5 100644 --- a/server/protocol-handshake.c +++ b/server/protocol-handshake.c @@ -52,37 +52,37 @@ protocol_compute_eflags (struct connection *conn, uint16_t *flags) uint16_t eflags = NBD_FLAG_HAS_FLAGS; int fl; + /* Check all flags even if they won't be advertised, to prime the + * cache and make later request validation easier. + */ fl = backend_can_write (backend, conn); if (fl == -1) return -1; - if (readonly || !fl) { + if (!fl) eflags |= NBD_FLAG_READ_ONLY; - conn->readonly = true; + + fl = backend_can_zero (backend, conn); + if (fl == -1) + return -1; + if (fl) { + eflags |= NBD_FLAG_SEND_WRITE_ZEROES; + conn->can_zero = true; } - if (!conn->readonly) { - fl = backend_can_zero (backend, conn); - if (fl == -1) - return -1; - if (fl) { - eflags |= NBD_FLAG_SEND_WRITE_ZEROES; - conn->can_zero = true; - } - fl = backend_can_trim (backend, conn); - if (fl == -1) - return -1; - if (fl) { - eflags |= NBD_FLAG_SEND_TRIM; - conn->can_trim = true; - } + fl = backend_can_trim (backend, conn); + if (fl == -1) + return -1; + if (fl) { + eflags |= NBD_FLAG_SEND_TRIM; + conn->can_trim = true; + } - fl = backend_can_fua (backend, conn); - if (fl == -1) - return -1; - if (fl) { - eflags |= NBD_FLAG_SEND_FUA; - conn->can_fua = true; - } + fl = backend_can_fua (backend, conn); + if (fl == -1) + return -1; + if (fl) { + eflags |= NBD_FLAG_SEND_FUA; + conn->can_fua = true; } fl = backend_can_flush (backend, conn); @@ -101,16 +101,14 @@ protocol_compute_eflags (struct connection *conn, uint16_t *flags) conn->is_rotational = true; } - /* multi-conn is useless if parallel connections are not allowed */ - if (backend->thread_model (backend) > - NBDKIT_THREAD_MODEL_SERIALIZE_CONNECTIONS) { - fl = backend_can_multi_conn (backend, conn); - if (fl == -1) - return -1; - if (fl) { - eflags |= NBD_FLAG_CAN_MULTI_CONN; - conn->can_multi_conn = true; - } + /* multi-conn is useless if parallel connections are not allowed. */ + fl = backend_can_multi_conn (backend, conn); + if (fl == -1) + return -1; + if (fl && (backend->thread_model (backend) > + NBDKIT_THREAD_MODEL_SERIALIZE_CONNECTIONS)) { + eflags |= NBD_FLAG_CAN_MULTI_CONN; + conn->can_multi_conn = true; } fl = backend_can_cache (backend, conn); @@ -122,10 +120,10 @@ protocol_compute_eflags (struct connection *conn, uint16_t *flags) conn->emulate_cache = fl == NBDKIT_CACHE_EMULATE; } - /* The result of this is not returned to callers here (or at any - * time during the handshake). However it makes sense to do it once - * per connection and store the result in the handle anyway. This - * protocol_compute_eflags function is a bit misnamed XXX. + /* The result of this is not directly advertised as part of the + * handshake, but priming the cache here makes BLOCK_STATUS handling + * not have to worry about errors, and makes test-layers easier to + * write. */ fl = backend_can_extents (backend, conn); if (fl == -1) -- 2.21.0
Eric Blake
2019-Aug-30 22:55 UTC
[Libguestfs] [nbdkit PATCH v2 2/2] server: Remember .open(readonly) status
The previous patch argued that globally affecting .can_write based on '-r' (the global 'readonly') prevents the ability for a filter to call next_open(nxdata, true) to purposefully write to the plugin, even while advertising .can_write=0 to the client. But it also demonstrated that when a filter passes false to next_open, we were still making needless probes into the plugin, for something the filter won't be using. This patch improves things to prime the .can_write cache according to .open. The same script from the previous patch now fails a lot faster under -r (since nozero is not a filter that changes readonly=false to true, the plugin now fails .can_zero quickly): $ /bin/time -f %e ./nbdkit -U - -r --filter=nozero sh script zeromode=notrim \ --run 'qemu-nbd --list -k $unixsocket >/dev/null' nbdkit: sh[1]: error: zeromode 'notrim' requires plugin zero support qemu-nbd: Failed to read initial magic: Unexpected end-of-file before all bytes were read Command exited with non-zero status 1 0.06 Signed-off-by: Eric Blake <eblake@redhat.com> --- server/internal.h | 3 +++ server/backend.c | 20 +++++++++++++++----- server/connections.c | 2 +- server/filters.c | 6 ++---- server/plugins.c | 2 -- 5 files changed, 21 insertions(+), 12 deletions(-) diff --git a/server/internal.h b/server/internal.h index a55d6406..fb197b9e 100644 --- a/server/internal.h +++ b/server/internal.h @@ -325,6 +325,9 @@ extern void backend_load (struct backend *b, const char *name, extern void backend_unload (struct backend *b, void (*unload) (void)) __attribute__((__nonnull__ (1))); +extern int backend_open (struct backend *b, struct connection *conn, + int readonly) + __attribute__((__nonnull__ (1, 2))); extern void backend_set_handle (struct backend *b, struct connection *conn, void *handle) __attribute__((__nonnull__ (1, 2 /* not 3 */))); diff --git a/server/backend.c b/server/backend.c index 749b1f15..ebdef63a 100644 --- a/server/backend.c +++ b/server/backend.c @@ -167,6 +167,20 @@ backend_unload (struct backend *b, void (*unload) (void)) free (b->name); } +int +backend_open (struct backend *b, struct connection *conn, int readonly) +{ + struct b_conn_handle *h = &conn->handles[b->i]; + + debug ("%s: open readonly=%d", b->name, readonly); + + assert (h->handle == NULL); + assert (h->can_write == -1); + if (readonly) + h->can_write = 0; + return b->open (b, conn, readonly); +} + void backend_set_handle (struct backend *b, struct connection *conn, void *handle) { @@ -195,12 +209,8 @@ backend_can_write (struct backend *b, struct connection *conn) debug ("%s: can_write", b->name); - if (h->can_write == -1) { - /* Special case for outermost backend when -r is in effect. */ - if (readonly && b == backend) - return h->can_write = 0; + if (h->can_write == -1) h->can_write = b->can_write (b, conn); - } return h->can_write; } diff --git a/server/connections.c b/server/connections.c index 7609e9a7..0c1f2413 100644 --- a/server/connections.c +++ b/server/connections.c @@ -149,7 +149,7 @@ _handle_single_connection (int sockin, int sockout) goto done; lock_request (conn); - r = backend->open (backend, conn, readonly); + r = backend_open (backend, conn, readonly); unlock_request (conn); if (r == -1) goto done; diff --git a/server/filters.c b/server/filters.c index 0e10816f..a8d7dd7c 100644 --- a/server/filters.c +++ b/server/filters.c @@ -195,7 +195,7 @@ next_open (void *nxdata, int readonly) { struct b_conn *b_conn = nxdata; - return b_conn->b->open (b_conn->b, b_conn->conn, readonly); + return backend_open (b_conn->b, b_conn->conn, readonly); } static int @@ -205,8 +205,6 @@ filter_open (struct backend *b, struct connection *conn, int readonly) struct b_conn nxdata = { .b = b->next, .conn = conn }; void *handle; - debug ("%s: open readonly=%d", b->name, readonly); - if (f->filter.open) { handle = f->filter.open (next_open, &nxdata, readonly); if (handle == NULL) @@ -215,7 +213,7 @@ filter_open (struct backend *b, struct connection *conn, int readonly) return 0; } else - return b->next->open (b->next, conn, readonly); + return backend_open (b->next, conn, readonly); } static void diff --git a/server/plugins.c b/server/plugins.c index 1dec4008..1c4b5f50 100644 --- a/server/plugins.c +++ b/server/plugins.c @@ -241,8 +241,6 @@ plugin_open (struct backend *b, struct connection *conn, int readonly) assert (connection_get_handle (conn, 0) == NULL); assert (p->plugin.open != NULL); - debug ("%s: open readonly=%d", b->name, readonly); - handle = p->plugin.open (readonly); if (!handle) return -1; -- 2.21.0
Richard W.M. Jones
2019-Aug-31 06:39 UTC
Re: [Libguestfs] [nbdkit PATCH v2 2/2] server: Remember .open(readonly) status
Yes I guess we should push these and test them extensively for 1.16. As for a filter which sets can_write = 0 but opens the underlying plugin for writes (the opposite of the cow filter) I wonder what such a filter would be doing? One could certainly write a readonly filter, but that would needlessly dupicate the -r option. Anything which makes reads have a side effect of writing sounds dangerous. Maybe there is some form of copy-on-read which would do this though? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top
Maybe Matching Threads
- [PATCH nbdkit 3/3] server: Remove explicit connection parameter, use TLS instead.
- [nbdkit PATCH v2 0/2] caching .can_write
- [nbdkit PATCH] filters: Make nxdata persistent
- [nbdkit PATCH 2/9] server: Consolidate common backend tasks into new backend.c
- [nbdkit PATCH 1/4] server: Fix regression for NBD_OPT_INFO before NBD_OPT_GO