This is a subset of the last half of the larger 9-patch series. The uncontroversial first half of that series is pushed, but here, I tried to reduce the size of the patches by splitting out some of the more complex changes, so that the rest of the changes remaining in the series are more mechanical. In turn, it forced me to write timing tests, which let me spot another spot where we are wasting time, so patch 2 is completely new. As we discussed on IRC, we're early enough in the 1.16 release cycle that I'll probably just push these when I'm happy with them, whether or not a review turns up something (we can always do followup patches). But forcing me to reread the patches for legibility has been a good exercise, because I really like how it forced me to spend a good hour on the commit message of patch 1 explaining things, and how patch 2 then fell out in 10 minutes of hacking to fix the problem identified in passing in the patch 1 commit message. Eric Blake (2): server: Cache per-connection can_write server: Remember .open(readonly) status server/internal.h | 5 ++- server/backend.c | 43 ++++++++++++++++++++- server/connections.c | 5 ++- server/filters.c | 6 +-- server/plugins.c | 4 +- server/protocol-handshake.c | 74 ++++++++++++++++++------------------- 6 files changed, 88 insertions(+), 49 deletions(-) -- 2.21.0
Eric Blake
2019-Aug-30  22:55 UTC
[Libguestfs] [nbdkit PATCH v2 1/2] server: Cache per-connection can_write
The calculation of eflags skips calling .can_zero, .can_trim, and
.can_fua if .can_write returns false, because 1) time spent calling
.can_zero would be wasted, and 2) plugin writers can be lazy and let
.can_zero return constant true, regardless of what they return for
.can_write, and are relying on nbdkit to not advertise or call .zero.
We don't want to regress on this (that is, it is easy to write a sh
plugin script that would be noticeably slower if we call into
.can_zero in spite of .can_write returning false).  But at the same
time, just as plugins can blindly return .can_zero==true in spite of
.can_write==false, it is also possible for a plugin to blindly return
.can_write==true in spite of the readonly=true argument to .open (that
is, when the command line option -r is in effect), yet we were NOT
skipping that call into .can_write (meaning we can write a sh plugin
to observe a wasted call to .can_write [1]).
Next, consider that although we don't have such a filter today, it is
feasible that someone could write a filter that exposes .can_write==0
to the client but which wants to open the underlying plugin read-write
and uses next_ops->pwrite and next_ops->zero (for example, a scenario
of exposing read-only contents of a single file extracted from an
underlying file system, where the underlying device containing the
file system must be opened read-write even though its file system will
only be mounted read-only, because the act of mounting triggers a
journal replay that modifies the underlying device, but not the file
being extracted).  This implies that -r must only affect the outermost
backend, and not every backend along the chain.
Then, to date, we've documented that such a filter merely need check
next_ops->can_zero before calling next_ops->zero, but do not actually
enforce that; furthermore, we are back to the case of what to do when
a plugin hard-codes can_zero to return true even when can_write is
false.  All filters are in-tree, so we could merely change our
documentation to require the filters to also check
next_ops->can_write, but it is nicer to just do the extra checking as
part of backend.c.
At any rate, adding more calls to can_write implies that it is
important enough to be cached; and that's easy to do using the same
ideas as already used in commit b9d4875b for .get_size.  (Other things
like can_zero and can_fua are also worth caching, but this patch is
already going to be big enough to save that one for later.)
The next consideration is when we should check can_write and can_zero.
If the first time we actually call into the plugin is during
next_ops->zero itself, we have to deal with any errors reported by
next_ops->can_zero - except that the backend interface has no *err
parameter for can_zero, and we are not ensured that there will be a
sane errno to report for the failed .zero.  It would be a lot nicer to
just assert during .zero that the cache for .can_write is already
populated from the handshake phase (eflags computation for the
outermost backend, and during a filter's .prepare for the next
backend), to prove that the caller is not using .zero except when the
backend is writable.  Thankfully, all filters which have both an
override of .can_zero and which call next_ops->can_zero do check
next_ops->can_zero during .prepare, so as long as we fix can_zero to
trigger the call to can_write under the hood, we don't have to modify
those filters.
Finally, with all these changes in place, we can drop the gating logic
during eflags computation as redundant with the gating logic in
backend.c; eflags computations can just blindly call all of the
underlying functions (and when the next patch adds even more can_FOO
caching, we'll never risk those values being uncached during request
handling).  Likewise, plugins.c no longer has to fall back to
.can_write during its computation of .can_zero (because we have
ensured that .can_zero won't be reached).
[1] Demo time:
$ cat script
case "$1" in
    get_size) echo 1m;;
    can_write) echo can_write >/dev/tty; sleep 1; ${DIE} ;;
    can_zero) echo can_zero >/dev/tty; sleep 2;;
    *) exit 2 ;;
esac
Pre-patch:
$ /bin/time -f %e ./nbdkit -U - -r sh script \
  --run 'qemu-nbd --list -k $unixsocket >/dev/null'
can_write
1.10
Wasted call to .can_write, but at least .can_zero was skipped
$ /bin/time -f %e ./nbdkit -U - sh script \
  --run 'qemu-nbd --list -k $unixsocket >/dev/null'
can_write
can_zero
can_write
4.10
Without -r, eflags computation calls .can_write, then .can_zero; but
then plugin_can_zero calls .can_write again.
$ DIE='exit 3' /bin/time -f %e ./nbdkit -U - sh script \
  --run 'qemu-nbd --list -k $unixsocket >/dev/null'
can_write
1.06
But when .can_write fails, we see how .can_zero was skipped.
$ /bin/time -f %e ./nbdkit -U - -r --filter=nozero sh script zeromode=notrim \
  --run 'qemu-nbd --list -k $unixsocket >/dev/null'
can_zero
can_write
can_write
4.11
The nozero filter probes next_ops->can_zero unconditionally during
.prepare (here, if we had the theoretical filter on top of nozero that
advertises no write support but uses writes under the hood, then this
is actually good - although perhaps it is worth a followup patch to
filters to be smarter during .prepare if .open was passed
readonly=true), then plugin_can_zero calls .can_write.  The second
.can_write comes from eflags computation (since the nozero filter did
not intercept that one).
$ DIE='exit 3' /bin/time -f %e ./nbdkit -U - --filter=nozero sh script \
  zeromode=notrim --run 'qemu-nbd --list -k $unixsocket >/dev/null'
can_zero
can_write
nbdkit: sh[1]: error: zeromode 'notrim' requires plugin zero support
qemu-nbd: Failed to read initial magic: Unexpected end-of-file before all bytes
were read
Command exited with non-zero status 1
3.07
A bit faster (we didn't get to eflags computation because of the
nozero failure), but still long.
With this applied:
$ /bin/time -f %e ./nbdkit -U - -r sh script \
  --run 'qemu-nbd --list -k $unixsocket >/dev/null'
0.07
Yay - no time wasted!
$ /bin/time -f %e ./nbdkit -U - -r --filter=nozero sh script zeromode=notrim \
  --run 'qemu-nbd --list -k $unixsocket >/dev/null'
can_write
can_zero
3.10
Note the act of calling .can_zero from nozero's .prepare forces
.can_write to happen first, and because it was cached, we don't see it
again even during eflags computation.
$ DIE='exit 3' /bin/time -f %e ./nbdkit -U - --filter=nozero sh script \
  zeromode=notrim --run 'qemu-nbd --list -k $unixsocket >/dev/null'
can_write
nbdkit: sh[1]: error: zeromode 'notrim' requires plugin zero support
qemu-nbd: Failed to read initial magic: Unexpected end-of-file before all bytes
were read
Command exited with non-zero status 1
1.06
We still get the nozero failure, but this time much faster (without
wasting time on the .can_zero call, since the .can_write failure
already constrains us).
Signed-off-by: Eric Blake <eblake@redhat.com>
---
 server/internal.h           |  2 +-
 server/backend.c            | 33 ++++++++++++++++-
 server/connections.c        |  3 +-
 server/plugins.c            |  2 +-
 server/protocol-handshake.c | 74 ++++++++++++++++++-------------------
 5 files changed, 72 insertions(+), 42 deletions(-)
diff --git a/server/internal.h b/server/internal.h
index 4db55315..a55d6406 100644
--- a/server/internal.h
+++ b/server/internal.h
@@ -155,6 +155,7 @@ struct b_conn_handle {
   void *handle;
   uint64_t exportsize;
+  int can_write;
 };
 struct connection {
@@ -172,7 +173,6 @@ struct connection {
   uint32_t cflags;
   uint16_t eflags;
-  bool readonly;
   bool can_flush;
   bool is_rotational;
   bool can_trim;
diff --git a/server/backend.c b/server/backend.c
index 52d53f80..749b1f15 100644
--- a/server/backend.c
+++ b/server/backend.c
@@ -174,6 +174,8 @@ backend_set_handle (struct backend *b, struct connection
*conn, void *handle)
   conn->handles[b->i].handle = handle;
 }
+/* Wrappers for all callbacks in a filter's struct nbdkit_next_ops. */
+
 int64_t
 backend_get_size (struct backend *b, struct connection *conn)
 {
@@ -189,9 +191,17 @@ backend_get_size (struct backend *b, struct connection
*conn)
 int
 backend_can_write (struct backend *b, struct connection *conn)
 {
+  struct b_conn_handle *h = &conn->handles[b->i];
+
   debug ("%s: can_write", b->name);
-  return b->can_write (b, conn);
+  if (h->can_write == -1) {
+    /* Special case for outermost backend when -r is in effect. */
+    if (readonly && b == backend)
+      return h->can_write = 0;
+    h->can_write = b->can_write (b, conn);
+  }
+  return h->can_write;
 }
 int
@@ -213,16 +223,26 @@ backend_is_rotational (struct backend *b, struct
connection *conn)
 int
 backend_can_trim (struct backend *b, struct connection *conn)
 {
+  int r;
+
   debug ("%s: can_trim", b->name);
+  r = backend_can_write (b, conn);
+  if (r != 1)
+    return r;
   return b->can_trim (b, conn);
 }
 int
 backend_can_zero (struct backend *b, struct connection *conn)
 {
+  int r;
+
   debug ("%s: can_zero", b->name);
+  r = backend_can_write (b, conn);
+  if (r != 1)
+    return r;
   return b->can_zero (b, conn);
 }
@@ -237,8 +257,13 @@ backend_can_extents (struct backend *b, struct connection
*conn)
 int
 backend_can_fua (struct backend *b, struct connection *conn)
 {
+  int r;
+
   debug ("%s: can_fua", b->name);
+  r = backend_can_write (b, conn);
+  if (r != 1)
+    return r; /* Relies on 0 == NBDKIT_FUA_NONE */
   return b->can_fua (b, conn);
 }
@@ -280,8 +305,10 @@ backend_pwrite (struct backend *b, struct connection *conn,
                 const void *buf, uint32_t count, uint64_t offset,
                 uint32_t flags, int *err)
 {
+  struct b_conn_handle *h = &conn->handles[b->i];
   int r;
+  assert (h->can_write == 1);
   assert (!(flags & ~NBDKIT_FLAG_FUA));
   debug ("%s: pwrite count=%" PRIu32 " offset=%" PRIu64
" fua=%d",
          b->name, count, offset, !!(flags & NBDKIT_FLAG_FUA));
@@ -312,8 +339,10 @@ backend_trim (struct backend *b, struct connection *conn,
               uint32_t count, uint64_t offset, uint32_t flags,
               int *err)
 {
+  struct b_conn_handle *h = &conn->handles[b->i];
   int r;
+  assert (h->can_write == 1);
   assert (flags == 0);
   debug ("%s: trim count=%" PRIu32 " offset=%" PRIu64
" fua=%d",
          b->name, count, offset, !!(flags & NBDKIT_FLAG_FUA));
@@ -329,8 +358,10 @@ backend_zero (struct backend *b, struct connection *conn,
               uint32_t count, uint64_t offset, uint32_t flags,
               int *err)
 {
+  struct b_conn_handle *h = &conn->handles[b->i];
   int r;
+  assert (h->can_write == 1);
   assert (!(flags & ~(NBDKIT_FLAG_MAY_TRIM | NBDKIT_FLAG_FUA)));
   debug ("%s: zero count=%" PRIu32 " offset=%" PRIu64
" may_trim=%d fua=%d",
          b->name, count, offset, !!(flags & NBDKIT_FLAG_MAY_TRIM),
diff --git a/server/connections.c b/server/connections.c
index 2a5150cd..7609e9a7 100644
--- a/server/connections.c
+++ b/server/connections.c
@@ -274,8 +274,9 @@ new_connection (int sockin, int sockout, int nworkers)
     return NULL;
   }
   conn->nr_handles = backend->i + 1;
+  memset (conn->handles, -1, conn->nr_handles * sizeof
*conn->handles);
   for_each_backend (b)
-    conn->handles[b->i].exportsize = -1;
+    conn->handles[b->i].handle = NULL;
   conn->status = 1;
   conn->nworkers = nworkers;
diff --git a/server/plugins.c b/server/plugins.c
index 828b0dee..1dec4008 100644
--- a/server/plugins.c
+++ b/server/plugins.c
@@ -361,7 +361,7 @@ plugin_can_zero (struct backend *b, struct connection *conn)
   if (p->plugin.can_zero &&
       p->plugin.can_zero (connection_get_handle (conn, 0)) == -1)
     return -1;
-  return plugin_can_write (b, conn);
+  return 1;
 }
 static int
diff --git a/server/protocol-handshake.c b/server/protocol-handshake.c
index 4d12b3dc..6d44deb5 100644
--- a/server/protocol-handshake.c
+++ b/server/protocol-handshake.c
@@ -52,37 +52,37 @@ protocol_compute_eflags (struct connection *conn, uint16_t
*flags)
   uint16_t eflags = NBD_FLAG_HAS_FLAGS;
   int fl;
+  /* Check all flags even if they won't be advertised, to prime the
+   * cache and make later request validation easier.
+   */
   fl = backend_can_write (backend, conn);
   if (fl == -1)
     return -1;
-  if (readonly || !fl) {
+  if (!fl)
     eflags |= NBD_FLAG_READ_ONLY;
-    conn->readonly = true;
+
+  fl = backend_can_zero (backend, conn);
+  if (fl == -1)
+    return -1;
+  if (fl) {
+    eflags |= NBD_FLAG_SEND_WRITE_ZEROES;
+    conn->can_zero = true;
   }
-  if (!conn->readonly) {
-    fl = backend_can_zero (backend, conn);
-    if (fl == -1)
-      return -1;
-    if (fl) {
-      eflags |= NBD_FLAG_SEND_WRITE_ZEROES;
-      conn->can_zero = true;
-    }
-    fl = backend_can_trim (backend, conn);
-    if (fl == -1)
-      return -1;
-    if (fl) {
-      eflags |= NBD_FLAG_SEND_TRIM;
-      conn->can_trim = true;
-    }
+  fl = backend_can_trim (backend, conn);
+  if (fl == -1)
+    return -1;
+  if (fl) {
+    eflags |= NBD_FLAG_SEND_TRIM;
+    conn->can_trim = true;
+  }
-    fl = backend_can_fua (backend, conn);
-    if (fl == -1)
-      return -1;
-    if (fl) {
-      eflags |= NBD_FLAG_SEND_FUA;
-      conn->can_fua = true;
-    }
+  fl = backend_can_fua (backend, conn);
+  if (fl == -1)
+    return -1;
+  if (fl) {
+    eflags |= NBD_FLAG_SEND_FUA;
+    conn->can_fua = true;
   }
   fl = backend_can_flush (backend, conn);
@@ -101,16 +101,14 @@ protocol_compute_eflags (struct connection *conn, uint16_t
*flags)
     conn->is_rotational = true;
   }
-  /* multi-conn is useless if parallel connections are not allowed */
-  if (backend->thread_model (backend) >
-      NBDKIT_THREAD_MODEL_SERIALIZE_CONNECTIONS) {
-    fl = backend_can_multi_conn (backend, conn);
-    if (fl == -1)
-      return -1;
-    if (fl) {
-      eflags |= NBD_FLAG_CAN_MULTI_CONN;
-      conn->can_multi_conn = true;
-    }
+  /* multi-conn is useless if parallel connections are not allowed. */
+  fl = backend_can_multi_conn (backend, conn);
+  if (fl == -1)
+    return -1;
+  if (fl && (backend->thread_model (backend) >
+             NBDKIT_THREAD_MODEL_SERIALIZE_CONNECTIONS)) {
+    eflags |= NBD_FLAG_CAN_MULTI_CONN;
+    conn->can_multi_conn = true;
   }
   fl = backend_can_cache (backend, conn);
@@ -122,10 +120,10 @@ protocol_compute_eflags (struct connection *conn, uint16_t
*flags)
     conn->emulate_cache = fl == NBDKIT_CACHE_EMULATE;
   }
-  /* The result of this is not returned to callers here (or at any
-   * time during the handshake).  However it makes sense to do it once
-   * per connection and store the result in the handle anyway.  This
-   * protocol_compute_eflags function is a bit misnamed XXX.
+  /* The result of this is not directly advertised as part of the
+   * handshake, but priming the cache here makes BLOCK_STATUS handling
+   * not have to worry about errors, and makes test-layers easier to
+   * write.
    */
   fl = backend_can_extents (backend, conn);
   if (fl == -1)
-- 
2.21.0
Eric Blake
2019-Aug-30  22:55 UTC
[Libguestfs] [nbdkit PATCH v2 2/2] server: Remember .open(readonly) status
The previous patch argued that globally affecting .can_write based on
'-r' (the global 'readonly') prevents the ability for a filter
to call
next_open(nxdata, true) to purposefully write to the plugin, even
while advertising .can_write=0 to the client.  But it also
demonstrated that when a filter passes false to next_open, we were
still making needless probes into the plugin, for something the filter
won't be using.
This patch improves things to prime the .can_write cache according to
.open.  The same script from the previous patch now fails a lot faster
under -r (since nozero is not a filter that changes readonly=false to
true, the plugin now fails .can_zero quickly):
$ /bin/time -f %e ./nbdkit -U - -r --filter=nozero sh script zeromode=notrim \
  --run 'qemu-nbd --list -k $unixsocket >/dev/null'
nbdkit: sh[1]: error: zeromode 'notrim' requires plugin zero support
qemu-nbd: Failed to read initial magic: Unexpected end-of-file before all bytes
were read
Command exited with non-zero status 1
0.06
Signed-off-by: Eric Blake <eblake@redhat.com>
---
 server/internal.h    |  3 +++
 server/backend.c     | 20 +++++++++++++++-----
 server/connections.c |  2 +-
 server/filters.c     |  6 ++----
 server/plugins.c     |  2 --
 5 files changed, 21 insertions(+), 12 deletions(-)
diff --git a/server/internal.h b/server/internal.h
index a55d6406..fb197b9e 100644
--- a/server/internal.h
+++ b/server/internal.h
@@ -325,6 +325,9 @@ extern void backend_load (struct backend *b, const char
*name,
 extern void backend_unload (struct backend *b, void (*unload) (void))
   __attribute__((__nonnull__ (1)));
+extern int backend_open (struct backend *b, struct connection *conn,
+                         int readonly)
+  __attribute__((__nonnull__ (1, 2)));
 extern void backend_set_handle (struct backend *b, struct connection *conn,
                                 void *handle)
   __attribute__((__nonnull__ (1, 2 /* not 3 */)));
diff --git a/server/backend.c b/server/backend.c
index 749b1f15..ebdef63a 100644
--- a/server/backend.c
+++ b/server/backend.c
@@ -167,6 +167,20 @@ backend_unload (struct backend *b, void (*unload) (void))
   free (b->name);
 }
+int
+backend_open (struct backend *b, struct connection *conn, int readonly)
+{
+  struct b_conn_handle *h = &conn->handles[b->i];
+
+  debug ("%s: open readonly=%d", b->name, readonly);
+
+  assert (h->handle == NULL);
+  assert (h->can_write == -1);
+  if (readonly)
+    h->can_write = 0;
+  return b->open (b, conn, readonly);
+}
+
 void
 backend_set_handle (struct backend *b, struct connection *conn, void *handle)
 {
@@ -195,12 +209,8 @@ backend_can_write (struct backend *b, struct connection
*conn)
   debug ("%s: can_write", b->name);
-  if (h->can_write == -1) {
-    /* Special case for outermost backend when -r is in effect. */
-    if (readonly && b == backend)
-      return h->can_write = 0;
+  if (h->can_write == -1)
     h->can_write = b->can_write (b, conn);
-  }
   return h->can_write;
 }
diff --git a/server/connections.c b/server/connections.c
index 7609e9a7..0c1f2413 100644
--- a/server/connections.c
+++ b/server/connections.c
@@ -149,7 +149,7 @@ _handle_single_connection (int sockin, int sockout)
     goto done;
   lock_request (conn);
-  r = backend->open (backend, conn, readonly);
+  r = backend_open (backend, conn, readonly);
   unlock_request (conn);
   if (r == -1)
     goto done;
diff --git a/server/filters.c b/server/filters.c
index 0e10816f..a8d7dd7c 100644
--- a/server/filters.c
+++ b/server/filters.c
@@ -195,7 +195,7 @@ next_open (void *nxdata, int readonly)
 {
   struct b_conn *b_conn = nxdata;
-  return b_conn->b->open (b_conn->b, b_conn->conn, readonly);
+  return backend_open (b_conn->b, b_conn->conn, readonly);
 }
 static int
@@ -205,8 +205,6 @@ filter_open (struct backend *b, struct connection *conn, int
readonly)
   struct b_conn nxdata = { .b = b->next, .conn = conn };
   void *handle;
-  debug ("%s: open readonly=%d", b->name, readonly);
-
   if (f->filter.open) {
     handle = f->filter.open (next_open, &nxdata, readonly);
     if (handle == NULL)
@@ -215,7 +213,7 @@ filter_open (struct backend *b, struct connection *conn, int
readonly)
     return 0;
   }
   else
-    return b->next->open (b->next, conn, readonly);
+    return backend_open (b->next, conn, readonly);
 }
 static void
diff --git a/server/plugins.c b/server/plugins.c
index 1dec4008..1c4b5f50 100644
--- a/server/plugins.c
+++ b/server/plugins.c
@@ -241,8 +241,6 @@ plugin_open (struct backend *b, struct connection *conn, int
readonly)
   assert (connection_get_handle (conn, 0) == NULL);
   assert (p->plugin.open != NULL);
-  debug ("%s: open readonly=%d", b->name, readonly);
-
   handle = p->plugin.open (readonly);
   if (!handle)
     return -1;
-- 
2.21.0
Richard W.M. Jones
2019-Aug-31  06:39 UTC
Re: [Libguestfs] [nbdkit PATCH v2 2/2] server: Remember .open(readonly) status
Yes I guess we should push these and test them extensively for 1.16. As for a filter which sets can_write = 0 but opens the underlying plugin for writes (the opposite of the cow filter) I wonder what such a filter would be doing? One could certainly write a readonly filter, but that would needlessly dupicate the -r option. Anything which makes reads have a side effect of writing sounds dangerous. Maybe there is some form of copy-on-read which would do this though? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top
Maybe Matching Threads
- [nbdkit PATCH 0/9] can_FOO caching, more filter validation
- cross-project patches: Add NBD Fast Zero support
- [PATCH nbdkit v2 0/3] server: Remove explicit connection parameter.
- [PATCH nbdkit 0/3] server: Remove explicit connection parameter.
- [nbdkit PATCH] nozero: Add notrim mode