thr3ads.net - Libguestfs - [Libguestfs] [nbdkit PATCH v3 00/15] Add FUA support to nbdkit [Mar 2018]

If this information is useful, please help other people find it:
Share via:

Eric Blake

2018-Mar-08 23:02 UTC

[Libguestfs] [nbdkit PATCH v3 00/15] Add FUA support to nbdkit

After more than a month since v2 [1], I've finally got my FUA
support series polished.  This is all of my outstanding patches,
even though some of them were originally posted in separate
threads from the original FUA post [2], [3]

[1] https://www.redhat.com/archives/libguestfs/2018-January/msg00113.html
[2] https://www.redhat.com/archives/libguestfs/2018-January/msg00219.html
[3] https://www.redhat.com/archives/libguestfs/2018-February/msg00000.html

Still to go: figure out how we want to expose flags through the
language bindings (there, we can break API if needed, but hopefully
we can instead exploit languages with function-overloading and/or
optional parameters to make it feel like a natural extension).

This exercise has been good; I've found a couple of tweaks needed
in qemu for corner cases explored while writing these nbdkit
patches.  Also, the qemu list reminded me that even though the
NBD spec says FUA on trim is required to wait until the trim
effects have hit the disk, no sane client will ever send trim+FUA
because trim is advisory, and the client has no sane way to tell
if trim had an effect in the first place.

It feels pretty much like a rewrite, according to:

$ git backport-diff -u fua-v2 -r origin..
Key:
[----] : patches are identical
[####] : number of functional differences between upstream/downstream patch
[down] : patch is downstream-only
The flags [FC] indicate (F)unctional and (C)ontextual differences, respectively

001/15:[down] 'src: Let internal.h handle common includes'
002/15:[down] 'backend: Rework internal error return semantics'
003/15:[down] 'filters: Adjust callback API for flags/errors'
004/15:[0133] [FC] 'filters: Add log filter'
005/15:[0157] [FC] 'filters: Add blocksize filter'
006/15:[down] 'backend: Add .can_zero/.can_fua helpers'
007/15:[down] 'filters: Expose new .can_zero callback'
008/15:[0125] [FC] 'filters: Add nozero filter'
009/15:[down] 'filters: Expose new .can_fua callback'
010/15:[down] 'filters: Add fua filter'
011/15:[down] 'plugins: Expose new FUA callbacks'
012/15:[0043] [FC] 'nbd: Wire up FUA flag passthrough'
013/15:[down] 'null: Wire up FUA flag support'
014/15:[down] 'todo: Mention possibility of caching .can_FOO callbacks'
015/15:[0130] [FC] 'RFC: plugins: Add back-compat for new plugin with old
nbdkit'

Eric Blake (15):
  src: Let internal.h handle common includes
  backend: Rework internal error return semantics
  filters: Adjust callback API for flags/errors
  filters: Add log filter
  filters: Add blocksize filter
  backend: Add .can_zero/.can_fua helpers
  filters: Expose new .can_zero callback
  filters: Add nozero filter
  filters: Expose new .can_fua callback
  filters: Add fua filter
  plugins: Expose new FUA callbacks
  nbd: Wire up FUA flag passthrough
  null: Wire up FUA flag support
  todo: Mention possibility of caching .can_FOO callbacks
  RFC: plugins: Add back-compat for new plugin with old nbdkit

 TODO                                          |  22 +-
 docs/nbdkit-filter.pod                        | 173 ++++++++++--
 docs/nbdkit-plugin.pod                        | 151 +++++++++--
 docs/nbdkit.pod                               |   9 +-
 filters/blocksize/nbdkit-blocksize-filter.pod | 141 ++++++++++
 filters/fua/nbdkit-fua-filter.pod             | 119 +++++++++
 filters/log/nbdkit-log-filter.pod             | 115 ++++++++
 filters/nozero/nbdkit-nozero-filter.pod       |  99 +++++++
 configure.ac                                  |   6 +-
 include/nbdkit-common.h                       |   7 +
 include/nbdkit-filter.h                       |  36 ++-
 include/nbdkit-plugin.h                       |  89 ++++++-
 src/internal.h                                |  24 +-
 src/cleanup.c                                 |   1 -
 src/connections.c                             |  71 +++--
 src/errors.c                                  |   1 -
 src/filters.c                                 | 137 ++++++----
 src/main.c                                    |   1 -
 src/plugins.c                                 | 216 ++++++++++-----
 src/sockets.c                                 |   1 -
 src/threadlocal.c                             |   1 -
 src/utils.c                                   |   1 -
 plugins/nbd/nbd.c                             |  42 ++-
 plugins/null/null.c                           |  42 ++-
 filters/blocksize/blocksize.c                 | 370 ++++++++++++++++++++++++++
 filters/cache/cache.c                         |  87 ++++--
 filters/cow/cow.c                             |  55 +++-
 filters/delay/delay.c                         |  15 +-
 filters/fua/fua.c                             | 251 +++++++++++++++++
 filters/log/log.c                             | 366 +++++++++++++++++++++++++
 filters/nozero/nozero.c                       | 106 ++++++++
 filters/offset/offset.c                       |  20 +-
 filters/partition/partition.c                 |  26 +-
 filters/Makefile.am                           |   4 +
 filters/blocksize/Makefile.am                 |  62 +++++
 filters/fua/Makefile.am                       |  62 +++++
 filters/log/Makefile.am                       |  62 +++++
 filters/nozero/Makefile.am                    |  62 +++++
 tests/Makefile.am                             |  16 ++
 tests/test-blocksize.sh                       | 156 +++++++++++
 tests/test-fua.sh                             | 153 +++++++++++
 tests/test-log.sh                             |  88 ++++++
 tests/test-nozero.sh                          | 145 ++++++++++
 43 files changed, 3318 insertions(+), 293 deletions(-)
 create mode 100644 filters/blocksize/nbdkit-blocksize-filter.pod
 create mode 100644 filters/fua/nbdkit-fua-filter.pod
 create mode 100644 filters/log/nbdkit-log-filter.pod
 create mode 100644 filters/nozero/nbdkit-nozero-filter.pod
 create mode 100644 filters/blocksize/blocksize.c
 create mode 100644 filters/fua/fua.c
 create mode 100644 filters/log/log.c
 create mode 100644 filters/nozero/nozero.c
 create mode 100644 filters/blocksize/Makefile.am
 create mode 100644 filters/fua/Makefile.am
 create mode 100644 filters/log/Makefile.am
 create mode 100644 filters/nozero/Makefile.am
 create mode 100755 tests/test-blocksize.sh
 create mode 100755 tests/test-fua.sh
 create mode 100755 tests/test-log.sh
 create mode 100755 tests/test-nozero.sh

-- 
2.14.3

Eric Blake

2018-Mar-08 23:02 UTC

head link

[Libguestfs] [nbdkit PATCH v3 01/15] src: Let internal.h handle common includes

There is no need to have all of the .c files under src/ include
nbdkit-plugin.h, since internal.h already handles this.  Also,
this will make it easier when an upcoming patch updates the
public header to respond to a user-defined selection between
API levels; we want all of the files in src/ to see the latest
API, but it is easier to do this from the central location of
internal.h instead of repeated in each .c file.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 src/cleanup.c     | 1 -
 src/connections.c | 1 -
 src/errors.c      | 1 -
 src/filters.c     | 1 -
 src/main.c        | 1 -
 src/plugins.c     | 1 -
 src/sockets.c     | 1 -
 src/threadlocal.c | 1 -
 src/utils.c       | 1 -
 9 files changed, 9 deletions(-)

diff --git a/src/cleanup.c b/src/cleanup.c
index 3f7f8af..c68233d 100644
--- a/src/cleanup.c
+++ b/src/cleanup.c
@@ -38,7 +38,6 @@
 #include <stdarg.h>
 #include <string.h>

-#include "nbdkit-plugin.h"
 #include "internal.h"

 void
diff --git a/src/connections.c b/src/connections.c
index 75c2c2d..c25d316 100644
--- a/src/connections.c
+++ b/src/connections.c
@@ -46,7 +46,6 @@

 #include <pthread.h>

-#include "nbdkit-plugin.h"
 #include "internal.h"
 #include "protocol.h"

diff --git a/src/errors.c b/src/errors.c
index dd3096c..afc5008 100644
--- a/src/errors.c
+++ b/src/errors.c
@@ -39,7 +39,6 @@
 #include <string.h>
 #include <errno.h>

-#include "nbdkit-plugin.h"
 #include "internal.h"

 /* Called with flockfile (stderr) taken. */
diff --git a/src/filters.c b/src/filters.c
index 40c4913..6949548 100644
--- a/src/filters.c
+++ b/src/filters.c
@@ -43,7 +43,6 @@

 #include <dlfcn.h>

-#include "nbdkit-filter.h"
 #include "internal.h"

 /* We extend the generic backend struct with extra fields relating
diff --git a/src/main.c b/src/main.c
index 67cbbde..d2e5674 100644
--- a/src/main.c
+++ b/src/main.c
@@ -57,7 +57,6 @@

 #include <dlfcn.h>

-#include "nbdkit-plugin.h"
 #include "internal.h"

 #define FIRST_SOCKET_ACTIVATION_FD 3 /* defined by systemd ABI */
diff --git a/src/plugins.c b/src/plugins.c
index 1b0816c..280887d 100644
--- a/src/plugins.c
+++ b/src/plugins.c
@@ -43,7 +43,6 @@

 #include <dlfcn.h>

-#include "nbdkit-plugin.h"
 #include "internal.h"

 /* Maximum read or write request that we will handle. */
diff --git a/src/sockets.c b/src/sockets.c
index 0ea40a1..e3fa73d 100644
--- a/src/sockets.c
+++ b/src/sockets.c
@@ -52,7 +52,6 @@

 #include <pthread.h>

-#include "nbdkit-plugin.h"
 #include "internal.h"

 static void
diff --git a/src/threadlocal.c b/src/threadlocal.c
index 267f348..be34d39 100644
--- a/src/threadlocal.c
+++ b/src/threadlocal.c
@@ -42,7 +42,6 @@

 #include <pthread.h>

-#include "nbdkit-plugin.h"
 #include "internal.h"

 /* Note that most thread-local storage data is informational, used for
diff --git a/src/utils.c b/src/utils.c
index 0083370..63adbf1 100644
--- a/src/utils.c
+++ b/src/utils.c
@@ -43,7 +43,6 @@
 #include <termios.h>
 #include <errno.h>

-#include "nbdkit-plugin.h"
 #include "internal.h"

 char *
-- 
2.14.3

Eric Blake

2018-Mar-08 23:02 UTC

head link

[Libguestfs] [nbdkit PATCH v3 02/15] backend: Rework internal error return semantics

Previously, we let a plugin set an error in either thread-local
storage (nbdkit_set_error()) or errno, then connections.c would
decode which error to use.  But with filters in the mix, it is
very difficult for a filter to know what error was set by the
plugin (particularly since nbdkit_set_error() has no public
counterpart for reading the thread-local storage).  What's more,
if a filter does any non-trivial processing after calling into
next_ops, it is very probable that errno might be corrupted,
which would affect the error returned by a plugin that relied
on errno but not the error stored in thread-local storage.

Furthermore, in connections.c, handle_request() already has to
return a positive error value, as recv_request_send_reply()
then uses that value for determining what to send over the wire
to the client; so getting to the positive value sooner rather
than later can simplify the call stack.

We make the change in two steps: this patch changes the internal
interface to return the error value in a separate parameter,
and guarantees that filters see a valid value in errno; then a
later patch adjusts the public filter API to be more useful and
to more closely mirror the internal API.  The err parameter must
be set if the return is -1, and is ignored otherwise; for now,
it is only set by plugins.c (as the filter code needs to expose
it to the public filter callback API before it can be usefully
manipulated).  With the change in decoding location, the backend
interface no longer needs an .errno_is_preserved() callback.

Signed-off-by: Eric Blake <eblake@redhat.com>

---
v3: split into two parts, with error value sent by parameter
rather than by return value; more commit message details; fix
pod rendering errors
v2: more wording tweaks to documentation
---
 src/internal.h    |  16 ++++---
 src/connections.c |  39 +++++++----------
 src/filters.c     | 128 ++++++++++++++++++++++++++++++++++--------------------
 src/plugins.c     |  83 +++++++++++++++++++++--------------
 4 files changed, 157 insertions(+), 109 deletions(-)

diff --git a/src/internal.h b/src/internal.h
index 3cbfde5..be79e10 100644
--- a/src/internal.h
+++ b/src/internal.h
@@ -176,7 +176,6 @@ struct backend {
   void (*dump_fields) (struct backend *);
   void (*config) (struct backend *, const char *key, const char *value);
   void (*config_complete) (struct backend *);
-  int (*errno_is_preserved) (struct backend *);
   int (*open) (struct backend *, struct connection *conn, int readonly);
   int (*prepare) (struct backend *, struct connection *conn);
   int (*finalize) (struct backend *, struct connection *conn);
@@ -186,11 +185,16 @@ struct backend {
   int (*can_flush) (struct backend *, struct connection *conn);
   int (*is_rotational) (struct backend *, struct connection *conn);
   int (*can_trim) (struct backend *, struct connection *conn);
-  int (*pread) (struct backend *, struct connection *conn, void *buf, uint32_t
count, uint64_t offset, uint32_t flags);
-  int (*pwrite) (struct backend *, struct connection *conn, const void *buf,
uint32_t count, uint64_t offset, uint32_t flags);
-  int (*flush) (struct backend *, struct connection *conn, uint32_t flags);
-  int (*trim) (struct backend *, struct connection *conn, uint32_t count,
uint64_t offset, uint32_t flags);
-  int (*zero) (struct backend *, struct connection *conn, uint32_t count,
uint64_t offset, uint32_t flags);
+  int (*pread) (struct backend *, struct connection *conn, void *buf,
+                uint32_t count, uint64_t offset, uint32_t flags, int *err);
+  int (*pwrite) (struct backend *, struct connection *conn, const void *buf,
+                 uint32_t count, uint64_t offset, uint32_t flags, int *err);
+  int (*flush) (struct backend *, struct connection *conn, uint32_t flags,
+                int *err);
+  int (*trim) (struct backend *, struct connection *conn, uint32_t count,
+               uint64_t offset, uint32_t flags, int *err);
+  int (*zero) (struct backend *, struct connection *conn, uint32_t count,
+               uint64_t offset, uint32_t flags, int *err);
 };

 /* plugins.c */
diff --git a/src/connections.c b/src/connections.c
index c25d316..03f59f7 100644
--- a/src/connections.c
+++ b/src/connections.c
@@ -43,6 +43,7 @@
 #include <endian.h>
 #include <sys/types.h>
 #include <stddef.h>
+#include <assert.h>

 #include <pthread.h>

@@ -911,18 +912,6 @@ validate_request (struct connection *conn,
   return true;                     /* Command validates. */
 }

-/* Grab the appropriate error value.
- */
-static int
-get_error (struct connection *conn)
-{
-  int ret = threadlocal_get_error ();
-
-  if (!ret && backend->errno_is_preserved (backend))
-    ret = errno;
-  return ret ? ret : EIO;
-}
-
 /* This is called with the request lock held to actually execute the
  * request (by calling the plugin).  Note that the request fields have
  * been validated already in 'validate_request' so we don't have to
@@ -940,34 +929,35 @@ handle_request (struct connection *conn,
 {
   uint32_t f = 0;
   bool fua = conn->can_flush && (flags & NBD_CMD_FLAG_FUA);
+  int err = 0;

-  /* The plugin should call nbdkit_set_error() to request a particular
-     error, otherwise we fallback to errno or EIO. */
+  /* Clear the error, so that we know if the plugin calls
+   * nbdkit_set_error() or relied on errno.  */
   threadlocal_set_error (0);

   switch (cmd) {
   case NBD_CMD_READ:
-    if (backend->pread (backend, conn, buf, count, offset, 0) == -1)
-      return get_error (conn);
+    if (backend->pread (backend, conn, buf, count, offset, 0, &err) ==
-1)
+      return err;
     break;

   case NBD_CMD_WRITE:
     if (fua)
       f |= NBDKIT_FLAG_FUA;
-    if (backend->pwrite (backend, conn, buf, count, offset, f) == -1)
-      return get_error (conn);
+    if (backend->pwrite (backend, conn, buf, count, offset, f, &err) ==
-1)
+      return err;
     break;

   case NBD_CMD_FLUSH:
-    if (backend->flush (backend, conn, 0) == -1)
-      return get_error (conn);
+    if (backend->flush (backend, conn, 0, &err) == -1)
+      return err;
     break;

   case NBD_CMD_TRIM:
     if (fua)
       f |= NBDKIT_FLAG_FUA;
-    if (backend->trim (backend, conn, count, offset, f) == -1)
-      return get_error (conn);
+    if (backend->trim (backend, conn, count, offset, f, &err) == -1)
+      return err;
     break;

   case NBD_CMD_WRITE_ZEROES:
@@ -975,8 +965,8 @@ handle_request (struct connection *conn,
       f |= NBDKIT_FLAG_MAY_TRIM;
     if (fua)
       f |= NBDKIT_FLAG_FUA;
-    if (backend->zero (backend, conn, count, offset, f) == -1)
-      return get_error (conn);
+    if (backend->zero (backend, conn, count, offset, f, &err) == -1)
+      return err;
     break;

   default:
@@ -1129,6 +1119,7 @@ recv_request_send_reply (struct connection *conn)
   else {
     lock_request (conn);
     error = handle_request (conn, cmd, flags, offset, count, buf);
+    assert ((int) error >= 0);
     unlock_request (conn);
   }

diff --git a/src/filters.c b/src/filters.c
index 6949548..ae5edfb 100644
--- a/src/filters.c
+++ b/src/filters.c
@@ -106,14 +106,6 @@ filter_thread_model (struct backend *b)
 /* These are actually passing through to the final plugin, hence
  * the function names.
  */
-static int
-plugin_errno_is_preserved (struct backend *b)
-{
-  struct backend_filter *f = container_of (b, struct backend_filter, backend);
-
-  return f->backend.next->errno_is_preserved (f->backend.next);
-}
-
 static const char *
 plugin_name (struct backend *b)
 {
@@ -300,28 +292,46 @@ static int
 next_pread (void *nxdata, void *buf, uint32_t count, uint64_t offset)
 {
   struct b_conn *b_conn = nxdata;
-  return b_conn->b->pread (b_conn->b, b_conn->conn, buf, count,
offset, 0);
+  int err = 0;
+  int r = b_conn->b->pread (b_conn->b, b_conn->conn, buf, count,
offset, 0,
+                            &err);
+  if (r == -1)
+    errno = err;
+  return r;
 }

 static int
 next_pwrite (void *nxdata, const void *buf, uint32_t count, uint64_t offset)
 {
   struct b_conn *b_conn = nxdata;
-  return b_conn->b->pwrite (b_conn->b, b_conn->conn, buf, count,
offset, 0);
+  int err = 0;
+  int r = b_conn->b->pwrite (b_conn->b, b_conn->conn, buf, count,
offset, 0,
+                             &err);
+  if (r == -1)
+    errno = err;
+  return r;
 }

 static int
 next_flush (void *nxdata)
 {
   struct b_conn *b_conn = nxdata;
-  return b_conn->b->flush (b_conn->b, b_conn->conn, 0);
+  int err = 0;
+  int r = b_conn->b->flush (b_conn->b, b_conn->conn, 0, &err);
+  if (r == -1)
+    errno = err;
+  return r;
 }

 static int
 next_trim (void *nxdata, uint32_t count, uint64_t offset)
 {
   struct b_conn *b_conn = nxdata;
-  return b_conn->b->trim (b_conn->b, b_conn->conn, count, offset,
0);
+  int err = 0;
+  int r = b_conn->b->trim (b_conn->b, b_conn->conn, count, offset,
0, &err);
+  if (r == -1)
+    errno = err;
+  return r;
 }

 static int
@@ -329,11 +339,16 @@ next_zero (void *nxdata, uint32_t count, uint64_t offset,
int may_trim)
 {
   struct b_conn *b_conn = nxdata;
   uint32_t f = 0;
+  int err = 0;
+  int r;

   if (may_trim)
     f |= NBDKIT_FLAG_MAY_TRIM;

-  return b_conn->b->zero (b_conn->b, b_conn->conn, count, offset,
f);
+  r = b_conn->b->zero (b_conn->b, b_conn->conn, count, offset, f,
&err);
+  if (r == -1)
+    errno = err;
+  return r;
 }

 static struct nbdkit_next_ops next_ops = {
@@ -457,7 +472,7 @@ filter_can_trim (struct backend *b, struct connection *conn)
 static int
 filter_pread (struct backend *b, struct connection *conn,
               void *buf, uint32_t count, uint64_t offset,
-              uint32_t flags)
+              uint32_t flags, int *err)
 {
   struct backend_filter *f = container_of (b, struct backend_filter, backend);
   void *handle = connection_get_handle (conn, f->backend.i);
@@ -465,41 +480,50 @@ filter_pread (struct backend *b, struct connection *conn,

   assert (flags == 0);

-  debug ("pread count=%" PRIu32 " offset=%" PRIu64, count,
offset);
+  debug ("pread count=%" PRIu32 " offset=%" PRIu64 "
flags=0x%" PRIx32,
+         count, offset, flags);

-  if (f->filter.pread)
-    return f->filter.pread (&next_ops, &nxdata, handle,
-                            buf, count, offset);
+  if (f->filter.pread) {
+    int r = f->filter.pread (&next_ops, &nxdata, handle,
+                             buf, count, offset);
+    if (r == -1)
+      *err = errno;
+    return r;
+  }
   else
     return f->backend.next->pread (f->backend.next, conn,
-                                   buf, count, offset, flags);
+                                   buf, count, offset, flags, err);
 }

 static int
 filter_pwrite (struct backend *b, struct connection *conn,
                const void *buf, uint32_t count, uint64_t offset,
-               uint32_t flags)
+               uint32_t flags, int *err)
 {
   struct backend_filter *f = container_of (b, struct backend_filter, backend);
   void *handle = connection_get_handle (conn, f->backend.i);
   struct b_conn nxdata = { .b = f->backend.next, .conn = conn };
-  bool fua = flags & NBDKIT_FLAG_FUA;

   assert (!(flags & ~NBDKIT_FLAG_FUA));

-  debug ("pwrite count=%" PRIu32 " offset=%" PRIu64 "
fua=%d",
-         count, offset, fua);
+  debug ("pwrite count=%" PRIu32 " offset=%" PRIu64 "
flags=0x%" PRIx32,
+         count, offset, flags);

-  if (f->filter.pwrite)
-    return f->filter.pwrite (&next_ops, &nxdata, handle,
-                             buf, count, offset);
+  if (f->filter.pwrite) {
+    int r = f->filter.pwrite (&next_ops, &nxdata, handle,
+                              buf, count, offset);
+    if (r == -1)
+      *err = errno;
+    return r;
+  }
   else
     return f->backend.next->pwrite (f->backend.next, conn,
-                                    buf, count, offset, flags);
+                                    buf, count, offset, flags, err);
 }

 static int
-filter_flush (struct backend *b, struct connection *conn, uint32_t flags)
+filter_flush (struct backend *b, struct connection *conn, uint32_t flags,
+              int *err)
 {
   struct backend_filter *f = container_of (b, struct backend_filter, backend);
   void *handle = connection_get_handle (conn, f->backend.i);
@@ -507,18 +531,22 @@ filter_flush (struct backend *b, struct connection *conn,
uint32_t flags)

   assert (flags == 0);

-  debug ("flush");
+  debug ("flush flags=0x%" PRIx32, flags);

-  if (f->filter.flush)
-    return f->filter.flush (&next_ops, &nxdata, handle);
+  if (f->filter.flush) {
+    int r = f->filter.flush (&next_ops, &nxdata, handle);
+    if (r == -1)
+      *err = errno;
+    return r;
+  }
   else
-    return f->backend.next->flush (f->backend.next, conn, flags);
+    return f->backend.next->flush (f->backend.next, conn, flags, err);
 }

 static int
 filter_trim (struct backend *b, struct connection *conn,
              uint32_t count, uint64_t offset,
-             uint32_t flags)
+             uint32_t flags, int *err)
 {
   struct backend_filter *f = container_of (b, struct backend_filter, backend);
   void *handle = connection_get_handle (conn, f->backend.i);
@@ -526,34 +554,43 @@ filter_trim (struct backend *b, struct connection *conn,

   assert (flags == 0);

-  debug ("trim count=%" PRIu32 " offset=%" PRIu64, count,
offset);
+  debug ("trim count=%" PRIu32 " offset=%" PRIu64 "
flags=0x%" PRIx32,
+         count, offset, flags);

-  if (f->filter.trim)
-    return f->filter.trim (&next_ops, &nxdata, handle, count,
offset);
+  if (f->filter.trim) {
+    int r = f->filter.trim (&next_ops, &nxdata, handle, count,
offset);
+    if (r == -1)
+      *err = errno;
+    return r;
+  }
   else
-    return f->backend.next->trim (f->backend.next, conn, count,
offset, flags);
+    return f->backend.next->trim (f->backend.next, conn, count,
offset, flags,
+                                  err);
 }

 static int
 filter_zero (struct backend *b, struct connection *conn,
-             uint32_t count, uint64_t offset, uint32_t flags)
+             uint32_t count, uint64_t offset, uint32_t flags, int *err)
 {
   struct backend_filter *f = container_of (b, struct backend_filter, backend);
   void *handle = connection_get_handle (conn, f->backend.i);
   struct b_conn nxdata = { .b = f->backend.next, .conn = conn };
-  int may_trim = (flags & NBDKIT_FLAG_MAY_TRIM) != 0;

   assert (!(flags & ~(NBDKIT_FLAG_MAY_TRIM | NBDKIT_FLAG_FUA)));

-  debug ("zero count=%" PRIu32 " offset=%" PRIu64 "
may_trim=%d",
-         count, offset, may_trim);
+  debug ("zero count=%" PRIu32 " offset=%" PRIu64 "
flags=0x%" PRIx32,
+         count, offset, flags);

-  if (f->filter.zero)
-    return f->filter.zero (&next_ops, &nxdata, handle,
-                           count, offset, may_trim);
+  if (f->filter.zero) {
+    int r = f->filter.zero (&next_ops, &nxdata, handle,
+                            count, offset, !!(flags &
NBDKIT_FLAG_MAY_TRIM));
+    if (r == -1)
+      *err = errno;
+    return r;
+  }
   else
     return f->backend.next->zero (f->backend.next, conn,
-                                  count, offset, flags);
+                                  count, offset, flags, err);
 }

 static struct backend filter_functions = {
@@ -566,7 +603,6 @@ static struct backend filter_functions = {
   .dump_fields = filter_dump_fields,
   .config = filter_config,
   .config_complete = filter_config_complete,
-  .errno_is_preserved = plugin_errno_is_preserved,
   .open = filter_open,
   .prepare = filter_prepare,
   .finalize = filter_finalize,
diff --git a/src/plugins.c b/src/plugins.c
index 280887d..d5a2a65 100644
--- a/src/plugins.c
+++ b/src/plugins.c
@@ -226,14 +226,6 @@ plugin_config_complete (struct backend *b)
     exit (EXIT_FAILURE);
 }

-static int
-plugin_errno_is_preserved (struct backend *b)
-{
-  struct backend_plugin *p = container_of (b, struct backend_plugin, backend);
-
-  return p->plugin.errno_is_preserved;
-}
-
 static int
 plugin_open (struct backend *b, struct connection *conn, int readonly)
 {
@@ -357,11 +349,25 @@ plugin_can_trim (struct backend *b, struct connection
*conn)
     return p->plugin.trim != NULL;
 }

+/* Grab the appropriate error value.
+ */
+static int
+get_error (struct backend_plugin *p)
+{
+  int ret = threadlocal_get_error ();
+
+  if (!ret && p->plugin.errno_is_preserved)
+    ret = errno;
+  return ret ? ret : EIO;
+}
+
 static int
 plugin_pread (struct backend *b, struct connection *conn,
-              void *buf, uint32_t count, uint64_t offset, uint32_t flags)
+              void *buf, uint32_t count, uint64_t offset, uint32_t flags,
+              int *err)
 {
   struct backend_plugin *p = container_of (b, struct backend_plugin, backend);
+  int r;

   assert (connection_get_handle (conn, 0));
   assert (p->plugin.pread != NULL);
@@ -369,30 +375,40 @@ plugin_pread (struct backend *b, struct connection *conn,

   debug ("pread count=%" PRIu32 " offset=%" PRIu64, count,
offset);

-  return p->plugin.pread (connection_get_handle (conn, 0), buf, count,
offset);
+  r = p->plugin.pread (connection_get_handle (conn, 0), buf, count, offset);
+  if (r == -1)
+    *err = get_error (p);
+  return r;
 }

 static int
-plugin_flush (struct backend *b, struct connection *conn, uint32_t flags)
+plugin_flush (struct backend *b, struct connection *conn, uint32_t flags,
+              int *err)
 {
   struct backend_plugin *p = container_of (b, struct backend_plugin, backend);
+  int r;

   assert (connection_get_handle (conn, 0));
   assert (!flags);

   debug ("flush");

-  if (p->plugin.flush != NULL)
-    return p->plugin.flush (connection_get_handle (conn, 0));
+  if (p->plugin.flush != NULL) {
+    r = p->plugin.flush (connection_get_handle (conn, 0));
+    if (r == -1)
+      *err = get_error (p);
+    return r;
+  }
   else {
-    errno = EINVAL;
+    *err = EINVAL;
     return -1;
   }
 }

 static int
 plugin_pwrite (struct backend *b, struct connection *conn,
-               const void *buf, uint32_t count, uint64_t offset, uint32_t
flags)
+               const void *buf, uint32_t count, uint64_t offset, uint32_t
flags,
+               int *err)
 {
   int r;
   struct backend_plugin *p = container_of (b, struct backend_plugin, backend);
@@ -408,19 +424,21 @@ plugin_pwrite (struct backend *b, struct connection *conn,
     r = p->plugin.pwrite (connection_get_handle (conn, 0),
                           buf, count, offset);
   else {
-    errno = EROFS;
+    *err = EROFS;
     return -1;
   }
   if (r != -1 && fua) {
     assert (p->plugin.flush);
-    r = plugin_flush (b, conn, 0);
+    r = p->plugin.flush (connection_get_handle (conn, 0));
   }
+  if (r == -1)
+    *err = get_error (p);
   return r;
 }

 static int
 plugin_trim (struct backend *b, struct connection *conn,
-             uint32_t count, uint64_t offset, uint32_t flags)
+             uint32_t count, uint64_t offset, uint32_t flags, int *err)
 {
   int r;
   struct backend_plugin *p = container_of (b, struct backend_plugin, backend);
@@ -435,25 +453,26 @@ plugin_trim (struct backend *b, struct connection *conn,
   if (p->plugin.trim != NULL)
     r = p->plugin.trim (connection_get_handle (conn, 0), count, offset);
   else {
-    errno = EINVAL;
+    *err = EINVAL;
     return -1;
   }
   if (r != -1 && fua) {
     assert (p->plugin.flush);
-    r = plugin_flush (b, conn, 0);
+    r = p->plugin.flush (connection_get_handle (conn, 0));
   }
+  if (r == -1)
+    *err = get_error (p);
   return r;
 }

 static int
 plugin_zero (struct backend *b, struct connection *conn,
-             uint32_t count, uint64_t offset, uint32_t flags)
+             uint32_t count, uint64_t offset, uint32_t flags, int *err)
 {
   struct backend_plugin *p = container_of (b, struct backend_plugin, backend);
   char *buf;
   uint32_t limit;
   int result;
-  int err = 0;
   int may_trim = (flags & NBDKIT_FLAG_MAY_TRIM) != 0;
   bool fua = flags & NBDKIT_FLAG_FUA;

@@ -469,12 +488,9 @@ plugin_zero (struct backend *b, struct connection *conn,
     errno = 0;
     result = p->plugin.zero (connection_get_handle (conn, 0),
                              count, offset, may_trim);
-    if (result == -1) {
-      err = threadlocal_get_error ();
-      if (!err && plugin_errno_is_preserved (b))
-        err = errno;
-    }
-    if (result == 0 || err != EOPNOTSUPP)
+    if (result == -1)
+      *err = get_error (p);
+    if (result == 0 || *err != EOPNOTSUPP)
       goto done;
   }

@@ -483,7 +499,7 @@ plugin_zero (struct backend *b, struct connection *conn,
   limit = count < MAX_REQUEST_SIZE ? count : MAX_REQUEST_SIZE;
   buf = calloc (limit, 1);
   if (!buf) {
-    errno = ENOMEM;
+    *err = ENOMEM;
     return -1;
   }

@@ -497,15 +513,17 @@ plugin_zero (struct backend *b, struct connection *conn,
       limit = count;
   }

-  err = errno;
+  *err = errno;
   free (buf);
-  errno = err;
+  errno = *err;

  done:
   if (result != -1 && fua) {
     assert (p->plugin.flush);
-    result = plugin_flush (b, conn, 0);
+    result = p->plugin.flush (connection_get_handle (conn, 0));
   }
+  if (result == -1)
+    *err = get_error (p);
   return result;
 }

@@ -519,7 +537,6 @@ static struct backend plugin_functions = {
   .dump_fields = plugin_dump_fields,
   .config = plugin_config,
   .config_complete = plugin_config_complete,
-  .errno_is_preserved = plugin_errno_is_preserved,
   .open = plugin_open,
   .prepare = plugin_prepare,
   .finalize = plugin_finalize,
-- 
2.14.3

Eric Blake

2018-Mar-08 23:02 UTC

head link

[Libguestfs] [nbdkit PATCH v3 03/15] filters: Adjust callback API for flags/errors

It's time to expose the full semantics already in use by the backend
to our filters, by exposing flags and an explicit parameter for
tracking the error value to return to the client.  This is an
incompatible API change, and all existing filters are updated to
match the new semantics.

Of note: access to flags means the cache filter can now behave
correctly in the face of FUA semantics (a FUA write must write
through, even when the cache mode is writeback, rather than
waiting for a subsequent flush).  However, more work to more
fully support FUA through the stack will come in later patches.

I considerered a possible boilerplate reduction option of having
a filter that returns -1 but does not set *err then cause
filters.c to populate *err with the current value of errno; but
for now, manually setting *err at all sites makes it obvious
how much this API change affects.

Signed-off-by: Eric Blake <eblake@redhat.com>

---
v: split off backend change; fold in part of FUA flags change;
retitle the patch; more commit message details
v2: more wording tweaks to documentation
---
 docs/nbdkit-filter.pod        | 115 ++++++++++++++++++++++++++++++++++++------
 include/nbdkit-common.h       |   3 ++
 include/nbdkit-filter.h       |  30 +++++++----
 src/internal.h                |   3 --
 src/filters.c                 | 102 +++++++++++--------------------------
 filters/cache/cache.c         |  62 +++++++++++++++--------
 filters/cow/cow.c             |  34 ++++++++-----
 filters/delay/delay.c         |  15 +++---
 filters/offset/offset.c       |  20 +++++---
 filters/partition/partition.c |  26 ++++++----
 10 files changed, 253 insertions(+), 157 deletions(-)

diff --git a/docs/nbdkit-filter.pod b/docs/nbdkit-filter.pod
index eb72dae..8ef26e3 100644
--- a/docs/nbdkit-filter.pod
+++ b/docs/nbdkit-filter.pod
@@ -163,10 +163,16 @@ short-circuited.

 The filter’s other methods like C<.prepare>, C<.get_size>,
C<.pread>
 etc ― always called in the context of a connection ― are passed a
-pointer to C<struct nbdkit_next_ops> which contains a subset of the
-plugin methods that can be called during a connection.  It is possible
-for a filter to issue (for example) extra read calls in response to a
-single C<.pwrite> call.
+pointer to C<struct nbdkit_next_ops> which contains a comparable set
+of accessors to plugin methods that can be called during a connection.
+It is possible for a filter to issue (for example) extra read calls in
+response to a single C<.pwrite> call.  Note that the semantics of the
+functions in C<struct nbdkit_next_ops> are slightly different from
+what a plugin implements: for example, when a plugin's C<.pread>
+returns -1 on error, the error value to advertise to the client is
+implicit (via the plugin calling C<nbdkit_set_error> or setting
+C<errno>), whereas C<next_ops-E<gt>pread> exposes this via an
explicit
+parameter, allowing a filter to learn or modify this error if desired.

 You can modify parameters when you call the C<next> function.  However
 be careful when modifying strings because for some methods
@@ -324,6 +330,10 @@ will see.

 The returned size must be E<ge> 0.  If there is an error,
C<.get_size>
 should call C<nbdkit_error> with an error message and return C<-1>.
+If this function is called more than once for the same connection, it
+should return the same value; similarly, the filter may cache
+C<next_ops-E<gt>get_size> for a given connection rather than
repeating
+calls.

 =head2 C<.can_write>

@@ -346,65 +356,140 @@ should call C<nbdkit_error> with an error message
and return C<-1>.
 These intercept the corresponding plugin methods.

 If there is an error, the callback should call C<nbdkit_error> with an
-error message and return C<-1>.
+error message and return C<-1>.  If these functions are called more
+than once for the same connection, they should return the same value;
+similarly, the filter may cache the results of each counterpart in
+C<next_ops> for a given connection rather than repeating calls.

 =head2 C<.pread>

  int (*pread) (struct nbdkit_next_ops *next_ops, void *nxdata,
-               void *handle, void *buf, uint32_t count, uint64_t offset);
+               void *handle, void *buf, uint32_t count, uint64_t offset,
+               uint32_t flags, int *err);

 This intercepts the plugin C<.pread> method and can be used to read or
 modify data read by the plugin.

+The parameter C<flags> exists in case of future NBD protocol
+extensions; at this time, it will be 0 on input, and the filter should
+not pass any flags to C<next_ops-E<gt>pread>.
+
 If there is an error (including a short read which couldn't be
 recovered from), C<.pread> should call C<nbdkit_error> with an
error
-message B<and> set C<errno>, then return C<-1>.
+message B<and> return -1 with C<err> set to the positive errno
value
+to return to the client.

 =head2 C<.pwrite>

  int (*pwrite) (struct nbdkit_next_ops *next_ops, void *nxdata,
                 void *handle,
-                const void *buf, uint32_t count, uint64_t offset);
+                const void *buf, uint32_t count, uint64_t offset,
+                uint32_t flags, int *err);

 This intercepts the plugin C<.pwrite> method and can be used to modify
 data written by the plugin.

+This function will not be called if C<.can_write> returned false; in
+turn, the filter should not call C<next_ops-E<gt>pwrite> if
+C<next_ops-E<gt>can_write> did not return true.
+
+The parameter C<flags> may include C<NBDKIT_FLAG_FUA> on input
based
+on the result of C<.can_flush>.  In turn, the filter should only pass
+C<NBDKIT_FLAG_FUA> on to C<next_ops-E<gt>pwrite> if
+C<next_ops-E<gt>can_flush> returned true.
+
 If there is an error (including a short write which couldn't be
 recovered from), C<.pwrite> should call C<nbdkit_error> with an
error
-message B<and> set C<errno>, then return C<-1>.
+message B<and> return -1 with C<err> set to the positive errno
value
+to return to the client.

 =head2 C<.flush>

  int (*flush) (struct nbdkit_next_ops *next_ops, void *nxdata,
-               void *handle);
+               void *handle, uint32_t flags, int *err);

 This intercepts the plugin C<.flush> method and can be used to modify
 flush requests.

+This function will not be called if C<.can_flush> returned false; in
+turn, the filter should not call C<next_ops-E<gt>flush> if
+C<next_ops-E<gt>can_flush> did not return true.
+
+The parameter C<flags> exists in case of future NBD protocol
+extensions; at this time, it will be 0 on input, and the filter should
+not pass any flags to C<next_ops-E<gt>flush>.
+
 If there is an error, C<.flush> should call C<nbdkit_error> with an
-error message B<and> set C<errno>, then return C<-1>.
+error message B<and> return -1 with C<err> set to the positive
errno
+value to return to the client.

 =head2 C<.trim>

  int (*trim) (struct nbdkit_next_ops *next_ops, void *nxdata,
-              void *handle, uint32_t count, uint64_t offset);
+              void *handle, uint32_t count, uint64_t offset,
+              uint32_t flags, int *err);

 This intercepts the plugin C<.trim> method and can be used to modify
 trim requests.

+This function will not be called if C<.can_trim> returned false; in
+turn, the filter should not call C<next_ops-E<gt>trim> if
+C<next_ops-E<gt>can_trim> did not return true.
+
+The parameter C<flags> may include C<NBDKIT_FLAG_FUA> on input
based
+on the result of C<.can_flush>.  In turn, the filter should only pass
+C<NBDKIT_FLAG_FUA> on to C<next_ops-E<gt>trim> if
+C<next_ops-E<gt>can_flush> returned true.
+
 If there is an error, C<.trim> should call C<nbdkit_error> with an
-error message B<and> set C<errno>, then return C<-1>.
+error message B<and> return -1 with C<err> set to the positive
errno
+value to return to the client.

 =head2 C<.zero>

  int (*zero) (struct nbdkit_next_ops *next_ops, void *nxdata,
-              void *handle, uint32_t count, uint64_t offset, int may_trim);
+              void *handle, uint32_t count, uint64_t offset, uint32_t flags,
+              int *err);

 This intercepts the plugin C<.zero> method and can be used to modify
 zero requests.

+This function will not be called if C<.can_write> returned false; in
+turn, the filter should not call C<next_ops-E<gt>zero> if
+C<next_ops-E<gt>can_write> did not return true.
+
+On input, the parameter C<flags> may include
C<NBDKIT_FLAG_MAY_TRIM>
+unconditionally, and C<NBDKIT_FLAG_FUA> based on the result of
+C<.can_flush>.  In turn, the filter may pass
C<NBDKIT_FLAG_MAY_TRIM>
+unconditionally, but should only pass C<NBDKIT_FLAG_FUA> on to
+C<next_ops-E<gt>zero> if C<next_ops-E<gt>can_flush>
returned true.
+
+Note that unlike the plugin C<.zero> which is permitted to fail with
+C<EOPNOTSUPP> to force a fallback to C<.pwrite>, the function
+C<next_ops-E<gt>zero> will never fail with C<err> set to
C<EOPNOTSUPP>
+because the fallback has already taken place.
+
 If there is an error, C<.zero> should call C<nbdkit_error> with an
-error message B<and> set C<errno>, then return C<-1>.
+error message B<and> return -1 with C<err> set to the positive
errno
+value to return to the client.  The filter should never fail with
+C<EOPNOTSUPP> (while plugins have automatic fallback to C<.pwrite>,
+filters do not).
+
+=head1 ERROR HANDLING
+
+If there is an error in the filter itself, the filter should call
+C<nbdkit_error> to report an error message.  If the callback is
+involved in serving data, the explicit C<err> parameter determines the
+error code that will be sent to the client; other callbacks should
+return the appropriate error indication, eg. C<NULL> or C<-1>.
+
+C<nbdkit_error> has the following prototype and works like
+L<printf(3)>:
+
+ void nbdkit_error (const char *fs, ...);
+ void nbdkit_verror (const char *fs, va_list args);
+
+For convenience, C<nbdkit_error> preserves the value of C<errno>.

 =head1 DEBUGGING

diff --git a/include/nbdkit-common.h b/include/nbdkit-common.h
index 5e69579..f32bf76 100644
--- a/include/nbdkit-common.h
+++ b/include/nbdkit-common.h
@@ -50,6 +50,9 @@ extern "C" {
 #define NBDKIT_THREAD_MODEL_SERIALIZE_REQUESTS        2
 #define NBDKIT_THREAD_MODEL_PARALLEL                  3

+#define NBDKIT_FLAG_MAY_TRIM (1<<0) /* Maps to !NBD_CMD_FLAG_NO_HOLE */
+#define NBDKIT_FLAG_FUA      (1<<1) /* Maps to NBD_CMD_FLAG_FUA */
+
 extern void nbdkit_error (const char *msg, ...)
   __attribute__((__format__ (__printf__, 1, 2)));
 extern void nbdkit_verror (const char *msg, va_list args);
diff --git a/include/nbdkit-filter.h b/include/nbdkit-filter.h
index 0cfc3f8..95be130 100644
--- a/include/nbdkit-filter.h
+++ b/include/nbdkit-filter.h
@@ -42,7 +42,7 @@
 extern "C" {
 #endif

-#define NBDKIT_FILTER_API_VERSION 1
+#define NBDKIT_FILTER_API_VERSION 2

 typedef int nbdkit_next_config (void *nxdata,
                                 const char *key, const char *value);
@@ -58,12 +58,16 @@ struct nbdkit_next_ops {
   int (*is_rotational) (void *nxdata);
   int (*can_trim) (void *nxdata);

-  int (*pread) (void *nxdata, void *buf, uint32_t count, uint64_t offset);
+  int (*pread) (void *nxdata, void *buf, uint32_t count, uint64_t offset,
+                uint32_t flags, int *err);
   int (*pwrite) (void *nxdata,
-                 const void *buf, uint32_t count, uint64_t offset);
-  int (*flush) (void *nxdata);
-  int (*trim) (void *nxdata, uint32_t count, uint64_t offset);
-  int (*zero) (void *nxdata, uint32_t count, uint64_t offset, int may_trim);
+                 const void *buf, uint32_t count, uint64_t offset,
+                 uint32_t flags, int *err);
+  int (*flush) (void *nxdata, uint32_t flags, int *err);
+  int (*trim) (void *nxdata, uint32_t count, uint64_t offset, uint32_t flags,
+               int *err);
+  int (*zero) (void *nxdata, uint32_t count, uint64_t offset, uint32_t flags,
+               int *err);
 };

 struct nbdkit_filter {
@@ -113,16 +117,20 @@ struct nbdkit_filter {
                    void *handle);

   int (*pread) (struct nbdkit_next_ops *next_ops, void *nxdata,
-                void *handle, void *buf, uint32_t count, uint64_t offset);
+                void *handle, void *buf, uint32_t count, uint64_t offset,
+                uint32_t flags, int *err);
   int (*pwrite) (struct nbdkit_next_ops *next_ops, void *nxdata,
                  void *handle,
-                 const void *buf, uint32_t count, uint64_t offset);
+                 const void *buf, uint32_t count, uint64_t offset,
+                 uint32_t flags, int *err);
   int (*flush) (struct nbdkit_next_ops *next_ops, void *nxdata,
-                void *handle);
+                void *handle, uint32_t flags, int *err);
   int (*trim) (struct nbdkit_next_ops *next_ops, void *nxdata,
-               void *handle, uint32_t count, uint64_t offset);
+               void *handle, uint32_t count, uint64_t offset, uint32_t flags,
+               int *err);
   int (*zero) (struct nbdkit_next_ops *next_ops, void *nxdata,
-               void *handle, uint32_t count, uint64_t offset, int may_trim);
+               void *handle, uint32_t count, uint64_t offset, uint32_t flags,
+               int *err);
 };

 #define NBDKIT_REGISTER_FILTER(filter)                                  \
diff --git a/src/internal.h b/src/internal.h
index be79e10..12dfc14 100644
--- a/src/internal.h
+++ b/src/internal.h
@@ -98,9 +98,6 @@
       (type *) ((char *) __mptr - offsetof(type, member));       \
     })

-#define NBDKIT_FLAG_MAY_TRIM (1<<0) /* Maps to !NBD_CMD_FLAG_NO_HOLE */
-#define NBDKIT_FLAG_FUA      (1<<1) /* Maps to NBD_CMD_FLAG_FUA */
-
 /* main.c */
 extern const char *exportname;
 extern const char *ipaddr;
diff --git a/src/filters.c b/src/filters.c
index ae5edfb..9d821f2 100644
--- a/src/filters.c
+++ b/src/filters.c
@@ -39,7 +39,6 @@
 #include <string.h>
 #include <inttypes.h>
 #include <assert.h>
-#include <errno.h>

 #include <dlfcn.h>

@@ -289,66 +288,44 @@ next_can_trim (void *nxdata)
 }

 static int
-next_pread (void *nxdata, void *buf, uint32_t count, uint64_t offset)
+next_pread (void *nxdata, void *buf, uint32_t count, uint64_t offset,
+            uint32_t flags, int *err)
 {
   struct b_conn *b_conn = nxdata;
-  int err = 0;
-  int r = b_conn->b->pread (b_conn->b, b_conn->conn, buf, count,
offset, 0,
-                            &err);
-  if (r == -1)
-    errno = err;
-  return r;
+  return b_conn->b->pread (b_conn->b, b_conn->conn, buf, count,
offset, flags,
+                           err);
 }

 static int
-next_pwrite (void *nxdata, const void *buf, uint32_t count, uint64_t offset)
+next_pwrite (void *nxdata, const void *buf, uint32_t count, uint64_t offset,
+             uint32_t flags, int *err)
 {
   struct b_conn *b_conn = nxdata;
-  int err = 0;
-  int r = b_conn->b->pwrite (b_conn->b, b_conn->conn, buf, count,
offset, 0,
-                             &err);
-  if (r == -1)
-    errno = err;
-  return r;
+  return b_conn->b->pwrite (b_conn->b, b_conn->conn, buf, count,
offset, flags,
+                            err);
 }

 static int
-next_flush (void *nxdata)
+next_flush (void *nxdata, uint32_t flags, int *err)
 {
   struct b_conn *b_conn = nxdata;
-  int err = 0;
-  int r = b_conn->b->flush (b_conn->b, b_conn->conn, 0, &err);
-  if (r == -1)
-    errno = err;
-  return r;
+  return b_conn->b->flush (b_conn->b, b_conn->conn, flags, err);
 }

 static int
-next_trim (void *nxdata, uint32_t count, uint64_t offset)
+next_trim (void *nxdata, uint32_t count, uint64_t offset, uint32_t flags,
+           int *err)
 {
   struct b_conn *b_conn = nxdata;
-  int err = 0;
-  int r = b_conn->b->trim (b_conn->b, b_conn->conn, count, offset,
0, &err);
-  if (r == -1)
-    errno = err;
-  return r;
+  return b_conn->b->trim (b_conn->b, b_conn->conn, count, offset,
flags, err);
 }

 static int
-next_zero (void *nxdata, uint32_t count, uint64_t offset, int may_trim)
+next_zero (void *nxdata, uint32_t count, uint64_t offset, uint32_t flags,
+           int *err)
 {
   struct b_conn *b_conn = nxdata;
-  uint32_t f = 0;
-  int err = 0;
-  int r;
-
-  if (may_trim)
-    f |= NBDKIT_FLAG_MAY_TRIM;
-
-  r = b_conn->b->zero (b_conn->b, b_conn->conn, count, offset, f,
&err);
-  if (r == -1)
-    errno = err;
-  return r;
+  return b_conn->b->zero (b_conn->b, b_conn->conn, count, offset,
flags, err);
 }

 static struct nbdkit_next_ops next_ops = {
@@ -483,13 +460,9 @@ filter_pread (struct backend *b, struct connection *conn,
   debug ("pread count=%" PRIu32 " offset=%" PRIu64 "
flags=0x%" PRIx32,
          count, offset, flags);

-  if (f->filter.pread) {
-    int r = f->filter.pread (&next_ops, &nxdata, handle,
-                             buf, count, offset);
-    if (r == -1)
-      *err = errno;
-    return r;
-  }
+  if (f->filter.pread)
+    return f->filter.pread (&next_ops, &nxdata, handle,
+                            buf, count, offset, flags, err);
   else
     return f->backend.next->pread (f->backend.next, conn,
                                    buf, count, offset, flags, err);
@@ -509,13 +482,9 @@ filter_pwrite (struct backend *b, struct connection *conn,
   debug ("pwrite count=%" PRIu32 " offset=%" PRIu64 "
flags=0x%" PRIx32,
          count, offset, flags);

-  if (f->filter.pwrite) {
-    int r = f->filter.pwrite (&next_ops, &nxdata, handle,
-                              buf, count, offset);
-    if (r == -1)
-      *err = errno;
-    return r;
-  }
+  if (f->filter.pwrite)
+    return f->filter.pwrite (&next_ops, &nxdata, handle,
+                             buf, count, offset, flags, err);
   else
     return f->backend.next->pwrite (f->backend.next, conn,
                                     buf, count, offset, flags, err);
@@ -533,12 +502,8 @@ filter_flush (struct backend *b, struct connection *conn,
uint32_t flags,

   debug ("flush flags=0x%" PRIx32, flags);

-  if (f->filter.flush) {
-    int r = f->filter.flush (&next_ops, &nxdata, handle);
-    if (r == -1)
-      *err = errno;
-    return r;
-  }
+  if (f->filter.flush)
+    return f->filter.flush (&next_ops, &nxdata, handle, flags, err);
   else
     return f->backend.next->flush (f->backend.next, conn, flags, err);
 }
@@ -557,12 +522,9 @@ filter_trim (struct backend *b, struct connection *conn,
   debug ("trim count=%" PRIu32 " offset=%" PRIu64 "
flags=0x%" PRIx32,
          count, offset, flags);

-  if (f->filter.trim) {
-    int r = f->filter.trim (&next_ops, &nxdata, handle, count,
offset);
-    if (r == -1)
-      *err = errno;
-    return r;
-  }
+  if (f->filter.trim)
+    return f->filter.trim (&next_ops, &nxdata, handle, count,
offset, flags,
+                           err);
   else
     return f->backend.next->trim (f->backend.next, conn, count,
offset, flags,
                                   err);
@@ -581,13 +543,9 @@ filter_zero (struct backend *b, struct connection *conn,
   debug ("zero count=%" PRIu32 " offset=%" PRIu64 "
flags=0x%" PRIx32,
          count, offset, flags);

-  if (f->filter.zero) {
-    int r = f->filter.zero (&next_ops, &nxdata, handle,
-                            count, offset, !!(flags &
NBDKIT_FLAG_MAY_TRIM));
-    if (r == -1)
-      *err = errno;
-    return r;
-  }
+  if (f->filter.zero)
+    return f->filter.zero (&next_ops, &nxdata, handle,
+                           count, offset, flags, err);
   else
     return f->backend.next->zero (f->backend.next, conn,
                                   count, offset, flags, err);
diff --git a/filters/cache/cache.c b/filters/cache/cache.c
index 9473f2c..c8dd791 100644
--- a/filters/cache/cache.c
+++ b/filters/cache/cache.c
@@ -45,6 +45,7 @@
 #include <errno.h>
 #include <sys/types.h>
 #include <sys/ioctl.h>
+#include <assert.h>

 #include <nbdkit-filter.h>

@@ -55,6 +56,8 @@

 /* Size of a block in the cache.  A 4K block size means that we need
  * 64 MB of memory to store the bitmaps for a 1 TB underlying image.
+ * It is also smaller than the usual hole size for sparse files, which
+ * means we have no reason to call next_ops->zero.
  */
 #define BLKSIZE 4096

@@ -263,7 +266,7 @@ blk_set_bitmap_entry (uint64_t blknum, enum bm_entry state)
  */
 static int
 blk_read (struct nbdkit_next_ops *next_ops, void *nxdata,
-          uint64_t blknum, uint8_t *block)
+          uint64_t blknum, uint8_t *block, int *err)
 {
   off_t offset = blknum * BLKSIZE;
   enum bm_entry state = blk_get_bitmap_entry (blknum);
@@ -276,9 +279,10 @@ blk_read (struct nbdkit_next_ops *next_ops, void *nxdata,
                 "unknown");

   if (state == BLOCK_NOT_CACHED) /* Read underlying plugin. */
-    return next_ops->pread (nxdata, block, BLKSIZE, offset);
+    return next_ops->pread (nxdata, block, BLKSIZE, offset, 0, err);
   else {                         /* Read cache. */
     if (pread (fd, block, BLKSIZE, offset) == -1) {
+      *err = errno;
       nbdkit_error ("pread: %m");
       return -1;
     }
@@ -289,7 +293,8 @@ blk_read (struct nbdkit_next_ops *next_ops, void *nxdata,
 /* Write to the cache and the plugin. */
 static int
 blk_writethrough (struct nbdkit_next_ops *next_ops, void *nxdata,
-                  uint64_t blknum, const uint8_t *block)
+                  uint64_t blknum, const uint8_t *block, uint32_t flags,
+                  int *err)
 {
   off_t offset = blknum * BLKSIZE;

@@ -298,11 +303,12 @@ blk_writethrough (struct nbdkit_next_ops *next_ops, void
*nxdata,
                 blknum, (uint64_t) offset);

   if (pwrite (fd, block, BLKSIZE, offset) == -1) {
+    *err = errno;
     nbdkit_error ("pwrite: %m");
     return -1;
   }

-  if (next_ops->pwrite (nxdata, block, BLKSIZE, offset) == -1)
+  if (next_ops->pwrite (nxdata, block, BLKSIZE, offset, flags, err) == -1)
     return -1;

   blk_set_bitmap_entry (blknum, BLOCK_CLEAN);
@@ -313,12 +319,14 @@ blk_writethrough (struct nbdkit_next_ops *next_ops, void
*nxdata,
 /* Write to the cache only. */
 static int
 blk_writeback (struct nbdkit_next_ops *next_ops, void *nxdata,
-               uint64_t blknum, const uint8_t *block)
+               uint64_t blknum, const uint8_t *block, uint32_t flags,
+               int *err)
 {
   off_t offset;

-  if (cache_mode == CACHE_MODE_WRITETHROUGH)
-    return blk_writethrough (next_ops, nxdata, blknum, block);
+  if (cache_mode == CACHE_MODE_WRITETHROUGH ||
+      (cache_mode == CACHE_MODE_WRITEBACK && (flags &
NBDKIT_FLAG_FUA)))
+    return blk_writethrough (next_ops, nxdata, blknum, block, flags, err);

   offset = blknum * BLKSIZE;

@@ -327,6 +335,7 @@ blk_writeback (struct nbdkit_next_ops *next_ops, void
*nxdata,
                 blknum, (uint64_t) offset);

   if (pwrite (fd, block, BLKSIZE, offset) == -1) {
+    *err = errno;
     nbdkit_error ("pwrite: %m");
     return -1;
   }
@@ -338,12 +347,15 @@ blk_writeback (struct nbdkit_next_ops *next_ops, void
*nxdata,
 /* Read data. */
 static int
 cache_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
-             void *handle, void *buf, uint32_t count, uint64_t offset)
+             void *handle, void *buf, uint32_t count, uint64_t offset,
+             uint32_t flags, int *err)
 {
   uint8_t *block;

+  assert (!flags);
   block = malloc (BLKSIZE);
   if (block == NULL) {
+    *err = errno;
     nbdkit_error ("malloc: %m");
     return -1;
   }
@@ -357,7 +369,7 @@ cache_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
     if (n > count)
       n = count;

-    if (blk_read (next_ops, nxdata, blknum, block) == -1) {
+    if (blk_read (next_ops, nxdata, blknum, block, err) == -1) {
       free (block);
       return -1;
     }
@@ -376,12 +388,14 @@ cache_pread (struct nbdkit_next_ops *next_ops, void
*nxdata,
 /* Write data. */
 static int
 cache_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata,
-              void *handle, const void *buf, uint32_t count, uint64_t offset)
+              void *handle, const void *buf, uint32_t count, uint64_t offset,
+              uint32_t flags, int *err)
 {
   uint8_t *block;

   block = malloc (BLKSIZE);
   if (block == NULL) {
+    *err = errno;
     nbdkit_error ("malloc: %m");
     return -1;
   }
@@ -396,12 +410,12 @@ cache_pwrite (struct nbdkit_next_ops *next_ops, void
*nxdata,
       n = count;

     /* Do a read-modify-write operation on the current block. */
-    if (blk_read (next_ops, nxdata, blknum, block) == -1) {
+    if (blk_read (next_ops, nxdata, blknum, block, err) == -1){
       free (block);
       return -1;
     }
     memcpy (&block[blkoffs], buf, n);
-    if (blk_writeback (next_ops, nxdata, blknum, block) == -1) {
+    if (blk_writeback (next_ops, nxdata, blknum, block, flags, err) == -1) {
       free (block);
       return -1;
     }
@@ -418,16 +432,19 @@ cache_pwrite (struct nbdkit_next_ops *next_ops, void
*nxdata,
 /* Zero data. */
 static int
 cache_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
-            void *handle, uint32_t count, uint64_t offset, int may_trim)
+            void *handle, uint32_t count, uint64_t offset, uint32_t flags,
+            int *err)
 {
   uint8_t *block;

   block = malloc (BLKSIZE);
   if (block == NULL) {
+    *err = errno;
     nbdkit_error ("malloc: %m");
     return -1;
   }

+  flags &= ~NBDKIT_FLAG_MAY_TRIM; /* See BLKSIZE comment above. */
   while (count > 0) {
     uint64_t blknum, blkoffs, n;

@@ -437,12 +454,12 @@ cache_zero (struct nbdkit_next_ops *next_ops, void
*nxdata,
     if (n > count)
       n = count;

-    if (blk_read (next_ops, nxdata, blknum, block) == -1) {
+    if (blk_read (next_ops, nxdata, blknum, block, err) == -1) {
       free (block);
       return -1;
     }
     memset (&block[blkoffs], 0, n);
-    if (blk_writeback (next_ops, nxdata, blknum, block) == -1) {
+    if (blk_writeback (next_ops, nxdata, blknum, block, flags, err) == -1) {
       free (block);
       return -1;
     }
@@ -457,13 +474,15 @@ cache_zero (struct nbdkit_next_ops *next_ops, void
*nxdata,

 /* Flush: Go through all the dirty blocks, flushing them to disk. */
 static int
-cache_flush (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle)
+cache_flush (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle,
+             uint32_t flags, int *err)
 {
   uint8_t *block = NULL;
   uint64_t i, j;
   uint64_t blknum;
   enum bm_entry state;
   unsigned errors = 0;
+  int tmp;

   if (cache_mode == CACHE_MODE_UNSAFE)
     return 0;
@@ -473,7 +492,7 @@ cache_flush (struct nbdkit_next_ops *next_ops, void *nxdata,
void *handle)
    * to be sure.  Also we still need to issue the flush to the
    * underlying storage.
    */
-
+  assert (!flags);
   for (i = 0; i < bm_size; ++i) {
     if (bitmap[i] != 0) {
       /* The bitmap stores information about 4 blocks per byte,
@@ -487,6 +506,7 @@ cache_flush (struct nbdkit_next_ops *next_ops, void *nxdata,
void *handle)
           if (!block) {
             block = malloc (BLKSIZE);
             if (block == NULL) {
+              *err = errno;
               nbdkit_error ("malloc: %m");
               return -1;
             }
@@ -494,8 +514,10 @@ cache_flush (struct nbdkit_next_ops *next_ops, void
*nxdata, void *handle)
           /* Perform a read + writethrough which will read from the
            * cache and write it through to the underlying storage.
            */
-          if (blk_read (next_ops, nxdata, blknum, block) == -1 ||
-              blk_writethrough (next_ops, nxdata, blknum, block)) {
+          if (blk_read (next_ops, nxdata, blknum, block,
+                        errors ? &tmp : err) == -1 ||
+              blk_writethrough (next_ops, nxdata, blknum, block, 0,
+                                errors ? &tmp : err) == -1) {
             nbdkit_error ("cache: flush of block %" PRIu64 "
failed", blknum);
             errors++;
           }
@@ -507,7 +529,7 @@ cache_flush (struct nbdkit_next_ops *next_ops, void *nxdata,
void *handle)
   free (block);

   /* Now issue a flush request to the underlying storage. */
-  if (next_ops->flush (nxdata) == -1)
+  if (next_ops->flush (nxdata, 0, errors ? &tmp : err) == -1)
     errors++;

   return errors == 0 ? 0 : -1;
diff --git a/filters/cow/cow.c b/filters/cow/cow.c
index 5c2fcd0..cbf4ed4 100644
--- a/filters/cow/cow.c
+++ b/filters/cow/cow.c
@@ -270,7 +270,7 @@ blk_set_allocated (uint64_t blknum)
  */
 static int
 blk_read (struct nbdkit_next_ops *next_ops, void *nxdata,
-          uint64_t blknum, uint8_t *block)
+          uint64_t blknum, uint8_t *block, int *err)
 {
   off_t offset = blknum * BLKSIZE;
   bool allocated = blk_is_allocated (blknum);
@@ -280,9 +280,10 @@ blk_read (struct nbdkit_next_ops *next_ops, void *nxdata,
                 !allocated ? "a hole" : "allocated");

   if (!allocated)               /* Read underlying plugin. */
-    return next_ops->pread (nxdata, block, BLKSIZE, offset);
+    return next_ops->pread (nxdata, block, BLKSIZE, offset, 0, err);
   else {                        /* Read overlay. */
     if (pread (fd, block, BLKSIZE, offset) == -1) {
+      *err = errno;
       nbdkit_error ("pread: %m");
       return -1;
     }
@@ -291,7 +292,7 @@ blk_read (struct nbdkit_next_ops *next_ops, void *nxdata,
 }

 static int
-blk_write (uint64_t blknum, const uint8_t *block)
+blk_write (uint64_t blknum, const uint8_t *block, int *err)
 {
   off_t offset = blknum * BLKSIZE;

@@ -299,6 +300,7 @@ blk_write (uint64_t blknum, const uint8_t *block)
                 blknum, (uint64_t) offset);

   if (pwrite (fd, block, BLKSIZE, offset) == -1) {
+    *err = errno;
     nbdkit_error ("pwrite: %m");
     return -1;
   }
@@ -310,12 +312,14 @@ blk_write (uint64_t blknum, const uint8_t *block)
 /* Read data. */
 static int
 cow_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
-           void *handle, void *buf, uint32_t count, uint64_t offset)
+           void *handle, void *buf, uint32_t count, uint64_t offset,
+           uint32_t flags, int *err)
 {
   uint8_t *block;

   block = malloc (BLKSIZE);
   if (block == NULL) {
+    *err = errno;
     nbdkit_error ("malloc: %m");
     return -1;
   }
@@ -329,7 +333,7 @@ cow_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
     if (n > count)
       n = count;

-    if (blk_read (next_ops, nxdata, blknum, block) == -1) {
+    if (blk_read (next_ops, nxdata, blknum, block, err) == -1) {
       free (block);
       return -1;
     }
@@ -348,12 +352,14 @@ cow_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
 /* Write data. */
 static int
 cow_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata,
-            void *handle, const void *buf, uint32_t count, uint64_t offset)
+            void *handle, const void *buf, uint32_t count, uint64_t offset,
+            uint32_t flags, int *err)
 {
   uint8_t *block;

   block = malloc (BLKSIZE);
   if (block == NULL) {
+    *err = errno;
     nbdkit_error ("malloc: %m");
     return -1;
   }
@@ -368,12 +374,12 @@ cow_pwrite (struct nbdkit_next_ops *next_ops, void
*nxdata,
       n = count;

     /* Do a read-modify-write operation on the current block. */
-    if (blk_read (next_ops, nxdata, blknum, block) == -1) {
+    if (blk_read (next_ops, nxdata, blknum, block, err) == -1) {
       free (block);
       return -1;
     }
     memcpy (&block[blkoffs], buf, n);
-    if (blk_write (blknum, block) == -1) {
+    if (blk_write (blknum, block, err) == -1) {
       free (block);
       return -1;
     }
@@ -390,12 +396,14 @@ cow_pwrite (struct nbdkit_next_ops *next_ops, void
*nxdata,
 /* Zero data. */
 static int
 cow_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
-          void *handle, uint32_t count, uint64_t offset, int may_trim)
+          void *handle, uint32_t count, uint64_t offset, uint32_t flags,
+          int *err)
 {
   uint8_t *block;

   block = malloc (BLKSIZE);
   if (block == NULL) {
+    *err = errno;
     nbdkit_error ("malloc: %m");
     return -1;
   }
@@ -412,12 +420,12 @@ cow_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
     /* XXX There is the possibility of optimizing this: ONLY if we are
      * writing a whole, aligned block, then use FALLOC_FL_ZERO_RANGE.
      */
-    if (blk_read (next_ops, nxdata, blknum, block) == -1) {
+    if (blk_read (next_ops, nxdata, blknum, block, err) == -1) {
       free (block);
       return -1;
     }
     memset (&block[blkoffs], 0, n);
-    if (blk_write (blknum, block) == -1) {
+    if (blk_write (blknum, block, err) == -1) {
       free (block);
       return -1;
     }
@@ -431,12 +439,14 @@ cow_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
 }

 static int
-cow_flush (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle)
+cow_flush (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle,
+           uint32_t flags, int *err)
 {
   /* I think we don't care about file metadata for this temporary
    * file, so only flush the data.
    */
   if (fdatasync (fd) == -1) {
+    *err = errno;
     nbdkit_error ("fdatasync: %m");
     return -1;
   }
diff --git a/filters/delay/delay.c b/filters/delay/delay.c
index ebe3386..d8d0c8e 100644
--- a/filters/delay/delay.c
+++ b/filters/delay/delay.c
@@ -124,29 +124,32 @@ delay_config (nbdkit_next_config *next, void *nxdata,
 /* Read data. */
 static int
 delay_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
-             void *handle, void *buf, uint32_t count, uint64_t offset)
+             void *handle, void *buf, uint32_t count, uint64_t offset,
+             uint32_t flags, int *err)
 {
   read_delay ();
-  return next_ops->pread (nxdata, buf, count, offset);
+  return next_ops->pread (nxdata, buf, count, offset, flags, err);
 }

 /* Write data. */
 static int
 delay_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata,
               void *handle,
-              const void *buf, uint32_t count, uint64_t offset)
+              const void *buf, uint32_t count, uint64_t offset, uint32_t flags,
+              int *err)
 {
   write_delay ();
-  return next_ops->pwrite (nxdata, buf, count, offset);
+  return next_ops->pwrite (nxdata, buf, count, offset, flags, err);
 }

 /* Zero data. */
 static int
 delay_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
-            void *handle, uint32_t count, uint64_t offset, int may_trim)
+            void *handle, uint32_t count, uint64_t offset, uint32_t flags,
+            int *err)
 {
   write_delay ();
-  return next_ops->zero (nxdata, count, offset, may_trim);
+  return next_ops->zero (nxdata, count, offset, flags, err);
 }

 static struct nbdkit_filter filter = {
diff --git a/filters/offset/offset.c b/filters/offset/offset.c
index b6687f2..bb63ed4 100644
--- a/filters/offset/offset.c
+++ b/filters/offset/offset.c
@@ -97,34 +97,38 @@ offset_get_size (struct nbdkit_next_ops *next_ops, void
*nxdata,
 /* Read data. */
 static int
 offset_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
-              void *handle, void *buf, uint32_t count, uint64_t offs)
+              void *handle, void *buf, uint32_t count, uint64_t offs,
+              uint32_t flags, int *err)
 {
-  return next_ops->pread (nxdata, buf, count, offs + offset);
+  return next_ops->pread (nxdata, buf, count, offs + offset, flags, err);
 }

 /* Write data. */
 static int
 offset_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata,
                void *handle,
-               const void *buf, uint32_t count, uint64_t offs)
+               const void *buf, uint32_t count, uint64_t offs, uint32_t flags,
+               int *err)
 {
-  return next_ops->pwrite (nxdata, buf, count, offs + offset);
+  return next_ops->pwrite (nxdata, buf, count, offs + offset, flags, err);
 }

 /* Trim data. */
 static int
 offset_trim (struct nbdkit_next_ops *next_ops, void *nxdata,
-             void *handle, uint32_t count, uint64_t offs)
+             void *handle, uint32_t count, uint64_t offs, uint32_t flags,
+             int *err)
 {
-  return next_ops->trim (nxdata, count, offs + offset);
+  return next_ops->trim (nxdata, count, offs + offset, flags, err);
 }

 /* Zero data. */
 static int
 offset_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
-             void *handle, uint32_t count, uint64_t offs, int may_trim)
+             void *handle, uint32_t count, uint64_t offs, uint32_t flags,
+             int *err)
 {
-  return next_ops->zero (nxdata, count, offs + offset, may_trim);
+  return next_ops->zero (nxdata, count, offs + offset, flags, err);
 }

 static struct nbdkit_filter filter = {
diff --git a/filters/partition/partition.c b/filters/partition/partition.c
index f74b3af..1fe0bc7 100644
--- a/filters/partition/partition.c
+++ b/filters/partition/partition.c
@@ -196,6 +196,7 @@ find_gpt_partition (struct nbdkit_next_ops *next_ops, void
*nxdata,
   struct gpt_header header;
   struct gpt_partition partition;
   int i;
+  int err;

   if (partnum > 128) {
   out_of_range:
@@ -216,7 +217,7 @@ find_gpt_partition (struct nbdkit_next_ops *next_ops, void
*nxdata,
      * partition_prepare call above.
      */
     if (next_ops->pread (nxdata, partition_bytes, sizeof partition_bytes,
-                         2*512 + i*128) == -1)
+                         2*512 + i*128, 0, &err) == -1)
       return -1;
     get_gpt_partition (partition_bytes, &partition);
     if (memcmp (partition.partition_type_guid,
@@ -240,6 +241,7 @@ partition_prepare (struct nbdkit_next_ops *next_ops, void
*nxdata,
   int64_t size;
   uint8_t lba01[1024];          /* LBA 0 and 1 */
   int r;
+  int err;

   size = next_ops->get_size (nxdata);
   if (size == -1)
@@ -251,7 +253,7 @@ partition_prepare (struct nbdkit_next_ops *next_ops, void
*nxdata,

   nbdkit_debug ("disk size=%" PRIi64, size);

-  if (next_ops->pread (nxdata, lba01, sizeof lba01, 0) != 0)
+  if (next_ops->pread (nxdata, lba01, sizeof lba01, 0, 0, &err) == -1)
     return -1;

   /* Is it GPT? */
@@ -295,42 +297,46 @@ partition_get_size (struct nbdkit_next_ops *next_ops, void
*nxdata,
 /* Read data. */
 static int
 partition_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
-                 void *handle, void *buf, uint32_t count, uint64_t offs)
+                 void *handle, void *buf, uint32_t count, uint64_t offs,
+                 uint32_t flags, int *err)
 {
   struct handle *h = handle;

-  return next_ops->pread (nxdata, buf, count, offs + h->offset);
+  return next_ops->pread (nxdata, buf, count, offs + h->offset, flags,
err);
 }

 /* Write data. */
 static int
 partition_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata,
                   void *handle,
-                  const void *buf, uint32_t count, uint64_t offs)
+                  const void *buf, uint32_t count, uint64_t offs,
+                  uint32_t flags, int *err)
 {
   struct handle *h = handle;

-  return next_ops->pwrite (nxdata, buf, count, offs + h->offset);
+  return next_ops->pwrite (nxdata, buf, count, offs + h->offset, flags,
err);
 }

 /* Trim data. */
 static int
 partition_trim (struct nbdkit_next_ops *next_ops, void *nxdata,
-                void *handle, uint32_t count, uint64_t offs)
+                void *handle, uint32_t count, uint64_t offs, uint32_t flags,
+                int *err)
 {
   struct handle *h = handle;

-  return next_ops->trim (nxdata, count, offs + h->offset);
+  return next_ops->trim (nxdata, count, offs + h->offset, flags, err);
 }

 /* Zero data. */
 static int
 partition_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
-                void *handle, uint32_t count, uint64_t offs, int may_trim)
+                void *handle, uint32_t count, uint64_t offs, uint32_t flags,
+                int *err)
 {
   struct handle *h = handle;

-  return next_ops->zero (nxdata, count, offs + h->offset, may_trim);
+  return next_ops->zero (nxdata, count, offs + h->offset, flags, err);
 }

 static struct nbdkit_filter filter = {
-- 
2.14.3

Eric Blake

2018-Mar-08 23:03 UTC

head link

[Libguestfs] [nbdkit PATCH v3 04/15] filters: Add log filter

'nbdkit -v' is quite verbose, and traces everything.  Well, actually
it only traces if you also use -f; because if we daemonize, stderr
is redirected to /dev/null.  Sometimes, we want to trace JUST the
client interactions, and it would be nice to trace this even when
run in the background.  In particular, when debugging another
filter, being able to trace what an earlier filter changed can be
quite useful; and having the log filter in place will make it easier
to add testsuite coverage of other filters (this patch includes a
rudimentary test of just the log filter).  Also, it is nice to have
timestamps in the log, in order to see if message interleaving takes
place, and what delays are occurring.

The filter requires a logfile; I debated making the filter fall
back to logging to stderr if no logfile parameter was found, but
while that works for 'nbdkit -f', it fails miserably when we
daemonize (again, part of the argument for having this filter).

Signed-off-by: Eric Blake <eblake@redhat.com>

---
v3: rework to changed api, get unit test working
---
 TODO                              |   2 -
 docs/nbdkit-filter.pod            |   1 +
 docs/nbdkit.pod                   |   1 +
 filters/log/nbdkit-log-filter.pod | 115 ++++++++++++
 configure.ac                      |   1 +
 filters/log/log.c                 | 365 ++++++++++++++++++++++++++++++++++++++
 filters/Makefile.am               |   1 +
 filters/log/Makefile.am           |  62 +++++++
 tests/Makefile.am                 |   4 +
 tests/test-log.sh                 |  88 +++++++++
 10 files changed, 638 insertions(+), 2 deletions(-)
 create mode 100644 filters/log/nbdkit-log-filter.pod
 create mode 100644 filters/log/log.c
 create mode 100644 filters/log/Makefile.am
 create mode 100755 tests/test-log.sh

diff --git a/TODO b/TODO
index 4731b1e..a691ff3 100644
--- a/TODO
+++ b/TODO
@@ -77,8 +77,6 @@ Suggestions for filters
 * injecting artificial errors or otherwise masking plugin features
   (such as hiding zero support) for testing clients

-* logging all client commands
-
 * fua filter: setting mode=none stops advertisement, mode=emulate uses
   flush emulation (or fails if !can_flush), mode=native passes on
   through (or fails if can_fua does not report native support).  When
diff --git a/docs/nbdkit-filter.pod b/docs/nbdkit-filter.pod
index 8ef26e3..3ec8f2a 100644
--- a/docs/nbdkit-filter.pod
+++ b/docs/nbdkit-filter.pod
@@ -563,6 +563,7 @@ Filters:
 L<nbdkit-cache-filter(1)>,
 L<nbdkit-cow-filter(1)>,
 L<nbdkit-delay-filter(1)>,
+L<nbdkit-log-filter(1)>,
 L<nbdkit-offset-filter(1)>,
 L<nbdkit-partition-filter(1)>.

diff --git a/docs/nbdkit.pod b/docs/nbdkit.pod
index 1167245..22d91e7 100644
--- a/docs/nbdkit.pod
+++ b/docs/nbdkit.pod
@@ -920,6 +920,7 @@ Filters:
 L<nbdkit-cache-filter(1)>,
 L<nbdkit-cow-filter(1)>,
 L<nbdkit-delay-filter(1)>,
+L<nbdkit-log-filter(1)>,
 L<nbdkit-offset-filter(1)>,
 L<nbdkit-partition-filter(1)>.

diff --git a/filters/log/nbdkit-log-filter.pod
b/filters/log/nbdkit-log-filter.pod
new file mode 100644
index 0000000..f3484fe
--- /dev/null
+++ b/filters/log/nbdkit-log-filter.pod
@@ -0,0 +1,115 @@
+=encoding utf8
+
+=head1 NAME
+
+nbdkit-log-filter - nbdkit log filter
+
+=head1 SYNOPSIS
+
+ nbdkit --filter=log plugin logfile=FILE [plugin-args...]
+
+=head1 DESCRIPTION
+
+C<nbdkit-log-filter> is a filter that logs all transactions.  When
+used as the first filter, it can show the original client requests; as
+a later filter, it can show how earlier filters have modified the
+original request.  The log results are placed in a user-specified
+file; for more details on the log format, see L<FILES>.  Note that
+using C<nbdkit -v> produces much more verbose logging details to
+stderr about every aspect of nbdkit operation.
+
+=head1 PARAMETERS
+
+The nbdkit-log-filter requires a single parameter C<logfile> which
+specifies the path of the file to use for logging.  If the file
+already exists, it will be truncated.
+
+=head1 EXAMPLES
+
+Serve the file F<disk.img>, and log each client transaction in the
+file F<disk.log>:
+
+ nbdkit --filter=log file file=disk.img logfile=disk.log
+
+Repeat the task, but with the cow (copy-on-write) filter to perform
+local caching of data served from the original plugin:
+
+ nbdkit --filter=cow --filter=log file file=disk.img logfile=disk.log2
+
+After running a client that performs the same operations under each of
+the two servers, you can compare F<disk.log> and F<disk.log2> to
see
+the impact of the caching.
+
+=head1 FILES
+
+This filter writes to the file specified by the C<logfile=FILE>
+parameter.  All lines include a timestamp, a connection counter, then
+details about the command.  The following actions are logged: Connect,
+Read, Write, Zero, Trim, Flush, and Disconnect.  Except for Connect
+and Disconnect, an event is logged across two lines for call and
+return value, to allow tracking duration and tracing any parallel
+execution, using id for correlation (incremented per action on the
+connection).
+
+An example logging session of a client that performs a single
+successful read is:
+
+ 2018-01-27 20:38:22.959984 connection=1 Connect size=0x400 write=1 flush=1
rotational=0 trim=0
+ 2018-01-27 20:38:23.001720 connection=1 Read id=1 offset=0x0 count=0x100 ...
+ 2018-01-27 20:38:23.001995 connection=1 ...Read id=1 return=0 (Success)
+ 2018-01-27 20:38:23.044259 connection=1 Disconnect transactions=1
+
+=head1 SEE ALSO
+
+L<nbdkit(1)>,
+L<nbdkit-file-plugin(1)>,
+L<nbdkit-cow-filter(1)>,
+L<nbdkit-filter(3)>.
+
+=head1 AUTHORS
+
+Eric Blake
+
+=head1 COPYRIGHT
+
+Copyright (C) 2018 Red Hat Inc.
+
+=head1 LICENSE
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are
+met:
+
+=over 4
+
+=item *
+
+Redistributions of source code must retain the above copyright
+notice, this list of conditions and the following disclaimer.
+
+=item *
+
+Redistributions in binary form must reproduce the above copyright
+notice, this list of conditions and the following disclaimer in the
+documentation and/or other materials provided with the distribution.
+
+=item *
+
+Neither the name of Red Hat nor the names of its contributors may be
+used to endorse or promote products derived from this software without
+specific prior written permission.
+
+=back
+
+THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS''
AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+SUCH DAMAGE.
diff --git a/configure.ac b/configure.ac
index 6025ce0..3fcc776 100644
--- a/configure.ac
+++ b/configure.ac
@@ -516,6 +516,7 @@ AC_CONFIG_FILES([Makefile
                  filters/cache/Makefile
                  filters/cow/Makefile
                  filters/delay/Makefile
+                 filters/log/Makefile
                  filters/offset/Makefile
                  filters/partition/Makefile
                  src/Makefile
diff --git a/filters/log/log.c b/filters/log/log.c
new file mode 100644
index 0000000..58fd4f8
--- /dev/null
+++ b/filters/log/log.c
@@ -0,0 +1,365 @@
+/* nbdkit
+ * Copyright (C) 2018 Red Hat Inc.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ *
+ * * Neither the name of Red Hat nor the names of its contributors may be
+ * used to endorse or promote products derived from this software without
+ * specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+ * PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <config.h>
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <stdarg.h>
+#include <errno.h>
+#include <inttypes.h>
+#include <pthread.h>
+#include <sys/time.h>
+#include <assert.h>
+
+#include <nbdkit-filter.h>
+
+#define THREAD_MODEL NBDKIT_THREAD_MODEL_PARALLEL
+
+static uint64_t connections;
+static char *logfilename;
+static FILE *logfile;
+static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
+
+static void
+log_unload (void)
+{
+  if (logfilename)
+    fclose (logfile);
+  free (logfilename);
+}
+
+/* Called for each key=value passed on the command line. */
+static int
+log_config (nbdkit_next_config *next, void *nxdata,
+            const char *key, const char *value)
+{
+  if (strcmp (key, "logfile") == 0) {
+    free (logfilename);
+    logfilename = strdup (value);
+    if (!logfilename) {
+      nbdkit_error ("strdup: %m");
+      return -1;
+    }
+    return 0;
+  }
+  return next (nxdata, key, value);
+}
+
+/* Open the logfile. */
+static int
+log_config_complete (nbdkit_next_config_complete *next, void *nxdata)
+{
+  if (!logfilename) {
+    nbdkit_error ("missing logfile= parameter for the log filter");
+    return -1;
+  }
+  logfile = fopen (logfilename, "w");
+  if (!logfile) {
+    nbdkit_error ("fopen: %m");
+    return -1;
+  }
+
+  return next (nxdata);
+}
+
+#define log_config_help \
+  "logfile=<FILE>    The file to place the log in."
+
+struct handle {
+  uint64_t connection;
+  uint64_t id;
+};
+
+/* Compute the next id number on the current connection. */
+static uint64_t
+get_id (struct handle *h)
+{
+  uint64_t r;
+
+  pthread_mutex_lock (&lock);
+  r = ++h->id;
+  pthread_mutex_unlock (&lock);
+  return r;
+}
+
+/* Output a timestamp and the log message. */
+static void __attribute__ ((format (printf, 4, 5)))
+output (struct handle *h, const char *act, uint64_t id, const char *fmt, ...)
+{
+  va_list args;
+  struct timeval tv;
+  struct tm tm;
+  char timestamp[27] = "Time unknown";
+
+  /* Logging is best effort, so ignore failure to get timestamp */
+  if (!gettimeofday (&tv, NULL))
+    {
+      size_t s;
+
+      gmtime_r (&tv.tv_sec, &tm);
+      s = strftime (timestamp, sizeof timestamp, "%F %T", &tm);
+      assert (s);
+      s = snprintf (timestamp + s, sizeof timestamp - s, ".%06ld",
+                    0L + tv.tv_usec);
+    }
+  flockfile (logfile);
+  fprintf (logfile, "%s connection=%" PRIu64 " %s ",
timestamp, h->connection,
+           act);
+  if (id)
+    fprintf (logfile, "id=%" PRIu64 " ", id);
+  va_start (args, fmt);
+  vfprintf (logfile, fmt, args);
+  va_end (args);
+  fputc ('\n', logfile);
+  fflush (logfile);
+  funlockfile (logfile);
+}
+
+/* Shared code for a nicer log of return value */
+static void
+output_return (struct handle *h, const char *act, uint64_t id, int r, int *err)
+{
+  const char *s = "Other=>EINVAL";
+
+  /* Only decode what connections.c:nbd_errno() recognizes */
+  if (r == -1) {
+    switch (*err) {
+    case EROFS:
+      s = "EROFS=>EPERM";
+      break;
+    case EPERM:
+      s = "EPERM";
+      break;
+    case EIO:
+      s = "EIO";
+      break;
+    case ENOMEM:
+      s = "ENOMEM";
+      break;
+#ifdef EDQUOT
+    case EDQUOT:
+      s = "EDQUOT=>ENOSPC";
+      break;
+#endif
+    case EFBIG:
+      s = "EFBIG=>ENOSPC";
+      break;
+    case ENOSPC:
+      s = "ENOSPC";
+      break;
+#ifdef ESHUTDOWN
+    case ESHUTDOWN:
+      s = "ESHUTDOWN";
+      break;
+#endif
+    case EINVAL:
+      s = "EINVAL";
+      break;
+    }
+  }
+  else {
+    s = "Success";
+  }
+  output (h, act, id, "return=%d (%s)", r, s);
+}
+
+/* Open a connection. */
+static void *
+log_open (nbdkit_next_open *next, void *nxdata, int readonly)
+{
+  struct handle *h;
+
+  if (next (nxdata, readonly) == -1)
+    return NULL;
+
+  h = malloc (sizeof *h);
+  if (h == NULL) {
+    nbdkit_error ("malloc: %m");
+    return NULL;
+  }
+
+  pthread_mutex_lock (&lock);
+  h->connection = ++connections;
+  pthread_mutex_unlock (&lock);
+  h->id = 0;
+  return h;
+}
+
+static void
+log_close (void *handle)
+{
+  struct handle *h = handle;
+
+  free (h);
+}
+
+static int
+log_prepare (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle)
+{
+  struct handle *h = handle;
+  int64_t size = next_ops->get_size (nxdata);
+  int w = next_ops->can_write (nxdata);
+  int f = next_ops->can_flush (nxdata);
+  int r = next_ops->is_rotational (nxdata);
+  int t = next_ops->can_trim (nxdata);
+
+  if (size < 0 || w < 0 || f < 0 || r < 0 || t < 0)
+    return -1;
+
+  /* TODO expose can_zero, can_fua to filters */
+  output (h, "Connect", 0, "size=0x%" PRIx64 "
write=%d flush=%d "
+          "rotational=%d trim=%d" /* zero=? fua=? */, size, w, f, r,
t);
+  return 0;
+}
+
+static int
+log_finalize (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle)
+{
+  struct handle *h = handle;
+
+  output (h, "Disconnect", 0, "transactions=%" PRId64,
h->id);
+  return 0;
+}
+
+/* Read data. */
+static int
+log_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
+           void *handle, void *buf, uint32_t count, uint64_t offs,
+           uint32_t flags, int *err)
+{
+  struct handle *h = handle;
+  uint64_t id = get_id (h);
+  int r;
+
+  assert (!flags);
+  output (h, "Read", id, "offset=0x%" PRIx64 "
count=0x%x ...",
+          offs, count);
+  r = next_ops->pread (nxdata, buf, count, offs, flags, err);
+  output_return (h, "...Read", id, r, err);
+  return r;
+}
+
+/* Write data. */
+static int
+log_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata,
+            void *handle, const void *buf, uint32_t count, uint64_t offs,
+            uint32_t flags, int *err)
+{
+  struct handle *h = handle;
+  uint64_t id = get_id (h);
+  int r;
+
+  assert (!(flags & ~NBDKIT_FLAG_FUA));
+  output (h, "Write", id, "offset=0x%" PRIx64 "
count=0x%x fua=%d ...",
+          offs, count, !!(flags & NBDKIT_FLAG_FUA));
+  r = next_ops->pwrite (nxdata, buf, count, offs, flags, err);
+  output_return (h, "...Write", id, r, err);
+  return r;
+}
+
+/* Flush. */
+static int
+log_flush (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle,
+           uint32_t flags, int *err)
+{
+  struct handle *h = handle;
+  uint64_t id = get_id (h);
+  int r;
+
+  assert (!flags);
+  output (h, "Flush", id, "...");
+  r = next_ops->flush (nxdata, flags, err);
+  output_return (h, "...Flush", id, r, err);
+  return r;
+}
+
+/* Trim data. */
+static int
+log_trim (struct nbdkit_next_ops *next_ops, void *nxdata,
+          void *handle, uint32_t count, uint64_t offs, uint32_t flags,
+          int *err)
+{
+  struct handle *h = handle;
+  uint64_t id = get_id (h);
+  int r;
+
+  assert (!(flags & ~NBDKIT_FLAG_FUA));
+  output (h, "Trim", id, "offset=0x%" PRIx64 "
count=0x%x fua=%d ...",
+          offs, count, !!(flags & NBDKIT_FLAG_FUA));
+  r = next_ops->trim (nxdata, count, offs, flags, err);
+  output_return (h, "...Trim", id, r, err);
+  return r;
+}
+
+/* Zero data. */
+static int
+log_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
+          void *handle, uint32_t count, uint64_t offs, uint32_t flags,
+          int *err)
+{
+  struct handle *h = handle;
+  uint64_t id = get_id (h);
+  int r;
+
+  assert (!(flags & ~(NBDKIT_FLAG_FUA | NBDKIT_FLAG_MAY_TRIM)));
+  output (h, "Zero", id, "offset=0x%" PRIx64 "
count=0x%x trim=%d fua=%d ...",
+          offs, count, !!(flags & NBDKIT_FLAG_MAY_TRIM),
+          !!(flags & NBDKIT_FLAG_FUA));
+  r = next_ops->zero (nxdata, count, offs, flags, err);
+  output_return (h, "...Zero", id, r, err);
+  return r;
+}
+
+static struct nbdkit_filter filter = {
+  .name              = "log",
+  .longname          = "nbdkit log filter",
+  .version           = PACKAGE_VERSION,
+  .config            = log_config,
+  .config_complete   = log_config_complete,
+  .config_help       = log_config_help,
+  .unload            = log_unload,
+  .open              = log_open,
+  .close             = log_close,
+  .prepare           = log_prepare,
+  .finalize          = log_finalize,
+  .pread             = log_pread,
+  .pwrite            = log_pwrite,
+  .flush             = log_flush,
+  .trim              = log_trim,
+  .zero              = log_zero,
+};
+
+NBDKIT_REGISTER_FILTER(filter)
diff --git a/filters/Makefile.am b/filters/Makefile.am
index 9996d77..8e070e5 100644
--- a/filters/Makefile.am
+++ b/filters/Makefile.am
@@ -34,5 +34,6 @@ SUBDIRS = \
 	cache \
 	cow \
 	delay \
+	log \
 	offset \
 	partition
diff --git a/filters/log/Makefile.am b/filters/log/Makefile.am
new file mode 100644
index 0000000..18a0eba
--- /dev/null
+++ b/filters/log/Makefile.am
@@ -0,0 +1,62 @@
+# nbdkit
+# Copyright (C) 2018 Red Hat Inc.
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met:
+#
+# * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution.
+#
+# * Neither the name of Red Hat nor the names of its contributors may be
+# used to endorse or promote products derived from this software without
+# specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+
+EXTRA_DIST = nbdkit-log-filter.pod
+
+CLEANFILES = *~
+
+filterdir = $(libdir)/nbdkit/filters
+
+filter_LTLIBRARIES = nbdkit-log-filter.la
+
+nbdkit_log_filter_la_SOURCES = \
+	log.c \
+	$(top_srcdir)/include/nbdkit-filter.h
+
+nbdkit_log_filter_la_CPPFLAGS = \
+	-I$(top_srcdir)/include
+nbdkit_log_filter_la_CFLAGS = \
+	$(WARNINGS_CFLAGS)
+nbdkit_log_filter_la_LDFLAGS = \
+	-module -avoid-version -shared
+
+if HAVE_POD2MAN
+
+man_MANS = nbdkit-log-filter.1
+CLEANFILES += $(man_MANS)
+
+nbdkit-log-filter.1: nbdkit-log-filter.pod
+	$(POD2MAN) $(POD2MAN_ARGS) --section=1 --name=`basename $@ .1` $< $@.t
&& \
+	if grep 'POD ERROR' $@.t; then rm $@.t; exit 1; fi && \
+	mv $@.t $@
+
+endif
diff --git a/tests/Makefile.am b/tests/Makefile.am
index ec33109..2d6393d 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -51,6 +51,7 @@ EXTRA_DIST = \
 	test-help.sh \
 	test-help-plugin.sh \
 	test-ip.sh \
+	test-log.sh \
 	test_ocaml_plugin.ml \
 	test-ocaml.c \
 	test-parallel-file.sh \
@@ -427,6 +428,9 @@ test_delay_SOURCES = test-delay.c test.h
 test_delay_CFLAGS = $(WARNINGS_CFLAGS) $(LIBGUESTFS_CFLAGS)
 test_delay_LDADD = libtest.la $(LIBGUESTFS_LIBS)

+# log filter test.
+TESTS += test-log.sh
+
 # offset filter test.
 check_DATA += offset-data
 MAINTAINERCLEANFILES += offset-data
diff --git a/tests/test-log.sh b/tests/test-log.sh
new file mode 100755
index 0000000..5d6a138
--- /dev/null
+++ b/tests/test-log.sh
@@ -0,0 +1,88 @@
+#!/bin/bash -
+# nbdkit
+# Copyright (C) 2018 Red Hat Inc.
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met:
+#
+# * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution.
+#
+# * Neither the name of Red Hat nor the names of its contributors may be
+# used to endorse or promote products derived from this software without
+# specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+
+set -e
+
+files="log.img log.log log.sock log.pid"
+rm -f $files
+
+: ${QEMU_IO=qemu-io}
+
+# Test that qemu-io works
+truncate --size 10M log.img
+if ! $QEMU_IO -f raw -c 'w 1M 2M' log.img; then
+    echo "$0: missing or broken qemu-io"
+    exit 77
+fi
+
+# Run nbdkit with logging enabled to file.
+nbdkit -P log.pid -U log.sock --filter=log file file=log.img logfile=log.log
+
+# We may have to wait a short time for the pid file to appear.
+for i in `seq 1 10`; do
+    if test -f log.pid; then
+        break
+    fi
+    sleep 1
+done
+if ! test -f log.pid; then
+    echo "$0: PID file was not created"
+    exit 1
+fi
+
+pid="$(cat log.pid)"
+
+# Kill the nbdkit process on exit.
+cleanup ()
+{
+    status=$?
+
+    kill $pid
+    # For easier debugging, dump the final log file before removing it.
+    echo "Log file contents:"
+    cat log.log
+    rm -f $files
+
+    exit $status
+}
+trap cleanup INT QUIT TERM EXIT ERR
+
+# Write, then read some data in the file.
+$QEMU_IO -f raw -c 'w -P 11 1M 2M'
'nbd+unix://?socket=log.sock'
+$QEMU_IO -r -f raw -c 'r -P 11 2M 1M'
'nbd+unix://?socket=log.sock'
+
+# The log should show a write on connection 1, and read on connection 2.
+grep 'connection=1 Write id=1 offset=0x100000 count=0x200000 ' log.log
+grep 'connection=2 Read id=1 offset=0x200000 count=0x100000 ' log.log
+
+# The cleanup() function is called implicitly on exit.
-- 
2.14.3

Eric Blake

2018-Mar-08 23:03 UTC

head link

[Libguestfs] [nbdkit PATCH v3 05/15] filters: Add blocksize filter

The upstream NBD protocol recently clarified that servers can
advertise block size limitations to clients that ask with
NBD_OPT_GO (although we're still a ways off from implementing
that in nbdkit); and that in the absence of that, then clients
should agree on limits using out-of-band information or stick
to sane defaults (everything 512-byte-aligned, no reads or
writes larger than 32M).  But the protocol is not inherently
prevented from serving 1-byte requests, nor does it prohibit
a request of nearly 4G on bulk actions like trim and zero; and
nbdkit itself supports 1-byte requests, read and write up to
64M, and no limit on trim or zero, even though these requests
may fail miserably on some plugins.

So, nbdkit should make it easier to plug together clients and
servers that have different notions of blocksize limitations.
A new blocksize filter makes it possible to specify a minimum
blocksize handed to the plugin (everything smaller is rounded
out, using read-modify-write as needed), and maximum limits
(maxdata for read/write since they have a buffer, and maxlen
for zero/trim since they do not).

Testing is easy by reusing the log filter and observing that
requests were rewritten as expected (well, with the slight
complication that different versions of qemu-io vary in whether
they obey the NBD spec of limiting themselves to 512-byte I/O
if we don't advertise otherwise with NBD_OPT_GO, or in just
directly handing us byte-based I/O anyway - so the test has
to round to something even bigger to be sure it sees a
difference).

Signed-off-by: Eric Blake <eblake@redhat.com>

---
v3: rebase to API changes; get unit test working; use (fixed)
nbdkit_parse_size
---
 TODO                                          |   5 -
 docs/nbdkit-filter.pod                        |   1 +
 docs/nbdkit.pod                               |   1 +
 filters/blocksize/nbdkit-blocksize-filter.pod | 141 +++++++++++
 configure.ac                                  |   3 +-
 filters/blocksize/blocksize.c                 | 350 ++++++++++++++++++++++++++
 filters/Makefile.am                           |   1 +
 filters/blocksize/Makefile.am                 |  62 +++++
 tests/Makefile.am                             |   4 +
 tests/test-blocksize.sh                       | 156 ++++++++++++
 10 files changed, 718 insertions(+), 6 deletions(-)
 create mode 100644 filters/blocksize/nbdkit-blocksize-filter.pod
 create mode 100644 filters/blocksize/blocksize.c
 create mode 100644 filters/blocksize/Makefile.am
 create mode 100755 tests/test-blocksize.sh

diff --git a/TODO b/TODO
index a691ff3..d02671d 100644
--- a/TODO
+++ b/TODO
@@ -86,11 +86,6 @@ Suggestions for filters
   unneeded intermediate flushing; hence, where this filter is placed
   in the stack may have a performance impact.

-* blocksize filter: setting minblock performs read-modify-write of
-  requests that are too small or unaligned for the plugin; setting
-  maxdata breaks up too-large read/write; setting maxlen breaks up
-  too-large trim/zero
-
 nbdkit-cache-filter needs considerable work:

 * allow the user to limit the total size of the cache
diff --git a/docs/nbdkit-filter.pod b/docs/nbdkit-filter.pod
index 3ec8f2a..32d50cb 100644
--- a/docs/nbdkit-filter.pod
+++ b/docs/nbdkit-filter.pod
@@ -560,6 +560,7 @@ L<nbdkit-plugin(1)>.

 Filters:

+L<nbdkit-blocksize-filter(1)>,
 L<nbdkit-cache-filter(1)>,
 L<nbdkit-cow-filter(1)>,
 L<nbdkit-delay-filter(1)>,
diff --git a/docs/nbdkit.pod b/docs/nbdkit.pod
index 22d91e7..94ddb62 100644
--- a/docs/nbdkit.pod
+++ b/docs/nbdkit.pod
@@ -917,6 +917,7 @@ L<nbdkit-xz-plugin(1)>.

 Filters:

+L<nbdkit-blocksize-filter(1)>,
 L<nbdkit-cache-filter(1)>,
 L<nbdkit-cow-filter(1)>,
 L<nbdkit-delay-filter(1)>,
diff --git a/filters/blocksize/nbdkit-blocksize-filter.pod
b/filters/blocksize/nbdkit-blocksize-filter.pod
new file mode 100644
index 0000000..39a2ffc
--- /dev/null
+++ b/filters/blocksize/nbdkit-blocksize-filter.pod
@@ -0,0 +1,141 @@
+=encoding utf8
+
+=head1 NAME
+
+nbdkit-blocksize-filter - nbdkit blocksize filter
+
+=head1 SYNOPSIS
+
+ nbdkit --filter=blocksize plugin [minblock=SIZE] [maxdata=SIZE] \
+     [maxlen=SIZE] [plugin-args...]
+
+=head1 DESCRIPTION
+
+C<nbdkit-blocksize-filter> is a filter that ensures various block size
+limits are met on transactions presented to the plugin.  The NBD
+protocol permits clients to send requests with a granularity as small
+as 1 byte or as large as nearly 4 gigabytes, although it suggests that
+portable clients should align requests to 512 bytes and not exceed 32
+megabytes without prior coordination with the server.
+
+Meanwhile, some plugins require requests to be aligned to 512-byte
+multiples, or may enforce a maximum transaction size to bound the time
+or memory resources spent by any one command (note that nbdkit itself
+refuses a read or write larger than 64 megabytes, while many other NBD
+servers limit things to 32 megabytes).  The blocksize filter can be
+used to modify the client requests to meet the plugin restrictions.
+
+=head1 PARAMETERS
+
+The nbdkit-blocksize-filter accepts the following parameters.
+
+=over 4
+
+=item B<minblock=SIZE>
+
+The minimum block size and alignment to pass to the plugin.  This must
+be a power of two, and no larger than 64k.  If omitted, this defaults
+to 1 (that is, no minimum size restrictions).  The filter rounds up
+read requests to alignment boundaries, performs read-modify-write
+cycles for any unaligned head or tail of a write or zero request, and
+silently ignores any unaligned head or tail of a trim request.  The
+filter also truncates the plugin size down to an aligned value (as it
+cannot safely operate on the unaligned tail); it is an error if this
+would result in a size of 0.
+
+This parameter understands the suffix 'k' for 1024.
+
+=item B<maxdata=SIZE>
+
+The maximum block size for any single transaction with data (read and
+write).  If omitted, this defaults to 64 megabytes (that is, the
+nbdkit maximum).  This need not be a power of two, but must be an
+integer multiple of C<minblock>.  The filter fragments any larger
+client request into multiple plugin requests.
+
+This parameter understands the suffixes 'k', 'M', and
'G' for powers
+of 1024.
+
+=item B<maxlen=SIZE>
+
+The maximum length for any single transaction without data (trim and
+zero).  If omitted, this defaults to 0xffffffff rounded down to
+C<minsize> alignment (that is, the inherent 32-bit limit of the NBD
+protocol).  This need not be a power of two, but must be an integer
+multiple of C<minblock>, and should be at least as large as
+C<maxdata>.  The filter fragments any larger client request into
+multiple plugin requests.
+
+This parameter understands the suffixes 'k', 'M', and
'G' for powers
+of 1024.
+
+=back
+
+=head1 EXAMPLES
+
+Allow an arbitrary client to use the VDDK plugin (which is limited to
+512-byte blocks), without having to fix the client to avoid sending
+unaligned requests:
+
+ nbdkit --filter=blocksize vddk minblock=512 file=/absolute/path/to/file.vmdk
+
+Allow an arbitrary client tuned to nbdkit's 64 megabyte sizing to
+connect to a remote server that insists on 32 megabyte sizing, via the
+nbd plugin:
+
+ nbdkit --filter=blocksize nbd maxdata=32M socket=/path/to/socket
+
+=head1 SEE ALSO
+
+L<nbdkit(1)>,
+L<nbdkit-nbd-plugin(1)>,
+L<nbdkit-vddk-plugin(1)>,
+L<nbdkit-filter(3)>.
+
+=head1 AUTHORS
+
+Eric Blake
+
+=head1 COPYRIGHT
+
+Copyright (C) 2018 Red Hat Inc.
+
+=head1 LICENSE
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are
+met:
+
+=over 4
+
+=item *
+
+Redistributions of source code must retain the above copyright
+notice, this list of conditions and the following disclaimer.
+
+=item *
+
+Redistributions in binary form must reproduce the above copyright
+notice, this list of conditions and the following disclaimer in the
+documentation and/or other materials provided with the distribution.
+
+=item *
+
+Neither the name of Red Hat nor the names of its contributors may be
+used to endorse or promote products derived from this software without
+specific prior written permission.
+
+=back
+
+THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS''
AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+SUCH DAMAGE.
diff --git a/configure.ac b/configure.ac
index 3fcc776..c3a121d 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1,5 +1,5 @@
 # nbdkit
-# Copyright (C) 2013-2017 Red Hat Inc.
+# Copyright (C) 2013-2018 Red Hat Inc.
 # All rights reserved.
 #
 # Redistribution and use in source and binary forms, with or without
@@ -513,6 +513,7 @@ AC_CONFIG_FILES([Makefile
                  plugins/vddk/Makefile
                  plugins/xz/Makefile
                  filters/Makefile
+                 filters/blocksize/Makefile
                  filters/cache/Makefile
                  filters/cow/Makefile
                  filters/delay/Makefile
diff --git a/filters/blocksize/blocksize.c b/filters/blocksize/blocksize.c
new file mode 100644
index 0000000..8834457
--- /dev/null
+++ b/filters/blocksize/blocksize.c
@@ -0,0 +1,350 @@
+/* nbdkit
+ * Copyright (C) 2018 Red Hat Inc.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ *
+ * * Neither the name of Red Hat nor the names of its contributors may be
+ * used to endorse or promote products derived from this software without
+ * specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+ * PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <config.h>
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <inttypes.h>
+#include <limits.h>
+#include <errno.h>
+
+#include <nbdkit-filter.h>
+
+/* XXX See design comment in filters/cow/cow.c. */
+#define THREAD_MODEL NBDKIT_THREAD_MODEL_SERIALIZE_ALL_REQUESTS
+
+#define BLOCKSIZE_MIN_LIMIT (64U * 1024)
+#define MIN(a, b) ((a) < (b) ? (a) : (b))
+
+/* As long as we don't have parallel requests, we can reuse a common
+ * buffer for alignment purposes; size it to the maximum we allow for
+ * minblock. */
+static char bounce[BLOCKSIZE_MIN_LIMIT];
+static unsigned int minblock;
+static unsigned int maxdata;
+static unsigned int maxlen;
+
+static int
+blocksize_parse (const char *name, const char *s, unsigned int *v)
+{
+  int64_t size = nbdkit_parse_size (s);
+
+  if (UINT_MAX < size) {
+    nbdkit_error ("parameter '%s' too large", name);
+    return -1;
+  }
+  *v = size;
+  return 0;
+}
+
+/* Called for each key=value passed on the command line. */
+static int
+blocksize_config (nbdkit_next_config *next, void *nxdata,
+                  const char *key, const char *value)
+{
+
+  if (strcmp (key, "minblock") == 0)
+    return blocksize_parse (key, value, &minblock);
+  if (strcmp (key, "maxdata") == 0)
+    return blocksize_parse (key, value, &maxdata);
+  if (strcmp (key, "maxlen") == 0)
+    return blocksize_parse (key, value, &maxlen);
+  return next (nxdata, key, value);
+}
+
+/* Check that limits are sane. */
+static int
+blocksize_config_complete (nbdkit_next_config_complete *next, void *nxdata)
+{
+  if (minblock) {
+    if (minblock & (minblock - 1)) {
+      nbdkit_error ("minblock must be a power of 2");
+      return -1;
+    }
+    if (minblock > BLOCKSIZE_MIN_LIMIT) {
+      nbdkit_error ("minblock must not exceed %u",
BLOCKSIZE_MIN_LIMIT);
+      return -1;
+    }
+  }
+  else
+    minblock = 1;
+
+  if (maxdata) {
+    if (maxdata & (minblock - 1)) {
+      nbdkit_error ("maxdata must be a multiple of %u", minblock);
+      return -1;
+    }
+  }
+  else
+    maxdata = 64 * 1024 * 1024;
+
+  if (maxlen) {
+    if (maxlen & (minblock - 1)) {
+      nbdkit_error ("maxlen must be a multiple of %u", minblock);
+      return -1;
+    }
+  }
+  else
+    maxlen = -minblock;
+
+  return next (nxdata);
+}
+
+#define blocksize_config_help \
+  "minblock=<SIZE>      Minimum block size, power of 2 <= 64k
(default 1).\n" \
+  "maxdata=<SIZE>       Maximum size for read/write (default
64M).\n" \
+  "maxlen=<SIZE>        Maximum size for trim/zero (default
4G-minblock)."
+
+static int
+blocksize_prepare (struct nbdkit_next_ops *next_ops, void *nxdata,
+                   void *handle)
+{
+  /* Early call to get_size to ensure it doesn't truncate to 0. */
+  int64_t size = next_ops->get_size (nxdata);
+
+  if (size == -1)
+    return -1;
+  if (size < minblock) {
+    nbdkit_error ("disk is too small for minblock size %u",
minblock);
+    return -1;
+  }
+  return 0;
+}
+
+static int64_t
+blocksize_get_size (struct nbdkit_next_ops *next_ops, void *nxdata,
+                    void *handle)
+{
+  int64_t size = next_ops->get_size (nxdata);
+
+  return size == -1 ? size : size & ~(minblock - 1);
+}
+
+static int
+blocksize_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
+                 void *handle, void *b, uint32_t count, uint64_t offs,
+                 uint32_t flags, int *err)
+{
+  char *buf = b;
+  uint32_t keep;
+  uint32_t drop;
+
+  /* Unaligned head */
+  if (offs & (minblock - 1)) {
+    drop = offs & (minblock - 1);
+    keep = MIN (minblock - drop, count);
+    if (next_ops->pread (nxdata, bounce, minblock, offs - drop, flags,
+                         err) == -1)
+      return -1;
+    memcpy (buf, bounce + drop, keep);
+    buf += keep;
+    offs += keep;
+    count -= keep;
+  }
+
+  /* Unaligned tail */
+  if (count & (minblock - 1)) {
+    keep = count & (minblock - 1);
+    count -= keep;
+    if (next_ops->pread (nxdata, bounce, minblock, offs + count, flags,
+                         err) == -1)
+      return -1;
+    memcpy (buf + count, bounce, keep);
+  }
+
+  /* Aligned body */
+  while (count) {
+    keep = MIN (maxdata, count);
+    if (next_ops->pread (nxdata, buf, keep, offs, flags, err) == -1)
+      return -1;
+    buf += keep;
+    offs += keep;
+    count -= keep;
+  }
+
+  return 0;
+}
+
+static int
+blocksize_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata,
+                  void *handle, const void *b, uint32_t count, uint64_t offs,
+                  uint32_t flags, int *err)
+{
+  const char *buf = b;
+  uint32_t keep;
+  uint32_t drop;
+
+  /* FIXME: Smarter handling of FUA - pass it through if the next layer
+   * can handle it natively, but just once at end if next layer emulates. */
+
+  /* Unaligned head */
+  if (offs & (minblock - 1)) {
+    drop = offs & (minblock - 1);
+    keep = MIN (minblock - drop, count);
+    if (next_ops->pread (nxdata, bounce, minblock, offs - drop, 0, err) ==
-1)
+      return -1;
+    memcpy (bounce + drop, buf, keep);
+    if (next_ops->pwrite (nxdata, bounce, minblock, offs - drop, flags,
+                          err) == -1)
+      return -1;
+    buf += keep;
+    offs += keep;
+    count -= keep;
+  }
+
+  /* Unaligned tail */
+  if (count & (minblock - 1)) {
+    keep = count & (minblock - 1);
+    count -= keep;
+    if (next_ops->pread (nxdata, bounce, minblock, offs + count, 0, err) ==
-1)
+      return -1;
+    memcpy (bounce, buf + count, keep);
+    if (next_ops->pwrite (nxdata, bounce, minblock, offs + count, flags,
+                          err) == -1)
+      return -1;
+  }
+
+  /* Aligned body */
+  while (count) {
+    keep = MIN (maxdata, count);
+    if (next_ops->pwrite (nxdata, buf, keep, offs, flags, err) == -1)
+      return -1;
+    buf += keep;
+    offs += keep;
+    count -= keep;
+  }
+
+  return 0;
+}
+
+static int
+blocksize_trim (struct nbdkit_next_ops *next_ops, void *nxdata,
+                void *handle, uint32_t count, uint64_t offs, uint32_t flags,
+                int *err)
+{
+  uint32_t keep;
+
+  /* FIXME: Smarter handling of FUA - pass it through if the next layer
+   * can handle it natively, but just once at end if next layer emulates. */
+
+  /* Unaligned head */
+  if (offs & (minblock - 1)) {
+    keep = MIN (minblock - (offs & (minblock - 1)), count);
+    offs += keep;
+    count -= keep;
+  }
+
+  /* Unaligned tail */
+  if (count & (minblock - 1))
+    count -= count & (minblock - 1);
+
+  /* Aligned body */
+  while (count) {
+    keep = MIN (maxlen, count);
+    if (next_ops->trim (nxdata, keep, offs, flags, err) == -1)
+      return -1;
+    offs += keep;
+    count -= keep;
+  }
+
+  return 0;
+}
+
+static int
+blocksize_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
+                void *handle, uint32_t count, uint64_t offs, uint32_t flags,
+                int *err)
+{
+  uint32_t keep;
+  uint32_t drop;
+
+  /* FIXME: Smarter handling of FUA - pass it through if the next layer
+   * can handle it natively, but just once at end if next layer emulates. */
+
+  /* Unaligned head */
+  if (offs & (minblock - 1)) {
+    drop = offs & (minblock - 1);
+    keep = MIN (minblock - drop, count);
+    if (next_ops->pread (nxdata, bounce, minblock, offs - drop, 0, err) ==
-1)
+      return -1;
+    memset (bounce + drop, 0, keep);
+    if (next_ops->pwrite (nxdata, bounce, minblock, offs - drop,
+                          flags & ~NBDKIT_FLAG_MAY_TRIM, err) == -1)
+      return -1;
+    offs += keep;
+    count -= keep;
+  }
+
+  /* Unaligned tail */
+  if (count & (minblock - 1)) {
+    keep = count & (minblock - 1);
+    count -= keep;
+    if (next_ops->pread (nxdata, bounce, minblock, offs + count, 0, err) ==
-1)
+      return -1;
+    memset (bounce, 0, keep);
+    if (next_ops->pwrite (nxdata, bounce, minblock, offs + count,
+                          flags & ~NBDKIT_FLAG_MAY_TRIM, err) == -1)
+      return -1;
+  }
+
+  /* Aligned body */
+  while (count) {
+    keep = MIN (maxlen, count);
+    if (next_ops->zero (nxdata, keep, offs, flags, err) == -1)
+      return -1;
+    offs += keep;
+    count -= keep;
+  }
+
+  return 0;
+}
+
+static struct nbdkit_filter filter = {
+  .name              = "blocksize",
+  .longname          = "nbdkit blocksize filter",
+  .version           = PACKAGE_VERSION,
+  .config            = blocksize_config,
+  .config_complete   = blocksize_config_complete,
+  .config_help       = blocksize_config_help,
+  .prepare           = blocksize_prepare,
+  .get_size          = blocksize_get_size,
+  .pread             = blocksize_pread,
+  .pwrite            = blocksize_pwrite,
+  .trim              = blocksize_trim,
+  .zero              = blocksize_zero,
+};
+
+NBDKIT_REGISTER_FILTER(filter)
diff --git a/filters/Makefile.am b/filters/Makefile.am
index 8e070e5..de98f43 100644
--- a/filters/Makefile.am
+++ b/filters/Makefile.am
@@ -31,6 +31,7 @@
 # SUCH DAMAGE.

 SUBDIRS = \
+	blocksize \
 	cache \
 	cow \
 	delay \
diff --git a/filters/blocksize/Makefile.am b/filters/blocksize/Makefile.am
new file mode 100644
index 0000000..0069403
--- /dev/null
+++ b/filters/blocksize/Makefile.am
@@ -0,0 +1,62 @@
+# nbdkit
+# Copyright (C) 2018 Red Hat Inc.
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met:
+#
+# * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution.
+#
+# * Neither the name of Red Hat nor the names of its contributors may be
+# used to endorse or promote products derived from this software without
+# specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+
+EXTRA_DIST = nbdkit-blocksize-filter.pod
+
+CLEANFILES = *~
+
+filterdir = $(libdir)/nbdkit/filters
+
+filter_LTLIBRARIES = nbdkit-blocksize-filter.la
+
+nbdkit_blocksize_filter_la_SOURCES = \
+	blocksize.c \
+	$(top_srcdir)/include/nbdkit-filter.h
+
+nbdkit_blocksize_filter_la_CPPFLAGS = \
+	-I$(top_srcdir)/include
+nbdkit_blocksize_filter_la_CFLAGS = \
+	$(WARNINGS_CFLAGS)
+nbdkit_blocksize_filter_la_LDFLAGS = \
+	-module -avoid-version -shared
+
+if HAVE_POD2MAN
+
+man_MANS = nbdkit-blocksize-filter.1
+CLEANFILES += $(man_MANS)
+
+nbdkit-blocksize-filter.1: nbdkit-blocksize-filter.pod
+	$(POD2MAN) $(POD2MAN_ARGS) --section=1 --name=`basename $@ .1` $< $@.t
&& \
+	if grep 'POD ERROR' $@.t; then rm $@.t; exit 1; fi && \
+	mv $@.t $@
+
+endif
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 2d6393d..2b3082e 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -40,6 +40,7 @@ EXTRA_DIST = \
 	shebang.pl \
 	shebang.py \
 	shebang.rb \
+	test-blocksize.sh \
 	test-cache.sh \
 	test-captive.sh \
 	test-cow.sh \
@@ -414,6 +415,9 @@ endif HAVE_RUBY
 #----------------------------------------------------------------------
 # Tests of filters.

+# blocksize filter test.
+TESTS += test-blocksize.sh
+
 # cache filter test.
 TESTS += test-cache.sh

diff --git a/tests/test-blocksize.sh b/tests/test-blocksize.sh
new file mode 100755
index 0000000..2a78e9c
--- /dev/null
+++ b/tests/test-blocksize.sh
@@ -0,0 +1,156 @@
+#!/bin/bash -
+# nbdkit
+# Copyright (C) 2018 Red Hat Inc.
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met:
+#
+# * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution.
+#
+# * Neither the name of Red Hat nor the names of its contributors may be
+# used to endorse or promote products derived from this software without
+# specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+
+set -e
+
+files="blocksize1.img blocksize1.log blocksize1.sock blocksize1.pid
+       blocksize2.img blocksize2.log blocksize2.sock blocksize2.pid"
+rm -f $files
+
+: ${QEMU_IO=qemu-io}
+
+# Prep images, and check that qemu-io understands the actions we plan on doing.
+# TODO: Until we implement NBD_OPT_GO, qemu-io does its own read-modify-write
+# at 512-byte alignment, while we'd like to ultimately test 1-byte accesses
+truncate --size 10M blocksize1.img
+if ! $QEMU_IO -f raw -c 'r 0 1' -c 'w -z 1000 2000' \
+     -c 'w -P 0 1M 2M' -c 'discard 3M 4M' blocksize1.img; then
+    echo "$0: missing or broken qemu-io"
+    rm blocksize1.img
+    exit 77
+fi
+truncate --size 10M blocksize2.img
+
+pid1= pid2+
+# Kill any nbdkit processes on exit.
+cleanup ()
+{
+    status=$?
+
+    test "$pid1" && kill $pid1
+    test "$pid2" && kill $pid2
+    # For easier debugging, dump the final log files before removing them.
+    echo "Log 1 file contents:"
+    cat blocksize1.log || :
+    echo "Log 2 file contents:"
+    cat blocksize2.log || :
+    rm -f $files
+
+    exit $status
+}
+trap cleanup INT QUIT TERM EXIT ERR
+
+# Run two parallel nbdkit; to compare the logs and see what changes.
+nbdkit -P blocksize1.pid -U blocksize1.sock \
+       --filter=log file logfile=blocksize1.log file=blocksize1.img
+nbdkit -P blocksize2.pid -U blocksize2.sock --filter=blocksize \
+       --filter=log file logfile=blocksize2.log file=blocksize2.img \
+       minblock=1024 maxdata=512k maxlen=1M
+
+# We may have to wait a short time for the pid files to appear.
+for i in `seq 1 10`; do
+    if test -f blocksize1.pid && test -f blocksize2.pid; then
+        break
+    fi
+    sleep 1
+done
+
+pid1="$(cat blocksize1.pid)" || :
+pid2="$(cat blocksize2.pid)" || :
+
+if ! test -f blocksize1.pid || ! test -f blocksize2.pid; then
+    echo "$0: PID files were not created"
+    exit 1
+fi
+
+# Test behavior on short accesses.
+$QEMU_IO -f raw -c 'r 1 1' -c 'w 10001 1' -c 'w -z 20001
1' \
+         -c 'discard 30001 1'
'nbd+unix://?socket=blocksize1.sock'
+$QEMU_IO -f raw -c 'r 1 1' -c 'w 10001 1' -c 'w -z 20001
1' \
+         -c 'discard 30001 1'
'nbd+unix://?socket=blocksize2.sock'
+
+# Read should round up (qemu-io may round to 512, but we must round to 1024
+grep 'connection=1 Read .* count=0x\(1\|200\) ' blocksize1.log ||
+    { echo "qemu-io can't pass 1-byte reads"; exit 77; }
+grep 'connection=1 Read .* offset=0x0 count=0x400 ' blocksize2.log
+# Write should become read-modify-write
+grep 'connection=1 Write .* count=0x\(1\|200\) ' blocksize1.log ||
+    { echo "qemu-io can't pass 1-byte writes"; exit 77; }
+grep 'connection=1 Read .* offset=0x2400 count=0x400 ' blocksize2.log
+grep 'connection=1 Write .* offset=0x2400 count=0x400 ' blocksize2.log
+# Zero should become read-modify-write
+if grep 'connection=1 Zero' blocksize2.log; then
+    echo "filter should have converted short zero to write"
+    exit 1
+fi
+grep 'connection=1 Read .* offset=0x4c00 count=0x400 ' blocksize2.log
+grep 'connection=1 Write .* offset=0x4c00 count=0x400 ' blocksize2.log
+# Trim should be discarded
+if grep 'connection=1 Trim' blocksize2.log; then
+    echo "filter should have dropped too-small trim"
+    exit 1
+fi
+
+# Test behavior on overlarge accesses.
+$QEMU_IO -f raw -c 'w -P 11 1048575 4094305' -c 'w -z 1050000
1100000' \
+         -c 'r -P 0 1050000 1100000' -c 'r -P 11 3000000
1048577' \
+         -c 'discard 7340031 2097153'
'nbd+unix://?socket=blocksize1.sock'
+$QEMU_IO -f raw -c 'w -P 11 1048575 4094305' -c 'w -z 1050000
1100000' \
+         -c 'r -P 0 1050000 1100000' -c 'r -P 11 3000000
1048577' \
+         -c 'discard 7340031 2097153'
'nbd+unix://?socket=blocksize2.sock'
+
+# Reads and writes should have been split.
+test "$(grep -c '\(Read\|Write\) .*count=0x80000 '
blocksize2.log)" -ge 10
+test "$(grep -c '\(Read\|Write\) .*count=0x[0-9a-f]\{6\} '
blocksize2.log)" = 0
+# Zero and trim should be split, but at different boundary
+grep 'Zero .*count=0x100000 ' blocksize2.log
+test "$(grep -c 'connection=2 Zero' blocksize2.log)" = 2
+if grep Trim blocksize1.log; then
+    test "$(grep -c 'connection=2 Trim .*count=0x100000 '
blocksize2.log)" = 2
+fi
+
+# Final sanity checks.
+if grep 'offset=0x[0-9a-f]*\([1235679abdef]00\|[0-9a-f]\(.[^0]\|[^0].\)\)
' \
+        blocksize2.log; then
+    echo "filter didn't align offset to 1024";
+    exit 1;
+fi
+if grep 'count=0x[0-9a-f]*\([1235679abdef]00\|[0-9a-f]\(.[^0]\|[^0].\)\)
' \
+        blocksize2.log; then
+    echo"filter didn't align count to 512";
+    exit 1;
+fi
+diff -u blocksize1.img blocksize2.img
+
+# The cleanup() function is called implicitly on exit.
-- 
2.14.3

Eric Blake

2018-Mar-08 23:03 UTC

head link

[Libguestfs] [nbdkit PATCH v3 06/15] backend: Add .can_zero/.can_fua helpers

While our plugin code always advertises WRITE_ZEROES on writable
connections (because we can emulate .zero by calling .pwrite),
and FUA support when .flush exists (because we can emulate the
persistence by calling .flush), it is conceivable that a filter
may want to explicitly avoid advertising particular bits.  More
to the point, upcoming patches will add a 'nozero' filter that
hides WRITE_ZEROES support, at least for the purposes of timing
comparisons between server emulation and the client having to
send zeroes over the wire; as well as support for letting plugins
and filters directly manage the level of FUA support.

Since we only have to honor FUA support on write requests, we do
not need to advertise it on a readonly connection.

Wire up two new backend callbacks.  .can_zero has the traditional
boolean+error semantics, but .can_fua has three return values
besides errors - this is because down the road, the decision on
whether to pass FUA to the next layer when subdividing a larger
request into smaller ones is dependent on knowing whether FUA is
emulated by .flush (flush only once after the last chunk) or
handled natively (pass FUA on every chunk).

Later patches will then expose both backend helpers to filters,
and .can_fua to plugins.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 src/internal.h    |  8 ++++++++
 src/connections.c | 31 +++++++++++++++++++++++++++----
 src/filters.c     | 24 ++++++++++++++++++++++++
 src/plugins.c     | 30 ++++++++++++++++++++++++++++++
 4 files changed, 89 insertions(+), 4 deletions(-)

diff --git a/src/internal.h b/src/internal.h
index 12dfc14..b49f071 100644
--- a/src/internal.h
+++ b/src/internal.h
@@ -98,6 +98,10 @@
       (type *) ((char *) __mptr - offsetof(type, member));       \
     })

+#define NBDKIT_FUA_NONE       0
+#define NBDKIT_FUA_EMULATE    1
+#define NBDKIT_FUA_NATIVE     2
+
 /* main.c */
 extern const char *exportname;
 extern const char *ipaddr;
@@ -177,11 +181,15 @@ struct backend {
   int (*prepare) (struct backend *, struct connection *conn);
   int (*finalize) (struct backend *, struct connection *conn);
   void (*close) (struct backend *, struct connection *conn);
+
   int64_t (*get_size) (struct backend *, struct connection *conn);
   int (*can_write) (struct backend *, struct connection *conn);
   int (*can_flush) (struct backend *, struct connection *conn);
   int (*is_rotational) (struct backend *, struct connection *conn);
   int (*can_trim) (struct backend *, struct connection *conn);
+  int (*can_zero) (struct backend *, struct connection *conn);
+  int (*can_fua) (struct backend *, struct connection *conn);
+
   int (*pread) (struct backend *, struct connection *conn, void *buf,
                 uint32_t count, uint64_t offset, uint32_t flags, int *err);
   int (*pwrite) (struct backend *, struct connection *conn, const void *buf,
diff --git a/src/connections.c b/src/connections.c
index 03f59f7..fbcf5e5 100644
--- a/src/connections.c
+++ b/src/connections.c
@@ -80,6 +80,8 @@ struct connection {
   int can_flush;
   int is_rotational;
   int can_trim;
+  int can_zero;
+  int can_fua;
   int using_tls;

   int sockin, sockout;
@@ -426,7 +428,13 @@ compute_eflags (struct connection *conn, uint16_t *flags)
     conn->readonly = 1;
   }
   if (!conn->readonly) {
-    eflags |= NBD_FLAG_SEND_WRITE_ZEROES;
+    fl = backend->can_zero (backend, conn);
+    if (fl == -1)
+      return -1;
+    if (fl) {
+      eflags |= NBD_FLAG_SEND_WRITE_ZEROES;
+      conn->can_zero = 1;
+    }

     fl = backend->can_trim (backend, conn);
     if (fl == -1)
@@ -435,13 +443,21 @@ compute_eflags (struct connection *conn, uint16_t *flags)
       eflags |= NBD_FLAG_SEND_TRIM;
       conn->can_trim = 1;
     }
+
+    fl = backend->can_fua (backend, conn);
+    if (fl == -1)
+      return -1;
+    if (fl) {
+      eflags |= NBD_FLAG_SEND_FUA;
+      conn->can_fua = 1;
+    }
   }

   fl = backend->can_flush (backend, conn);
   if (fl == -1)
     return -1;
   if (fl) {
-    eflags |= NBD_FLAG_SEND_FLUSH | NBD_FLAG_SEND_FUA;
+    eflags |= NBD_FLAG_SEND_FLUSH;
     conn->can_flush = 1;
   }

@@ -880,7 +896,7 @@ validate_request (struct connection *conn,
     *error = EINVAL;
     return false;
   }
-  if (!conn->can_flush && (flags & NBD_CMD_FLAG_FUA)) {
+  if (!conn->can_fua && (flags & NBD_CMD_FLAG_FUA)) {
     nbdkit_error ("invalid request: FUA flag not supported");
     *error = EINVAL;
     return false;
@@ -909,6 +925,13 @@ validate_request (struct connection *conn,
     return false;
   }

+  /* Zero allowed? */
+  if (!conn->can_zero && cmd == NBD_CMD_WRITE_ZEROES) {
+    nbdkit_error ("invalid request: write zeroes operation not
supported");
+    *error = EINVAL;
+    return false;
+  }
+
   return true;                     /* Command validates. */
 }

@@ -928,7 +951,7 @@ handle_request (struct connection *conn,
                 void *buf)
 {
   uint32_t f = 0;
-  bool fua = conn->can_flush && (flags & NBD_CMD_FLAG_FUA);
+  bool fua = conn->can_fua && (flags & NBD_CMD_FLAG_FUA);
   int err = 0;

   /* Clear the error, so that we know if the plugin calls
diff --git a/src/filters.c b/src/filters.c
index 9d821f2..0cf7594 100644
--- a/src/filters.c
+++ b/src/filters.c
@@ -446,6 +446,28 @@ filter_can_trim (struct backend *b, struct connection
*conn)
     return f->backend.next->can_trim (f->backend.next, conn);
 }

+static int
+filter_can_zero (struct backend *b, struct connection *conn)
+{
+  struct backend_filter *f = container_of (b, struct backend_filter, backend);
+
+  debug ("can_zero");
+
+  /* TODO expose this to filters */
+  return f->backend.next->can_zero (f->backend.next, conn);
+}
+
+static int
+filter_can_fua (struct backend *b, struct connection *conn)
+{
+  struct backend_filter *f = container_of (b, struct backend_filter, backend);
+
+  debug ("can_fua");
+
+  /* TODO expose this to plugins and filters */
+  return f->backend.next->can_fua (f->backend.next, conn);
+}
+
 static int
 filter_pread (struct backend *b, struct connection *conn,
               void *buf, uint32_t count, uint64_t offset,
@@ -570,6 +592,8 @@ static struct backend filter_functions = {
   .can_flush = filter_can_flush,
   .is_rotational = filter_is_rotational,
   .can_trim = filter_can_trim,
+  .can_zero = filter_can_zero,
+  .can_fua = filter_can_fua,
   .pread = filter_pread,
   .pwrite = filter_pwrite,
   .flush = filter_flush,
diff --git a/src/plugins.c b/src/plugins.c
index d5a2a65..10c911d 100644
--- a/src/plugins.c
+++ b/src/plugins.c
@@ -361,6 +361,34 @@ get_error (struct backend_plugin *p)
   return ret ? ret : EIO;
 }

+static int
+plugin_can_zero (struct backend *b, struct connection *conn)
+{
+  debug ("can_zero");
+
+  /* We always allow .zero to fall back to .write, so plugins don't
+   * need to override this. */
+  return plugin_can_write (b, conn);
+}
+
+static int
+plugin_can_fua (struct backend *b, struct connection *conn)
+{
+  struct backend_plugin *p = container_of (b, struct backend_plugin, backend);
+  int r;
+
+  debug ("can_fua");
+
+  /* TODO - wire FUA flag support into plugins. Until then, this copies
+   * can_flush, since that's how we emulate FUA. */
+  r = plugin_can_flush (b, conn);
+  if (r == -1)
+    return -1;
+  if (r == 0 || !p->plugin.flush)
+    return NBDKIT_FUA_NONE;
+  return NBDKIT_FUA_EMULATE;
+}
+
 static int
 plugin_pread (struct backend *b, struct connection *conn,
               void *buf, uint32_t count, uint64_t offset, uint32_t flags,
@@ -546,6 +574,8 @@ static struct backend plugin_functions = {
   .can_flush = plugin_can_flush,
   .is_rotational = plugin_is_rotational,
   .can_trim = plugin_can_trim,
+  .can_zero = plugin_can_zero,
+  .can_fua = plugin_can_fua,
   .pread = plugin_pread,
   .pwrite = plugin_pwrite,
   .flush = plugin_flush,
-- 
2.14.3

Eric Blake

2018-Mar-08 23:03 UTC

head link

[Libguestfs] [nbdkit PATCH v3 07/15] filters: Expose new .can_zero callback

While our plugin code always advertises WRITE_ZEROES on writable
connections (because we can emulate .zero by calling .pwrite),
it is conceivable that a filter may want to explicitly avoid
advertising particular bits.  More to the point, an upcoming
patch will add a 'nozero' filter that hides WRITE_ZEROES support,
at least for the purposes of timing comparisons between server
emulation and the client having to send zeroes over the wire.

This change alters filter API; but given that we've already bumped
the API in a previous patch, we can group all of the API changes
within the same release without having to bump the #define value
in this patch.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 docs/nbdkit-filter.pod  | 20 +++++++++++++++++++-
 include/nbdkit-filter.h |  3 +++
 src/filters.c           | 16 ++++++++++++++--
 filters/log/log.c       |  7 ++++---
 4 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/docs/nbdkit-filter.pod b/docs/nbdkit-filter.pod
index 32d50cb..3af97b0 100644
--- a/docs/nbdkit-filter.pod
+++ b/docs/nbdkit-filter.pod
@@ -353,7 +353,8 @@ calls.
  int (*can_trim) (struct nbdkit_next_ops *next_ops, void *nxdata,
                   void *handle);

-These intercept the corresponding plugin methods.
+These intercept the corresponding plugin methods, and control feature
+bits advertised to the client.

 If there is an error, the callback should call C<nbdkit_error> with an
 error message and return C<-1>.  If these functions are called more
@@ -361,6 +362,23 @@ than once for the same connection, they should return the
same value;
 similarly, the filter may cache the results of each counterpart in
 C<next_ops> for a given connection rather than repeating calls.

+=head2 C<.can_zero>
+
+ int (*can_zero) (struct nbdkit_next_ops *next_ops, void *nxdata,
+                  void *handle);
+
+This controls whether write zero support will be advertised to the
+client.  This function has no counterpart in plugins, because nbdkit
+can always emulate zero by using pwrite; but a filter may want to
+force different handling than the nbdkit implementation.  If this
+callback is omitted, the default returned for the plugin layer is
+true.
+
+If there is an error, the callback should call C<nbdkit_error> with an
+error message and return C<-1>.  Like the other initial queries
+documented above, caching the return value of this function is
+allowed.
+
 =head2 C<.pread>

  int (*pread) (struct nbdkit_next_ops *next_ops, void *nxdata,
diff --git a/include/nbdkit-filter.h b/include/nbdkit-filter.h
index 95be130..533713b 100644
--- a/include/nbdkit-filter.h
+++ b/include/nbdkit-filter.h
@@ -57,6 +57,7 @@ struct nbdkit_next_ops {
   int (*can_flush) (void *nxdata);
   int (*is_rotational) (void *nxdata);
   int (*can_trim) (void *nxdata);
+  int (*can_zero) (void *nxdata);

   int (*pread) (void *nxdata, void *buf, uint32_t count, uint64_t offset,
                 uint32_t flags, int *err);
@@ -115,6 +116,8 @@ struct nbdkit_filter {
                         void *handle);
   int (*can_trim) (struct nbdkit_next_ops *next_ops, void *nxdata,
                    void *handle);
+  int (*can_zero) (struct nbdkit_next_ops *next_ops, void *nxdata,
+                   void *handle);

   int (*pread) (struct nbdkit_next_ops *next_ops, void *nxdata,
                 void *handle, void *buf, uint32_t count, uint64_t offset,
diff --git a/src/filters.c b/src/filters.c
index 0cf7594..491c676 100644
--- a/src/filters.c
+++ b/src/filters.c
@@ -287,6 +287,13 @@ next_can_trim (void *nxdata)
   return b_conn->b->can_trim (b_conn->b, b_conn->conn);
 }

+static int
+next_can_zero (void *nxdata)
+{
+  struct b_conn *b_conn = nxdata;
+  return b_conn->b->can_zero (b_conn->b, b_conn->conn);
+}
+
 static int
 next_pread (void *nxdata, void *buf, uint32_t count, uint64_t offset,
             uint32_t flags, int *err)
@@ -334,6 +341,7 @@ static struct nbdkit_next_ops next_ops = {
   .can_flush = next_can_flush,
   .is_rotational = next_is_rotational,
   .can_trim = next_can_trim,
+  .can_zero = next_can_zero,
   .pread = next_pread,
   .pwrite = next_pwrite,
   .flush = next_flush,
@@ -450,11 +458,15 @@ static int
 filter_can_zero (struct backend *b, struct connection *conn)
 {
   struct backend_filter *f = container_of (b, struct backend_filter, backend);
+  void *handle = connection_get_handle (conn, f->backend.i);
+  struct b_conn nxdata = { .b = f->backend.next, .conn = conn };

   debug ("can_zero");

-  /* TODO expose this to filters */
-  return f->backend.next->can_zero (f->backend.next, conn);
+  if (f->filter.can_zero)
+    return f->filter.can_zero (&next_ops, &nxdata, handle);
+  else
+    return f->backend.next->can_zero (f->backend.next, conn);
 }

 static int
diff --git a/filters/log/log.c b/filters/log/log.c
index 58fd4f8..2dd61c0 100644
--- a/filters/log/log.c
+++ b/filters/log/log.c
@@ -235,13 +235,14 @@ log_prepare (struct nbdkit_next_ops *next_ops, void
*nxdata, void *handle)
   int f = next_ops->can_flush (nxdata);
   int r = next_ops->is_rotational (nxdata);
   int t = next_ops->can_trim (nxdata);
+  int z = next_ops->can_zero (nxdata);

-  if (size < 0 || w < 0 || f < 0 || r < 0 || t < 0)
+  if (size < 0 || w < 0 || f < 0 || r < 0 || t < 0 || z < 0)
     return -1;

-  /* TODO expose can_zero, can_fua to filters */
+  /* TODO expose can_fua to filters */
   output (h, "Connect", 0, "size=0x%" PRIx64 "
write=%d flush=%d "
-          "rotational=%d trim=%d" /* zero=? fua=? */, size, w, f, r,
t);
+          "rotational=%d trim=%d zero=%d" /* fua=? */, size, w, f, r,
t, z);
   return 0;
 }

-- 
2.14.3

Eric Blake

2018-Mar-08 23:03 UTC

head link

[Libguestfs] [nbdkit PATCH v3 08/15] filters: Add nozero filter

Sometimes, it's nice to see what a difference it makes in timing
or in destination file size when sparse handling is enabled or
disabled.  Add a new filter that makes it trivial to disable
write zero support, both at the client side (the feature is not
advertised, so the client must do fallbacks) and at the server
side (the feature is always emulated, rather than using any
fast paths built into the plugin).

The log filter makes it easy to test the difference between the
two modes.

Signed-off-by: Eric Blake <eblake@redhat.com>

---
v3: rebase to improved API, get unit test working
---
 TODO                                    |   2 +-
 docs/nbdkit-filter.pod                  |   1 +
 docs/nbdkit.pod                         |   1 +
 filters/nozero/nbdkit-nozero-filter.pod |  99 ++++++++++++++++++++++
 configure.ac                            |   1 +
 filters/nozero/nozero.c                 | 106 +++++++++++++++++++++++
 filters/Makefile.am                     |   1 +
 filters/nozero/Makefile.am              |  62 ++++++++++++++
 tests/Makefile.am                       |   4 +
 tests/test-nozero.sh                    | 145 ++++++++++++++++++++++++++++++++
 10 files changed, 421 insertions(+), 1 deletion(-)
 create mode 100644 filters/nozero/nbdkit-nozero-filter.pod
 create mode 100644 filters/nozero/nozero.c
 create mode 100644 filters/nozero/Makefile.am
 create mode 100755 tests/test-nozero.sh

diff --git a/TODO b/TODO
index d02671d..e7a8405 100644
--- a/TODO
+++ b/TODO
@@ -75,7 +75,7 @@ Suggestions for filters
 -----------------------

 * injecting artificial errors or otherwise masking plugin features
-  (such as hiding zero support) for testing clients
+  for testing clients (see 'nozero' filter for example)

 * fua filter: setting mode=none stops advertisement, mode=emulate uses
   flush emulation (or fails if !can_flush), mode=native passes on
diff --git a/docs/nbdkit-filter.pod b/docs/nbdkit-filter.pod
index 3af97b0..6766216 100644
--- a/docs/nbdkit-filter.pod
+++ b/docs/nbdkit-filter.pod
@@ -583,6 +583,7 @@ L<nbdkit-cache-filter(1)>,
 L<nbdkit-cow-filter(1)>,
 L<nbdkit-delay-filter(1)>,
 L<nbdkit-log-filter(1)>,
+L<nbdkit-nozero-filter(1)>,
 L<nbdkit-offset-filter(1)>,
 L<nbdkit-partition-filter(1)>.

diff --git a/docs/nbdkit.pod b/docs/nbdkit.pod
index 94ddb62..9247ff5 100644
--- a/docs/nbdkit.pod
+++ b/docs/nbdkit.pod
@@ -922,6 +922,7 @@ L<nbdkit-cache-filter(1)>,
 L<nbdkit-cow-filter(1)>,
 L<nbdkit-delay-filter(1)>,
 L<nbdkit-log-filter(1)>,
+L<nbdkit-nozero-filter(1)>,
 L<nbdkit-offset-filter(1)>,
 L<nbdkit-partition-filter(1)>.

diff --git a/filters/nozero/nbdkit-nozero-filter.pod
b/filters/nozero/nbdkit-nozero-filter.pod
new file mode 100644
index 0000000..4065302
--- /dev/null
+++ b/filters/nozero/nbdkit-nozero-filter.pod
@@ -0,0 +1,99 @@
+=encoding utf8
+
+=head1 NAME
+
+nbdkit-nozero-filter - nbdkit nozero filter
+
+=head1 SYNOPSIS
+
+ nbdkit --filter=nozero plugin [zeromode=MODE] [plugin-args...]
+
+=head1 DESCRIPTION
+
+C<nbdkit-nozero-filter> is a filter that intentionally disables
+efficient handling of sparse file holes (ranges of all-zero bytes)
+across the NBD protocol.  It is mainly useful for evaluating timing
+differences between naive vs. sparse-aware connections, and for
+testing client or server fallbacks.
+
+=head1 PARAMETERS
+
+=over 4
+
+=item B<zeromode=MODE>
+
+Optional, controls which mode the filter will use.  Mode B<none>
+(default) means that zero support is not advertised to the client;
+mode B<emulate> means that zero support is emulated by the filter
+using the plugin's C<pwrite> callback, regardless of whether the
+plugin itself implemented the C<zero> callback with a more efficient
+way to write zeros.
+
+=back
+
+=head1 EXAMPLES
+
+Serve the file F<disk.img>, but force the client to write zeroes
+explicitly rather than with C<NBD_CMD_WRITE_ZEROES>:
+
+ nbdkit --filter=nozero file file=disk.img
+
+Serve the file F<disk.img>, allowing the client to take advantage of
+less network traffic via C<NBD_CMD_WRITE_ZEROES>, but still forcing
+the data to be written explicitly rather than punching any holes:
+
+ nbdkit --filter=nozero file zeromode=emulate file=disk.img
+
+=head1 SEE ALSO
+
+L<nbdkit(1)>,
+L<nbdkit-file-plugin(1)>,
+L<nbdkit-filter(3)>.
+
+=head1 AUTHORS
+
+Eric Blake
+
+=head1 COPYRIGHT
+
+Copyright (C) 2018 Red Hat Inc.
+
+=head1 LICENSE
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are
+met:
+
+=over 4
+
+=item *
+
+Redistributions of source code must retain the above copyright
+notice, this list of conditions and the following disclaimer.
+
+=item *
+
+Redistributions in binary form must reproduce the above copyright
+notice, this list of conditions and the following disclaimer in the
+documentation and/or other materials provided with the distribution.
+
+=item *
+
+Neither the name of Red Hat nor the names of its contributors may be
+used to endorse or promote products derived from this software without
+specific prior written permission.
+
+=back
+
+THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS''
AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+SUCH DAMAGE.
diff --git a/configure.ac b/configure.ac
index c3a121d..29ef109 100644
--- a/configure.ac
+++ b/configure.ac
@@ -518,6 +518,7 @@ AC_CONFIG_FILES([Makefile
                  filters/cow/Makefile
                  filters/delay/Makefile
                  filters/log/Makefile
+                 filters/nozero/Makefile
                  filters/offset/Makefile
                  filters/partition/Makefile
                  src/Makefile
diff --git a/filters/nozero/nozero.c b/filters/nozero/nozero.c
new file mode 100644
index 0000000..dac47ad
--- /dev/null
+++ b/filters/nozero/nozero.c
@@ -0,0 +1,106 @@
+/* nbdkit
+ * Copyright (C) 2018 Red Hat Inc.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ *
+ * * Neither the name of Red Hat nor the names of its contributors may be
+ * used to endorse or promote products derived from this software without
+ * specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+ * PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <config.h>
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <stdbool.h>
+#include <assert.h>
+
+#include <nbdkit-filter.h>
+
+#define THREAD_MODEL NBDKIT_THREAD_MODEL_PARALLEL
+
+#define MIN(a, b) ((a) < (b) ? (a) : (b))
+#define MAX_WRITE (64 * 1024 * 1024)
+
+static char buffer[MAX_WRITE];
+static bool emulate;
+
+static int
+nozero_config (nbdkit_next_config *next, void *nxdata,
+               const char *key, const char *value)
+{
+  if (strcmp (key, "zeromode") == 0) {
+    if (strcmp (value, "emulate") == 0)
+      emulate = true;
+    else if (strcmp (value, "none") != 0) {
+      nbdkit_error ("unknown zeromode '%s'", value);
+      return -1;
+    }
+    return 0;
+  }
+  return next (nxdata, key, value);
+}
+
+#define nozero_config_help \
+  "zeromode=<MODE>      Either 'none' (default) or
'emulate'.\n" \
+
+/* Advertise desired WRITE_ZEROES mode. */
+static int
+nozero_can_zero (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle)
+{
+  return emulate;
+}
+
+static int
+nozero_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
+             void *handle, uint32_t count, uint64_t offs, uint32_t flags,
+             int *err)
+{
+  assert (emulate);
+  while (count) {
+    uint32_t size = MIN (count, MAX_WRITE);
+    if (next_ops->pwrite (nxdata, buffer, size, offs,
+                          flags & ~NBDKIT_FLAG_MAY_TRIM, err) == -1)
+      return -1;
+    offs += size;
+    count -= size;
+  }
+  return 0;
+}
+
+static struct nbdkit_filter filter = {
+  .name              = "nozero",
+  .longname          = "nbdkit nozero filter",
+  .version           = PACKAGE_VERSION,
+  .config            = nozero_config,
+  .config_help       = nozero_config_help,
+  .can_zero          = nozero_can_zero,
+  .zero              = nozero_zero,
+};
+
+NBDKIT_REGISTER_FILTER(filter)
diff --git a/filters/Makefile.am b/filters/Makefile.am
index de98f43..170ee09 100644
--- a/filters/Makefile.am
+++ b/filters/Makefile.am
@@ -36,5 +36,6 @@ SUBDIRS = \
 	cow \
 	delay \
 	log \
+	nozero \
 	offset \
 	partition
diff --git a/filters/nozero/Makefile.am b/filters/nozero/Makefile.am
new file mode 100644
index 0000000..1c66ed8
--- /dev/null
+++ b/filters/nozero/Makefile.am
@@ -0,0 +1,62 @@
+# nbdkit
+# Copyright (C) 2018 Red Hat Inc.
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met:
+#
+# * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution.
+#
+# * Neither the name of Red Hat nor the names of its contributors may be
+# used to endorse or promote products derived from this software without
+# specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+
+EXTRA_DIST = nbdkit-nozero-filter.pod
+
+CLEANFILES = *~
+
+filterdir = $(libdir)/nbdkit/filters
+
+filter_LTLIBRARIES = nbdkit-nozero-filter.la
+
+nbdkit_nozero_filter_la_SOURCES = \
+	nozero.c \
+	$(top_srcdir)/include/nbdkit-filter.h
+
+nbdkit_nozero_filter_la_CPPFLAGS = \
+	-I$(top_srcdir)/include
+nbdkit_nozero_filter_la_CFLAGS = \
+	$(WARNINGS_CFLAGS)
+nbdkit_nozero_filter_la_LDFLAGS = \
+	-module -avoid-version -shared
+
+if HAVE_POD2MAN
+
+man_MANS = nbdkit-nozero-filter.1
+CLEANFILES += $(man_MANS)
+
+nbdkit-nozero-filter.1: nbdkit-nozero-filter.pod
+	$(POD2MAN) $(POD2MAN_ARGS) --section=1 --name=`basename $@ .1` $< $@.t
&& \
+	if grep 'POD ERROR' $@.t; then rm $@.t; exit 1; fi && \
+	mv $@.t $@
+
+endif
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 2b3082e..0013fda 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -53,6 +53,7 @@ EXTRA_DIST = \
 	test-help-plugin.sh \
 	test-ip.sh \
 	test-log.sh \
+	test-nozero.sh \
 	test_ocaml_plugin.ml \
 	test-ocaml.c \
 	test-parallel-file.sh \
@@ -435,6 +436,9 @@ test_delay_LDADD = libtest.la $(LIBGUESTFS_LIBS)
 # log filter test.
 TESTS += test-log.sh

+# nozero filter test.
+TESTS += test-nozero.sh
+
 # offset filter test.
 check_DATA += offset-data
 MAINTAINERCLEANFILES += offset-data
diff --git a/tests/test-nozero.sh b/tests/test-nozero.sh
new file mode 100755
index 0000000..85caa61
--- /dev/null
+++ b/tests/test-nozero.sh
@@ -0,0 +1,145 @@
+#!/bin/bash -
+# nbdkit
+# Copyright (C) 2018 Red Hat Inc.
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met:
+#
+# * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution.
+#
+# * Neither the name of Red Hat nor the names of its contributors may be
+# used to endorse or promote products derived from this software without
+# specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+
+set -e
+
+files="nozero1.img nozero1.log nozero1.sock nozero1.pid
+       nozero2.img nozero2.log nozero2.sock nozero2.pid
+       nozero3.img nozero3.log nozero3.sock nozero3.pid
+       nozero4.img nozero4.log nozero4.sock nozero4.pid"
+rm -f $files
+
+: ${QEMU_IO=qemu-io}
+
+# Prep images, and check that qemu-io understands the actions we plan on
+# doing, and that zero with trim results in a sparse image.
+for f in {0..1023}; do printf '%1024s' . >> nozero1.img; done
+cp nozero1.img nozero2.img
+cp nozero1.img nozero3.img
+cp nozero1.img nozero4.img
+if ! $QEMU_IO -f raw -d unmap -c 'w -z -u 0 1M' nozero1.img; then
+    echo "$0: missing or broken qemu-io"
+    rm nozero?.img
+    exit 77
+fi
+if test "$(stat -c %b nozero1.img)" = "$(stat -c %b
nozero2.img)"; then
+    echo "$0: can't trim file by writing zeros"
+    rm nozero?.img
+    exit 77
+fi
+cp nozero2.img nozero1.img
+
+pid1= pid2= pid3= pid4+
+# Kill any nbdkit processes on exit.
+cleanup ()
+{
+    status=$?
+
+    test "$pid1" && kill $pid1
+    test "$pid2" && kill $pid2
+    test "$pid3" && kill $pid3
+    test "$pid4" && kill $pid4
+    # For easier debugging, dump the final log files before removing them.
+    echo "Log 1 file contents:"
+    cat nozero1.log || :
+    echo "Log 2 file contents:"
+    cat nozero2.log || :
+    echo "Log 3 file contents:"
+    cat nozero3.log || :
+    echo "Log 4 file contents:"
+    cat nozero4.log || :
+    rm -f $files
+
+    exit $status
+}
+trap cleanup INT QUIT TERM EXIT ERR
+
+# Run four parallel nbdkit; to compare the logs and see what changes.
+# 1: unfiltered, to check that qemu-io sends ZERO request and plugin trims
+# 2: log before filter with zeromode=none (default), to ensure no ZERO request
+# 3: log before filter with zeromode=emulate, to ensure ZERO from client
+# 4: log after filter with zeromode=emulate, to ensure no ZERO to plugin
+nbdkit -P nozero1.pid -U nozero1.sock --filter=log \
+       file logfile=nozero1.log file=nozero1.img
+nbdkit -P nozero2.pid -U nozero2.sock --filter=log --filter=nozero \
+       file logfile=nozero2.log file=nozero2.img
+nbdkit -P nozero3.pid -U nozero3.sock --filter=log --filter=nozero \
+       file logfile=nozero3.log file=nozero3.img zeromode=emulate
+nbdkit -P nozero4.pid -U nozero4.sock --filter=nozero --filter=log \
+       file logfile=nozero4.log file=nozero4.img zeromode=emulate
+
+# We may have to wait a short time for the pid files to appear.
+for i in `seq 1 10`; do
+    if test -f nozero1.pid && test -f nozero2.pid && test -f
nozero3.pid &&
+       test -f nozero4.pid; then
+        break
+    fi
+    sleep 1
+done
+
+pid1="$(cat nozero1.pid)" || :
+pid2="$(cat nozero2.pid)" || :
+pid3="$(cat nozero3.pid)" || :
+pid4="$(cat nozero4.pid)" || :
+
+if ! test -f nozero1.pid || ! test -f nozero2.pid || ! test -f nozero3.pid ||
+   ! test -f nozero4.pid; then
+    echo "$0: PID files were not created"
+    exit 1
+fi
+
+# Perform the zero write.
+$QEMU_IO -f raw -c 'w -z -u 0 1M'
'nbd+unix://?socket=nozero1.sock'
+$QEMU_IO -f raw -c 'w -z -u 0 1M'
'nbd+unix://?socket=nozero2.sock'
+$QEMU_IO -f raw -c 'w -z -u 0 1M'
'nbd+unix://?socket=nozero3.sock'
+$QEMU_IO -f raw -c 'w -z -u 0 1M'
'nbd+unix://?socket=nozero4.sock'
+
+# Check for expected ZERO vs. WRITE results
+grep 'connection=1 Zero' nozero1.log
+if grep 'connection=1 Zero' nozero2.log; then
+    echo "filter should have prevented zero"
+    exit 1
+fi
+grep 'connection=1 Zero' nozero3.log
+if grep 'connection=1 Zero' nozero4.log; then
+    echo "filter should have converted zero into write"
+    exit 1
+fi
+
+# Sanity check on contents - all 4 files should read identically
+cmp nozero1.img nozero2.img
+cmp nozero2.img nozero3.img
+cmp nozero3.img nozero4.img
+
+# The cleanup() function is called implicitly on exit.
-- 
2.14.3

Eric Blake

2018-Mar-08 23:03 UTC

head link

[Libguestfs] [nbdkit PATCH v3 09/15] filters: Expose new .can_fua callback

We already added a .can_fua callback internally; it is now time to
expose it to the filters, and to update the particular filters that
can perform more efficient FUA when subdividing requests.

Note that this is a change to the filter API; but given that we've
already bumped the API in a previous patch, we can group all of the
API changes within the same release without bumping the #define
value in this patch.

For the cow filter, emulate FUA requests locally (arguably, as long as
we are serializing all requests and the overlay file is temporary,
waiting for data to hit the disk is pointless and we could treat FUA
as a no-op, but for now we keep the fdatasync() call that matches what
we would have had with a client calling flush instead of using FUA).

For the cache and blocksize filters, optimize whether the FUA flag is
passed on to the next layer according to what style of FUA the next
layer implements.  However, in the emulate case, it was simpler to
just always do a final flush, compared to figuring out how to send
the FUA flag on just the final sub-request.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 docs/nbdkit-filter.pod        | 46 +++++++++++++++++++++++++++++++++++++------
 include/nbdkit-common.h       |  4 ++++
 include/nbdkit-filter.h       |  3 +++
 src/internal.h                |  4 ----
 src/filters.c                 | 16 +++++++++++++--
 filters/blocksize/blocksize.c | 32 ++++++++++++++++++++++++------
 filters/cache/cache.c         | 25 ++++++++++++++++++++++-
 filters/cow/cow.c             | 21 ++++++++++++++++++++
 filters/log/log.c             |  6 +++---
 9 files changed, 135 insertions(+), 22 deletions(-)

diff --git a/docs/nbdkit-filter.pod b/docs/nbdkit-filter.pod
index 6766216..d53daf4 100644
--- a/docs/nbdkit-filter.pod
+++ b/docs/nbdkit-filter.pod
@@ -379,6 +379,39 @@ error message and return C<-1>.  Like the other
initial queries
 documented above, caching the return value of this function is
 allowed.

+=head2 C<.can_fua>
+
+ int (*can_fua) (struct nbdkit_next_ops *next_ops, void *nxdata,
+                 void *handle);
+
+This controls how the Forced Unit Access flag will be handled; it is
+only relevant for connections that are not read-only.  For now, this
+function has no counterpart in plugins, although it will be added.
+Unlike other feature functions with just two success states, this one
+returns three success values: C<NBDKIT_FUA_NONE> to avoid advertising
+FUA support to the client, C<NBDKIT_FUA_EMULATE> if FUA is emulated by
+nbdkit calling flush after a write transaction, and
+C<NBDKIT_FUA_NATIVE> if FUA semantics are handled by the plugin
+without the overhead of an extra flush from nbdkit.
+
+The difference between the two non-zero values may affect choices made
+in the filter: when splitting a write request that requested FUA from
+the client, native support should pass the FUA flag on to each
+sub-request; while if it is known that FUA is emulated by a flush, it
+is more efficient to only flush once after all sub-requests have
+completed (often by passing C<NBDKIT_FLAG_FUA> on to only the final
+sub-request, or by dropping the flag and ending with a direct call to
+C<next_ops-E<gt>flush>).
+
+If this callback is omitted, the default returned for the plugin layer
+is C<NBDKIT_FUA_EMULATE> if C<can_flush> returned true, otherwise
it
+is C<NBDKIT_FUA_NONE>.
+
+If there is an error, the callback should call C<nbdkit_error> with an
+error message and return C<-1>.  Like the other initial queries
+documented above, caching the return value of this function is
+allowed.
+
 =head2 C<.pread>

  int (*pread) (struct nbdkit_next_ops *next_ops, void *nxdata,
@@ -412,9 +445,9 @@ turn, the filter should not call
C<next_ops-E<gt>pwrite> if
 C<next_ops-E<gt>can_write> did not return true.

 The parameter C<flags> may include C<NBDKIT_FLAG_FUA> on input
based
-on the result of C<.can_flush>.  In turn, the filter should only pass
+on the result of C<.can_fua>.  In turn, the filter should only pass
 C<NBDKIT_FLAG_FUA> on to C<next_ops-E<gt>pwrite> if
-C<next_ops-E<gt>can_flush> returned true.
+C<next_ops-E<gt>can_fua> returned a positive value.

 If there is an error (including a short write which couldn't be
 recovered from), C<.pwrite> should call C<nbdkit_error> with an
error
@@ -455,9 +488,9 @@ turn, the filter should not call
C<next_ops-E<gt>trim> if
 C<next_ops-E<gt>can_trim> did not return true.

 The parameter C<flags> may include C<NBDKIT_FLAG_FUA> on input
based
-on the result of C<.can_flush>.  In turn, the filter should only pass
+on the result of C<.can_fua>.  In turn, the filter should only pass
 C<NBDKIT_FLAG_FUA> on to C<next_ops-E<gt>trim> if
-C<next_ops-E<gt>can_flush> returned true.
+C<next_ops-E<gt>can_fua> returned a positive value.

 If there is an error, C<.trim> should call C<nbdkit_error> with an
 error message B<and> return -1 with C<err> set to the positive
errno
@@ -478,9 +511,10 @@ C<next_ops-E<gt>can_write> did not return true.

 On input, the parameter C<flags> may include
C<NBDKIT_FLAG_MAY_TRIM>
 unconditionally, and C<NBDKIT_FLAG_FUA> based on the result of
-C<.can_flush>.  In turn, the filter may pass
C<NBDKIT_FLAG_MAY_TRIM>
+C<.can_fua>.  In turn, the filter may pass C<NBDKIT_FLAG_MAY_TRIM>
 unconditionally, but should only pass C<NBDKIT_FLAG_FUA> on to
-C<next_ops-E<gt>zero> if C<next_ops-E<gt>can_flush>
returned true.
+C<next_ops-E<gt>zero> if C<next_ops-E<gt>can_fua>
returned a positive
+value.

 Note that unlike the plugin C<.zero> which is permitted to fail with
 C<EOPNOTSUPP> to force a fallback to C<.pwrite>, the function
diff --git a/include/nbdkit-common.h b/include/nbdkit-common.h
index f32bf76..851b890 100644
--- a/include/nbdkit-common.h
+++ b/include/nbdkit-common.h
@@ -53,6 +53,10 @@ extern "C" {
 #define NBDKIT_FLAG_MAY_TRIM (1<<0) /* Maps to !NBD_CMD_FLAG_NO_HOLE */
 #define NBDKIT_FLAG_FUA      (1<<1) /* Maps to NBD_CMD_FLAG_FUA */

+#define NBDKIT_FUA_NONE       0
+#define NBDKIT_FUA_EMULATE    1
+#define NBDKIT_FUA_NATIVE     2
+
 extern void nbdkit_error (const char *msg, ...)
   __attribute__((__format__ (__printf__, 1, 2)));
 extern void nbdkit_verror (const char *msg, va_list args);
diff --git a/include/nbdkit-filter.h b/include/nbdkit-filter.h
index 533713b..931d923 100644
--- a/include/nbdkit-filter.h
+++ b/include/nbdkit-filter.h
@@ -58,6 +58,7 @@ struct nbdkit_next_ops {
   int (*is_rotational) (void *nxdata);
   int (*can_trim) (void *nxdata);
   int (*can_zero) (void *nxdata);
+  int (*can_fua) (void *nxdata);

   int (*pread) (void *nxdata, void *buf, uint32_t count, uint64_t offset,
                 uint32_t flags, int *err);
@@ -118,6 +119,8 @@ struct nbdkit_filter {
                    void *handle);
   int (*can_zero) (struct nbdkit_next_ops *next_ops, void *nxdata,
                    void *handle);
+  int (*can_fua) (struct nbdkit_next_ops *next_ops, void *nxdata,
+                  void *handle);

   int (*pread) (struct nbdkit_next_ops *next_ops, void *nxdata,
                 void *handle, void *buf, uint32_t count, uint64_t offset,
diff --git a/src/internal.h b/src/internal.h
index b49f071..d873cd5 100644
--- a/src/internal.h
+++ b/src/internal.h
@@ -98,10 +98,6 @@
       (type *) ((char *) __mptr - offsetof(type, member));       \
     })

-#define NBDKIT_FUA_NONE       0
-#define NBDKIT_FUA_EMULATE    1
-#define NBDKIT_FUA_NATIVE     2
-
 /* main.c */
 extern const char *exportname;
 extern const char *ipaddr;
diff --git a/src/filters.c b/src/filters.c
index 491c676..3d2c07e 100644
--- a/src/filters.c
+++ b/src/filters.c
@@ -294,6 +294,13 @@ next_can_zero (void *nxdata)
   return b_conn->b->can_zero (b_conn->b, b_conn->conn);
 }

+static int
+next_can_fua (void *nxdata)
+{
+  struct b_conn *b_conn = nxdata;
+  return b_conn->b->can_fua (b_conn->b, b_conn->conn);
+}
+
 static int
 next_pread (void *nxdata, void *buf, uint32_t count, uint64_t offset,
             uint32_t flags, int *err)
@@ -342,6 +349,7 @@ static struct nbdkit_next_ops next_ops = {
   .is_rotational = next_is_rotational,
   .can_trim = next_can_trim,
   .can_zero = next_can_zero,
+  .can_fua = next_can_fua,
   .pread = next_pread,
   .pwrite = next_pwrite,
   .flush = next_flush,
@@ -473,11 +481,15 @@ static int
 filter_can_fua (struct backend *b, struct connection *conn)
 {
   struct backend_filter *f = container_of (b, struct backend_filter, backend);
+  void *handle = connection_get_handle (conn, f->backend.i);
+  struct b_conn nxdata = { .b = f->backend.next, .conn = conn };

   debug ("can_fua");

-  /* TODO expose this to plugins and filters */
-  return f->backend.next->can_fua (f->backend.next, conn);
+  if (f->filter.can_fua)
+    return f->filter.can_fua (&next_ops, &nxdata, handle);
+  else
+    return f->backend.next->can_fua (f->backend.next, conn);
 }

 static int
diff --git a/filters/blocksize/blocksize.c b/filters/blocksize/blocksize.c
index 8834457..88240cd 100644
--- a/filters/blocksize/blocksize.c
+++ b/filters/blocksize/blocksize.c
@@ -37,6 +37,7 @@
 #include <stdlib.h>
 #include <stdint.h>
 #include <string.h>
+#include <stdbool.h>
 #include <inttypes.h>
 #include <limits.h>
 #include <errno.h>
@@ -141,6 +142,7 @@ blocksize_prepare (struct nbdkit_next_ops *next_ops, void
*nxdata,
     nbdkit_error ("disk is too small for minblock size %u",
minblock);
     return -1;
   }
+  /* TODO: cache per-connection FUA mode? */
   return 0;
 }

@@ -206,9 +208,13 @@ blocksize_pwrite (struct nbdkit_next_ops *next_ops, void
*nxdata,
   const char *buf = b;
   uint32_t keep;
   uint32_t drop;
+  bool need_flush = false;

-  /* FIXME: Smarter handling of FUA - pass it through if the next layer
-   * can handle it natively, but just once at end if next layer emulates. */
+  if ((flags & NBDKIT_FLAG_FUA) &&
+      next_ops->can_fua (nxdata) == NBDKIT_FUA_EMULATE) {
+    flags &= ~NBDKIT_FLAG_FUA;
+    need_flush = true;
+  }

   /* Unaligned head */
   if (offs & (minblock - 1)) {
@@ -247,6 +253,8 @@ blocksize_pwrite (struct nbdkit_next_ops *next_ops, void
*nxdata,
     count -= keep;
   }

+  if (need_flush)
+    return next_ops->flush (nxdata, 0, err);
   return 0;
 }

@@ -256,9 +264,13 @@ blocksize_trim (struct nbdkit_next_ops *next_ops, void
*nxdata,
                 int *err)
 {
   uint32_t keep;
+  bool need_flush = false;

-  /* FIXME: Smarter handling of FUA - pass it through if the next layer
-   * can handle it natively, but just once at end if next layer emulates. */
+  if ((flags & NBDKIT_FLAG_FUA) &&
+      next_ops->can_fua (nxdata) == NBDKIT_FUA_EMULATE) {
+    flags &= ~NBDKIT_FLAG_FUA;
+    need_flush = true;
+  }

   /* Unaligned head */
   if (offs & (minblock - 1)) {
@@ -280,6 +292,8 @@ blocksize_trim (struct nbdkit_next_ops *next_ops, void
*nxdata,
     count -= keep;
   }

+  if (need_flush)
+    return next_ops->flush (nxdata, 0, err);
   return 0;
 }

@@ -290,9 +304,13 @@ blocksize_zero (struct nbdkit_next_ops *next_ops, void
*nxdata,
 {
   uint32_t keep;
   uint32_t drop;
+  bool need_flush = false;

-  /* FIXME: Smarter handling of FUA - pass it through if the next layer
-   * can handle it natively, but just once at end if next layer emulates. */
+  if ((flags & NBDKIT_FLAG_FUA) &&
+      next_ops->can_fua (nxdata) == NBDKIT_FUA_EMULATE) {
+    flags &= ~NBDKIT_FLAG_FUA;
+    need_flush = true;
+  }

   /* Unaligned head */
   if (offs & (minblock - 1)) {
@@ -329,6 +347,8 @@ blocksize_zero (struct nbdkit_next_ops *next_ops, void
*nxdata,
     count -= keep;
   }

+  if (need_flush)
+    return next_ops->flush (nxdata, 0, err);
   return 0;
 }

diff --git a/filters/cache/cache.c b/filters/cache/cache.c
index c8dd791..76a0438 100644
--- a/filters/cache/cache.c
+++ b/filters/cache/cache.c
@@ -90,6 +90,10 @@ static enum cache_mode {
   CACHE_MODE_UNSAFE,
 } cache_mode = CACHE_MODE_WRITEBACK;

+static int
+cache_flush (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle,
+             uint32_t flags, int *err);
+
 static void
 cache_load (void)
 {
@@ -228,7 +232,10 @@ cache_prepare (struct nbdkit_next_ops *next_ops, void
*nxdata,
   int64_t r;

   r = cache_get_size (next_ops, nxdata, handle);
-  return r >= 0 ? 0 : -1;
+  if (r < 0)
+    return -1;
+  /* TODO: cache per-connection FUA mode? */
+  return 0;
 }

 /* Return true if the block is allocated.  Consults the bitmap. */
@@ -392,6 +399,7 @@ cache_pwrite (struct nbdkit_next_ops *next_ops, void
*nxdata,
               uint32_t flags, int *err)
 {
   uint8_t *block;
+  bool need_flush = false;

   block = malloc (BLKSIZE);
   if (block == NULL) {
@@ -400,6 +408,11 @@ cache_pwrite (struct nbdkit_next_ops *next_ops, void
*nxdata,
     return -1;
   }

+  if ((flags & NBDKIT_FLAG_FUA) &&
+      next_ops->can_fua (nxdata) == NBDKIT_FUA_EMULATE) {
+    flags &= ~NBDKIT_FLAG_FUA;
+    need_flush = true;
+  }
   while (count > 0) {
     uint64_t blknum, blkoffs, n;

@@ -426,6 +439,8 @@ cache_pwrite (struct nbdkit_next_ops *next_ops, void
*nxdata,
   }

   free (block);
+  if (need_flush)
+    return cache_flush (next_ops, nxdata, handle, 0, err);
   return 0;
 }

@@ -436,6 +451,7 @@ cache_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
             int *err)
 {
   uint8_t *block;
+  bool need_flush = false;

   block = malloc (BLKSIZE);
   if (block == NULL) {
@@ -445,6 +461,11 @@ cache_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
   }

   flags &= ~NBDKIT_FLAG_MAY_TRIM; /* See BLKSIZE comment above. */
+  if ((flags & NBDKIT_FLAG_FUA) &&
+      next_ops->can_fua (nxdata) == NBDKIT_FUA_EMULATE) {
+    flags &= ~NBDKIT_FLAG_FUA;
+    need_flush = true;
+  }
   while (count > 0) {
     uint64_t blknum, blkoffs, n;

@@ -469,6 +490,8 @@ cache_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
   }

   free (block);
+  if (need_flush)
+    return cache_flush (next_ops, nxdata, handle, 0, err);
   return 0;
 }

diff --git a/filters/cow/cow.c b/filters/cow/cow.c
index cbf4ed4..2f86f6c 100644
--- a/filters/cow/cow.c
+++ b/filters/cow/cow.c
@@ -61,6 +61,12 @@
  * this we limit the thread model to SERIALIZE_ALL_REQUESTS so that
  * there cannot be concurrent pwrite requests.  We could relax this
  * restriction with a bit of work.
+ *
+ * We allow the client to request FUA, and emulate it with a flush
+ * (arguably, since the write overlay is temporary, and since we
+ * serialize all requests, we could ignore FUA altogether, but
+ * flushing will be more correct if we improve the thread model to be
+ * more parallel).
  */

 #include <config.h>
@@ -235,6 +241,12 @@ cow_can_flush (struct nbdkit_next_ops *next_ops, void
*nxdata, void *handle)
   return 1;
 }

+static int
+cow_can_fua (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle)
+{
+  return NBDKIT_FUA_EMULATE;
+}
+
 /* Return true if the block is allocated.  Consults the bitmap. */
 static bool
 blk_is_allocated (uint64_t blknum)
@@ -309,6 +321,10 @@ blk_write (uint64_t blknum, const uint8_t *block, int *err)
   return 0;
 }

+static int
+cow_flush (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle,
+           uint32_t flags, int *err);
+
 /* Read data. */
 static int
 cow_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
@@ -390,6 +406,8 @@ cow_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata,
   }

   free (block);
+  if (flags & NBDKIT_FLAG_FUA)
+    return cow_flush (next_ops, nxdata, handle, 0, err);
   return 0;
 }

@@ -435,6 +453,8 @@ cow_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
   }

   free (block);
+  if (flags & NBDKIT_FLAG_FUA)
+    return cow_flush (next_ops, nxdata, handle, 0, err);
   return 0;
 }

@@ -466,6 +486,7 @@ static struct nbdkit_filter filter = {
   .can_write         = cow_can_write,
   .can_flush         = cow_can_flush,
   .can_trim          = cow_can_trim,
+  .can_fua           = cow_can_fua,
   .pread             = cow_pread,
   .pwrite            = cow_pwrite,
   .zero              = cow_zero,
diff --git a/filters/log/log.c b/filters/log/log.c
index 2dd61c0..2079084 100644
--- a/filters/log/log.c
+++ b/filters/log/log.c
@@ -236,13 +236,13 @@ log_prepare (struct nbdkit_next_ops *next_ops, void
*nxdata, void *handle)
   int r = next_ops->is_rotational (nxdata);
   int t = next_ops->can_trim (nxdata);
   int z = next_ops->can_zero (nxdata);
+  int F = next_ops->can_fua (nxdata);

-  if (size < 0 || w < 0 || f < 0 || r < 0 || t < 0 || z < 0)
+  if (size < 0 || w < 0 || f < 0 || r < 0 || t < 0 || z < 0
|| F < 0)
     return -1;

-  /* TODO expose can_fua to filters */
   output (h, "Connect", 0, "size=0x%" PRIx64 "
write=%d flush=%d "
-          "rotational=%d trim=%d zero=%d" /* fua=? */, size, w, f, r,
t, z);
+          "rotational=%d trim=%d zero=%d fua=%d", size, w, f, r, t,
z, F);
   return 0;
 }

-- 
2.14.3

Eric Blake

2018-Mar-08 23:03 UTC

head link

[Libguestfs] [nbdkit PATCH v3 10/15] filters: Add fua filter

Sometimes, it's nice to see what a difference it makes in timing
based on whether Forced Unit Access or a full flush is used when
waiting for particular data to land in persistent storage.  Add a
new filter that makes it trivial to switch between different FUA
policies (none: the client must use flush, emulate: the filter
uses flush even if the plugin supports efficient FUA, native:
the filter advertises native even if the plugin relies on flush,
force: the filter forces FUA even when the client didn't request
it).  The log filter makes it easy to test the difference between
the various modes, and also demonstrates that the previous
patches that optimized various filters based on whether FUA is
emulated or native actually work.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 TODO                              |  11 +-
 docs/nbdkit-filter.pod            |   1 +
 docs/nbdkit.pod                   |   1 +
 filters/fua/nbdkit-fua-filter.pod | 119 ++++++++++++++++++
 configure.ac                      |   1 +
 filters/fua/fua.c                 | 251 ++++++++++++++++++++++++++++++++++++++
 filters/Makefile.am               |   1 +
 filters/fua/Makefile.am           |  62 ++++++++++
 tests/Makefile.am                 |   4 +
 tests/test-fua.sh                 | 153 +++++++++++++++++++++++
 10 files changed, 594 insertions(+), 10 deletions(-)
 create mode 100644 filters/fua/nbdkit-fua-filter.pod
 create mode 100644 filters/fua/fua.c
 create mode 100644 filters/fua/Makefile.am
 create mode 100755 tests/test-fua.sh

diff --git a/TODO b/TODO
index e7a8405..b6fb0b1 100644
--- a/TODO
+++ b/TODO
@@ -75,16 +75,7 @@ Suggestions for filters
 -----------------------

 * injecting artificial errors or otherwise masking plugin features
-  for testing clients (see 'nozero' filter for example)
-
-* fua filter: setting mode=none stops advertisement, mode=emulate uses
-  flush emulation (or fails if !can_flush), mode=native passes on
-  through (or fails if can_fua does not report native support).  When
-  breaking up a large request into smaller pieces, then native support
-  requires passing fua through to all sub-requests; but emulation by
-  flushing only needs one flush after the last sub-request, to avoid
-  unneeded intermediate flushing; hence, where this filter is placed
-  in the stack may have a performance impact.
+  for testing clients (see 'nozero' and 'fua' filters for
example)

 nbdkit-cache-filter needs considerable work:

diff --git a/docs/nbdkit-filter.pod b/docs/nbdkit-filter.pod
index d53daf4..851bc37 100644
--- a/docs/nbdkit-filter.pod
+++ b/docs/nbdkit-filter.pod
@@ -616,6 +616,7 @@ L<nbdkit-blocksize-filter(1)>,
 L<nbdkit-cache-filter(1)>,
 L<nbdkit-cow-filter(1)>,
 L<nbdkit-delay-filter(1)>,
+L<nbdkit-fua-filter(1)>,
 L<nbdkit-log-filter(1)>,
 L<nbdkit-nozero-filter(1)>,
 L<nbdkit-offset-filter(1)>,
diff --git a/docs/nbdkit.pod b/docs/nbdkit.pod
index 9247ff5..dacf11c 100644
--- a/docs/nbdkit.pod
+++ b/docs/nbdkit.pod
@@ -921,6 +921,7 @@ L<nbdkit-blocksize-filter(1)>,
 L<nbdkit-cache-filter(1)>,
 L<nbdkit-cow-filter(1)>,
 L<nbdkit-delay-filter(1)>,
+L<nbdkit-fua-filter(1)>,
 L<nbdkit-log-filter(1)>,
 L<nbdkit-nozero-filter(1)>,
 L<nbdkit-offset-filter(1)>,
diff --git a/filters/fua/nbdkit-fua-filter.pod
b/filters/fua/nbdkit-fua-filter.pod
new file mode 100644
index 0000000..cec54d5
--- /dev/null
+++ b/filters/fua/nbdkit-fua-filter.pod
@@ -0,0 +1,119 @@
+=encoding utf8
+
+=head1 NAME
+
+nbdkit-fua-filter - nbdkit FUA filter
+
+=head1 SYNOPSIS
+
+ nbdkit --filter=fua plugin [fuamode=MODE] [plugin-args...]
+
+=head1 DESCRIPTION
+
+C<nbdkit-fua-filter> is a filter that intentionally modifies handling
+of the 'Forced Unit Access' (FUA) flag across the NBD protocol.  It is
+mainly useful for testing client or server fallbacks, and for
+evaluating timing differences between proper use of FUA compared to a
+full flush.
+
+=head1 PARAMETERS
+
+=over 4
+
+=item B<fuamode=MODE>
+
+Optional, controls which mode the filter will use.  Mode B<none>
+(default) means that FUA support is not advertised to the client.
+Mode B<emulate> causes the filter to emulate FUA support using the
+plugin's C<flush> callback, regardless of whether the plugin itself
+supports more efficient FUA (or refuses to load if the plugin's
+C<can_flush> callback returns false).  Mode B<native> causes the
+filter to advertise native FUA support to earlier filters, useful for
+comparing optimizations of FUA handling when splitting large requests
+into sub-requests (or refuses to load if the plugin's C<can_fua>
+callback returns C<NBDKIT_FUA_NONE>).  Mode B<force> causes the
filter
+to request FUA on all write transactions, even when the client did not
+request it, and in turn treats client flush requests as a no-op (or
+refuses to load if the plugin's C<can_fua> callback returns
+C<NBDKIT_FUA_NONE>).
+
+=back
+
+=head1 EXAMPLES
+
+Serve the file F<disk.img>, but force the client to submit explicit
+flush requests instead of using C<NBD_CMD_FLAG_FUA>:
+
+ nbdkit --filter=fua file file=disk.img
+
+Observe that the blocksize filter optimizes its handling of the FUA
+flag based on whether it knows nbdkit will be emulating FUA with a
+flush, by comparing the log filter output on top of different fua
+filter modes:
+
+ nbdkit --filter=blocksize --filter=log --filter=fua file file=disk.img \
+   maxlen=4k logfile=fua_emulated fuamode=emulate
+ nbdkit --filter=blocksize --filter=log --filter=fua file file=disk.img \
+   maxlen=4k logfile=fua_native fuamode=native
+
+Serve the file F<disk.img> in write-through mode, where all writes
+from the client are immediately flushed to disk as if the client had
+always requested FUA:
+
+ nbdkit --filter=fua file fuamode=force file=disk.img
+
+=head1 SEE ALSO
+
+L<nbdkit(1)>,
+L<nbdkit-file-plugin(1)>,
+L<nbdkit-filter(3)>,
+L<nbdkit-blocksize-filter(3)>,
+L<nbdkit-log-filter(3)>.
+
+=head1 AUTHORS
+
+Eric Blake
+
+=head1 COPYRIGHT
+
+Copyright (C) 2018 Red Hat Inc.
+
+=head1 LICENSE
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are
+met:
+
+=over 4
+
+=item *
+
+Redistributions of source code must retain the above copyright
+notice, this list of conditions and the following disclaimer.
+
+=item *
+
+Redistributions in binary form must reproduce the above copyright
+notice, this list of conditions and the following disclaimer in the
+documentation and/or other materials provided with the distribution.
+
+=item *
+
+Neither the name of Red Hat nor the names of its contributors may be
+used to endorse or promote products derived from this software without
+specific prior written permission.
+
+=back
+
+THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS''
AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+SUCH DAMAGE.
diff --git a/configure.ac b/configure.ac
index 29ef109..c914e00 100644
--- a/configure.ac
+++ b/configure.ac
@@ -517,6 +517,7 @@ AC_CONFIG_FILES([Makefile
                  filters/cache/Makefile
                  filters/cow/Makefile
                  filters/delay/Makefile
+                 filters/fua/Makefile
                  filters/log/Makefile
                  filters/nozero/Makefile
                  filters/offset/Makefile
diff --git a/filters/fua/fua.c b/filters/fua/fua.c
new file mode 100644
index 0000000..f3bcbc9
--- /dev/null
+++ b/filters/fua/fua.c
@@ -0,0 +1,251 @@
+/* nbdkit
+ * Copyright (C) 2018 Red Hat Inc.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ *
+ * * Neither the name of Red Hat nor the names of its contributors may be
+ * used to endorse or promote products derived from this software without
+ * specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+ * PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <config.h>
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <stdbool.h>
+#include <assert.h>
+
+#include <nbdkit-filter.h>
+
+#define THREAD_MODEL NBDKIT_THREAD_MODEL_PARALLEL
+
+enum FuaMode {
+  NONE,
+  EMULATE,
+  NATIVE,
+  FORCE,
+} fuamode;
+
+static int
+fua_config (nbdkit_next_config *next, void *nxdata,
+            const char *key, const char *value)
+{
+  if (strcmp (key, "fuamode") == 0) {
+    if (strcmp (value, "none") == 0)
+      fuamode = NONE;
+    if (strcmp (value, "emulate") == 0)
+      fuamode = EMULATE;
+    else if (strcmp (value, "native") == 0)
+      fuamode = NATIVE;
+    else if (strcmp (value, "force") == 0)
+      fuamode = FORCE;
+    else {
+      nbdkit_error ("unknown fuamode '%s'", value);
+      return -1;
+    }
+    return 0;
+  }
+  return next (nxdata, key, value);
+}
+
+#define fua_config_help \
+  "fuamode=<MODE>       One of 'none' (default),
'emulate', 'native', 'force'.\n" \
+
+/* Check that desired mode is supported by plugin. */
+static int
+fua_prepare (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle)
+{
+  int r;
+
+  switch (fuamode) {
+  case NONE:
+    break;
+  case EMULATE:
+    r = next_ops->can_flush (nxdata);
+    if (r == -1)
+      return -1;
+    if (r == 0) {
+      nbdkit_error ("fuamode 'emulate' requires plugin flush
support");
+      return -1;
+    }
+    break;
+  case NATIVE:
+  case FORCE:
+    r = next_ops->can_fua (nxdata);
+    if (r == -1)
+      return -1;
+    if (r == NBDKIT_FUA_NONE) {
+      nbdkit_error ("fuamode '%s' requires plugin fua
support",
+                    fuamode == EMULATE ? "emulate" :
"force");
+      return -1;
+    }
+    break;
+  }
+  return 0;
+}
+
+/* Advertise proper flush support. */
+static int
+fua_can_flush (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle)
+{
+  if (fuamode == FORCE)
+    return 1; /* Advertise our no-op flush, even if plugin lacks it */
+  return next_ops->can_flush (nxdata);
+}
+
+/* Advertise desired fua mode. */
+static int
+fua_can_fua (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle)
+{
+  switch (fuamode) {
+  case NONE:
+    return NBDKIT_FUA_NONE;
+  case EMULATE:
+    return NBDKIT_FUA_EMULATE;
+  case NATIVE:
+  case FORCE:
+    return NBDKIT_FUA_NATIVE;
+  }
+  abort ();
+}
+
+static int
+fua_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata,
+            void *handle, const void *buf, uint32_t count, uint64_t offs,
+            uint32_t flags, int *err)
+{
+  int r;
+  bool need_flush = false;
+
+  switch (fuamode) {
+  case NONE:
+    assert (!(flags & NBDKIT_FLAG_FUA));
+    break;
+  case EMULATE:
+    if (flags & NBDKIT_FLAG_FUA) {
+      need_flush = true;
+      flags &= ~NBDKIT_FLAG_FUA;
+    }
+    break;
+  case NATIVE:
+    break;
+  case FORCE:
+    flags |= NBDKIT_FLAG_FUA;
+    break;
+  }
+  r = next_ops->pwrite (nxdata, buf, count, offs, flags, err);
+  if (r != -1 && need_flush)
+    r = next_ops->flush (nxdata, 0, err);
+  return r;
+}
+
+static int
+fua_flush (struct nbdkit_next_ops *next_ops, void *nxdata,
+           void *handle, uint32_t flags, int *err)
+{
+  if (fuamode == FORCE)
+    return 0; /* Nothing to flush, since all writes already used FUA */
+  return next_ops->flush (nxdata, flags, err);
+}
+
+static int
+fua_trim (struct nbdkit_next_ops *next_ops, void *nxdata,
+          void *handle, uint32_t count, uint64_t offs, uint32_t flags,
+          int *err)
+{
+  int r;
+  bool need_flush = false;
+
+  switch (fuamode) {
+  case NONE:
+    assert (!(flags & NBDKIT_FLAG_FUA));
+    break;
+  case EMULATE:
+    if (flags & NBDKIT_FLAG_FUA) {
+      need_flush = true;
+      flags &= ~NBDKIT_FLAG_FUA;
+    }
+    break;
+  case NATIVE:
+    break;
+  case FORCE:
+    flags |= NBDKIT_FLAG_FUA;
+    break;
+  }
+  r = next_ops->trim (nxdata, count, offs, flags, err);
+  if (r != -1 && need_flush)
+    r = next_ops->flush (nxdata, 0, err);
+  return r;
+}
+
+static int
+fua_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
+          void *handle, uint32_t count, uint64_t offs, uint32_t flags,
+          int *err)
+{
+  int r;
+  bool need_flush = false;
+
+  switch (fuamode) {
+  case NONE:
+    assert (!(flags & NBDKIT_FLAG_FUA));
+    break;
+  case EMULATE:
+    if (flags & NBDKIT_FLAG_FUA) {
+      need_flush = true;
+      flags &= ~NBDKIT_FLAG_FUA;
+    }
+    break;
+  case NATIVE:
+    break;
+  case FORCE:
+    flags |= NBDKIT_FLAG_FUA;
+    break;
+  }
+  r = next_ops->zero (nxdata, count, offs, flags, err);
+  if (r != -1 && need_flush)
+    r = next_ops->flush (nxdata, 0, err);
+  return r;
+}
+
+static struct nbdkit_filter filter = {
+  .name              = "fua",
+  .longname          = "nbdkit fua filter",
+  .version           = PACKAGE_VERSION,
+  .config            = fua_config,
+  .config_help       = fua_config_help,
+  .prepare           = fua_prepare,
+  .can_flush         = fua_can_flush,
+  .can_fua           = fua_can_fua,
+  .pwrite            = fua_pwrite,
+  .flush             = fua_flush,
+  .trim              = fua_trim,
+  .zero              = fua_zero,
+};
+
+NBDKIT_REGISTER_FILTER(filter)
diff --git a/filters/Makefile.am b/filters/Makefile.am
index 170ee09..fd1cf13 100644
--- a/filters/Makefile.am
+++ b/filters/Makefile.am
@@ -35,6 +35,7 @@ SUBDIRS = \
 	cache \
 	cow \
 	delay \
+	fua \
 	log \
 	nozero \
 	offset \
diff --git a/filters/fua/Makefile.am b/filters/fua/Makefile.am
new file mode 100644
index 0000000..03df0c3
--- /dev/null
+++ b/filters/fua/Makefile.am
@@ -0,0 +1,62 @@
+# nbdkit
+# Copyright (C) 2018 Red Hat Inc.
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met:
+#
+# * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution.
+#
+# * Neither the name of Red Hat nor the names of its contributors may be
+# used to endorse or promote products derived from this software without
+# specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+
+EXTRA_DIST = nbdkit-fua-filter.pod
+
+CLEANFILES = *~
+
+filterdir = $(libdir)/nbdkit/filters
+
+filter_LTLIBRARIES = nbdkit-fua-filter.la
+
+nbdkit_fua_filter_la_SOURCES = \
+	fua.c \
+	$(top_srcdir)/include/nbdkit-filter.h
+
+nbdkit_fua_filter_la_CPPFLAGS = \
+	-I$(top_srcdir)/include
+nbdkit_fua_filter_la_CFLAGS = \
+	$(WARNINGS_CFLAGS)
+nbdkit_fua_filter_la_LDFLAGS = \
+	-module -avoid-version -shared
+
+if HAVE_POD2MAN
+
+man_MANS = nbdkit-fua-filter.1
+CLEANFILES += $(man_MANS)
+
+nbdkit-fua-filter.1: nbdkit-fua-filter.pod
+	$(POD2MAN) $(POD2MAN_ARGS) --section=1 --name=`basename $@ .1` $< $@.t
&& \
+	if grep 'POD ERROR' $@.t; then rm $@.t; exit 1; fi && \
+	mv $@.t $@
+
+endif
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 0013fda..cf61c40 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -49,6 +49,7 @@ EXTRA_DIST = \
 	test-dump-plugin.sh \
 	test-dump-plugin-example4.sh \
 	test-foreground.sh \
+	test-fua.sh \
 	test-help.sh \
 	test-help-plugin.sh \
 	test-ip.sh \
@@ -433,6 +434,9 @@ test_delay_SOURCES = test-delay.c test.h
 test_delay_CFLAGS = $(WARNINGS_CFLAGS) $(LIBGUESTFS_CFLAGS)
 test_delay_LDADD = libtest.la $(LIBGUESTFS_LIBS)

+# fua filter test.
+TESTS += test-fua.sh
+
 # log filter test.
 TESTS += test-log.sh

diff --git a/tests/test-fua.sh b/tests/test-fua.sh
new file mode 100755
index 0000000..c102f49
--- /dev/null
+++ b/tests/test-fua.sh
@@ -0,0 +1,153 @@
+#!/bin/bash -
+# nbdkit
+# Copyright (C) 2018 Red Hat Inc.
+# All rights reserved.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met:
+#
+# * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution.
+#
+# * Neither the name of Red Hat nor the names of its contributors may be
+# used to endorse or promote products derived from this software without
+# specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+
+set -e
+
+files="fua.img
+       fua1.log fua1.sock fua1.pid
+       fua2.log fua2.sock fua2.pid
+       fua3.log fua3.sock fua3.pid
+       fua4.log fua4.sock fua4.pid"
+rm -f $files
+
+: ${QEMU_IO=qemu-io}
+
+# Prep images, and check that qemu-io understands the actions we plan on
+# doing.  Too bad qemu-io 2.11 doesn't support trim with FUA.
+truncate --size 1M fua.img
+if ! $QEMU_IO -f raw -t none -c flush -c 'w -f -z 0 64k' fua.img; then
+    echo "$0: missing or broken qemu-io"
+    rm fua.img
+    exit 77
+fi
+
+pid1= pid2= pid3= pid4+
+# Kill any nbdkit processes on exit.
+cleanup ()
+{
+    status=$?
+
+    test "$pid1" && kill $pid1
+    test "$pid2" && kill $pid2
+    test "$pid3" && kill $pid3
+    test "$pid4" && kill $pid4
+    # For easier debugging, dump the final log files before removing them.
+    echo "Log 1 file contents:"
+    cat fua1.log || :
+    echo "Log 2 file contents:"
+    cat fua2.log || :
+    echo "Log 3 file contents:"
+    cat fua3.log || :
+    echo "Log 4 file contents:"
+    cat fua4.log || :
+    rm -f $files
+
+    exit $status
+}
+trap cleanup INT QUIT TERM EXIT ERR
+
+# Run four parallel nbdkit; to compare the logs and see what changes.
+# 1: fuamode=none (default): client should send flush instead
+# 2: fuamode=emulate: log shows that blocksize optimizes fua to flush
+# 3: fuamode=native: log shows that blocksize preserves fua
+# 4: fuamode=force: log shows that fua is always enabled
+nbdkit -P fua1.pid -U fua1.sock --filter=log --filter=fua \
+       file logfile=fua1.log file=fua.img
+nbdkit -P fua2.pid -U fua2.sock --filter=blocksize --filter=log --filter=fua \
+       file logfile=fua2.log file=fua.img fuamode=emulate maxdata=4k maxlen=4k
+nbdkit -P fua3.pid -U fua3.sock --filter=blocksize --filter=log --filter=fua \
+       file logfile=fua3.log file=fua.img fuamode=native maxdata=4k maxlen=4k
+nbdkit -P fua4.pid -U fua4.sock --filter=fua --filter=log \
+       file logfile=fua4.log file=fua.img fuamode=force
+
+# We may have to wait a short time for the pid files to appear.
+for i in `seq 1 10`; do
+    if test -f fua1.pid && test -f fua2.pid && test -f fua3.pid
&&
+       test -f fua4.pid; then
+        break
+    fi
+    sleep 1
+done
+
+pid1="$(cat fua1.pid)" || :
+pid2="$(cat fua2.pid)" || :
+pid3="$(cat fua3.pid)" || :
+pid4="$(cat fua4.pid)" || :
+
+if ! test -f fua1.pid || ! test -f fua2.pid || ! test -f fua3.pid ||
+   ! test -f fua4.pid; then
+    echo "$0: PID files were not created"
+    exit 1
+fi
+
+# Perform a flush, write, and zero, with and without FUA
+for f in '' -f; do
+    for i in {1..4}; do
+	$QEMU_IO -f raw -t none -c flush -c "w $f 0 64k" -c "w -z $f
64k 64k" \
+		 "nbd+unix://?socket=fua$i.sock"
+    done
+done
+
+# Test 1: no fua sent over wire, qemu-io sent more flushes in place of fua
+if grep 'fua=1' fua1.log; then
+    echo "filter should have prevented fua"
+    exit 1
+fi
+test $(grep -c 'connection=1 Flush' fua1.log) -lt \
+     $(grep -c 'connection=2 Flush' fua1.log)
+
+# Test 2: either last part of split has fua, or a flush is added, but
+# all earlier parts of the transaction do not have fua
+flush1=$(grep -c 'connection=1 Flush' fua2.log || :)
+flush2=$(grep -c 'connection=2 Flush' fua2.log || :)
+fua=$(grep -c 'fua=1' fua2.log || :)
+test $(( $flush2 - $flush1 + $fua )) = 2
+
+# Test 3: every part of split has fua, and no flushes are added
+flush1=$(grep -c 'connection=1 Flush' fua3.log || :)
+flush2=$(grep -c 'connection=2 Flush' fua3.log || :)
+test $flush1 = $flush2
+test $(grep -c 'fua=1' fua3.log) = 32
+
+# Test 4: flush is no-op, and every transaction has fua
+if grep 'fua=0' fua4.log; then
+    echo "filter should have forced fua"
+    exit 1
+fi
+if grep 'Flush' fua4.log; then
+    echo "filter should have elided flush"
+    exit 1
+fi
+
+# The cleanup() function is called implicitly on exit.
-- 
2.14.3

Eric Blake

2018-Mar-08 23:03 UTC

head link

[Libguestfs] [nbdkit PATCH v3 11/15] plugins: Expose new FUA callbacks

The NBD protocol supports Forced Unit Access (FUA) as a more efficient
way to wait for just one write to land in persistent storage, rather
than all outstanding writes at the time of a flush; modeled after
the kernel's block I/O flag of the same name.  While we can emulate
the proper semantics with a full-blown flush, there are some plugins
that can properly pass the FUA flag on to the end storage and thereby
avoid some overhead.

This patch introduces new callbacks, and updates the documentation
to the new API, while ensuring that plugins compiled to the old API
still work.  The new API adds .can_fua, then adds a flags parameter
to all five data callbacks, even though only three of them will use
a flag at the moment.  A plugin client has to opt in to both the
version 2 API and provide .can_fua with a return of NBDKIT_FUA_NATIVE
before nbdkit will pass the NBDKIT_FLAG_FUA to the plugin.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 docs/nbdkit-filter.pod  |  12 ++--
 docs/nbdkit-plugin.pod  | 151 +++++++++++++++++++++++++++++++++++++++++++-----
 docs/nbdkit.pod         |   7 ++-
 include/nbdkit-plugin.h |  30 ++++++++++
 src/internal.h          |   1 +
 src/plugins.c           | 136 +++++++++++++++++++++++++++----------------
 6 files changed, 265 insertions(+), 72 deletions(-)

diff --git a/docs/nbdkit-filter.pod b/docs/nbdkit-filter.pod
index 851bc37..471c4bc 100644
--- a/docs/nbdkit-filter.pod
+++ b/docs/nbdkit-filter.pod
@@ -385,12 +385,12 @@ allowed.
                  void *handle);

 This controls how the Forced Unit Access flag will be handled; it is
-only relevant for connections that are not read-only.  For now, this
-function has no counterpart in plugins, although it will be added.
-Unlike other feature functions with just two success states, this one
-returns three success values: C<NBDKIT_FUA_NONE> to avoid advertising
-FUA support to the client, C<NBDKIT_FUA_EMULATE> if FUA is emulated by
-nbdkit calling flush after a write transaction, and
+only relevant for connections that are not read-only.  This intercepts
+the corresponding plugin method, to control feature bits advertised to
+the client.  Unlike other feature functions with just two success
+states, this one returns three success values: C<NBDKIT_FUA_NONE> to
+avoid advertising FUA support to the client, C<NBDKIT_FUA_EMULATE> if
+FUA is emulated by nbdkit calling flush after a write transaction, and
 C<NBDKIT_FUA_NATIVE> if FUA semantics are handled by the plugin
 without the overhead of an extra flush from nbdkit.

diff --git a/docs/nbdkit-plugin.pod b/docs/nbdkit-plugin.pod
index 44822fc..933f710 100644
--- a/docs/nbdkit-plugin.pod
+++ b/docs/nbdkit-plugin.pod
@@ -6,6 +6,8 @@ nbdkit-plugin - How to write nbdkit plugins

 =head1 SYNOPSIS

+ #define NBDKIT_API_VERSION 2
+ 
  #include <nbdkit-plugin.h>
  
  #define THREAD_MODEL NBDKIT_THREAD_MODEL_SERIALIZE_ALL_REQUESTS
@@ -63,9 +65,22 @@ L<nbdkit-perl-plugin(3)>,
 L<nbdkit-python-plugin(3)>,
 L<nbdkit-ruby-plugin(3)>.

+=head1 C<#define NBDKIT_API_VERSION 2>
+
+Plugins must choose which API version they want to use, by defining
+NBDKIT_API_VERSION to a positive integer prior to including
+C<nbdkit-plugin.h> (or any other nbdkit header).  The default version
+is 1 for backwards-compatibility with nbdkit v1.1.26 and earlier;
+however, it is recommended that new plugins be written to the maximum
+version (currently 2) as it enables more features and better
+interaction with nbdkit filters.  Therefore, the rest of this document
+only covers the version 2 interface.  A newer nbdkit will always
+support plugins compiled against any prior API version.
+
 =head1 C<nbdkit-plugin.h>

-All plugins should start by including this header file:
+All plugins should start by including this header file, after
+optionally choosing an API version:

  #include <nbdkit-plugin.h>

@@ -127,14 +142,20 @@ A new client has connected.

 =item C<.can_write>, C<.get_size> and other option negotiation
callbacks

-These are called during option negotiation with the client, but
-before any data is served.
+These are called during option negotiation with the client, but before
+any data is served.  These callbacks may return different values
+across different C<.open> calls, but within a single connection, must
+always return the same value; other code in nbdkit may cache the
+per-connection value returned rather than using the callback a second
+time.

 =item C<.pread>, C<.pwrite> and other data serving callbacks

 After option negotiation has finished, these may be called to serve
 data.  Depending on the thread model chosen, they might be called in
-parallel from multiple threads.
+parallel from multiple threads.  The data serving callbacks include a
+flags argument; the results of the negotiation callbacks influence
+whether particular flags will ever be passed to a data callback.

 =item C<.close>

@@ -152,6 +173,58 @@ is called once just before the plugin is unloaded from
memory.

 =back

+=head1 Flags
+
+The following flags are defined by nbdkit, and used in various data
+serving callbacks as follows:
+
+=over 4
+
+=item C<NBDKIT_FLAG_MAY_TRIM>
+
+This flag is used by the C<.zero> callback; a plugin may ignore it
+without changing valid semantics, so there is no way to disable this
+flag.
+
+=item C<NBDKIT_FLAG_FUA>
+
+This flag represents Forced Unit Access semantics.  It is used by the
+C<.pwrite>, C<.zero>, and C<.trim> callbacks to indicate that
the
+plugin must not return a result until the action has landed in
+persistent storage.  This flag will not be sent to the plugin unless
+C<.can_fua> is provided and returns C<NBDKIT_FUA_NATIVE>.
+
+=back
+
+The following defines are valid as successful return values for
+C<.can_fua>:
+
+=over 4
+
+=item C<NBDKIT_FUA_NONE>
+
+Forced Unit Access is not supported; the client must manually request
+a flush after writes have completed.  The C<NBDKIT_FLAG_FUA> flag will
+not be passed to the plugin's write callbacks.
+
+=item C<NBDKIT_FUA_EMULATE>
+
+The client may request Forced Unit Access, but it is implemented by
+emulation, where nbdkit calls C<.flush> after a write operation; this
+is semantically correct, but may hurt performance as it tends to flush
+more data than just what the client requested.  The C<NBDKIT_FLAG_FUA>
+flag will not be passed to the plugin's write callbacks.
+
+=item C<NBDKIT_FUA_NATIVE>
+
+The client may request Forced Unit Access, which results in the
+C<NBDKIT_FLAG_FUA> flag being passed to the plugin's write callbacks
+(C<.pwrite>, C<.trim>, and C<.zero>).  When the flag is set,
these
+callbacks must not return success until the client's request has
+landed in persistent storage.
+
+=back
+
 =head1 ERROR HANDLING

 If there is an error in the plugin, the plugin should call
@@ -414,9 +487,33 @@ error message and return C<-1>.
 This callback is not required.  If omitted, then we return true iff a
 C<.trim> callback has been defined.

+=head2 C<.can_fua>
+
+ int can_fua (void *handle);
+
+This is called during the option negotiation phase to find out if the
+plugin supports the Forced Unit Access (FUA) flag on write, zero, and
+trim requests.  If this returns C<NBDKIT_FUA_NONE>, FUA support is not
+advertised to the guest; if this returns C<NBDKIT_FUA_EMULATE>, the
+C<.flush> callback must work (even if C<.can_flush> returns false),
+and FUA support is emulated by calling C<.flush> after any write
+operation; if this returns C<NBDKIT_FUA_NATIVE>, then the
C<.pwrite>,
+C<.zero>, and C<.trim> callbacks (if implemented) must handle the
flag
+C<NBDKIT_FLAG_FUA>, by not returning until that action has landed in
+persistent storage.
+
+If there is an error, C<.can_fua> should call C<nbdkit_error> with
an
+error message and return C<-1>.
+
+This callback is not required unless a plugin wants to specifically
+handle FUA requests.  If omitted, nbdkit checks C<.can_flush>, and
+behaves as if this function returns C<NBDKIT_FUA_NONE> or
+C<NBDKIT_FUA_EMULATE> as appropriate.
+
 =head2 C<.pread>

- int pread (void *handle, void *buf, uint32_t count, uint64_t offset);
+ int pread (void *handle, void *buf, uint32_t count, uint64_t offset,
+            uint32_t flags);

 During the data serving phase, nbdkit calls this callback to read data
 from the backing store.  C<count> bytes starting at C<offset> in
the
@@ -424,6 +521,9 @@ backing store should be read and copied into C<buf>. 
nbdkit takes
 care of all bounds- and sanity-checking, so the plugin does not need
 to worry about that.

+The parameter C<flags> exists in case of future NBD protocol
+extensions; at this time, it will be 0 on input.
+
 The callback must read the whole C<count> bytes if it can.  The NBD
 protocol doesn't allow partial reads (instead, these would be errors).
 If the whole C<count> bytes was read, the callback should return
C<0>
@@ -436,7 +536,8 @@ message, and C<nbdkit_set_error> to record an
appropriate error

 =head2 C<.pwrite>

- int pwrite (void *handle, const void *buf, uint32_t count, uint64_t offset);
+ int pwrite (void *handle, const void *buf, uint32_t count, uint64_t offset,
+             uint32_t flags);

 During the data serving phase, nbdkit calls this callback to write
 data to the backing store.  C<count> bytes starting at C<offset> in
@@ -444,6 +545,10 @@ the backing store should be written using the data in
C<buf>.  nbdkit
 takes care of all bounds- and sanity-checking, so the plugin does not
 need to worry about that.

+This function will not be called if C<.can_write> returned false.  The
+parameter C<flags> may include C<NBDKIT_FLAG_FUA> on input based on
+the result of C<.can_fua>.
+
 The callback must write the whole C<count> bytes if it can.  The NBD
 protocol doesn't allow partial writes (instead, these would be
 errors).  If the whole C<count> bytes was written successfully, the
@@ -456,40 +561,56 @@ message, and C<nbdkit_set_error> to record an
appropriate error

 =head2 C<.flush>

- int flush (void *handle);
+ int flush (void *handle, uint32_t flags);

 During the data serving phase, this callback is used to
 L<fdatasync(2)> the backing store, ie. to ensure it has been
 completely written to a permanent medium.  If that is not possible
 then you can omit this callback.

+This function will not be called directly by the client if
+C<.can_flush> returned false; however, it may still be called by
+nbdkit if C<.can_fua> returned C<NBDKIT_FUA_EMULATE>.  The
parameter
+C<flags> exists in case of future NBD protocol extensions; at this
+time, it will be 0 on input.
+
 If there is an error, C<.flush> should call C<nbdkit_error> with an
 error message, and C<nbdkit_set_error> to record an appropriate error
 (unless C<errno> is sufficient), then return C<-1>.

 =head2 C<.trim>

- int trim (void *handle, uint32_t count, uint64_t offset);
+ int trim (void *handle, uint32_t count, uint64_t offset, uint32_t flags);

 During the data serving phase, this callback is used to "punch holes"
 in the backing store.  If that is not possible then you can omit this
 callback.

+This function will not be called if C<.can_trim> returned false.  The
+parameter C<flags> may include C<NBDKIT_FLAG_FUA> on input based on
+the result of C<.can_fua>.
+
 If there is an error, C<.trim> should call C<nbdkit_error> with an
 error message, and C<nbdkit_set_error> to record an appropriate error
 (unless C<errno> is sufficient), then return C<-1>.

 =head2 C<.zero>

- int zero (void *handle, uint32_t count, uint64_t offset, int may_trim);
+ int zero (void *handle, uint32_t count, uint64_t offset, uint32_t flags);

 During the data serving phase, this callback is used to write C<count>
-bytes of zeroes at C<offset> in the backing store.  If C<may_trim>
is
-non-zero, the operation can punch a hole instead of writing actual
-zero bytes, but only if subsequent reads from the hole read as zeroes.
-If this callback is omitted, or if it fails with C<EOPNOTSUPP>
-(whether by C<nbdkit_set_error> or C<errno>), then C<.pwrite>
will be
-used instead.
+bytes of zeroes at C<offset> in the backing store.
+
+This function will not be called if C<.can_write> returned false.  On
+input, the parameter C<flags> may include C<NBDKIT_FLAG_MAY_TRIM>
+unconditionally, and C<NBDKIT_FLAG_FUA> based on the result of
+C<.can_fua>.
+
+If C<NBDKIT_FLAG_MAY_TRIM> is requested, the operation can punch a
+hole instead of writing actual zero bytes, but only if subsequent
+reads from the hole read as zeroes.  If this callback is omitted, or
+if it fails with C<EOPNOTSUPP> (whether by C<nbdkit_set_error> or
+C<errno>), then C<.pwrite> will be used instead.

 The callback must write the whole C<count> bytes if it can.  The NBD
 protocol doesn't allow partial writes (instead, these would be
diff --git a/docs/nbdkit.pod b/docs/nbdkit.pod
index dacf11c..d49ae53 100644
--- a/docs/nbdkit.pod
+++ b/docs/nbdkit.pod
@@ -827,7 +827,12 @@ information about that plugin, eg:
  [etc]

 Plugins which ship with nbdkit usually have the same version as the
-corresponding nbdkit binary.
+corresponding nbdkit binary.  The nbdkit binary will always be able to
+utilize plugins compiled against an older version of the header;
+however, newer plugins may not be fully supported by an older nbdkit
+binary (for example, a plugin compiled with C<NBDKIT_API_VERSION> of 2
+fails to load with an older nbdkit that only knows
+C<NBDKIT_API_VERSION> 1).

 =head2 Detect if a plugin is installed

diff --git a/include/nbdkit-plugin.h b/include/nbdkit-plugin.h
index 1981061..3f541a1 100644
--- a/include/nbdkit-plugin.h
+++ b/include/nbdkit-plugin.h
@@ -42,7 +42,13 @@
 extern "C" {
 #endif

+/* By default, a plugin gets API version 1; but you may request
+ * version 2 prior to including this header */
+#ifndef NBDKIT_API_VERSION
 #define NBDKIT_API_VERSION                            1
+#elif (NBDKIT_API_VERSION - 0) < 1 || NBDKIT_API_VERSION > 2
+#error Unsupported API version
+#endif

 struct nbdkit_plugin {
   /* Do not set these fields directly; use NBDKIT_REGISTER_PLUGIN.
@@ -80,16 +86,40 @@ struct nbdkit_plugin {
   int (*is_rotational) (void *handle);
   int (*can_trim) (void *handle);

+#if NBDKIT_API_VERSION == 1
   int (*pread) (void *handle, void *buf, uint32_t count, uint64_t offset);
   int (*pwrite) (void *handle, const void *buf, uint32_t count, uint64_t
offset);
   int (*flush) (void *handle);
   int (*trim) (void *handle, uint32_t count, uint64_t offset);
   int (*zero) (void *handle, uint32_t count, uint64_t offset, int may_trim);
+#else
+  int (*_pread_old) (void *, void *, uint32_t, uint64_t);
+  int (*_pwrite_old) (void *, const void *, uint32_t, uint64_t);
+  int (*_flush_old) (void *);
+  int (*_trim_old) (void *, uint32_t, uint64_t);
+  int (*_zero_old) (void *, uint32_t, uint64_t, int);
+#endif

   int errno_is_preserved;

   void (*dump_plugin) (void);

+  int (*can_fua) (void *handle);
+#if NBDKIT_API_VERSION == 1
+  int (*_unused1) (void *, void *, uint32_t, uint64_t);
+  int (*_unused2) (void *, const void *, uint32_t, uint64_t, uint32_t);
+  int (*_unused3) (void *, uint32_t);
+  int (*_unused4) (void *, uint32_t, uint64_t, uint32_t);
+  int (*_unused5) (void *, uint32_t, uint64_t, uint32_t);
+#else
+  int (*pread) (void *handle, void *buf, uint32_t count, uint64_t offset,
+                uint32_t flags);
+  int (*pwrite) (void *handle, const void *buf, uint32_t count,
+                 uint64_t offset, uint32_t flags);
+  int (*flush) (void *handle, uint32_t flags);
+  int (*trim) (void *handle, uint32_t count, uint64_t offset, uint32_t flags);
+  int (*zero) (void *handle, uint32_t count, uint64_t offset, uint32_t flags);
+#endif
   /* int (*set_exportname) (void *handle, const char *exportname); */
 };

diff --git a/src/internal.h b/src/internal.h
index d873cd5..ea3155c 100644
--- a/src/internal.h
+++ b/src/internal.h
@@ -40,6 +40,7 @@
 #include <sys/socket.h>
 #include <pthread.h>

+#define NBDKIT_API_VERSION 2
 #include "nbdkit-plugin.h"
 #include "nbdkit-filter.h"

diff --git a/src/plugins.c b/src/plugins.c
index 10c911d..d44e724 100644
--- a/src/plugins.c
+++ b/src/plugins.c
@@ -180,6 +180,12 @@ plugin_dump_fields (struct backend *b)
   HAS (can_flush);
   HAS (is_rotational);
   HAS (can_trim);
+  HAS (_pread_old);
+  HAS (_pwrite_old);
+  HAS (_flush_old);
+  HAS (_trim_old);
+  HAS (_zero_old);
+  HAS (can_fua);
   HAS (pread);
   HAS (pwrite);
   HAS (flush);
@@ -301,7 +307,7 @@ plugin_can_write (struct backend *b, struct connection
*conn)
   if (p->plugin.can_write)
     return p->plugin.can_write (connection_get_handle (conn, 0));
   else
-    return p->plugin.pwrite != NULL;
+    return p->plugin.pwrite || p->plugin._pwrite_old;
 }

 static int
@@ -316,7 +322,7 @@ plugin_can_flush (struct backend *b, struct connection
*conn)
   if (p->plugin.can_flush)
     return p->plugin.can_flush (connection_get_handle (conn, 0));
   else
-    return p->plugin.flush != NULL;
+    return p->plugin.flush || p->plugin._flush_old;
 }

 static int
@@ -346,7 +352,7 @@ plugin_can_trim (struct backend *b, struct connection *conn)
   if (p->plugin.can_trim)
     return p->plugin.can_trim (connection_get_handle (conn, 0));
   else
-    return p->plugin.trim != NULL;
+    return p->plugin.trim || p->plugin._trim_old;
 }

 /* Grab the appropriate error value.
@@ -377,14 +383,20 @@ plugin_can_fua (struct backend *b, struct connection
*conn)
   struct backend_plugin *p = container_of (b, struct backend_plugin, backend);
   int r;

+  assert (connection_get_handle (conn, 0));
+
   debug ("can_fua");

-  /* TODO - wire FUA flag support into plugins. Until then, this copies
-   * can_flush, since that's how we emulate FUA. */
+  if (p->plugin.can_fua) {
+    r = p->plugin.can_fua (connection_get_handle (conn, 0));
+    if (r > NBDKIT_FUA_EMULATE && p->plugin._api_version == 1)
+      r = NBDKIT_FUA_EMULATE;
+    return r;
+  }
   r = plugin_can_flush (b, conn);
   if (r == -1)
     return -1;
-  if (r == 0 || !p->plugin.flush)
+  if (r == 0 || !(p->plugin.flush || p->plugin._flush_old))
     return NBDKIT_FUA_NONE;
   return NBDKIT_FUA_EMULATE;
 }
@@ -398,12 +410,17 @@ plugin_pread (struct backend *b, struct connection *conn,
   int r;

   assert (connection_get_handle (conn, 0));
-  assert (p->plugin.pread != NULL);
+  assert (p->plugin.pread || p->plugin._pread_old);
   assert (!flags);

   debug ("pread count=%" PRIu32 " offset=%" PRIu64, count,
offset);

-  r = p->plugin.pread (connection_get_handle (conn, 0), buf, count, offset);
+  if (p->plugin.pread)
+    r = p->plugin.pread (connection_get_handle (conn, 0), buf, count,
offset,
+                         0);
+  else
+    r = p->plugin._pread_old (connection_get_handle (conn, 0), buf, count,
+                              offset);
   if (r == -1)
     *err = get_error (p);
   return r;
@@ -421,16 +438,17 @@ plugin_flush (struct backend *b, struct connection *conn,
uint32_t flags,

   debug ("flush");

-  if (p->plugin.flush != NULL) {
-    r = p->plugin.flush (connection_get_handle (conn, 0));
-    if (r == -1)
-      *err = get_error (p);
-    return r;
-  }
+  if (p->plugin.flush)
+    r = p->plugin.flush (connection_get_handle (conn, 0), 0);
+  else if (p->plugin._flush_old)
+    r = p->plugin._flush_old (connection_get_handle (conn, 0));
   else {
     *err = EINVAL;
     return -1;
   }
+  if (r == -1)
+    *err = get_error (p);
+  return r;
 }

 static int
@@ -441,6 +459,7 @@ plugin_pwrite (struct backend *b, struct connection *conn,
   int r;
   struct backend_plugin *p = container_of (b, struct backend_plugin, backend);
   bool fua = flags & NBDKIT_FLAG_FUA;
+  bool need_flush = false;

   assert (connection_get_handle (conn, 0));
   assert (!(flags & ~NBDKIT_FLAG_FUA));
@@ -448,17 +467,22 @@ plugin_pwrite (struct backend *b, struct connection *conn,
   debug ("pwrite count=%" PRIu32 " offset=%" PRIu64 "
fua=%d", count, offset,
          fua);

-  if (p->plugin.pwrite != NULL)
-    r = p->plugin.pwrite (connection_get_handle (conn, 0),
-                          buf, count, offset);
+  if (fua && plugin_can_fua (b, conn) != NBDKIT_FUA_NATIVE) {
+    flags &= ~NBDKIT_FLAG_FUA;
+    need_flush = true;
+  }
+  if (p->plugin.pwrite)
+    r = p->plugin.pwrite (connection_get_handle (conn, 0), buf, count,
offset,
+                          flags);
+  else if (p->plugin._pwrite_old)
+    r = p->plugin._pwrite_old (connection_get_handle (conn, 0),
+                               buf, count, offset);
   else {
     *err = EROFS;
     return -1;
   }
-  if (r != -1 && fua) {
-    assert (p->plugin.flush);
-    r = p->plugin.flush (connection_get_handle (conn, 0));
-  }
+  if (r != -1 && need_flush)
+    r = plugin_flush (b, conn, 0, err);
   if (r == -1)
     *err = get_error (p);
   return r;
@@ -471,6 +495,7 @@ plugin_trim (struct backend *b, struct connection *conn,
   int r;
   struct backend_plugin *p = container_of (b, struct backend_plugin, backend);
   bool fua = flags & NBDKIT_FLAG_FUA;
+  bool need_flush = false;

   assert (connection_get_handle (conn, 0));
   assert (!(flags & ~NBDKIT_FLAG_FUA));
@@ -478,16 +503,20 @@ plugin_trim (struct backend *b, struct connection *conn,
   debug ("trim count=%" PRIu32 " offset=%" PRIu64 "
fua=%d", count, offset,
          fua);

-  if (p->plugin.trim != NULL)
-    r = p->plugin.trim (connection_get_handle (conn, 0), count, offset);
+  if (fua && plugin_can_fua (b, conn) != NBDKIT_FUA_NATIVE) {
+    flags &= ~NBDKIT_FLAG_FUA;
+    need_flush = true;
+  }
+  if (p->plugin.trim)
+    r = p->plugin.trim (connection_get_handle (conn, 0), count, offset,
flags);
+  else if (p->plugin._trim_old)
+    r = p->plugin._trim_old (connection_get_handle (conn, 0), count,
offset);
   else {
     *err = EINVAL;
     return -1;
   }
-  if (r != -1 && fua) {
-    assert (p->plugin.flush);
-    r = p->plugin.flush (connection_get_handle (conn, 0));
-  }
+  if (r != -1 && need_flush)
+    r = plugin_flush (b, conn, 0, err);
   if (r == -1)
     *err = get_error (p);
   return r;
@@ -500,9 +529,11 @@ plugin_zero (struct backend *b, struct connection *conn,
   struct backend_plugin *p = container_of (b, struct backend_plugin, backend);
   char *buf;
   uint32_t limit;
-  int result;
-  int may_trim = (flags & NBDKIT_FLAG_MAY_TRIM) != 0;
+  int r = -1;
+  bool may_trim = flags & NBDKIT_FLAG_MAY_TRIM;
   bool fua = flags & NBDKIT_FLAG_FUA;
+  bool emulate = false;
+  bool need_flush = false;

   assert (connection_get_handle (conn, 0));
   assert (!(flags & ~(NBDKIT_FLAG_MAY_TRIM | NBDKIT_FLAG_FUA)));
@@ -510,19 +541,27 @@ plugin_zero (struct backend *b, struct connection *conn,
   debug ("zero count=%" PRIu32 " offset=%" PRIu64 "
may_trim=%d fua=%d",
          count, offset, may_trim, fua);

+  if (fua && plugin_can_fua (b, conn) != NBDKIT_FUA_NATIVE) {
+    flags &= ~NBDKIT_FLAG_FUA;
+    need_flush = true;
+  }
   if (!count)
     return 0;
-  if (p->plugin.zero) {
-    errno = 0;
-    result = p->plugin.zero (connection_get_handle (conn, 0),
-                             count, offset, may_trim);
-    if (result == -1)
-      *err = get_error (p);
-    if (result == 0 || *err != EOPNOTSUPP)
-      goto done;
-  }
+  errno = 0;
+  if (p->plugin.zero)
+    r = p->plugin.zero (connection_get_handle (conn, 0), count, offset,
flags);
+  else if (p->plugin._zero_old)
+    r = p->plugin._zero_old (connection_get_handle (conn, 0), count, offset,
+                             may_trim);
+  else
+    emulate = true;
+  if (r == -1)
+    *err = emulate ? EOPNOTSUPP : get_error (p);
+  if (r == 0 || *err != EOPNOTSUPP)
+    goto done;

-  assert (p->plugin.pwrite);
+  assert (p->plugin.pwrite || p->plugin._pwrite_old);
+  flags &= ~NBDKIT_FLAG_MAY_TRIM;
   threadlocal_set_error (0);
   limit = count < MAX_REQUEST_SIZE ? count : MAX_REQUEST_SIZE;
   buf = calloc (limit, 1);
@@ -532,9 +571,8 @@ plugin_zero (struct backend *b, struct connection *conn,
   }

   while (count) {
-    result = p->plugin.pwrite (connection_get_handle (conn, 0),
-                               buf, limit, offset);
-    if (result == -1)
+    r = plugin_pwrite (b, conn, buf, limit, offset, flags, err);
+    if (r == -1)
       break;
     count -= limit;
     if (count < limit)
@@ -546,13 +584,11 @@ plugin_zero (struct backend *b, struct connection *conn,
   errno = *err;

  done:
-  if (result != -1 && fua) {
-    assert (p->plugin.flush);
-    result = p->plugin.flush (connection_get_handle (conn, 0));
-  }
-  if (result == -1)
+  if (r != -1 && need_flush)
+    r = plugin_flush (b, conn, 0, err);
+  if (r == -1)
     *err = get_error (p);
-  return result;
+  return r;
 }

 static struct backend plugin_functions = {
@@ -619,7 +655,7 @@ plugin_register (size_t index, const char *filename,
   }

   /* Check for incompatible future versions. */
-  if (plugin->_api_version != 1) {
+  if (plugin->_api_version < 0 || plugin->_api_version > 2) {
     fprintf (stderr, "%s: %s: plugin is incompatible with this version of
nbdkit (_api_version = %d)\n",
              program_name, p->filename, plugin->_api_version);
     exit (EXIT_FAILURE);
@@ -654,7 +690,7 @@ plugin_register (size_t index, const char *filename,
              program_name, p->filename);
     exit (EXIT_FAILURE);
   }
-  if (p->plugin.pread == NULL) {
+  if (p->plugin.pread == NULL && p->plugin._pread_old == NULL) {
     fprintf (stderr, "%s: %s: plugin must have a .pread callback\n",
              program_name, p->filename);
     exit (EXIT_FAILURE);
-- 
2.14.3

Eric Blake

2018-Mar-08 23:03 UTC

head link

[Libguestfs] [nbdkit PATCH v3 12/15] nbd: Wire up FUA flag passthrough

If the target server supports FUA, then we should pass the client
flag through rather than emulating things with a less-efficient
flush.

Signed-off-by: Eric Blake <eblake@redhat.com>

---
v3: rebase to API changes
---
 plugins/nbd/nbd.c | 42 +++++++++++++++++++++++++++++++++---------
 1 file changed, 33 insertions(+), 9 deletions(-)

diff --git a/plugins/nbd/nbd.c b/plugins/nbd/nbd.c
index c9727f7..db53746 100644
--- a/plugins/nbd/nbd.c
+++ b/plugins/nbd/nbd.c
@@ -46,6 +46,8 @@
 #include <assert.h>
 #include <pthread.h>

+#define NBDKIT_API_VERSION 2
+
 #include <nbdkit-plugin.h>
 #include "protocol.h"

@@ -616,68 +618,89 @@ nbd_can_trim (void *handle)
   return h->flags & NBD_FLAG_SEND_TRIM;
 }

+static int
+nbd_can_fua (void *handle)
+{
+  struct handle *h = handle;
+
+  return h->flags & NBD_FLAG_SEND_FUA ? NBDKIT_FUA_NATIVE :
NBDKIT_FUA_NONE;
+}
+
 /* Read data from the file. */
 static int
-nbd_pread (void *handle, void *buf, uint32_t count, uint64_t offset)
+nbd_pread (void *handle, void *buf, uint32_t count, uint64_t offset,
+           uint32_t flags)
 {
   struct handle *h = handle;
   int c;

   /* TODO Auto-fragment this if the client has a larger max transfer
      limit than the server */
+  assert (!flags);
   c = nbd_request_full (h, 0, NBD_CMD_READ, offset, count, NULL, buf);
   return c < 0 ? c : nbd_reply (h, c);
 }

 /* Write data to the file. */
 static int
-nbd_pwrite (void *handle, const void *buf, uint32_t count, uint64_t offset)
+nbd_pwrite (void *handle, const void *buf, uint32_t count, uint64_t offset,
+            uint32_t flags)
 {
   struct handle *h = handle;
   int c;

   /* TODO Auto-fragment this if the client has a larger max transfer
      limit than the server */
-  c = nbd_request_full (h, 0, NBD_CMD_WRITE, offset, count, buf, NULL);
+  assert (!(flags & ~NBDKIT_FLAG_FUA));
+  c = nbd_request_full (h, flags & NBDKIT_FLAG_FUA ? NBD_CMD_FLAG_FUA : 0,
+                        NBD_CMD_WRITE, offset, count, buf, NULL);
   return c < 0 ? c : nbd_reply (h, c);
 }

 /* Write zeroes to the file. */
 static int
-nbd_zero (void *handle, uint32_t count, uint64_t offset, int may_trim)
+nbd_zero (void *handle, uint32_t count, uint64_t offset, uint32_t flags)
 {
   struct handle *h = handle;
   int c;
+  int f = 0;

+  assert (!(flags & ~(NBDKIT_FLAG_FUA | NBDKIT_FLAG_MAY_TRIM)));
   if (!(h->flags & NBD_FLAG_SEND_WRITE_ZEROES)) {
     /* Trigger a fall back to regular writing */
     errno = EOPNOTSUPP;
     return -1;
   }

-  c = nbd_request (h, may_trim ? 0 : NBD_CMD_FLAG_NO_HOLE,
-                   NBD_CMD_WRITE_ZEROES, offset, count);
+  if (!(flags & NBDKIT_FLAG_MAY_TRIM))
+    f |= NBD_CMD_FLAG_NO_HOLE;
+  if (flags & NBDKIT_FLAG_FUA)
+    f |= NBD_CMD_FLAG_FUA;
+  c = nbd_request (h, f, NBD_CMD_WRITE_ZEROES, offset, count);
   return c < 0 ? c : nbd_reply (h, c);
 }

 /* Trim a portion of the file. */
 static int
-nbd_trim (void *handle, uint32_t count, uint64_t offset)
+nbd_trim (void *handle, uint32_t count, uint64_t offset, uint32_t flags)
 {
   struct handle *h = handle;
   int c;

-  c = nbd_request (h, 0, NBD_CMD_TRIM, offset, count);
+  assert (!(flags & ~NBDKIT_FLAG_FUA));
+  c = nbd_request (h, flags & NBDKIT_FLAG_FUA ? NBD_CMD_FLAG_FUA : 0,
+                   NBD_CMD_TRIM, offset, count);
   return c < 0 ? c : nbd_reply (h, c);
 }

 /* Flush the file to disk. */
 static int
-nbd_flush (void *handle)
+nbd_flush (void *handle, uint32_t flags)
 {
   struct handle *h = handle;
   int c;

+  assert (!flags);
   c = nbd_request (h, 0, NBD_CMD_FLUSH, 0, 0);
   return c < 0 ? c : nbd_reply (h, c);
 }
@@ -697,6 +720,7 @@ static struct nbdkit_plugin plugin = {
   .can_flush          = nbd_can_flush,
   .is_rotational      = nbd_is_rotational,
   .can_trim           = nbd_can_trim,
+  .can_fua            = nbd_can_fua,
   .pread              = nbd_pread,
   .pwrite             = nbd_pwrite,
   .zero               = nbd_zero,
-- 
2.14.3

Eric Blake

2018-Mar-08 23:03 UTC

head link

[Libguestfs] [nbdkit PATCH v3 13/15] null: Wire up FUA flag support

Since the null plugin already does nothing during writes, it can be
argued that those writes have already landed on persistent storage
(we know that we will read back zeros without any further delays).
Instead of advertising that the client cannot flush or FUA, we can
enable a few more callbacks to get instant native FUA support.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 plugins/null/null.c | 42 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 36 insertions(+), 6 deletions(-)

diff --git a/plugins/null/null.c b/plugins/null/null.c
index d469963..905cc64 100644
--- a/plugins/null/null.c
+++ b/plugins/null/null.c
@@ -1,5 +1,5 @@
 /* nbdkit
- * Copyright (C) 2017 Red Hat Inc.
+ * Copyright (C) 2017-2018 Red Hat Inc.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
@@ -39,6 +39,8 @@
 #include <string.h>
 #include <errno.h>

+#define NBDKIT_API_VERSION 2
+
 #include <nbdkit-plugin.h>

 /* The size of disk in bytes (initialized by size=<SIZE> parameter). */
@@ -111,7 +113,8 @@ null_get_size (void *handle)

 /* Read data. */
 static int
-null_pread (void *handle, void *buf, uint32_t count, uint64_t offset)
+null_pread (void *handle, void *buf, uint32_t count, uint64_t offset,
+            uint32_t flags)
 {
   memset (buf, 0, count);
   return 0;
@@ -119,7 +122,16 @@ null_pread (void *handle, void *buf, uint32_t count,
uint64_t offset)

 /* Write data. */
 static int
-null_pwrite (void *handle, const void *buf, uint32_t count, uint64_t offset)
+null_pwrite (void *handle, const void *buf, uint32_t count, uint64_t offset,
+             uint32_t flags)
+{
+  /* nothing */
+  return 0;
+}
+
+/* Write zeros. */
+static int
+null_zero (void *handle, uint32_t count, uint64_t offset, uint32_t flags)
 {
   /* nothing */
   return 0;
@@ -133,14 +145,28 @@ null_can_write (void *handle)
   return !h->readonly;
 }

-/* Write zeroes - very efficient! */
+/* Flush is a no-op, so advertise native FUA support */
 static int
-null_zero (void *handle, uint32_t count, uint64_t offset, int may_trim)
+null_can_fua (void *handle)
+{
+  return NBDKIT_FUA_NATIVE;
+}
+
+/* Trim. */
+static int
+null_trim (void *handle, uint32_t count, uint64_t offset, uint32_t flags)
 {
   /* nothing */
   return 0;
 }

+/* Nothing is persistent, so flush is trivially supported */
+static int
+null_flush (void *handle, uint32_t flags)
+{
+  return 0;
+}
+
 static struct nbdkit_plugin plugin = {
   .name              = "null",
   .version           = PACKAGE_VERSION,
@@ -152,7 +178,11 @@ static struct nbdkit_plugin plugin = {
   .pread             = null_pread,
   .pwrite            = null_pwrite,
   .can_write         = null_can_write,
-  .zero              = null_zero,
+  .zero              = null_trim,
+  .trim              = null_zero,
+  .can_trim          = null_can_write,
+  .can_fua           = null_can_fua,
+  .flush             = null_flush,
   /* In this plugin, errno is preserved properly along error return
    * paths from failed system calls.
    */
-- 
2.14.3

Eric Blake

2018-Mar-08 23:03 UTC

head link

[Libguestfs] [nbdkit PATCH v3 14/15] todo: Mention possibility of caching .can_FOO callbacks

Recent patches clarified documentation to point out that within
the life of a single connection, the .can_FOO helpers should
return consistent results, and that callers may cache those
results.  But at least in the case of .can_fua, we aren't really
caching things; depending on the overhead involved, calling out
to the plugin's .can_fua on every .pwrite with FUA requested
may be noticeable overhead compared to caching it.  Any cache
must not be a static variable, as it can differ between
connections.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 TODO | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/TODO b/TODO
index b6fb0b1..1e0f483 100644
--- a/TODO
+++ b/TODO
@@ -40,6 +40,10 @@ General ideas for improvements
   ones like offset) can fail to initialize if they can't guarantee
   strict alignment and don't want to deal with bounce buffers.

+* Add per-connection caching of .can_FOO callbacks (we already have
+  some: .can_write is only called once, but .can_fua is called on
+  every request with the FUA flag set).
+
 Suggestions for plugins
 -----------------------

-- 
2.14.3

Eric Blake

2018-Mar-08 23:03 UTC

head link

[Libguestfs] [nbdkit PATCH v3 15/15] RFC: plugins: Add back-compat for new plugin with old nbdkit

If we bump NBDKIT_API_VERSION, we have forcefully made older
nbdkit reject all plugins that opt to the newer API:

$ nbdkit ./plugins/nbd/.libs/nbdkit-nbd-plugin.so --dump-plugin
nbdkit: ./plugins/nbd/.libs/nbdkit-nbd-plugin.so: plugin is incompatible with
this version of nbdkit (_api_version = 2)

But with a bit of finesse, we can make opting in to FUA support
orthogonal to NBDKIT_API_VERSION, by introducing a new witness
level that the user controls, and by providing sane fallbacks
so that newer plugins correctly populate the fields expected by
older nbdkit.

---
v3: rework entirely, add new variable instead of NBDKIT_API_VERSION

This patch is still RFC; if you like it, it's probably better to
rebase portions of this patch into the rest of the series, rather
than churning NBDKIT_API_VERSION temporarily to 2 and now back to 1

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 docs/nbdkit-plugin.pod  | 12 ++++-----
 docs/nbdkit.pod         |  6 ++---
 include/nbdkit-plugin.h | 71 ++++++++++++++++++++++++++++++++++++++++++++-----
 src/internal.h          |  2 +-
 src/plugins.c           |  4 +--
 plugins/nbd/nbd.c       |  2 +-
 plugins/null/null.c     |  2 +-
 7 files changed, 77 insertions(+), 22 deletions(-)

diff --git a/docs/nbdkit-plugin.pod b/docs/nbdkit-plugin.pod
index 933f710..a82ced8 100644
--- a/docs/nbdkit-plugin.pod
+++ b/docs/nbdkit-plugin.pod
@@ -6,7 +6,7 @@ nbdkit-plugin - How to write nbdkit plugins

 =head1 SYNOPSIS

- #define NBDKIT_API_VERSION 2
+ #define NBDKIT_PLUGIN_LEVEL 2
  
  #include <nbdkit-plugin.h>
  
@@ -65,22 +65,22 @@ L<nbdkit-perl-plugin(3)>,
 L<nbdkit-python-plugin(3)>,
 L<nbdkit-ruby-plugin(3)>.

-=head1 C<#define NBDKIT_API_VERSION 2>
+=head1 C<#define NBDKIT_PLUGIN_LEVEL 2>

 Plugins must choose which API version they want to use, by defining
-NBDKIT_API_VERSION to a positive integer prior to including
+NBDKIT_PLUGIN_LEVEL to a positive integer prior to including
 C<nbdkit-plugin.h> (or any other nbdkit header).  The default version
 is 1 for backwards-compatibility with nbdkit v1.1.26 and earlier;
 however, it is recommended that new plugins be written to the maximum
-version (currently 2) as it enables more features and better
+level (currently 2) as it enables more features and better
 interaction with nbdkit filters.  Therefore, the rest of this document
 only covers the version 2 interface.  A newer nbdkit will always
-support plugins compiled against any prior API version.
+support plugins compiled against any prior API level.

 =head1 C<nbdkit-plugin.h>

 All plugins should start by including this header file, after
-optionally choosing an API version:
+optionally choosing an API level:

  #include <nbdkit-plugin.h>

diff --git a/docs/nbdkit.pod b/docs/nbdkit.pod
index d49ae53..ee27d54 100644
--- a/docs/nbdkit.pod
+++ b/docs/nbdkit.pod
@@ -829,10 +829,8 @@ information about that plugin, eg:
 Plugins which ship with nbdkit usually have the same version as the
 corresponding nbdkit binary.  The nbdkit binary will always be able to
 utilize plugins compiled against an older version of the header;
-however, newer plugins may not be fully supported by an older nbdkit
-binary (for example, a plugin compiled with C<NBDKIT_API_VERSION> of 2
-fails to load with an older nbdkit that only knows
-C<NBDKIT_API_VERSION> 1).
+however, although the design tries hard to allow it, newer plugins may
+not be fully supported by an older nbdkit binary.

 =head2 Detect if a plugin is installed

diff --git a/include/nbdkit-plugin.h b/include/nbdkit-plugin.h
index 3f541a1..9ecddeb 100644
--- a/include/nbdkit-plugin.h
+++ b/include/nbdkit-plugin.h
@@ -42,11 +42,15 @@
 extern "C" {
 #endif

-/* By default, a plugin gets API version 1; but you may request
- * version 2 prior to including this header */
-#ifndef NBDKIT_API_VERSION
-#define NBDKIT_API_VERSION                            1
-#elif (NBDKIT_API_VERSION - 0) < 1 || NBDKIT_API_VERSION > 2
+/* Remaining compatible with API version 1 allows plugins compiled
+ * against newer nbdkit headers to run with older nbdkit binaries.  */
+#define NBDKIT_API_VERSION 1
+
+/* By default, a plugin gets API version 1 level 1; but you may
+ * request version 1 level 2 prior to including this header. */
+#ifndef NBDKIT_PLUGIN_LEVEL
+#define NBDKIT_PLUGIN_LEVEL 1
+#elif (NBDKIT_PLUGIN_LEVEL - 0) < 1 || NBDKIT_PLUGIN_LEVEL > 2
 #error Unsupported API version
 #endif

@@ -86,7 +90,7 @@ struct nbdkit_plugin {
   int (*is_rotational) (void *handle);
   int (*can_trim) (void *handle);

-#if NBDKIT_API_VERSION == 1
+#if NBDKIT_PLUGIN_LEVEL == 1
   int (*pread) (void *handle, void *buf, uint32_t count, uint64_t offset);
   int (*pwrite) (void *handle, const void *buf, uint32_t count, uint64_t
offset);
   int (*flush) (void *handle);
@@ -105,7 +109,7 @@ struct nbdkit_plugin {
   void (*dump_plugin) (void);

   int (*can_fua) (void *handle);
-#if NBDKIT_API_VERSION == 1
+#if NBDKIT_PLUGIN_LEVEL == 1
   int (*_unused1) (void *, void *, uint32_t, uint64_t);
   int (*_unused2) (void *, const void *, uint32_t, uint64_t, uint32_t);
   int (*_unused3) (void *, uint32_t);
@@ -125,6 +129,7 @@ struct nbdkit_plugin {

 extern void nbdkit_set_error (int err);

+#if NBDKIT_PLUGIN_LEVEL == 1
 #define NBDKIT_REGISTER_PLUGIN(plugin)                                  \
   NBDKIT_CXX_LANG_C                                                     \
   struct nbdkit_plugin *                                                \
@@ -136,6 +141,58 @@ extern void nbdkit_set_error (int err);
     return &(plugin);                                                   \
   }

+#else
+#define NBDKIT_REGISTER_PLUGIN(plugin)                                  \
+  static int                                                            \
+  nbdkit_pread_old (void *handle, void *buf, uint32_t count,            \
+                    uint64_t offset)                                    \
+  {                                                                     \
+    return (plugin).pread (handle, buf, count, offset, 0);              \
+  }                                                                     \
+  static int                                                            \
+  nbdkit_pwrite_old (void *handle, const void *buf, uint32_t count,     \
+                     uint64_t offset)                                   \
+  {                                                                     \
+    return (plugin).pwrite (handle, buf, count, offset, 0);             \
+  }                                                                     \
+  static int                                                            \
+  nbdkit_flush_old (void *handle)                                       \
+  {                                                                     \
+    return (plugin).flush (handle, 0);                                  \
+  }                                                                     \
+  static int                                                            \
+  nbdkit_trim_old (void *handle, uint32_t count, uint64_t offset)       \
+  {                                                                     \
+    return (plugin).trim (handle, count, offset, 0);                    \
+  }                                                                     \
+  static int                                                            \
+  nbdkit_zero_old (void *handle, uint32_t count, uint64_t offset,       \
+                   int may_trim)                                        \
+  {                                                                     \
+    return (plugin).zero (handle, count, offset,                        \
+                          may_trim ? NBDKIT_FLAG_MAY_TRIM : 0);         \
+  }                                                                     \
+  NBDKIT_CXX_LANG_C                                                     \
+  struct nbdkit_plugin *                                                \
+  plugin_init (void)                                                    \
+  {                                                                     \
+    (plugin)._struct_size = sizeof (plugin);                            \
+    (plugin)._api_version = NBDKIT_API_VERSION;                         \
+    (plugin)._thread_model = THREAD_MODEL;                              \
+    (plugin)._pread_old = nbdkit_pread_old;                             \
+    if ((plugin).pwrite)                                                \
+      (plugin)._pwrite_old = nbdkit_pwrite_old;                         \
+    if ((plugin).flush)                                                 \
+      (plugin)._flush_old = nbdkit_flush_old;                           \
+    if ((plugin).trim)                                                  \
+      (plugin)._trim_old = nbdkit_trim_old;                             \
+    if ((plugin).zero)                                                  \
+      (plugin)._zero_old = nbdkit_zero_old;                             \
+    return &(plugin);                                                   \
+  }
+
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/src/internal.h b/src/internal.h
index ea3155c..e803875 100644
--- a/src/internal.h
+++ b/src/internal.h
@@ -40,7 +40,7 @@
 #include <sys/socket.h>
 #include <pthread.h>

-#define NBDKIT_API_VERSION 2
+#define NBDKIT_PLUGIN_LEVEL 2
 #include "nbdkit-plugin.h"
 #include "nbdkit-filter.h"

diff --git a/src/plugins.c b/src/plugins.c
index d44e724..e583fac 100644
--- a/src/plugins.c
+++ b/src/plugins.c
@@ -389,7 +389,7 @@ plugin_can_fua (struct backend *b, struct connection *conn)

   if (p->plugin.can_fua) {
     r = p->plugin.can_fua (connection_get_handle (conn, 0));
-    if (r > NBDKIT_FUA_EMULATE && p->plugin._api_version == 1)
+    if (r > NBDKIT_FUA_EMULATE && !p->plugin.pread)
       r = NBDKIT_FUA_EMULATE;
     return r;
   }
@@ -655,7 +655,7 @@ plugin_register (size_t index, const char *filename,
   }

   /* Check for incompatible future versions. */
-  if (plugin->_api_version < 0 || plugin->_api_version > 2) {
+  if (plugin->_api_version != NBDKIT_API_VERSION) {
     fprintf (stderr, "%s: %s: plugin is incompatible with this version of
nbdkit (_api_version = %d)\n",
              program_name, p->filename, plugin->_api_version);
     exit (EXIT_FAILURE);
diff --git a/plugins/nbd/nbd.c b/plugins/nbd/nbd.c
index db53746..bd1b48b 100644
--- a/plugins/nbd/nbd.c
+++ b/plugins/nbd/nbd.c
@@ -46,7 +46,7 @@
 #include <assert.h>
 #include <pthread.h>

-#define NBDKIT_API_VERSION 2
+#define NBDKIT_PLUGIN_LEVEL 2

 #include <nbdkit-plugin.h>
 #include "protocol.h"
diff --git a/plugins/null/null.c b/plugins/null/null.c
index 905cc64..436892d 100644
--- a/plugins/null/null.c
+++ b/plugins/null/null.c
@@ -39,7 +39,7 @@
 #include <string.h>
 #include <errno.h>

-#define NBDKIT_API_VERSION 2
+#define NBDKIT_PLUGIN_LEVEL 2

 #include <nbdkit-plugin.h>

-- 
2.14.3

Richard W.M. Jones

2018-Mar-20 09:56 UTC

head link

Re: [Libguestfs] [nbdkit PATCH v3 14/15] todo: Mention possibility of caching .can_FOO callbacks

On Thu, Mar 08, 2018 at 05:03:10PM -0600, Eric Blake
wrote:> Recent patches clarified documentation to point out that within
> the life of a single connection, the .can_FOO helpers should
> return consistent results, and that callers may cache those
> results.  But at least in the case of .can_fua, we aren't really
> caching things; depending on the overhead involved, calling out
> to the plugin's .can_fua on every .pwrite with FUA requested
> may be noticeable overhead compared to caching it.  Any cache
> must not be a static variable, as it can differ between
> connections.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  TODO | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/TODO b/TODO
> index b6fb0b1..1e0f483 100644
> --- a/TODO
> +++ b/TODO
> @@ -40,6 +40,10 @@ General ideas for improvements
>    ones like offset) can fail to initialize if they can't guarantee
>    strict alignment and don't want to deal with bounce buffers.
> 
> +* Add per-connection caching of .can_FOO callbacks (we already have
> +  some: .can_write is only called once, but .can_fua is called on
> +  every request with the FUA flag set).
> +
>  Suggestions for plugins
>  -----------------------
> 
ACK series up to and including this patch.

I've got some comments on the final (RFC) patch.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Richard W.M. Jones

2018-Mar-20 10:00 UTC

head link

Re: [Libguestfs] [nbdkit PATCH v3 15/15] RFC: plugins: Add back-compat for new plugin with old nbdkit

On Thu, Mar 08, 2018 at 05:03:11PM -0600, Eric Blake
wrote:> If we bump NBDKIT_API_VERSION, we have forcefully made older
> nbdkit reject all plugins that opt to the newer API:
> 
> $ nbdkit ./plugins/nbd/.libs/nbdkit-nbd-plugin.so --dump-plugin
> nbdkit: ./plugins/nbd/.libs/nbdkit-nbd-plugin.so: plugin is incompatible
with this version of nbdkit (_api_version = 2)
> 
> But with a bit of finesse, we can make opting in to FUA support
> orthogonal to NBDKIT_API_VERSION, by introducing a new witness
> level that the user controls, and by providing sane fallbacks
> so that newer plugins correctly populate the fields expected by
> older nbdkit.
> 
> ---
> v3: rework entirely, add new variable instead of NBDKIT_API_VERSION
> 
> This patch is still RFC; if you like it, it's probably better to
> rebase portions of this patch into the rest of the series, rather
> than churning NBDKIT_API_VERSION temporarily to 2 and now back to 1
I do wonder if this patch is really worthwhile.  Old nbdkit / new
plugin is probably going to go untested in general, and therefore will
just cause problems/complexity.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Eric Blake

2018-Mar-20 15:30 UTC

head link

Re: [Libguestfs] [nbdkit PATCH v3 05/15] filters: Add blocksize filter

On 03/08/2018 05:03 PM, Eric Blake wrote:> Testing is easy by reusing the log filter and observing that
> requests were rewritten as expected (well, with the slight
> complication that different versions of qemu-io vary in whether
> they obey the NBD spec of limiting themselves to 512-byte I/O
> if we don't advertise otherwise with NBD_OPT_GO, or in just
> directly handing us byte-based I/O anyway - so the test has
> to round to something even bigger to be sure it sees a
> difference).
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> 
> +++ b/tests/test-blocksize.sh
> @@ -0,0 +1,156 @@
> +#!/bin/bash -
> +# nbdkit
> +# Copyright (C) 2018 Red Hat Inc.
> +
> +: ${QEMU_IO=qemu-io}
Your pending patch to remove QEMU_IO/HAVE_QEMU_IO impacts whether I 
should do this, or just rely on a PATH override (I can rebase this and 
similar patches on top of your cleanup if you want to push that first)

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Eric Blake

2018-Apr-11 04:10 UTC

head link

Re: [Libguestfs] [nbdkit PATCH v3 11/15] plugins: Expose new FUA callbacks

On 03/08/2018 05:03 PM, Eric Blake wrote:> The NBD protocol supports Forced Unit Access (FUA) as a more efficient
> way to wait for just one write to land in persistent storage, rather
> than all outstanding writes at the time of a flush; modeled after
> the kernel's block I/O flag of the same name.  While we can emulate
> the proper semantics with a full-blown flush, there are some plugins
> that can properly pass the FUA flag on to the end storage and thereby
> avoid some overhead.
> 
> This patch introduces new callbacks, and updates the documentation
> to the new API, while ensuring that plugins compiled to the old API
> still work.  The new API adds .can_fua, then adds a flags parameter
> to all five data callbacks, even though only three of them will use
> a flag at the moment.  A plugin client has to opt in to both the
> version 2 API and provide .can_fua with a return of NBDKIT_FUA_NATIVE
> before nbdkit will pass the NBDKIT_FLAG_FUA to the plugin.
> 
> 
> +=head2 C<.can_fua>
> +
> + int can_fua (void *handle);
> +
> +This is called during the option negotiation phase to find out if the
> +plugin supports the Forced Unit Access (FUA) flag on write, zero, and
> +trim requests.  If this returns C<NBDKIT_FUA_NONE>, FUA support is
not
> +advertised to the guest; if this returns C<NBDKIT_FUA_EMULATE>, the
> +C<.flush> callback must work (even if C<.can_flush> returns
false),
> +and FUA support is emulated by calling C<.flush> after any write
> +operation; if this returns C<NBDKIT_FUA_NATIVE>, then the
C<.pwrite>,
> +C<.zero>, and C<.trim> callbacks (if implemented) must handle
the flag
> +C<NBDKIT_FLAG_FUA>, by not returning until that action has landed in
> +persistent storage.
> +
> +If there is an error, C<.can_fua> should call C<nbdkit_error>
with an
> +error message and return C<-1>.
> +
> +This callback is not required unless a plugin wants to specifically
> +handle FUA requests.  If omitted, nbdkit checks C<.can_flush>, and
> +behaves as if this function returns C<NBDKIT_FUA_NONE> or
> +C<NBDKIT_FUA_EMULATE> as appropriate.
That documentation doesn't quite match the code, although I prefer the
documentation...
> +++ b/src/plugins.c
> @@ -377,14 +383,20 @@ plugin_can_fua (struct backend *b, struct connection
*conn)
>    struct backend_plugin *p = container_of (b, struct backend_plugin,
backend);
>    int r;
> 
> +  assert (connection_get_handle (conn, 0));
> +
>    debug ("can_fua");
> 
> -  /* TODO - wire FUA flag support into plugins. Until then, this copies
> -   * can_flush, since that's how we emulate FUA. */
> +  if (p->plugin.can_fua) {
> +    r = p->plugin.can_fua (connection_get_handle (conn, 0));
> +    if (r > NBDKIT_FUA_EMULATE && p->plugin._api_version ==
1)
> +      r = NBDKIT_FUA_EMULATE;
> +    return r;
> +  }
>    r = plugin_can_flush (b, conn);
>    if (r == -1)
>      return -1;
> -  if (r == 0 || !p->plugin.flush)
> +  if (r == 0 || !(p->plugin.flush || p->plugin._flush_old))
>      return NBDKIT_FUA_NONE;
>    return NBDKIT_FUA_EMULATE;
According to the documentation, the result of .can_flush should have no
impact on the return; we want to return NBDKIT_FUA_EMULATE if .flush
exists, even when can_flush returns false.  I'll be pushing this as
followup:

diff --git i/src/plugins.c w/src/plugins.c
index ff4facf..8ee58fb 100644
--- i/src/plugins.c
+++ w/src/plugins.c
@@ -386,18 +386,18 @@ plugin_can_fua (struct backend *b, struct
connection *conn)

   debug ("can_fua");

+  /* The plugin MUST provide .can_fua with a return of NBDKIT_FUA_NATIVE
+     before we will pass FUA on to the plugin. */
   if (p->plugin.can_fua) {
     r = p->plugin.can_fua (connection_get_handle (conn, 0));
     if (r > NBDKIT_FUA_EMULATE && p->plugin._api_version == 1)
       r = NBDKIT_FUA_EMULATE;
     return r;
   }
-  r = plugin_can_flush (b, conn);
-  if (r == -1)
-    return -1;
-  if (r == 0 || !(p->plugin.flush || p->plugin._flush_old))
-    return NBDKIT_FUA_NONE;
-  return NBDKIT_FUA_EMULATE;
+  /* We intend to call .flush even if .can_flush returns false */
+  if (p->plugin.flush || p->plugin._flush_old)
+    return NBDKIT_FUA_EMULATE;
+  return NBDKIT_FUA_NONE;
 }

 /* Grab the appropriate error value.


-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Apparently Analagous Threads

Search for more apparently analagous threads

Libguestfs - Mar 2018 - [nbdkit PATCH v3 00/15] Add FUA support to nbdkit

[Libguestfs] [nbdkit PATCH v3 00/15] Add FUA support to nbdkit

[Libguestfs] [nbdkit PATCH v3 01/15] src: Let internal.h handle common includes

[Libguestfs] [nbdkit PATCH v3 02/15] backend: Rework internal error return semantics

[Libguestfs] [nbdkit PATCH v3 03/15] filters: Adjust callback API for flags/errors

[Libguestfs] [nbdkit PATCH v3 04/15] filters: Add log filter

[Libguestfs] [nbdkit PATCH v3 05/15] filters: Add blocksize filter

[Libguestfs] [nbdkit PATCH v3 06/15] backend: Add .can_zero/.can_fua helpers

[Libguestfs] [nbdkit PATCH v3 07/15] filters: Expose new .can_zero callback

[Libguestfs] [nbdkit PATCH v3 08/15] filters: Add nozero filter

[Libguestfs] [nbdkit PATCH v3 09/15] filters: Expose new .can_fua callback

[Libguestfs] [nbdkit PATCH v3 10/15] filters: Add fua filter

[Libguestfs] [nbdkit PATCH v3 11/15] plugins: Expose new FUA callbacks

[Libguestfs] [nbdkit PATCH v3 12/15] nbd: Wire up FUA flag passthrough

[Libguestfs] [nbdkit PATCH v3 13/15] null: Wire up FUA flag support

[Libguestfs] [nbdkit PATCH v3 14/15] todo: Mention possibility of caching .can_FOO callbacks

[Libguestfs] [nbdkit PATCH v3 15/15] RFC: plugins: Add back-compat for new plugin with old nbdkit

Re: [Libguestfs] [nbdkit PATCH v3 14/15] todo: Mention possibility of caching .can_FOO callbacks

Re: [Libguestfs] [nbdkit PATCH v3 15/15] RFC: plugins: Add back-compat for new plugin with old nbdkit

Re: [Libguestfs] [nbdkit PATCH v3 05/15] filters: Add blocksize filter

Re: [Libguestfs] [nbdkit PATCH v3 11/15] plugins: Expose new FUA callbacks

Apparently Analagous Threads