thr3ads.net - Libguestfs - Re: [Libguestfs] [Qemu-devel] [PATCH 2/5] nbd: Prepare for NBD_CMD_FLAG_FAST

If this information is useful, please help other people find it:
Share via:

Eric Blake

2019-Aug-23 14:30 UTC

[Libguestfs] cross-project patches: Add NBD Fast Zero support

This is a cover letter to a series of patches being proposed in tandem
to four different projects:
- nbd: Document a new NBD_CMD_FLAG_FAST_ZERO command flag
- qemu: Implement the flag for both clients and server
- libnbd: Implement the flag for clients
- nbdkit: Implement the flag for servers, including the nbd passthrough
client

If you want to test the patches together, I've pushed a 'fast-zero'
branch to each of:
https://repo.or.cz/nbd/ericb.git/shortlog/refs/heads/fast-zero
https://repo.or.cz/qemu/ericb.git/shortlog/refs/heads/fast-zero
https://repo.or.cz/libnbd/ericb.git/shortlog/refs/heads/fast-zero
https://repo.or.cz/nbdkit/ericb.git/shortlog/refs/heads/fast-zero


I've run several tests to demonstrate why this is useful, as well as
prove that because I have multiple interoperable projects, it is worth
including in the NBD standard.  The original proposal was here:
https://lists.debian.org/nbd/2019/03/msg00004.html
where I stated:
> I will not push this without both:
> - a positive review (for example, we may decide that burning another
> NBD_FLAG_* is undesirable, and that we should instead have some sort
> of NBD_OPT_ handshake for determining when the server supports
> NBD_CMF_FLAG_FAST_ZERO)
> - a reference client and server implementation (probably both via qemu,
> since it was qemu that raised the problem in the first place)
Consensus on that thread seemed to be that a new NBD_FLAG was okay; and
this thread solves the second bullet of having reference implementations.

Here's what I did for testing full-path interoperability:

nbdkit memory -> qemu-nbd -> nbdkit nbd -> nbdsh

$ nbdkit -p 10810 --filter=nozero --filter=delay memory 1m delay-write=3
zeromode=emulate
$ qemu-nbd -p 10811 -f raw nbd://localhost:10810
$ nbdkit -p 10812 nbd nbd://localhost:10811
$ time nbdsh --connect nbd://localhost:10812 -c 'buf = h.zero(512, 0)'
# takes more than 3 seconds, but succeeds
$ time nbdsh --connect nbd://localhost:10812 -c 'buf = h.zero(512, 0,
nbd.CMD_FLAG_FAST_ZERO)'
# takes less than 1 second to fail with ENOTSUP

And here's some demonstrations on why the feature matters, starting with
this qemu thread as justification:
https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg06389.html

First, I had to create a scenario where falling back to writes is
noticeably slower than performing a zero operation, and where
pre-zeroing also shows an effect.  My choice: let's test 'qemu-img
convert' on an image that is half-sparse (every other megabyte is a
hole) to an in-memory nbd destination.  Then I use a series of nbdkit
filters to force the destination to behave in various manners:
 log logfile=>(sed ...|uniq -c) (track how many normal/fast zero
requests the client makes)
 nozero $params (fine-tune how zero requests behave - the parameters
zeromode and fastzeromode are the real drivers of my various tests)
 blocksize maxdata=256k (allows large zero requests, but forces large
writes into smaller chunks, to magnify the effects of write delays and
allow testing to provide obvious results with a smaller image)
 delay delay-write=20ms delay-zero=5ms (also to magnify the effects on a
smaller image, with writes penalized more than zeroing)
 stats statsfile=/dev/stderr (to track overall time and a decent summary
of how much I/O occurred).
 noextents (forces the entire image to report that it is allocated,
which eliminates any testing variability based on whether qemu-img uses
that to bypass a zeroing operation [1])

So here's my one-time setup, followed by repetitions of the nbdkit
command with different parameters to the nozero filter to explore
different behaviors.

$ qemu-img create -f qcow2 src 100m
$ for i in `seq 0 2 99`; do qemu-io -f qcow2 -c "w ${i}m 1m" src; done
$ nbdkit -U - --filter=log --filter=nozero --filter=blocksize \
  --filter=delay --filter=stats --filter=noextents memory 100m \
  logfile=>(sed -n '/Zero.*\.\./ s/.*\(fast=.\).*/\1/p' |sort|uniq
-c) \
  statsfile=/dev/stderr delay-write=20ms delay-zero=5s maxdata=256k \
  --run 'qemu-img convert -n -f qcow2 -O raw src $nbd' $params

Establish a baseline: when qemu-img does not see write zero support at
all (such as when talking to /dev/nbd0, because the kernel NBD
implementation still does not support write zeroes), qemu is forced to
write the entire disk, including the holes, but doesn't waste any time
pre-zeroing or checking block status for whether the disk is zero (the
default of the nozero filter is to turn off write zero advertisement):

paramselapsed time: 8.54488 s
write: 400 ops, 104857600 bytes, 9.81712e+07 bits/s

Next, let's emulate what qemu 3.1 was like, with a blind pre-zeroing
pass of the entire image without regards to whether that pass is fast or
slow.  For this test, it was easier to use modern qemu and merely ignore
the fast zero bit in nbdkit, but the numbers should be similar when
actually using older qemu.  If qemu guessed right that pre-zeroing is
fast, we see:

params='zeromode=plugin fastzeromode=ignore'
elapsed time: 4.30183 s
write: 200 ops, 52428800 bytes, 9.75005e+07 bits/s
zero: 4 ops, 104857600 bytes, 1.95001e+08 bits/s
      4 fast=1

which is definite win - instead of having to write the half of the image
that was zero on the source, the fast pre-zeroing pass already cleared
it (qemu-img currently breaks write zeroes into 32M chunks [1], and thus
requires 4 zero requests to pre-zero the image).  But if qemu guesses wrong:

params='zeromode=emulate fastzeromode=ignore'
elapsed time: 12.5065 s
write: 600 ops, 157286400 bytes, 1.00611e+08 bits/s
      4 fast=1

Ouch - that is actually slower than the case when zeroing is not used at
all, because the zeroes turned into writes result in performing double
the I/O over the data portions of the file (once during the pre-zero
pass, then again during the data).  The qemu 3.1 behavior is very
bi-polar in nature, and we don't like that.

So qemu 4.0 introduced BDRV_REQ_NO_FALLBACK, which qemu uses during the
pre-zero request to fail quickly if pre-zeroing is not viable. At the
time, NBD did not have a way to support fast zero requests, so qemu
blindly assumes that pre-zeroing is not viable over NBD:

params='zeromode=emulate fastzeromode=none'
elapsed time: 8.32433 s
write: 400 ops, 104857600 bytes, 1.00772e+08 bits/s
     50 fast=0

When zeroing is slow, our time actually beats the baseline by about 0.2
seconds (although zeroing still turned into writes, the use of zero
requests results in less network traffic; you also see that there are 50
zero requests, one per hole, rather than 4 requests for pre-zeroing the
image).  So we've avoided the pre-zeroing penalty.  However:

params='zeromode=plugin fastzeromode=none'
elapsed time: 4.53951 s
write: 200 ops, 52428800 bytes, 9.23955e+07 bits/s
zero: 50 ops, 52428800 bytes, 9.23955e+07 bits/s
     50 fast=0

when zeroing is fast, we're still 0.2 seconds slower than the
pre-zeroing behavior (zeroing runs fast, but one request per hole is
still more transactions than pre-zeroing used to use).  The qemu 4.0
decision thus regained the worst degradation seen in 3.1 when zeroing is
slow, but at a penalty to the case when zeroing is fast.

Since guessing is never as nice as knowing, let's repeat the test, but
now exploiting the new NBD fast zero:

params='zeromode=emulate'
elapsed time: 8.41174 s
write: 400 ops, 104857600 bytes, 9.9725e+07 bits/s
     50 fast=0
      1 fast=1

Good: when zeroes are not fast, qemu-img's initial fast-zero request
immediately fails, and then it switches back to writing the entire image
using regular zeroing for the holes; performance is comparable to the
baseline and to the qemu 4.0 behavior.

params='zeromode=plugin'
elapsed time: 4.31356 s
write: 200 ops, 52428800 bytes, 9.72354e+07 bits/s
zero: 4 ops, 104857600 bytes, 1.94471e+08 bits/s
      4 fast=1

Good: when zeroes are fast, qemu-img is able to use pre-zeroing on the
entire image, resulting in fewer zero transactions overall, getting us
back to the qemu 3.1 maximum performance (and better than the 4.0 behavior).

I hope you enjoyed reading this far, and agree with my interpretation of
the numbers about why this feature is useful!



[1] Orthogonal to these patches are other ideas I have for improving the
NBD protocol in its effects to qemu-img convert, which will result in
later cross-project patches:
- NBD should have a way to advertise (probably via NBD_INFO_ during
NBD_OPT_GO) if the initial image is known to begin life with all zeroes
(if that is the case, qemu-img can skip the extents calls and
pre-zeroing pass altogether)
- improving support to allow NBD to pass larger zero requests (qemu is
currently capping zero requests at 32m based on NBD_INFO_BLOCK_SIZE, but
could easily go up to ~4G with proper info advertisement of maximum zero
request sizing, or if we introduce 64-bit commands to the NBD protocol)

Given that NBD extensions need not be present in every server, each
orthogonal improvement should be tested in isolation to show that it
helps, even though qemu-img will probably use all of the extensions at
once when the server supports all of them.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Eric Blake

2019-Aug-23 14:34 UTC

head link

[Libguestfs] [PATCH 0/1] NBD protocol change to add fast zero support

See the cross-post cover letter for details:
https://www.redhat.com/archives/libguestfs/2019-August/msg00322.html

Eric Blake (1):
  protocol: Add NBD_CMD_FLAG_FAST_ZERO

 doc/proto.md | 50 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 49 insertions(+), 1 deletion(-)

-- 
2.21.0

Eric Blake

2019-Aug-23 14:34 UTC

head link

[Libguestfs] [PATCH 1/1] protocol: Add NBD_CMD_FLAG_FAST_ZERO

While it may be counterintuitive at first, the introduction of
NBD_CMD_WRITE_ZEROES and NBD_CMD_BLOCK_STATUS has caused a performance
regression in qemu [1], when copying a sparse file. When the
destination file must contain the same contents as the source, but it
is not known in advance whether the destination started life with all
zero content, then there are cases where it is faster to request a
bulk zero of the entire device followed by writing only the portions
of the device that are to contain data, as that results in fewer I/O
transactions overall. In fact, there are even situations where
trimming the entire device prior to writing zeroes may be faster than
bare write zero request [2]. However, if a bulk zero request ever
falls back to the same speed as a normal write, a bulk pre-zeroing
algorithm is actually a pessimization, as it ends up writing portions
of the disk twice.

[1] https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg06389.html
[2] https://github.com/libguestfs/nbdkit/commit/407f8dde

Hence, it is desirable to have a way for clients to specify that a
particular write zero request is being attempted for a fast wipe, and
get an immediate failure if the zero request would otherwise take the
same time as a write.  Conversely, if the client is not performing a
pre-initialization pass, it is still more efficient in terms of
networking traffic to send NBD_CMD_WRITE_ZERO requests where the
server implements the fallback to the slower write, than it is for the
client to have to perform the fallback to send NBD_CMD_WRITE with a
zeroed buffer.

Add a protocol flag and corresponding transmission advertisement flag
to make it easier for clients to inform the server of their intent. If
the server advertises NBD_FLAG_SEND_FAST_ZERO, then it promises two
things: to perform a fallback to write when the client does not
request NBD_CMD_FLAG_FAST_ZERO (so that the client benefits from the
lower network overhead); and to fail quickly with ENOTSUP, preferably
without modifying the export, if the client requested the flag but the
server cannot write zeroes more efficiently than a normal write (so
that the client is not penalized with the time of writing data areas
of the disk twice).

Note that the semantics are chosen so that servers should advertise
the new flag whether or not they have fast zeroing (that is, this is
NOT the server advertising that it has fast zeroes, but rather
advertising that the client can get fast feedback as needed on whether
zeroing is fast).  It is also intentional that the new advertisement
includes a new errno value, ENOTSUP, with rules that this error should
not be returned for any pre-existing behaviors, must not happen when
the client does not request a fast zero, and must be returned quickly
if the client requested fast zero but anything other than the error
would not be fast; while leaving it possible for clients to
distinguish other errors like EINVAL if alignment constraints are not
met.  Clients should not send the flag unless the server advertised
support, but well-behaved servers should already be reporting EINVAL
to unrecognized flags. If the server does not advertise the new
feature, clients can safely fall back to assuming that writing zeroes
is no faster than normal writes (whether or not the assumption
actually holds).

Note that the Linux fallocate(2) interface may or may not be powerful
enough to easily determine if zeroing will be efficient - in
particular, FALLOC_FL_ZERO_RANGE in isolation does NOT give that
insight; likewise, for block devices, it is known that
ioctl(BLKZEROOUT) does NOT have a way for userspace to probe if it is
efficient or slow.  But with enough demand, the kernel may add another
FALLOC_FL_ flag to use with FALLOC_FL_ZERO_RANGE, and/or appropriate
ioctls with guaranteed ENOTSUP failures if a fast path cannot be
taken.  If a server cannot easily determine if write zeroes will be
efficient, the server should either fail all NBD_CMD_FLAG_FAST_ZERO
with ENOTSUP, or else choose to not advertise NBD_FLAG_SEND_FAST_ZERO.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 doc/proto.md | 50 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 49 insertions(+), 1 deletion(-)

diff --git a/doc/proto.md b/doc/proto.md
index 52d3e7b..702688b 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -1070,6 +1070,18 @@ The field has the following format:
   which support the command without advertising this bit, and
   conversely that this bit does not guarantee that the command will
   succeed or have an impact.
+- bit 11, `NBD_FLAG_SEND_FAST_ZERO`: allow clients to detect whether
+  `NBD_CMD_WRITE_ZEROES` is faster than a corresponding write. The
+  server MUST set this transmission flag to 1 if the
+  `NBD_CMD_WRITE_ZEROES` request supports the `NBD_CMD_FLAG_FAST_ZERO`
+  flag, and MUST set this transmission flag to 0 if
+  `NBD_FLAG_SEND_WRITE_ZEROES` is not set. Servers MAY set this this
+  transmission flag even if it will always use `NBD_ENOTSUP` failures for
+  requests with `NBD_CMD_FLAG_FAST_ZERO` set (such as if the server
+  cannot quickly determine whether a particular write zeroes request
+  will be faster than a regular write). Clients MUST NOT set the
+  `NBD_CMD_FLAG_FAST_ZERO` request flag unless this transmission flag
+  is set.

 Clients SHOULD ignore unknown flags.

@@ -1647,6 +1659,12 @@ valid may depend on negotiation during the handshake
phase.
   MUST NOT send metadata on more than one extent in the reply. Client
   implementors should note that using this flag on multiple contiguous
   requests is likely to be inefficient.
+- bit 4, `NBD_CMD_FLAG_FAST_ZERO`; valid during
+  `NBD_CMD_WRITE_ZEROES`. If set, but the server cannot perform the
+  write zeroes any faster than it would for an equivalent
+  `NBD_CMD_WRITE`, then the server MUST fail quickly with an error of
+  `NBD_ENOTSUP`. The client MUST NOT set this unless the server advertised
+  `NBD_FLAG_SEND_FAST_ZERO`.

 ##### Structured reply flags

@@ -2015,7 +2033,10 @@ The following request types exist:
     reached permanent storage, unless `NBD_CMD_FLAG_FUA` is in use.

     A client MUST NOT send a write zeroes request unless
-    `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags field.
+    `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags
+    field. Additionally, a client MUST NOT send the
+    `NBD_CMD_FLAG_FAST_ZERO` flag unless `NBD_FLAG_SEND_FAST_ZERO` was
+    set in the transimssion flags field.

     By default, the server MAY use trimming to zero out the area, even
     if it did not advertise `NBD_FLAG_SEND_TRIM`; but it MUST ensure
@@ -2025,6 +2046,28 @@ The following request types exist:
     same area will not cause fragmentation or cause failure due to
     insufficient space.

+    If the server advertised `NBD_FLAG_SEND_FAST_ZERO` but
+    `NBD_CMD_FLAG_FAST_ZERO` is not set, then the server MUST NOT fail
+    with `NBD_ENOTSUP`, even if the operation is no faster than a
+    corresponding `NBD_CMD_WRITE`. Conversely, if
+    `NBD_CMD_FLAG_FAST_ZERO` is set, the server MUST fail quickly with
+    `NBD_ENOTSUP` unless the request can be serviced in less time than
+    a corresponding `NBD_CMD_WRITE`, and SHOULD NOT alter the contents
+    of the export when returning this failure. The server's
+    determination of a fast request MAY depend on a number of factors,
+    such as whether the request was suitably aligned, on whether the
+    `NBD_CMD_FLAG_NO_HOLE` flag was present, or even on whether a
+    previous `NBD_CMD_TRIM` had been performed on the region.  If the
+    server did not advertise `NBD_FLAG_SEND_FAST_ZERO`, then it SHOULD
+    NOT fail with `NBD_ENOTSUP`, regardless of the speed of servicing
+    a request, and SHOULD fail with `NBD_EINVAL` if the
+    `NBD_CMD_FLAG_FAST_ZERO` flag was set. A server MAY advertise
+    `NBD_FLAG_SEND_FAST_ZERO` whether or not it can perform fast
+    zeroing; similarly, a server SHOULD fail with `NBD_ENOTSUP` when
+    the flag is set if the server cannot quickly determine in advance
+    whether that request would have been fast, even if it turns out
+    that the same request without the flag would be fast after all.
+
     If an error occurs, the server MUST set the appropriate error code
     in the error field.

@@ -2125,6 +2168,7 @@ The following error values are defined:
 * `NBD_EINVAL` (22), Invalid argument.
 * `NBD_ENOSPC` (28), No space left on device.
 * `NBD_EOVERFLOW` (75), Value too large.
+* `NBD_ENOTSUP` (95), Operation not supported.
 * `NBD_ESHUTDOWN` (108), Server is in the process of being shut down.

 The server SHOULD return `NBD_ENOSPC` if it receives a write request
@@ -2139,6 +2183,10 @@ read-only export.
 The server SHOULD NOT return `NBD_EOVERFLOW` except as documented in
 response to `NBD_CMD_READ` when `NBD_CMD_FLAG_DF` is supported.

+The server SHOULD NOT return `NBD_ENOTSUP` except as documented in
+response to `NBD_CMD_WRITE_ZEROES` when `NBD_CMD_FLAG_FAST_ZERO` is
+supported.
+
 The server SHOULD return `NBD_EINVAL` if it receives an unknown command.

 The server SHOULD return `NBD_EINVAL` if it receives an unknown
-- 
2.21.0

Eric Blake

2019-Aug-23 14:37 UTC

head link

[Libguestfs] [PATCH 0/5] Add NBD fast zero support to qemu client and server

See the cross-post cover letter for more details:
https://www.redhat.com/archives/libguestfs/2019-August/msg00322.html

Based-on: https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg04805.html
[Andrey Shinkevich block: workaround for unaligned byte range in fallocate()]

Eric Blake (5):
  nbd: Improve per-export flag handling in server
  nbd: Prepare for NBD_CMD_FLAG_FAST_ZERO
  nbd: Implement client use of NBD FAST_ZERO
  nbd: Implement server use of NBD FAST_ZERO
  nbd: Tolerate more errors to structured reply request

 docs/interop/nbd.txt |  3 +-
 include/block/nbd.h  |  6 +++-
 block/nbd.c          |  7 +++++
 blockdev-nbd.c       |  3 +-
 nbd/client.c         | 39 ++++++++++++++----------
 nbd/common.c         |  5 ++++
 nbd/server.c         | 70 ++++++++++++++++++++++++++------------------
 qemu-nbd.c           |  7 +++--
 8 files changed, 88 insertions(+), 52 deletions(-)

-- 
2.21.0

Eric Blake

2019-Aug-23 14:37 UTC

head link

[Libguestfs] [PATCH 1/5] nbd: Improve per-export flag handling in server

When creating a read-only image, we are still advertising support for
TRIM and WRITE_ZEROES to the client, even though the client should not
be issuing those commands.  But seeing this requires looking across
multiple functions:

All callers to nbd_export_new() passed a single flag based solely on
whether the export allows writes.  Later, we then pass a constant set
of flags to nbd_negotiate_options() (namely, the set of flags which we
always support, at least for writable images), which is then further
dynamically modified based on client requests for structured options.
Finally, when processing NBD_OPT_EXPORT_NAME or NBD_OPT_EXPORT_GO we
bitwise-or the original caller's flag with the runtime set of flags
we've built up over several functions.

Let's refactor things to instead compute a baseline of flags as soon
as possible, in nbd_export_new(), and changing the signature for the
callers to pass in a simpler bool rather than having to figure out
flags.  We can then get rid of the 'myflags' parameter to various
functions, and instead refer to client for everything we need (we
still have to perform a bitwise-OR during NBD_OPT_EXPORT_NAME and
NBD_OPT_EXPORT_GO, but it's easier to see what is being computed).
This lets us quit advertising senseless flags for read-only images, as
well as making the next patch for exposing FAST_ZERO support easier to
write.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 include/block/nbd.h |  2 +-
 blockdev-nbd.c      |  3 +--
 nbd/server.c        | 62 +++++++++++++++++++++++++--------------------
 qemu-nbd.c          |  6 ++---
 4 files changed, 39 insertions(+), 34 deletions(-)

diff --git a/include/block/nbd.h b/include/block/nbd.h
index 991fd52a5134..2c87b42dfd48 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -326,7 +326,7 @@ typedef struct NBDClient NBDClient;

 NBDExport *nbd_export_new(BlockDriverState *bs, uint64_t dev_offset,
                           uint64_t size, const char *name, const char *desc,
-                          const char *bitmap, uint16_t nbdflags, bool shared,
+                          const char *bitmap, bool readonly, bool shared,
                           void (*close)(NBDExport *), bool writethrough,
                           BlockBackend *on_eject_blk, Error **errp);
 void nbd_export_close(NBDExport *exp);
diff --git a/blockdev-nbd.c b/blockdev-nbd.c
index ecfa2ef0adb5..1cdffab4fce1 100644
--- a/blockdev-nbd.c
+++ b/blockdev-nbd.c
@@ -187,8 +187,7 @@ void qmp_nbd_server_add(const char *device, bool has_name,
const char *name,
         writable = false;
     }

-    exp = nbd_export_new(bs, 0, len, name, NULL, bitmap,
-                         writable ? 0 : NBD_FLAG_READ_ONLY, !writable,
+    exp = nbd_export_new(bs, 0, len, name, NULL, bitmap, !writable, !writable,
                          NULL, false, on_eject_blk, errp);
     if (!exp) {
         return;
diff --git a/nbd/server.c b/nbd/server.c
index 0fb41c6c50ea..b5577828aa44 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -423,14 +423,14 @@ static void nbd_check_meta_export(NBDClient *client)

 /* Send a reply to NBD_OPT_EXPORT_NAME.
  * Return -errno on error, 0 on success. */
-static int nbd_negotiate_handle_export_name(NBDClient *client,
-                                            uint16_t myflags, bool no_zeroes,
+static int nbd_negotiate_handle_export_name(NBDClient *client, bool no_zeroes,
                                             Error **errp)
 {
     char name[NBD_MAX_NAME_SIZE + 1];
     char buf[NBD_REPLY_EXPORT_NAME_SIZE] = "";
     size_t len;
     int ret;
+    uint16_t myflags;

     /* Client sends:
         [20 ..  xx]   export name (length bytes)
@@ -458,10 +458,13 @@ static int nbd_negotiate_handle_export_name(NBDClient
*client,
         return -EINVAL;
     }

-    trace_nbd_negotiate_new_style_size_flags(client->exp->size,
-                                             client->exp->nbdflags |
myflags);
+    myflags = client->exp->nbdflags;
+    if (client->structured_reply) {
+        myflags |= NBD_FLAG_SEND_DF;
+    }
+    trace_nbd_negotiate_new_style_size_flags(client->exp->size, myflags);
     stq_be_p(buf, client->exp->size);
-    stw_be_p(buf + 8, client->exp->nbdflags | myflags);
+    stw_be_p(buf + 8, myflags);
     len = no_zeroes ? 10 : sizeof(buf);
     ret = nbd_write(client->ioc, buf, len, errp);
     if (ret < 0) {
@@ -526,8 +529,7 @@ static int nbd_reject_length(NBDClient *client, bool fatal,
Error **errp)
 /* Handle NBD_OPT_INFO and NBD_OPT_GO.
  * Return -errno on error, 0 if ready for next option, and 1 to move
  * into transmission phase.  */
-static int nbd_negotiate_handle_info(NBDClient *client, uint16_t myflags,
-                                     Error **errp)
+static int nbd_negotiate_handle_info(NBDClient *client, Error **errp)
 {
     int rc;
     char name[NBD_MAX_NAME_SIZE + 1];
@@ -540,6 +542,7 @@ static int nbd_negotiate_handle_info(NBDClient *client,
uint16_t myflags,
     uint32_t sizes[3];
     char buf[sizeof(uint64_t) + sizeof(uint16_t)];
     uint32_t check_align = 0;
+    uint16_t myflags;

     /* Client sends:
         4 bytes: L, name length (can be 0)
@@ -637,10 +640,13 @@ static int nbd_negotiate_handle_info(NBDClient *client,
uint16_t myflags,
     }

     /* Send NBD_INFO_EXPORT always */
-    trace_nbd_negotiate_new_style_size_flags(exp->size,
-                                             exp->nbdflags | myflags);
+    myflags = exp->nbdflags;
+    if (client->structured_reply) {
+        myflags |= NBD_FLAG_SEND_DF;
+    }
+    trace_nbd_negotiate_new_style_size_flags(exp->size, myflags);
     stq_be_p(buf, exp->size);
-    stw_be_p(buf + 8, exp->nbdflags | myflags);
+    stw_be_p(buf + 8, myflags);
     rc = nbd_negotiate_send_info(client, NBD_INFO_EXPORT,
                                  sizeof(buf), buf, errp);
     if (rc < 0) {
@@ -1037,8 +1043,7 @@ static int nbd_negotiate_meta_queries(NBDClient *client,
  * 1       if client sent NBD_OPT_ABORT, i.e. on valid disconnect,
  *         errp is not set
  */
-static int nbd_negotiate_options(NBDClient *client, uint16_t myflags,
-                                 Error **errp)
+static int nbd_negotiate_options(NBDClient *client, Error **errp)
 {
     uint32_t flags;
     bool fixedNewstyle = false;
@@ -1172,13 +1177,12 @@ static int nbd_negotiate_options(NBDClient *client,
uint16_t myflags,
                 return 1;

             case NBD_OPT_EXPORT_NAME:
-                return nbd_negotiate_handle_export_name(client,
-                                                        myflags, no_zeroes,
+                return nbd_negotiate_handle_export_name(client, no_zeroes,
                                                         errp);

             case NBD_OPT_INFO:
             case NBD_OPT_GO:
-                ret = nbd_negotiate_handle_info(client, myflags, errp);
+                ret = nbd_negotiate_handle_info(client, errp);
                 if (ret == 1) {
                     assert(option == NBD_OPT_GO);
                     return 0;
@@ -1209,7 +1213,6 @@ static int nbd_negotiate_options(NBDClient *client,
uint16_t myflags,
                 } else {
                     ret = nbd_negotiate_send_rep(client, NBD_REP_ACK, errp);
                     client->structured_reply = true;
-                    myflags |= NBD_FLAG_SEND_DF;
                 }
                 break;

@@ -1232,8 +1235,7 @@ static int nbd_negotiate_options(NBDClient *client,
uint16_t myflags,
              */
             switch (option) {
             case NBD_OPT_EXPORT_NAME:
-                return nbd_negotiate_handle_export_name(client,
-                                                        myflags, no_zeroes,
+                return nbd_negotiate_handle_export_name(client, no_zeroes,
                                                         errp);

             default:
@@ -1259,9 +1261,6 @@ static coroutine_fn int nbd_negotiate(NBDClient *client,
Error **errp)
 {
     char buf[NBD_OLDSTYLE_NEGOTIATE_SIZE] = "";
     int ret;
-    const uint16_t myflags = (NBD_FLAG_HAS_FLAGS | NBD_FLAG_SEND_TRIM |
-                              NBD_FLAG_SEND_FLUSH | NBD_FLAG_SEND_FUA |
-                              NBD_FLAG_SEND_WRITE_ZEROES |
NBD_FLAG_SEND_CACHE);

     /* Old style negotiation header, no room for options
         [ 0 ..   7]   passwd       ("NBDMAGIC")
@@ -1289,7 +1288,7 @@ static coroutine_fn int nbd_negotiate(NBDClient *client,
Error **errp)
         error_prepend(errp, "write failed: ");
         return -EINVAL;
     }
-    ret = nbd_negotiate_options(client, myflags, errp);
+    ret = nbd_negotiate_options(client, errp);
     if (ret != 0) {
         if (ret < 0) {
             error_prepend(errp, "option negotiation failed: ");
@@ -1461,7 +1460,7 @@ static void nbd_eject_notifier(Notifier *n, void *data)

 NBDExport *nbd_export_new(BlockDriverState *bs, uint64_t dev_offset,
                           uint64_t size, const char *name, const char *desc,
-                          const char *bitmap, uint16_t nbdflags, bool shared,
+                          const char *bitmap, bool readonly, bool shared,
                           void (*close)(NBDExport *), bool writethrough,
                           BlockBackend *on_eject_blk, Error **errp)
 {
@@ -1485,10 +1484,8 @@ NBDExport *nbd_export_new(BlockDriverState *bs, uint64_t
dev_offset,
     /* Don't allow resize while the NBD server is running, otherwise we
don't
      * care what happens with the node. */
     perm = BLK_PERM_CONSISTENT_READ;
-    if ((nbdflags & NBD_FLAG_READ_ONLY) == 0) {
+    if (!readonly) {
         perm |= BLK_PERM_WRITE;
-    } else if (shared) {
-        nbdflags |= NBD_FLAG_CAN_MULTI_CONN;
     }
     blk = blk_new(bdrv_get_aio_context(bs), perm,
                   BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED |
@@ -1507,7 +1504,16 @@ NBDExport *nbd_export_new(BlockDriverState *bs, uint64_t
dev_offset,
     exp->dev_offset = dev_offset;
     exp->name = g_strdup(name);
     exp->description = g_strdup(desc);
-    exp->nbdflags = nbdflags;
+    exp->nbdflags = (NBD_FLAG_HAS_FLAGS | NBD_FLAG_SEND_FLUSH |
+                     NBD_FLAG_SEND_FUA | NBD_FLAG_SEND_CACHE);
+    if (readonly) {
+        exp->nbdflags |= NBD_FLAG_READ_ONLY;
+        if (shared) {
+            exp->nbdflags |= NBD_FLAG_CAN_MULTI_CONN;
+        }
+    } else {
+        exp->nbdflags |= NBD_FLAG_SEND_TRIM | NBD_FLAG_SEND_WRITE_ZEROES;
+    }
     assert(size <= INT64_MAX - dev_offset);
     exp->size = QEMU_ALIGN_DOWN(size, BDRV_SECTOR_SIZE);

@@ -1532,7 +1538,7 @@ NBDExport *nbd_export_new(BlockDriverState *bs, uint64_t
dev_offset,
             goto fail;
         }

-        if ((nbdflags & NBD_FLAG_READ_ONLY) && bdrv_is_writable(bs)
&&
+        if (readonly && bdrv_is_writable(bs) &&
             bdrv_dirty_bitmap_enabled(bm)) {
             error_setg(errp,
                        "Enabled bitmap '%s' incompatible with
readonly export",
diff --git a/qemu-nbd.c b/qemu-nbd.c
index 55f5ceaf5c92..079702bb837f 100644
--- a/qemu-nbd.c
+++ b/qemu-nbd.c
@@ -600,7 +600,7 @@ int main(int argc, char **argv)
     BlockBackend *blk;
     BlockDriverState *bs;
     uint64_t dev_offset = 0;
-    uint16_t nbdflags = 0;
+    bool readonly = false;
     bool disconnect = false;
     const char *bindto = NULL;
     const char *port = NULL;
@@ -782,7 +782,7 @@ int main(int argc, char **argv)
             }
             /* fall through */
         case 'r':
-            nbdflags |= NBD_FLAG_READ_ONLY;
+            readonly = true;
             flags &= ~BDRV_O_RDWR;
             break;
         case 'P':
@@ -1173,7 +1173,7 @@ int main(int argc, char **argv)
     }

     export = nbd_export_new(bs, dev_offset, fd_size, export_name,
-                            export_description, bitmap, nbdflags, shared >
1,
+                            export_description, bitmap, readonly, shared >
1,
                             nbd_export_closed, writethrough, NULL,
                             &error_fatal);

-- 
2.21.0

Eric Blake

2019-Aug-23 14:37 UTC

head link

[Libguestfs] [PATCH 2/5] nbd: Prepare for NBD_CMD_FLAG_FAST_ZERO

Commit fe0480d6 and friends added BDRV_REQ_NO_FALLBACK as a way to
avoid wasting time on a preliminary write-zero request that will later
be rewritten by actual data, if it is known that the write-zero
request will use a slow fallback; but in doing so, could not optimize
for NBD.  The NBD specification is now considering an extension that
will allow passing on those semantics; this patch updates the new
protocol bits and 'qemu-nbd --list' output to recognize the bit, as
well as the new errno value possible when using the new flag; while
upcoming patches will improve the client to use the feature when
present, and the server to advertise support for it.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 docs/interop/nbd.txt | 3 ++-
 include/block/nbd.h  | 4 ++++
 nbd/common.c         | 5 +++++
 nbd/server.c         | 2 ++
 qemu-nbd.c           | 1 +
 5 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/docs/interop/nbd.txt b/docs/interop/nbd.txt
index 6dfec7f47647..45118809618e 100644
--- a/docs/interop/nbd.txt
+++ b/docs/interop/nbd.txt
@@ -53,4 +53,5 @@ the operation of that feature.
 * 2.12: NBD_CMD_BLOCK_STATUS for "base:allocation"
 * 3.0: NBD_OPT_STARTTLS with TLS Pre-Shared Keys (PSK),
 NBD_CMD_BLOCK_STATUS for "qemu:dirty-bitmap:", NBD_CMD_CACHE
-* 4.2: NBD_FLAG_CAN_MULTI_CONN for sharable read-only exports
+* 4.2: NBD_FLAG_CAN_MULTI_CONN for sharable read-only exports,
+NBD_CMD_FLAG_FAST_ZERO
diff --git a/include/block/nbd.h b/include/block/nbd.h
index 2c87b42dfd48..21550747cf35 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -140,6 +140,7 @@ enum {
     NBD_FLAG_CAN_MULTI_CONN_BIT     =  8, /* Multi-client cache consistent */
     NBD_FLAG_SEND_RESIZE_BIT        =  9, /* Send resize */
     NBD_FLAG_SEND_CACHE_BIT         = 10, /* Send CACHE (prefetch) */
+    NBD_FLAG_SEND_FAST_ZERO_BIT     = 11, /* FAST_ZERO flag for WRITE_ZEROES */
 };

 #define NBD_FLAG_HAS_FLAGS         (1 << NBD_FLAG_HAS_FLAGS_BIT)
@@ -153,6 +154,7 @@ enum {
 #define NBD_FLAG_CAN_MULTI_CONN    (1 << NBD_FLAG_CAN_MULTI_CONN_BIT)
 #define NBD_FLAG_SEND_RESIZE       (1 << NBD_FLAG_SEND_RESIZE_BIT)
 #define NBD_FLAG_SEND_CACHE        (1 << NBD_FLAG_SEND_CACHE_BIT)
+#define NBD_FLAG_SEND_FAST_ZERO    (1 << NBD_FLAG_SEND_FAST_ZERO_BIT)

 /* New-style handshake (global) flags, sent from server to client, and
    control what will happen during handshake phase. */
@@ -205,6 +207,7 @@ enum {
 #define NBD_CMD_FLAG_DF         (1 << 2) /* don't fragment structured
read */
 #define NBD_CMD_FLAG_REQ_ONE    (1 << 3) /* only one extent in
BLOCK_STATUS
                                           * reply chunk */
+#define NBD_CMD_FLAG_FAST_ZERO  (1 << 4) /* fail if WRITE_ZEROES is not
fast */

 /* Supported request types */
 enum {
@@ -270,6 +273,7 @@ static inline bool nbd_reply_type_is_error(int type)
 #define NBD_EINVAL     22
 #define NBD_ENOSPC     28
 #define NBD_EOVERFLOW  75
+#define NBD_ENOTSUP    95
 #define NBD_ESHUTDOWN  108

 /* Details collected by NBD_OPT_EXPORT_NAME and NBD_OPT_GO */
diff --git a/nbd/common.c b/nbd/common.c
index cc8b278e541d..ddfe7d118371 100644
--- a/nbd/common.c
+++ b/nbd/common.c
@@ -201,6 +201,8 @@ const char *nbd_err_lookup(int err)
         return "ENOSPC";
     case NBD_EOVERFLOW:
         return "EOVERFLOW";
+    case NBD_ENOTSUP:
+        return "ENOTSUP";
     case NBD_ESHUTDOWN:
         return "ESHUTDOWN";
     default:
@@ -231,6 +233,9 @@ int nbd_errno_to_system_errno(int err)
     case NBD_EOVERFLOW:
         ret = EOVERFLOW;
         break;
+    case NBD_ENOTSUP:
+        ret = ENOTSUP;
+        break;
     case NBD_ESHUTDOWN:
         ret = ESHUTDOWN;
         break;
diff --git a/nbd/server.c b/nbd/server.c
index b5577828aa44..981bc3cb1151 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -55,6 +55,8 @@ static int system_errno_to_nbd_errno(int err)
         return NBD_ENOSPC;
     case EOVERFLOW:
         return NBD_EOVERFLOW;
+    case ENOTSUP:
+        return NBD_ENOTSUP;
     case ESHUTDOWN:
         return NBD_ESHUTDOWN;
     case EINVAL:
diff --git a/qemu-nbd.c b/qemu-nbd.c
index 079702bb837f..dce52f564b5a 100644
--- a/qemu-nbd.c
+++ b/qemu-nbd.c
@@ -294,6 +294,7 @@ static int qemu_nbd_client_list(SocketAddress *saddr,
QCryptoTLSCreds *tls,
                 [NBD_FLAG_CAN_MULTI_CONN_BIT]       = "multi",
                 [NBD_FLAG_SEND_RESIZE_BIT]          = "resize",
                 [NBD_FLAG_SEND_CACHE_BIT]           = "cache",
+                [NBD_FLAG_SEND_FAST_ZERO_BIT]       = "fast-zero",
             };

             printf("  size:  %" PRIu64 "\n", list[i].size);
-- 
2.21.0

Eric Blake

2019-Aug-23 14:37 UTC

head link

[Libguestfs] [PATCH 3/5] nbd: Implement client use of NBD FAST_ZERO

The client side is fairly straightforward: if the server advertised
fast zero support, then we can map that to BDRV_REQ_NO_FALLBACK
support.  A server that advertises FAST_ZERO but not WRITE_ZEROES
is technically broken, but we can ignore that situation as it does
not change our behavior.

Signed-off-by: Eric Blake <eblake@redhat.com>

---

Perhaps this is worth merging with the previous patch.
---
 block/nbd.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/block/nbd.c b/block/nbd.c
index beed46fb3414..8339d7106366 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -1044,6 +1044,10 @@ static int nbd_client_co_pwrite_zeroes(BlockDriverState
*bs, int64_t offset,
     if (!(flags & BDRV_REQ_MAY_UNMAP)) {
         request.flags |= NBD_CMD_FLAG_NO_HOLE;
     }
+    if (flags & BDRV_REQ_NO_FALLBACK) {
+        assert(s->info.flags & NBD_FLAG_SEND_FAST_ZERO);
+        request.flags |= NBD_CMD_FLAG_FAST_ZERO;
+    }

     if (!bytes) {
         return 0;
@@ -1239,6 +1243,9 @@ static int nbd_client_connect(BlockDriverState *bs, Error
**errp)
     }
     if (s->info.flags & NBD_FLAG_SEND_WRITE_ZEROES) {
         bs->supported_zero_flags |= BDRV_REQ_MAY_UNMAP;
+        if (s->info.flags & NBD_FLAG_SEND_FAST_ZERO) {
+            bs->supported_zero_flags |= BDRV_REQ_NO_FALLBACK;
+        }
     }

     s->sioc = sioc;
-- 
2.21.0

Eric Blake

2019-Aug-23 14:37 UTC

head link

[Libguestfs] [PATCH 4/5] nbd: Implement server use of NBD FAST_ZERO

The server side is fairly straightforward: we can always advertise
support for detection of fast zero, and implement it by mapping the
request to the block layer BDRV_REQ_NO_FALLBACK.

Signed-off-by: Eric Blake <eblake@redhat.com>

---

Again, this may be worth merging with the previous patch.
---
 nbd/server.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/nbd/server.c b/nbd/server.c
index 981bc3cb1151..3c0799eda87f 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1514,7 +1514,8 @@ NBDExport *nbd_export_new(BlockDriverState *bs, uint64_t
dev_offset,
             exp->nbdflags |= NBD_FLAG_CAN_MULTI_CONN;
         }
     } else {
-        exp->nbdflags |= NBD_FLAG_SEND_TRIM | NBD_FLAG_SEND_WRITE_ZEROES;
+        exp->nbdflags |= (NBD_FLAG_SEND_TRIM | NBD_FLAG_SEND_WRITE_ZEROES |
+                          NBD_FLAG_SEND_FAST_ZERO);
     }
     assert(size <= INT64_MAX - dev_offset);
     exp->size = QEMU_ALIGN_DOWN(size, BDRV_SECTOR_SIZE);
@@ -2167,7 +2168,7 @@ static int nbd_co_receive_request(NBDRequestData *req,
NBDRequest *request,
     if (request->type == NBD_CMD_READ &&
client->structured_reply) {
         valid_flags |= NBD_CMD_FLAG_DF;
     } else if (request->type == NBD_CMD_WRITE_ZEROES) {
-        valid_flags |= NBD_CMD_FLAG_NO_HOLE;
+        valid_flags |= NBD_CMD_FLAG_NO_HOLE | NBD_CMD_FLAG_FAST_ZERO;
     } else if (request->type == NBD_CMD_BLOCK_STATUS) {
         valid_flags |= NBD_CMD_FLAG_REQ_ONE;
     }
@@ -2306,6 +2307,9 @@ static coroutine_fn int nbd_handle_request(NBDClient
*client,
         if (!(request->flags & NBD_CMD_FLAG_NO_HOLE)) {
             flags |= BDRV_REQ_MAY_UNMAP;
         }
+        if (request->flags & NBD_CMD_FLAG_FAST_ZERO) {
+            flags |= BDRV_REQ_NO_FALLBACK;
+        }
         ret = blk_pwrite_zeroes(exp->blk, request->from +
exp->dev_offset,
                                 request->len, flags);
         return nbd_send_generic_reply(client, request->handle, ret,
-- 
2.21.0

Eric Blake

2019-Aug-23 14:37 UTC

head link

[Libguestfs] [PATCH 5/5] nbd: Tolerate more errors to structured reply request

A server may have a reason to reject a request for structured replies,
beyond just not recognizing them as a valid request.  It doesn't hurt
us to continue talking to such a server; otherwise 'qemu-nbd --list'
of such a server fails to display all possible details about the
export.

Encountered when temporarily tweaking nbdkit to reply with
NBD_REP_ERR_POLICY.  Present since structured reply support was first
added (commit d795299b reused starttls handling, but starttls has to
reject all errors).

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 nbd/client.c | 39 +++++++++++++++++++++++----------------
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/nbd/client.c b/nbd/client.c
index 49bf9906f94b..92444f5e9a62 100644
--- a/nbd/client.c
+++ b/nbd/client.c
@@ -1,5 +1,5 @@
 /*
- *  Copyright (C) 2016-2018 Red Hat, Inc.
+ *  Copyright (C) 2016-2019 Red Hat, Inc.
  *  Copyright (C) 2005  Anthony Liguori <anthony@codemonkey.ws>
  *
  *  Network Block Device Client Side
@@ -142,17 +142,19 @@ static int nbd_receive_option_reply(QIOChannel *ioc,
uint32_t opt,
     return 0;
 }

-/* If reply represents success, return 1 without further action.
- * If reply represents an error, consume the optional payload of
- * the packet on ioc.  Then return 0 for unsupported (so the client
- * can fall back to other approaches), or -1 with errp set for other
- * errors.
+/*
+ * If reply represents success, return 1 without further action.  If
+ * reply represents an error, consume the optional payload of the
+ * packet on ioc.  Then return 0 for unsupported (so the client can
+ * fall back to other approaches), where @strict determines if only
+ * ERR_UNSUP or all errors fit that category, or -1 with errp set for
+ * other errors.
  */
 static int nbd_handle_reply_err(QIOChannel *ioc, NBDOptionReply *reply,
-                                Error **errp)
+                                bool strict, Error **errp)
 {
     char *msg = NULL;
-    int result = -1;
+    int result = strict ? -1 : 0;

     if (!(reply->type & (1 << 31))) {
         return 1;
@@ -163,6 +165,7 @@ static int nbd_handle_reply_err(QIOChannel *ioc,
NBDOptionReply *reply,
             error_setg(errp, "server error %" PRIu32
                        " (%s) message is too long",
                        reply->type, nbd_rep_lookup(reply->type));
+            result = -1;
             goto cleanup;
         }
         msg = g_malloc(reply->length + 1);
@@ -170,6 +173,7 @@ static int nbd_handle_reply_err(QIOChannel *ioc,
NBDOptionReply *reply,
             error_prepend(errp, "Failed to read option error %"
PRIu32
                           " (%s) message: ",
                           reply->type, nbd_rep_lookup(reply->type));
+            result = -1;
             goto cleanup;
         }
         msg[reply->length] = '\0';
@@ -258,7 +262,7 @@ static int nbd_receive_list(QIOChannel *ioc, char **name,
char **description,
     if (nbd_receive_option_reply(ioc, NBD_OPT_LIST, &reply, errp) < 0) {
         return -1;
     }
-    error = nbd_handle_reply_err(ioc, &reply, errp);
+    error = nbd_handle_reply_err(ioc, &reply, true, errp);
     if (error <= 0) {
         return error;
     }
@@ -371,7 +375,7 @@ static int nbd_opt_info_or_go(QIOChannel *ioc, uint32_t opt,
         if (nbd_receive_option_reply(ioc, opt, &reply, errp) < 0) {
             return -1;
         }
-        error = nbd_handle_reply_err(ioc, &reply, errp);
+        error = nbd_handle_reply_err(ioc, &reply, true, errp);
         if (error <= 0) {
             return error;
         }
@@ -546,12 +550,15 @@ static int nbd_receive_query_exports(QIOChannel *ioc,
     }
 }

-/* nbd_request_simple_option: Send an option request, and parse the reply
+/*
+ * nbd_request_simple_option: Send an option request, and parse the reply.
+ * @strict controls whether ERR_UNSUP or all errors produce 0 status.
  * return 1 for successful negotiation,
  *        0 if operation is unsupported,
  *        -1 with errp set for any other error
  */
-static int nbd_request_simple_option(QIOChannel *ioc, int opt, Error **errp)
+static int nbd_request_simple_option(QIOChannel *ioc, int opt, bool strict,
+                                     Error **errp)
 {
     NBDOptionReply reply;
     int error;
@@ -563,7 +570,7 @@ static int nbd_request_simple_option(QIOChannel *ioc, int
opt, Error **errp)
     if (nbd_receive_option_reply(ioc, opt, &reply, errp) < 0) {
         return -1;
     }
-    error = nbd_handle_reply_err(ioc, &reply, errp);
+    error = nbd_handle_reply_err(ioc, &reply, strict, errp);
     if (error <= 0) {
         return error;
     }
@@ -595,7 +602,7 @@ static QIOChannel *nbd_receive_starttls(QIOChannel *ioc,
     QIOChannelTLS *tioc;
     struct NBDTLSHandshakeData data = { 0 };

-    ret = nbd_request_simple_option(ioc, NBD_OPT_STARTTLS, errp);
+    ret = nbd_request_simple_option(ioc, NBD_OPT_STARTTLS, true, errp);
     if (ret <= 0) {
         if (ret == 0) {
             error_setg(errp, "Server don't support STARTTLS
option");
@@ -695,7 +702,7 @@ static int nbd_receive_one_meta_context(QIOChannel *ioc,
         return -1;
     }

-    ret = nbd_handle_reply_err(ioc, &reply, errp);
+    ret = nbd_handle_reply_err(ioc, &reply, true, errp);
     if (ret <= 0) {
         return ret;
     }
@@ -951,7 +958,7 @@ static int nbd_start_negotiate(AioContext *aio_context,
QIOChannel *ioc,
             if (structured_reply) {
                 result = nbd_request_simple_option(ioc,
                                                    NBD_OPT_STRUCTURED_REPLY,
-                                                   errp);
+                                                   false, errp);
                 if (result < 0) {
                     return -EINVAL;
                 }
-- 
2.21.0

Eric Blake

2019-Aug-23 14:38 UTC

head link

[Libguestfs] [libnbd PATCH 0/1] libnbd support for new fast zero

See the cross-post cover letter for details:
https://www.redhat.com/archives/libguestfs/2019-August/msg00322.html

Eric Blake (1):
  api: Add support for FAST_ZERO flag

 lib/nbd-protocol.h     |  2 ++
 generator/generator    | 29 +++++++++++++++++++++++------
 lib/flags.c            | 12 ++++++++++++
 lib/protocol.c         |  1 +
 lib/rw.c               |  9 ++++++++-
 tests/Makefile.am      | 20 ++++++++++++++++++++
 tests/eflags-plugin.sh |  5 +++++
 7 files changed, 71 insertions(+), 7 deletions(-)

-- 
2.21.0

Eric Blake

2019-Aug-23 14:38 UTC

head link

[Libguestfs] [libnbd PATCH 1/1] api: Add support for FAST_ZERO flag

Qemu was able to demonstrate that knowing whether a zero operation is
fast is useful when copying from one image to another: there is a
choice between bulk pre-zeroing and then revisiting the data sections
(fewer transactions, but depends on the zeroing to be fast),
vs. visiting every portion of the disk only once (more transactions,
but no time lost to duplicated I/O due to slow zeroes).  As such, the
NBD protocol is adding an extension to allow clients to request fast
failure when zero is not efficient, from servers that advertise
support for the new flag.

In libnbd, this results in the addition of a new flag, a new functoin
nbd_can_fast_zero, and the ability to use the new flag in nbd_zero
variants.  It also enhances the testsuite to test the flag against
new enough nbdkit, which is being patched at the same time.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 lib/nbd-protocol.h     |  2 ++
 generator/generator    | 29 +++++++++++++++++++++++------
 lib/flags.c            | 12 ++++++++++++
 lib/protocol.c         |  1 +
 lib/rw.c               |  9 ++++++++-
 tests/Makefile.am      | 20 ++++++++++++++++++++
 tests/eflags-plugin.sh |  5 +++++
 7 files changed, 71 insertions(+), 7 deletions(-)

diff --git a/lib/nbd-protocol.h b/lib/nbd-protocol.h
index 3e3fb4e..04e93d3 100644
--- a/lib/nbd-protocol.h
+++ b/lib/nbd-protocol.h
@@ -108,6 +108,7 @@ struct nbd_fixed_new_option_reply {
 #define NBD_FLAG_SEND_DF           (1 << 7)
 #define NBD_FLAG_CAN_MULTI_CONN    (1 << 8)
 #define NBD_FLAG_SEND_CACHE        (1 << 10)
+#define NBD_FLAG_SEND_FAST_ZERO    (1 << 11)

 /* NBD options (new style handshake only). */
 #define NBD_OPT_EXPORT_NAME        1
@@ -250,6 +251,7 @@ struct nbd_structured_reply_error {
 #define NBD_EINVAL     22
 #define NBD_ENOSPC     28
 #define NBD_EOVERFLOW  75
+#define NBD_ENOTSUP    95
 #define NBD_ESHUTDOWN 108

 #endif /* NBD_PROTOCOL_H */
diff --git a/generator/generator b/generator/generator
index c509573..9b1f5d8 100755
--- a/generator/generator
+++ b/generator/generator
@@ -958,10 +958,11 @@ let all_enums = [ tls_enum ]
 let cmd_flags = {
   flag_prefix = "CMD_FLAG";
   flags = [
-    "FUA",     1 lsl 0;
-    "NO_HOLE", 1 lsl 1;
-    "DF",      1 lsl 2;
-    "REQ_ONE", 1 lsl 3;
+    "FUA",       1 lsl 0;
+    "NO_HOLE",   1 lsl 1;
+    "DF",        1 lsl 2;
+    "REQ_ONE",   1 lsl 3;
+    "FAST_ZERO", 1 lsl 4;
   ]
 }
 let all_flags = [ cmd_flags ]
@@ -1389,6 +1390,19 @@ the server does not."
 ^ non_blocking_test_call_description;
   };

+  "can_fast_zero", {
+    default_call with
+    args = []; ret = RBool;
+    permitted_states = [ Connected; Closed ];
+    shortdesc = "does the server support the fast zero flag?";
+    longdesc = "\
+Returns true if the server supports the use of the
+C<LIBNBD_CMD_FLAG_FAST_ZERO> flag to the zero command
+(see C<nbd_zero>, C<nbd_aio_zero>).  Returns false if
+the server does not."
+^ non_blocking_test_call_description;
+  };
+
   "can_df", {
     default_call with
     args = []; ret = RBool;
@@ -1668,9 +1682,12 @@ The C<flags> parameter may be C<0> for no
flags, or may contain
 C<LIBNBD_CMD_FLAG_FUA> meaning that the server should not
 return until the data has been committed to permanent storage
 (if that is supported - some servers cannot do this, see
-L<nbd_can_fua(3)>), and/or C<LIBNBD_CMD_FLAG_NO_HOLE> meaning that
+L<nbd_can_fua(3)>), C<LIBNBD_CMD_FLAG_NO_HOLE> meaning that
 the server should favor writing actual allocated zeroes over
-punching a hole.";
+punching a hole, and/or C<LIBNBD_CMD_FLAG_FAST_ZERO> meaning
+that the server must fail quickly if writing zeroes is no
+faster than a normal write (if that is supported - some servers
+cannot do this, see L<nbd_can_fast_zero(3)>).";
   };

   "block_status", {
diff --git a/lib/flags.c b/lib/flags.c
index 2bcacb8..d55d10a 100644
--- a/lib/flags.c
+++ b/lib/flags.c
@@ -47,6 +47,12 @@ nbd_internal_set_size_and_flags (struct nbd_handle *h,
     eflags &= ~NBD_FLAG_SEND_DF;
   }

+  if (eflags & NBD_FLAG_SEND_FAST_ZERO &&
+      !(eflags & NBD_FLAG_SEND_WRITE_ZEROES)) {
+    debug (h, "server lacks write zeroes, ignoring claim of fast
zero");
+    eflags &= ~NBD_FLAG_SEND_FAST_ZERO;
+  }
+
   h->exportsize = exportsize;
   h->eflags = eflags;
   return 0;
@@ -100,6 +106,12 @@ nbd_unlocked_can_zero (struct nbd_handle *h)
   return get_flag (h, NBD_FLAG_SEND_WRITE_ZEROES);
 }

+int
+nbd_unlocked_can_fast_zero (struct nbd_handle *h)
+{
+  return get_flag (h, NBD_FLAG_SEND_FAST_ZERO);
+}
+
 int
 nbd_unlocked_can_df (struct nbd_handle *h)
 {
diff --git a/lib/protocol.c b/lib/protocol.c
index 6087887..acee203 100644
--- a/lib/protocol.c
+++ b/lib/protocol.c
@@ -36,6 +36,7 @@ nbd_internal_errno_of_nbd_error (uint32_t error)
   case NBD_EINVAL: return EINVAL;
   case NBD_ENOSPC: return ENOSPC;
   case NBD_EOVERFLOW: return EOVERFLOW;
+  case NBD_ENOTSUP: return ENOTSUP;
   case NBD_ESHUTDOWN: return ESHUTDOWN;
   default: return EINVAL;
   }
diff --git a/lib/rw.c b/lib/rw.c
index d427681..adb6bc2 100644
--- a/lib/rw.c
+++ b/lib/rw.c
@@ -426,7 +426,8 @@ nbd_unlocked_aio_zero (struct nbd_handle *h,
     return -1;
   }

-  if ((flags & ~(LIBNBD_CMD_FLAG_FUA | LIBNBD_CMD_FLAG_NO_HOLE)) != 0) {
+  if ((flags & ~(LIBNBD_CMD_FLAG_FUA | LIBNBD_CMD_FLAG_NO_HOLE |
+                 LIBNBD_CMD_FLAG_FAST_ZERO)) != 0) {
     set_error (EINVAL, "invalid flag: %" PRIu32, flags);
     return -1;
   }
@@ -437,6 +438,12 @@ nbd_unlocked_aio_zero (struct nbd_handle *h,
     return -1;
   }

+  if ((flags & LIBNBD_CMD_FLAG_FAST_ZERO) != 0 &&
+      nbd_unlocked_can_fast_zero (h) != 1) {
+    set_error (EINVAL, "server does not support the fast zero flag");
+    return -1;
+  }
+
   if (count == 0) {             /* NBD protocol forbids this. */
     set_error (EINVAL, "count cannot be 0");
     return -1;
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 2b4ea93..a7bc1b5 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -145,6 +145,8 @@ check_PROGRAMS += \
 	can-not-trim-flag \
 	can-zero-flag \
 	can-not-zero-flag \
+	can-fast-zero-flag \
+	can-not-fast-zero-flag \
 	can-multi-conn-flag \
 	can-not-multi-conn-flag \
 	can-cache-flag \
@@ -177,6 +179,8 @@ TESTS += \
 	can-not-trim-flag \
 	can-zero-flag \
 	can-not-zero-flag \
+	can-fast-zero-flag \
+	can-not-fast-zero-flag \
 	can-multi-conn-flag \
 	can-not-multi-conn-flag \
 	can-cache-flag \
@@ -289,6 +293,22 @@ can_not_zero_flag_CPPFLAGS = \
 can_not_zero_flag_CFLAGS = $(WARNINGS_CFLAGS)
 can_not_zero_flag_LDADD = $(top_builddir)/lib/libnbd.la

+can_fast_zero_flag_SOURCES = eflags.c
+can_fast_zero_flag_CPPFLAGS = \
+	-I$(top_srcdir)/include -Dflag=can_fast_zero \
+	-Drequire='"has_can_fast_zero=1"' \
+	$(NULL)
+can_fast_zero_flag_CFLAGS = $(WARNINGS_CFLAGS)
+can_fast_zero_flag_LDADD = $(top_builddir)/lib/libnbd.la
+
+can_not_fast_zero_flag_SOURCES = eflags.c
+can_not_fast_zero_flag_CPPFLAGS = \
+	-I$(top_srcdir)/include -Dflag=can_fast_zero -Dvalue=false \
+	-Drequire='"has_can_fast_zero=1"' \
+	$(NULL)
+can_not_fast_zero_flag_CFLAGS = $(WARNINGS_CFLAGS)
+can_not_fast_zero_flag_LDADD = $(top_builddir)/lib/libnbd.la
+
 can_multi_conn_flag_SOURCES = eflags.c
 can_multi_conn_flag_CPPFLAGS = \
 	-I$(top_srcdir)/include -Dflag=can_multi_conn \
diff --git a/tests/eflags-plugin.sh b/tests/eflags-plugin.sh
index 8fccff8..3c4cc65 100755
--- a/tests/eflags-plugin.sh
+++ b/tests/eflags-plugin.sh
@@ -52,6 +52,11 @@ case "$1" in
         # be read-only and methods like can_trim will never be called.
         exit 0
         ;;
+    can_zero)
+	# We have to default to answering this with true before
+	# can_fast_zero has an effect.
+	exit 0
+	;;
     *)
         exit 2
         ;;
-- 
2.21.0

Eric Blake

2019-Aug-23 14:40 UTC

head link

[Libguestfs] [nbdkit PATCH 0/3] nbdkit support for new NBD fast zero

See the cross-post cover letter for details:
https://www.redhat.com/archives/libguestfs/2019-August/msg00322.html

Notably, this series did NOT add fast zero support to the file plugin
yet; there, I probably need to do more testing and/or kernel source
code reading to learn whether to mark fallocate() as potentially slow,
as well as to definitely mark ioctl(BLKZEROOUT) as definitely slow.
That will be a followup patch.

Eric Blake (3):
  server: Add internal support for NBDKIT_FLAG_FAST_ZERO
  filters: Add .can_fast_zero hook
  plugins: Add .can_fast_zero hook

 docs/nbdkit-filter.pod                  |  27 ++++--
 docs/nbdkit-plugin.pod                  |  74 +++++++++++---
 docs/nbdkit-protocol.pod                |  11 +++
 filters/delay/nbdkit-delay-filter.pod   |  15 ++-
 filters/log/nbdkit-log-filter.pod       |   2 +-
 filters/nozero/nbdkit-nozero-filter.pod |  41 ++++++--
 plugins/sh/nbdkit-sh-plugin.pod         |  13 ++-
 server/internal.h                       |   2 +
 common/protocol/protocol.h              |  11 ++-
 server/filters.c                        |  33 ++++++-
 server/plugins.c                        |  47 ++++++++-
 server/protocol-handshake.c             |   7 ++
 server/protocol.c                       |  46 +++++++--
 include/nbdkit-common.h                 |   7 +-
 include/nbdkit-filter.h                 |   3 +
 include/nbdkit-plugin.h                 |   2 +
 filters/blocksize/blocksize.c           |  12 +++
 filters/cache/cache.c                   |  20 ++++
 filters/cow/cow.c                       |  20 ++++
 filters/delay/delay.c                   |  28 +++++-
 filters/log/log.c                       |  16 ++--
 filters/nozero/nozero.c                 |  62 +++++++++++-
 filters/truncate/truncate.c             |  15 +++
 plugins/data/data.c                     |  14 ++-
 plugins/full/full.c                     |  12 +--
 plugins/memory/memory.c                 |  14 ++-
 plugins/nbd/nbd.c                       |  28 +++++-
 plugins/null/null.c                     |   8 ++
 plugins/ocaml/ocaml.c                   |  26 ++++-
 plugins/sh/call.c                       |   1 -
 plugins/sh/sh.c                         |  39 ++++++--
 plugins/ocaml/NBDKit.ml                 |  10 +-
 plugins/ocaml/NBDKit.mli                |   2 +
 plugins/rust/src/lib.rs                 |   3 +
 tests/test-eflags.sh                    | 122 +++++++++++++++++++++---
 tests/test-fua.sh                       |   4 +-
 36 files changed, 686 insertions(+), 111 deletions(-)

-- 
2.21.0

Eric Blake

2019-Aug-23 14:40 UTC

head link

[Libguestfs] [nbdkit PATCH 1/3] server: Add internal support for NBDKIT_FLAG_FAST_ZERO

Qemu was able to demonstrate that knowing whether a zero operation is
fast is useful when copying from one image to another: there is a
choice between bulk pre-zeroing and then revisiting the data sections
(fewer transactions, but depends on the zeroing to be fast),
vs. visiting every portion of the disk only once (more transactions,
but no time lost to duplicated I/O due to slow zeroes).  As such, the
NBD protocol is adding an extension to allow clients to request fast
failure when zero is not efficient, from servers that advertise
support for the new flag.

In nbdkit, start by plumbing a new .can_fast_zero through the backends
(although it stays 0 in this patch, later patches will actually expose
it to plugins and filters to override).  Advertise the flag to the
client when the plugin provides a .can_fast_zero, and wire in passing
the flag down to .zero in the plugin.  In turn, if the flag is set and
the implementation .zero fails with ENOTSUP/EOPNOTSUPP, we no longer
attempt the .pwrite fallback.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 docs/nbdkit-protocol.pod    | 11 +++++++++
 server/internal.h           |  2 ++
 common/protocol/protocol.h  | 11 +++++----
 server/filters.c            | 20 +++++++++++++---
 server/plugins.c            | 28 +++++++++++++++++++---
 server/protocol-handshake.c |  7 ++++++
 server/protocol.c           | 46 +++++++++++++++++++++++++++++--------
 include/nbdkit-common.h     |  7 +++---
 plugins/ocaml/ocaml.c       |  1 -
 plugins/sh/call.c           |  1 -
 10 files changed, 110 insertions(+), 24 deletions(-)

diff --git a/docs/nbdkit-protocol.pod b/docs/nbdkit-protocol.pod
index 35db07b3..76c50bb8 100644
--- a/docs/nbdkit-protocol.pod
+++ b/docs/nbdkit-protocol.pod
@@ -173,6 +173,17 @@ This protocol extension allows a client to inform the
server about
 intent to access a portion of the export, to allow the server an
 opportunity to cache things appropriately.

+=item C<NBD_CMD_FLAG_FAST_ZERO>
+
+Supported in nbdkit E<ge> 1.13.9.
+
+This protocol extension allows a server to advertise that it can rank
+all zero requests as fast or slow, at which point the client can make
+fast zero requests which fail immediately with C<ENOTSUP> if the
+request is no faster than a counterpart write would be, while normal
+zero requests still benefit from compressed network traffic regardless
+of the time taken.
+
 =item Resize Extension

 I<Not supported>.
diff --git a/server/internal.h b/server/internal.h
index 22e13b6d..784c8b5c 100644
--- a/server/internal.h
+++ b/server/internal.h
@@ -168,6 +168,7 @@ struct connection {
   bool is_rotational;
   bool can_trim;
   bool can_zero;
+  bool can_fast_zero;
   bool can_fua;
   bool can_multi_conn;
   bool can_cache;
@@ -275,6 +276,7 @@ struct backend {
   int (*is_rotational) (struct backend *, struct connection *conn);
   int (*can_trim) (struct backend *, struct connection *conn);
   int (*can_zero) (struct backend *, struct connection *conn);
+  int (*can_fast_zero) (struct backend *, struct connection *conn);
   int (*can_extents) (struct backend *, struct connection *conn);
   int (*can_fua) (struct backend *, struct connection *conn);
   int (*can_multi_conn) (struct backend *, struct connection *conn);
diff --git a/common/protocol/protocol.h b/common/protocol/protocol.h
index e9386430..bf548390 100644
--- a/common/protocol/protocol.h
+++ b/common/protocol/protocol.h
@@ -96,6 +96,7 @@ extern const char *name_of_nbd_flag (int);
 #define NBD_FLAG_SEND_DF           (1 << 7)
 #define NBD_FLAG_CAN_MULTI_CONN    (1 << 8)
 #define NBD_FLAG_SEND_CACHE        (1 << 10)
+#define NBD_FLAG_SEND_FAST_ZERO    (1 << 11)

 /* NBD options (new style handshake only). */
 extern const char *name_of_nbd_opt (int);
@@ -223,10 +224,11 @@ extern const char *name_of_nbd_cmd (int);
 #define NBD_CMD_BLOCK_STATUS      7

 extern const char *name_of_nbd_cmd_flag (int);
-#define NBD_CMD_FLAG_FUA      (1<<0)
-#define NBD_CMD_FLAG_NO_HOLE  (1<<1)
-#define NBD_CMD_FLAG_DF       (1<<2)
-#define NBD_CMD_FLAG_REQ_ONE  (1<<3)
+#define NBD_CMD_FLAG_FUA       (1<<0)
+#define NBD_CMD_FLAG_NO_HOLE   (1<<1)
+#define NBD_CMD_FLAG_DF        (1<<2)
+#define NBD_CMD_FLAG_REQ_ONE   (1<<3)
+#define NBD_CMD_FLAG_FAST_ZERO (1<<4)

 /* Error codes (previously errno).
  * See
http://git.qemu.org/?p=qemu.git;a=commitdiff;h=ca4414804114fd0095b317785bc0b51862e62ebb
@@ -239,6 +241,7 @@ extern const char *name_of_nbd_error (int);
 #define NBD_EINVAL     22
 #define NBD_ENOSPC     28
 #define NBD_EOVERFLOW  75
+#define NBD_ENOTSUP    95
 #define NBD_ESHUTDOWN 108

 #endif /* NBDKIT_PROTOCOL_H */
diff --git a/server/filters.c b/server/filters.c
index 14ca0cc6..0dd2393e 100644
--- a/server/filters.c
+++ b/server/filters.c
@@ -403,8 +403,11 @@ next_zero (void *nxdata, uint32_t count, uint64_t offset,
uint32_t flags,
   int r;

   r = b_conn->b->zero (b_conn->b, b_conn->conn, count, offset,
flags, err);
-  if (r == -1)
-    assert (*err && *err != ENOTSUP && *err != EOPNOTSUPP);
+  if (r == -1) {
+    assert (*err);
+    if (!(flags & NBDKIT_FLAG_FAST_ZERO))
+      assert (*err != ENOTSUP && *err != EOPNOTSUPP);
+  }
   return r;
 }

@@ -586,6 +589,15 @@ filter_can_zero (struct backend *b, struct connection
*conn)
     return f->backend.next->can_zero (f->backend.next, conn);
 }

+static int
+filter_can_fast_zero (struct backend *b, struct connection *conn)
+{
+  struct backend_filter *f = container_of (b, struct backend_filter, backend);
+
+  debug ("%s: can_fast_zero", f->name);
+  return 0; /* Next patch will query or pass through */
+}
+
 static int
 filter_can_extents (struct backend *b, struct connection *conn)
 {
@@ -738,7 +750,8 @@ filter_zero (struct backend *b, struct connection *conn,
   void *handle = connection_get_handle (conn, f->backend.i);
   struct b_conn nxdata = { .b = f->backend.next, .conn = conn };

-  assert (!(flags & ~(NBDKIT_FLAG_MAY_TRIM | NBDKIT_FLAG_FUA)));
+  assert (!(flags & ~(NBDKIT_FLAG_MAY_TRIM | NBDKIT_FLAG_FUA |
+                      NBDKIT_FLAG_FAST_ZERO)));

   debug ("%s: zero count=%" PRIu32 " offset=%" PRIu64
" flags=0x%" PRIx32,
          f->name, count, offset, flags);
@@ -818,6 +831,7 @@ static struct backend filter_functions = {
   .is_rotational = filter_is_rotational,
   .can_trim = filter_can_trim,
   .can_zero = filter_can_zero,
+  .can_fast_zero = filter_can_fast_zero,
   .can_extents = filter_can_extents,
   .can_fua = filter_can_fua,
   .can_multi_conn = filter_can_multi_conn,
diff --git a/server/plugins.c b/server/plugins.c
index 34cc3cb5..c6dcf408 100644
--- a/server/plugins.c
+++ b/server/plugins.c
@@ -401,6 +401,16 @@ plugin_can_zero (struct backend *b, struct connection
*conn)
   return plugin_can_write (b, conn);
 }

+static int
+plugin_can_fast_zero (struct backend *b, struct connection *conn)
+{
+  assert (connection_get_handle (conn, 0));
+
+  debug ("can_fast_zero");
+
+  return 0; /* Upcoming patch will actually add support. */
+}
+
 static int
 plugin_can_extents (struct backend *b, struct connection *conn)
 {
@@ -621,15 +631,18 @@ plugin_zero (struct backend *b, struct connection *conn,
   int r = -1;
   bool may_trim = flags & NBDKIT_FLAG_MAY_TRIM;
   bool fua = flags & NBDKIT_FLAG_FUA;
+  bool fast_zero = flags & NBDKIT_FLAG_FAST_ZERO;
   bool emulate = false;
   bool need_flush = false;
   int can_zero = 1; /* TODO cache this per-connection? */

   assert (connection_get_handle (conn, 0));
-  assert (!(flags & ~(NBDKIT_FLAG_MAY_TRIM | NBDKIT_FLAG_FUA)));
+  assert (!(flags & ~(NBDKIT_FLAG_MAY_TRIM | NBDKIT_FLAG_FUA |
+                      NBDKIT_FLAG_FAST_ZERO)));

-  debug ("zero count=%" PRIu32 " offset=%" PRIu64 "
may_trim=%d fua=%d",
-         count, offset, may_trim, fua);
+  debug ("zero count=%" PRIu32 " offset=%" PRIu64
+         " may_trim=%d fua=%d fast_zero=%d",
+         count, offset, may_trim, fua, fast_zero);

   if (fua && plugin_can_fua (b, conn) != NBDKIT_FUA_NATIVE) {
     flags &= ~NBDKIT_FLAG_FUA;
@@ -643,6 +656,8 @@ plugin_zero (struct backend *b, struct connection *conn,
   }

   if (can_zero) {
+    /* if (!can_fast_zero) */
+    flags &= ~NBDKIT_FLAG_FAST_ZERO;
     errno = 0;
     if (p->plugin.zero)
       r = p->plugin.zero (connection_get_handle (conn, 0), count, offset,
@@ -658,6 +673,12 @@ plugin_zero (struct backend *b, struct connection *conn,
       goto done;
   }

+  if (fast_zero) {
+    assert (r == -1);
+    *err = EOPNOTSUPP;
+    goto done;
+  }
+
   assert (p->plugin.pwrite || p->plugin._pwrite_old);
   flags &= ~NBDKIT_FLAG_MAY_TRIM;
   threadlocal_set_error (0);
@@ -762,6 +783,7 @@ static struct backend plugin_functions = {
   .is_rotational = plugin_is_rotational,
   .can_trim = plugin_can_trim,
   .can_zero = plugin_can_zero,
+  .can_fast_zero = plugin_can_fast_zero,
   .can_extents = plugin_can_extents,
   .can_fua = plugin_can_fua,
   .can_multi_conn = plugin_can_multi_conn,
diff --git a/server/protocol-handshake.c b/server/protocol-handshake.c
index 0f3bd280..84fcacfd 100644
--- a/server/protocol-handshake.c
+++ b/server/protocol-handshake.c
@@ -66,6 +66,13 @@ protocol_compute_eflags (struct connection *conn, uint16_t
*flags)
     if (fl) {
       eflags |= NBD_FLAG_SEND_WRITE_ZEROES;
       conn->can_zero = true;
+      fl = backend->can_fast_zero (backend, conn);
+      if (fl == -1)
+        return -1;
+      if (fl) {
+        eflags |= NBD_FLAG_SEND_FAST_ZERO;
+        conn->can_fast_zero = true;
+      }
     }

     fl = backend->can_trim (backend, conn);
diff --git a/server/protocol.c b/server/protocol.c
index 72f6f2a8..77a045dc 100644
--- a/server/protocol.c
+++ b/server/protocol.c
@@ -110,7 +110,8 @@ validate_request (struct connection *conn,

   /* Validate flags */
   if (flags & ~(NBD_CMD_FLAG_FUA | NBD_CMD_FLAG_NO_HOLE |
-                NBD_CMD_FLAG_DF | NBD_CMD_FLAG_REQ_ONE)) {
+                NBD_CMD_FLAG_DF | NBD_CMD_FLAG_REQ_ONE |
+                NBD_CMD_FLAG_FAST_ZERO)) {
     nbdkit_error ("invalid request: unknown flag (0x%x)", flags);
     *error = EINVAL;
     return false;
@@ -121,6 +122,21 @@ validate_request (struct connection *conn,
     *error = EINVAL;
     return false;
   }
+  if (flags & NBD_CMD_FLAG_FAST_ZERO) {
+    if (cmd != NBD_CMD_WRITE_ZEROES) {
+      nbdkit_error ("invalid request: "
+                    "FAST_ZERO flag needs WRITE_ZEROES request");
+      *error = EINVAL;
+      return false;
+    }
+    if (!conn->can_fast_zero) {
+      nbdkit_error ("invalid request: "
+                    "%s: fast zero support was not advertised",
+                    name_of_nbd_cmd (cmd));
+      *error = EINVAL;
+      return false;
+    }
+  }
   if (flags & NBD_CMD_FLAG_DF) {
     if (cmd != NBD_CMD_READ) {
       nbdkit_error ("invalid request: DF flag needs READ request");
@@ -297,8 +313,12 @@ handle_request (struct connection *conn,
       f |= NBDKIT_FLAG_MAY_TRIM;
     if (fua)
       f |= NBDKIT_FLAG_FUA;
+    if (flags & NBD_CMD_FLAG_FAST_ZERO)
+      f |= NBDKIT_FLAG_FAST_ZERO;
     if (backend->zero (backend, conn, count, offset, f, &err) == -1) {
-      assert (err && err != ENOTSUP && err != EOPNOTSUPP);
+      assert (err);
+      if (!(flags & NBD_CMD_FLAG_FAST_ZERO))
+        assert (err != ENOTSUP && err != EOPNOTSUPP);
       return err;
     }
     break;
@@ -368,7 +388,7 @@ skip_over_write_buffer (int sock, size_t count)

 /* Convert a system errno to an NBD_E* error code. */
 static int
-nbd_errno (int error, bool flag_df)
+nbd_errno (int error, uint16_t flags)
 {
   switch (error) {
   case 0:
@@ -390,10 +410,17 @@ nbd_errno (int error, bool flag_df)
   case ESHUTDOWN:
     return NBD_ESHUTDOWN;
 #endif
+  case ENOTSUP:
+#if ENOTSUP != EOPNOTSUPP
+  case EOPNOTSUPP:
+#endif
+    if (flags & NBD_CMD_FLAG_FAST_ZERO)
+      return NBD_ENOTSUP;
+    return NBD_EINVAL;
   case EOVERFLOW:
-    if (flag_df)
+    if (flags & NBD_CMD_FLAG_DF)
       return NBD_EOVERFLOW;
-    /* fallthrough */
+    return NBD_EINVAL;
   case EINVAL:
   default:
     return NBD_EINVAL;
@@ -402,7 +429,7 @@ nbd_errno (int error, bool flag_df)

 static int
 send_simple_reply (struct connection *conn,
-                   uint64_t handle, uint16_t cmd,
+                   uint64_t handle, uint16_t cmd, uint16_t flags,
                    const char *buf, uint32_t count,
                    uint32_t error)
 {
@@ -413,7 +440,7 @@ send_simple_reply (struct connection *conn,

   reply.magic = htobe32 (NBD_SIMPLE_REPLY_MAGIC);
   reply.handle = handle;
-  reply.error = htobe32 (nbd_errno (error, false));
+  reply.error = htobe32 (nbd_errno (error, flags));

   r = conn->send (conn, &reply, sizeof reply, f);
   if (r == -1) {
@@ -640,7 +667,7 @@ send_structured_reply_error (struct connection *conn,
   }

   /* Send the error. */
-  error_data.error = htobe32 (nbd_errno (error, flags & NBD_CMD_FLAG_DF));
+  error_data.error = htobe32 (nbd_errno (error, flags));
   error_data.len = htobe16 (0);
   r = conn->send (conn, &error_data, sizeof error_data, 0);
   if (r == -1) {
@@ -790,5 +817,6 @@ protocol_recv_request_send_reply (struct connection *conn)
                                           error);
   }
   else
-    return send_simple_reply (conn, request.handle, cmd, buf, count, error);
+    return send_simple_reply (conn, request.handle, cmd, flags, buf, count,
+                              error);
 }
diff --git a/include/nbdkit-common.h b/include/nbdkit-common.h
index e004aa18..f33ce133 100644
--- a/include/nbdkit-common.h
+++ b/include/nbdkit-common.h
@@ -57,9 +57,10 @@ extern "C" {
 #define NBDKIT_THREAD_MODEL_SERIALIZE_REQUESTS        2
 #define NBDKIT_THREAD_MODEL_PARALLEL                  3

-#define NBDKIT_FLAG_MAY_TRIM (1<<0) /* Maps to !NBD_CMD_FLAG_NO_HOLE */
-#define NBDKIT_FLAG_FUA      (1<<1) /* Maps to NBD_CMD_FLAG_FUA */
-#define NBDKIT_FLAG_REQ_ONE  (1<<2) /* Maps to NBD_CMD_FLAG_REQ_ONE */
+#define NBDKIT_FLAG_MAY_TRIM  (1<<0) /* Maps to !NBD_CMD_FLAG_NO_HOLE */
+#define NBDKIT_FLAG_FUA       (1<<1) /* Maps to NBD_CMD_FLAG_FUA */
+#define NBDKIT_FLAG_REQ_ONE   (1<<2) /* Maps to NBD_CMD_FLAG_REQ_ONE */
+#define NBDKIT_FLAG_FAST_ZERO (1<<3) /* Maps to NBD_CMD_FLAG_FAST_ZERO */

 #define NBDKIT_FUA_NONE       0
 #define NBDKIT_FUA_EMULATE    1
diff --git a/plugins/ocaml/ocaml.c b/plugins/ocaml/ocaml.c
index c472281f..144a449e 100644
--- a/plugins/ocaml/ocaml.c
+++ b/plugins/ocaml/ocaml.c
@@ -857,7 +857,6 @@ ocaml_nbdkit_set_error (value nv)
   case 5: err = ENOSPC; break;
   case 6: err = ESHUTDOWN; break;
   case 7: err = EOVERFLOW; break;
-    /* Necessary for .zero support */
   case 8: err = EOPNOTSUPP; break;
     /* Other errno values that server/protocol.c treats specially */
   case 9: err = EROFS; break;
diff --git a/plugins/sh/call.c b/plugins/sh/call.c
index b86e7c9c..ab43e5ea 100644
--- a/plugins/sh/call.c
+++ b/plugins/sh/call.c
@@ -347,7 +347,6 @@ handle_script_error (char *ebuf, size_t len)
     err = ESHUTDOWN;
     skip = 9;
   }
-  /* Necessary for .zero support */
   else if (strncasecmp (ebuf, "ENOTSUP", 7) == 0) {
     err = ENOTSUP;
     skip = 7;
-- 
2.21.0

Eric Blake

2019-Aug-23 14:40 UTC

head link

[Libguestfs] [nbdkit PATCH 2/3] filters: Add .can_fast_zero hook

Allow filters to affect the handling of the new NBD_CMD_FLAG_FAST_ZERO
flag, then update affected filters.  In particular, the log filter
logs the state of the new flag (requires a tweak to expected output in
test-fua.sh), the delay filter gains a bool parameter delay-fast-zero,
several filters reject all fast requests because of local writes or
splitting a single client request into multiple plugin requests, and
the nozero filter gains additional modes for controlling fast zero
advertisement and support:

    zeromode→   none  emulate  notrim  plugin
↓fastzeromode                          (new)
---------------------------------------------
default          0      2        4      4
none             0      1        1      1
slow             0      2        2      2
ignore           0      3        3      3

0 - no zero advertised, thus no fast zero advertised
1 - fast zero not advertised
2 - fast zero advertised, but fast zero requests fail with
    ENOTSUP (ie. a fast zero was not possible)
3 - fast zero advertised, but fast zero requests are treated
    the same as normal requests (ignoring the fast zero flag,
    aids testing at the probable cost of spec non-compliance)
4 - fast zero advertisement/reaction is up to the plugin

Mode none/default remains the default for back-compat, and mode
plugin/default has no semantic change compared to omitting the nozero
filter from the command line.

Filters untouched by this patch are fine inheriting whatever fast-zero
behavior the underlying plugin uses.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 docs/nbdkit-filter.pod                  | 27 +++++++----
 filters/delay/nbdkit-delay-filter.pod   | 15 +++++-
 filters/log/nbdkit-log-filter.pod       |  2 +-
 filters/nozero/nbdkit-nozero-filter.pod | 41 +++++++++++++---
 server/filters.c                        | 15 +++++-
 include/nbdkit-filter.h                 |  3 ++
 filters/blocksize/blocksize.c           | 12 +++++
 filters/cache/cache.c                   | 20 ++++++++
 filters/cow/cow.c                       | 20 ++++++++
 filters/delay/delay.c                   | 28 ++++++++++-
 filters/log/log.c                       | 16 ++++---
 filters/nozero/nozero.c                 | 62 +++++++++++++++++++++++--
 filters/truncate/truncate.c             | 15 ++++++
 tests/test-fua.sh                       |  4 +-
 14 files changed, 248 insertions(+), 32 deletions(-)

diff --git a/docs/nbdkit-filter.pod b/docs/nbdkit-filter.pod
index 6e2bea61..ebce8961 100644
--- a/docs/nbdkit-filter.pod
+++ b/docs/nbdkit-filter.pod
@@ -361,6 +361,8 @@ calls.

 =head2 C<.can_zero>

+=head2 C<.can_fast_zero>
+
 =head2 C<.can_extents>

 =head2 C<.can_fua>
@@ -380,6 +382,8 @@ calls.
                   void *handle);
  int (*can_zero) (struct nbdkit_next_ops *next_ops, void *nxdata,
                   void *handle);
+ int (*can_fast_zero) (struct nbdkit_next_ops *next_ops, void *nxdata,
+                       void *handle);
  int (*can_extents) (struct nbdkit_next_ops *next_ops, void *nxdata,
                      void *handle);
  int (*can_fua) (struct nbdkit_next_ops *next_ops, void *nxdata,
@@ -517,22 +521,25 @@ turn, the filter should not call
C<next_ops-E<gt>zero> if
 C<next_ops-E<gt>can_zero> did not return true.

 On input, the parameter C<flags> may include
C<NBDKIT_FLAG_MAY_TRIM>
-unconditionally, and C<NBDKIT_FLAG_FUA> based on the result of
-C<.can_fua>.  In turn, the filter may pass C<NBDKIT_FLAG_MAY_TRIM>
-unconditionally, but should only pass C<NBDKIT_FLAG_FUA> on to
-C<next_ops-E<gt>zero> if C<next_ops-E<gt>can_fua>
returned a positive
-value.
+unconditionally, C<NBDKIT_FLAG_FUA> based on the result of
+C<.can_fua>, and C<NBDKIT_FLAG_FAST_ZERO> based on the result of
+C<.can_fast_zero>.  In turn, the filter may pass
+C<NBDKIT_FLAG_MAY_TRIM> unconditionally, but should only pass
+C<NBDKIT_FLAG_FUA> or C<NBDKIT_FLAG_FAST_ZERO> on to
+C<next_ops-E<gt>zero> if the corresponding
C<next_ops-E<gt>can_fua> or
+C<next_ops-E<gt>can_fast_zero> returned a positive value.

 Note that unlike the plugin C<.zero> which is permitted to fail with
 C<ENOTSUP> or C<EOPNOTSUPP> to force a fallback to
C<.pwrite>, the
-function C<next_ops-E<gt>zero> will never fail with C<err>
set to
-C<ENOTSUP> or C<EOPNOTSUPP> because the fallback has already taken
-place.
+function C<next_ops-E<gt>zero> will not fail with C<err> set
to
+C<ENOTSUP> or C<EOPNOTSUPP> unless C<NBDKIT_FLAG_FAST_ZERO>
was used,
+because otherwise the fallback has already taken place.

 If there is an error, C<.zero> should call C<nbdkit_error> with an
 error message B<and> return -1 with C<err> set to the positive
errno
-value to return to the client.  The filter should never fail with
-C<ENOTSUP> or C<EOPNOTSUPP> (while plugins have automatic fallback
to
+value to return to the client.  The filter should not fail with
+C<ENOTSUP> or C<EOPNOTSUPP> unless C<flags> includes
+C<NBDKIT_FLAG_FAST_ZERO> (while plugins have automatic fallback to
 C<.pwrite>, filters do not).

 =head2 C<.extents>
diff --git a/filters/delay/nbdkit-delay-filter.pod
b/filters/delay/nbdkit-delay-filter.pod
index 730cea4c..0a9c77f7 100644
--- a/filters/delay/nbdkit-delay-filter.pod
+++ b/filters/delay/nbdkit-delay-filter.pod
@@ -12,6 +12,7 @@ nbdkit-delay-filter - nbdkit delay filter
           delay-read=(SECS|NNms) delay-write=(SECS|NNms)
           delay-zero=(SECS|NNms) delay-trim=(SECS|NNms)
           delay-extents=(SECS|NNms) delay-cache=(SECS|NNms)
+          delay-fast-zero=BOOL

 =head1 DESCRIPTION

@@ -56,7 +57,8 @@ Delay write operations by C<SECS> seconds or C<NN>
milliseconds.

 =item B<delay-zero=>NNB<ms>

-Delay zero operations by C<SECS> seconds or C<NN> milliseconds.
+Delay zero operations by C<SECS> seconds or C<NN> milliseconds. 
See
+also B<delay-fast-zero>.

 =item B<delay-trim=>SECS

@@ -85,6 +87,17 @@ milliseconds.
 Delay write, zero and trim operations by C<SECS> seconds or C<NN>
 milliseconds.

+=item B<delay-fast-zero=>BOOL
+
+The NBD specification documents an extension called fast zero, in
+which the client may request that a server should reply with
+C<ENOTSUP> as soon as possible if the zero operation offers no real
+speedup over a corresponding write.  By default, this parameter is
+true, and fast zero requests are serviced by the plugin after the same
+delay as any other zero request; but setting this parameter to false
+instantly fails a fast zero response without waiting for or consulting
+the plugin.
+
 =back

 =head1 SEE ALSO
diff --git a/filters/log/nbdkit-log-filter.pod
b/filters/log/nbdkit-log-filter.pod
index 9e102bc0..5d9625a1 100644
--- a/filters/log/nbdkit-log-filter.pod
+++ b/filters/log/nbdkit-log-filter.pod
@@ -55,7 +55,7 @@ on the connection).
 An example logging session of a client that performs a single
 successful read is:

- 2018-01-27 20:38:22.959984 connection=1 Connect size=0x400 write=1 flush=1
rotational=0 trim=0 zero=1 fua=1
+ 2018-01-27 20:38:22.959984 connection=1 Connect size=0x400 write=1 flush=1
rotational=0 trim=0 zero=1 fua=1 extents=1 cache=0 fast_zero=0
  2018-01-27 20:38:23.001720 connection=1 Read id=1 offset=0x0 count=0x100 ...
  2018-01-27 20:38:23.001995 connection=1 ...Read id=1 return=0 (Success)
  2018-01-27 20:38:23.044259 connection=1 Disconnect transactions=1
diff --git a/filters/nozero/nbdkit-nozero-filter.pod
b/filters/nozero/nbdkit-nozero-filter.pod
index 144b8230..4fc7dc63 100644
--- a/filters/nozero/nbdkit-nozero-filter.pod
+++ b/filters/nozero/nbdkit-nozero-filter.pod
@@ -4,7 +4,8 @@ nbdkit-nozero-filter - nbdkit nozero filter

 =head1 SYNOPSIS

- nbdkit --filter=nozero plugin [zeromode=MODE] [plugin-args...]
+ nbdkit --filter=nozero plugin [plugin-args...] \
+   [zeromode=MODE] [fastzeromode=MODE]

 =head1 DESCRIPTION

@@ -18,7 +19,7 @@ testing client or server fallbacks.

 =over 4

-=item B<zeromode=none|emulate|notrim>
+=item B<zeromode=none|emulate|notrim|plugin>

 Optional, controls which mode the filter will use.  Mode B<none>
 (default) means that zero support is not advertised to the
@@ -29,8 +30,30 @@ efficient way to write zeros. Since nbdkit E<ge>
1.13.4, mode
 B<notrim> means that zero requests are forwarded on to the plugin,
 except that the plugin will never see the NBDKIT_MAY_TRIM flag, to
 determine if the client permitting trimming during zero operations
-makes a difference (it is an error to request this mode if the plugin
-does not support the C<zero> callback).
+makes a difference.  Since nbdkit E<ge> 1.13.9, mode B<plugin>
leaves
+normal zero requests up to the plugin, useful when combined with
+C<fastzeromode> for experimenting with the effects of fast zero
+requests.  It is an error to request B<notrim> or B<plugin> if the
+plugin does not support the C<zero> callback.
+
+=item B<fastzeromode=none|slow|ignore|default>
+
+Optional since nbdkit E<ge> 1.13.9, controls whether fast zeroes are
+advertised to the client, and if so, how the filter will react to a
+client fast zero request.  Mode B<none> avoids advertising fast zero
+support.  Mode B<slow> advertises fast zero support unconditionally,
+but treats all fast zero requests as an immediate C<ENOTSUP> failure
+rather than performing a fallback.  Mode B<ignore> advertises fast
+zero support, but treats all client fast zero requests as if the flag
+had not been used (this behavior is typically contrary to the NBD
+specification, but can be useful for comparison against the actual
+fast zero implementation to see if fast zeroes make a difference).
+Mode B<default> is selected by default; when paired with
+C<zeromode=emulate>, fast zeroes are advertised but fast zero requests
+always fail (similar to C<slow>); when paired with
C<zeromode=notrim>
+or C<zeromode=plugin>, fast zero support is left to the plugin
+(although in the latter case, the nozero filter could be omitted for
+the same behavior).

 =back

@@ -42,11 +65,17 @@ explicitly rather than with C<NBD_CMD_WRITE_ZEROES>:
  nbdkit --filter=nozero file disk.img

 Serve the file F<disk.img>, allowing the client to take advantage of
-less network traffic via C<NBD_CMD_WRITE_ZEROES>, but still forcing
-the data to be written explicitly rather than punching any holes:
+less network traffic via C<NBD_CMD_WRITE_ZEROES>, but fail any fast
+zero requests up front and force all other zero requests to write data
+explicitly rather than punching any holes:

  nbdkit --filter=nozero file zeromode=emulate disk.img

+Serve the file F<disk.img>, but do not advertise fast zero support to
+the client even if the plugin supports it:
+
+ nbdkit --filter=nozero file zeromode=plugin fastzeromode=none disk.img
+
 =head1 SEE ALSO

 L<nbdkit(1)>,
diff --git a/server/filters.c b/server/filters.c
index 0dd2393e..f2de5e4e 100644
--- a/server/filters.c
+++ b/server/filters.c
@@ -314,6 +314,13 @@ next_can_zero (void *nxdata)
   return b_conn->b->can_zero (b_conn->b, b_conn->conn);
 }

+static int
+next_can_fast_zero (void *nxdata)
+{
+  struct b_conn *b_conn = nxdata;
+  return b_conn->b->can_fast_zero (b_conn->b, b_conn->conn);
+}
+
 static int
 next_can_extents (void *nxdata)
 {
@@ -445,6 +452,7 @@ static struct nbdkit_next_ops next_ops = {
   .is_rotational = next_is_rotational,
   .can_trim = next_can_trim,
   .can_zero = next_can_zero,
+  .can_fast_zero = next_can_fast_zero,
   .can_extents = next_can_extents,
   .can_fua = next_can_fua,
   .can_multi_conn = next_can_multi_conn,
@@ -593,9 +601,14 @@ static int
 filter_can_fast_zero (struct backend *b, struct connection *conn)
 {
   struct backend_filter *f = container_of (b, struct backend_filter, backend);
+  void *handle = connection_get_handle (conn, f->backend.i);
+  struct b_conn nxdata = { .b = f->backend.next, .conn = conn };

   debug ("%s: can_fast_zero", f->name);
-  return 0; /* Next patch will query or pass through */
+  if (f->filter.can_fast_zero)
+    return f->filter.can_fast_zero (&next_ops, &nxdata, handle);
+  else
+    return f->backend.next->can_fast_zero (f->backend.next, conn);
 }

 static int
diff --git a/include/nbdkit-filter.h b/include/nbdkit-filter.h
index 94f17789..d11cf881 100644
--- a/include/nbdkit-filter.h
+++ b/include/nbdkit-filter.h
@@ -71,6 +71,7 @@ struct nbdkit_next_ops {
   int (*is_rotational) (void *nxdata);
   int (*can_trim) (void *nxdata);
   int (*can_zero) (void *nxdata);
+  int (*can_fast_zero) (void *nxdata);
   int (*can_extents) (void *nxdata);
   int (*can_fua) (void *nxdata);
   int (*can_multi_conn) (void *nxdata);
@@ -139,6 +140,8 @@ struct nbdkit_filter {
                    void *handle);
   int (*can_zero) (struct nbdkit_next_ops *next_ops, void *nxdata,
                    void *handle);
+  int (*can_fast_zero) (struct nbdkit_next_ops *next_ops, void *nxdata,
+                        void *handle);
   int (*can_extents) (struct nbdkit_next_ops *next_ops, void *nxdata,
                       void *handle);
   int (*can_fua) (struct nbdkit_next_ops *next_ops, void *nxdata,
diff --git a/filters/blocksize/blocksize.c b/filters/blocksize/blocksize.c
index 0978887f..47638c74 100644
--- a/filters/blocksize/blocksize.c
+++ b/filters/blocksize/blocksize.c
@@ -307,6 +307,18 @@ blocksize_zero (struct nbdkit_next_ops *next_ops, void
*nxdata,
   uint32_t drop;
   bool need_flush = false;

+  if (flags & NBDKIT_FLAG_FAST_ZERO) {
+    /* If we have to split the transaction, an ENOTSUP fast failure in
+     * a later call would be unnecessarily delayed behind earlier
+     * calls; it's easier to just declare that anything that can't be
+     * done in one call to the plugin is not fast.
+     */
+    if ((offs | count) & (minblock - 1) || count > maxlen) {
+      *err = ENOTSUP;
+      return -1;
+    }
+  }
+
   if ((flags & NBDKIT_FLAG_FUA) &&
       next_ops->can_fua (nxdata) == NBDKIT_FUA_EMULATE) {
     flags &= ~NBDKIT_FLAG_FUA;
diff --git a/filters/cache/cache.c b/filters/cache/cache.c
index b5dbccd2..7c1d6c4f 100644
--- a/filters/cache/cache.c
+++ b/filters/cache/cache.c
@@ -250,6 +250,17 @@ cache_can_cache (struct nbdkit_next_ops *next_ops, void
*nxdata, void *handle)
   return NBDKIT_CACHE_NATIVE;
 }

+/* Override the plugin's .can_fast_zero, because our .zero is not fast */
+static int
+cache_can_fast_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
+                     void *handle)
+{
+  /* It is better to advertise support even when we always reject fast
+   * zero attempts.
+   */
+  return 1;
+}
+
 /* Read data. */
 static int
 cache_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
@@ -418,6 +429,14 @@ cache_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
   int r;
   bool need_flush = false;

+  /* We are purposefully avoiding next_ops->zero, so a zero request is
+   * never faster than plain writes.
+   */
+  if (flags & NBDKIT_FLAG_FAST_ZERO) {
+    *err = ENOTSUP;
+    return -1;
+  }
+
   block = malloc (blksize);
   if (block == NULL) {
     *err = errno;
@@ -624,6 +643,7 @@ static struct nbdkit_filter filter = {
   .prepare           = cache_prepare,
   .get_size          = cache_get_size,
   .can_cache         = cache_can_cache,
+  .can_fast_zero     = cache_can_fast_zero,
   .pread             = cache_pread,
   .pwrite            = cache_pwrite,
   .zero              = cache_zero,
diff --git a/filters/cow/cow.c b/filters/cow/cow.c
index 9d91d432..a5c1f978 100644
--- a/filters/cow/cow.c
+++ b/filters/cow/cow.c
@@ -179,6 +179,17 @@ cow_can_cache (struct nbdkit_next_ops *next_ops, void
*nxdata, void *handle)
   return NBDKIT_FUA_NATIVE;
 }

+/* Override the plugin's .can_fast_zero, because our .zero is not fast */
+static int
+cow_can_fast_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
+                   void *handle)
+{
+  /* It is better to advertise support even when we always reject fast
+   * zero attempts.
+   */
+  return 1;
+}
+
 static int cow_flush (struct nbdkit_next_ops *next_ops, void *nxdata, void
*handle, uint32_t flags, int *err);

 /* Read data. */
@@ -340,6 +351,14 @@ cow_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
   uint64_t blknum, blkoffs;
   int r;

+  /* We are purposefully avoiding next_ops->zero, so a zero request is
+   * never faster than plain writes.
+   */
+  if (flags & NBDKIT_FLAG_FAST_ZERO) {
+    *err = ENOTSUP;
+    return -1;
+  }
+
   block = malloc (BLKSIZE);
   if (block == NULL) {
     *err = errno;
@@ -496,6 +515,7 @@ static struct nbdkit_filter filter = {
   .can_extents       = cow_can_extents,
   .can_fua           = cow_can_fua,
   .can_cache         = cow_can_cache,
+  .can_fast_zero     = cow_can_fast_zero,
   .pread             = cow_pread,
   .pwrite            = cow_pwrite,
   .zero              = cow_zero,
diff --git a/filters/delay/delay.c b/filters/delay/delay.c
index c92a12d7..207d101e 100644
--- a/filters/delay/delay.c
+++ b/filters/delay/delay.c
@@ -37,6 +37,8 @@
 #include <stdlib.h>
 #include <stdint.h>
 #include <string.h>
+#include <stdbool.h>
+#include <errno.h>

 #include <nbdkit-filter.h>

@@ -46,6 +48,7 @@ static int delay_zero_ms = 0;   /* zero delay (milliseconds)
*/
 static int delay_trim_ms = 0;   /* trim delay (milliseconds) */
 static int delay_extents_ms = 0;/* extents delay (milliseconds) */
 static int delay_cache_ms = 0;  /* cache delay (milliseconds) */
+static int delay_fast_zero = 1; /* whether delaying zero includes fast zero */

 static int
 parse_delay (const char *key, const char *value)
@@ -182,6 +185,12 @@ delay_config (nbdkit_next_config *next, void *nxdata,
       return -1;
     return 0;
   }
+  else if (strcmp (key, "delay-fast-zero") == 0) {
+    delay_fast_zero = nbdkit_parse_bool (value);
+    if (delay_fast_zero < 0)
+      return -1;
+    return 0;
+  }
   else
     return next (nxdata, key, value);
 }
@@ -194,7 +203,19 @@ delay_config (nbdkit_next_config *next, void *nxdata,
   "delay-trim=<NN>[ms]            Trim delay in
seconds/milliseconds.\n" \
   "delay-extents=<NN>[ms]         Extents delay in
seconds/milliseconds.\n" \
   "delay-cache=<NN>[ms]           Cache delay in
seconds/milliseconds.\n" \
-  "wdelay=<NN>[ms]                Write, zero and trim delay in
secs/msecs."
+  "wdelay=<NN>[ms]                Write, zero and trim delay in
secs/msecs.\n" \
+  "delay-fast-zero=<BOOL>         Delay fast zero requests (default
true).\n"
+
+/* Override the plugin's .can_fast_zero if needed */
+static int
+delay_can_fast_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
+                     void *handle)
+{
+  /* Advertise if we are handling fast zero requests locally */
+  if (delay_zero_ms && !delay_fast_zero)
+    return 1;
+  return next_ops->can_fast_zero (nxdata);
+}

 /* Read data. */
 static int
@@ -225,6 +246,10 @@ delay_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
             void *handle, uint32_t count, uint64_t offset, uint32_t flags,
             int *err)
 {
+  if ((flags & NBDKIT_FLAG_FAST_ZERO) && delay_zero_ms &&
!delay_fast_zero) {
+    *err = ENOTSUP;
+    return -1;
+  }
   if (zero_delay (err) == -1)
     return -1;
   return next_ops->zero (nxdata, count, offset, flags, err);
@@ -269,6 +294,7 @@ static struct nbdkit_filter filter = {
   .version           = PACKAGE_VERSION,
   .config            = delay_config,
   .config_help       = delay_config_help,
+  .can_fast_zero     = delay_can_fast_zero,
   .pread             = delay_pread,
   .pwrite            = delay_pwrite,
   .zero              = delay_zero,
diff --git a/filters/log/log.c b/filters/log/log.c
index 7cf741e1..95667c61 100644
--- a/filters/log/log.c
+++ b/filters/log/log.c
@@ -260,14 +260,15 @@ log_prepare (struct nbdkit_next_ops *next_ops, void
*nxdata, void *handle)
   int F = next_ops->can_fua (nxdata);
   int e = next_ops->can_extents (nxdata);
   int c = next_ops->can_cache (nxdata);
+  int Z = next_ops->can_fast_zero (nxdata);

   if (size < 0 || w < 0 || f < 0 || r < 0 || t < 0 || z < 0
|| F < 0 ||
-      e < 0 || c < 0)
+      e < 0 || c < 0 || Z < 0)
     return -1;

   output (h, "Connect", 0, "size=0x%" PRIx64 "
write=%d flush=%d "
-          "rotational=%d trim=%d zero=%d fua=%d extents=%d cache=%d",
-          size, w, f, r, t, z, F, e, c);
+          "rotational=%d trim=%d zero=%d fua=%d extents=%d cache=%d "
+          "fast_zero=%d", size, w, f, r, t, z, F, e, c, Z);
   return 0;
 }

@@ -360,10 +361,13 @@ log_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
   uint64_t id = get_id (h);
   int r;

-  assert (!(flags & ~(NBDKIT_FLAG_FUA | NBDKIT_FLAG_MAY_TRIM)));
-  output (h, "Zero", id, "offset=0x%" PRIx64 "
count=0x%x trim=%d fua=%d ...",
+  assert (!(flags & ~(NBDKIT_FLAG_FUA | NBDKIT_FLAG_MAY_TRIM |
+                      NBDKIT_FLAG_FAST_ZERO)));
+  output (h, "Zero", id,
+          "offset=0x%" PRIx64 " count=0x%x trim=%d fua=%d
fast=%d...",
           offs, count, !!(flags & NBDKIT_FLAG_MAY_TRIM),
-          !!(flags & NBDKIT_FLAG_FUA));
+          !!(flags & NBDKIT_FLAG_FUA),
+          !!(flags & NBDKIT_FLAG_FAST_ZERO));
   r = next_ops->zero (nxdata, count, offs, flags, err);
   output_return (h, "...Zero", id, r, err);
   return r;
diff --git a/filters/nozero/nozero.c b/filters/nozero/nozero.c
index 964cce9f..e54f7c62 100644
--- a/filters/nozero/nozero.c
+++ b/filters/nozero/nozero.c
@@ -38,6 +38,7 @@
 #include <string.h>
 #include <stdbool.h>
 #include <assert.h>
+#include <errno.h>

 #include <nbdkit-filter.h>

@@ -49,8 +50,16 @@ static enum ZeroMode {
   NONE,
   EMULATE,
   NOTRIM,
+  PLUGIN,
 } zeromode;

+static enum FastZeroMode {
+  DEFAULT,
+  SLOW,
+  IGNORE,
+  NOFAST,
+} fastzeromode;
+
 static int
 nozero_config (nbdkit_next_config *next, void *nxdata,
                const char *key, const char *value)
@@ -60,17 +69,35 @@ nozero_config (nbdkit_next_config *next, void *nxdata,
       zeromode = EMULATE;
     else if (strcmp (value, "notrim") == 0)
       zeromode = NOTRIM;
+    else if (strcmp (value, "plugin") == 0)
+      zeromode = PLUGIN;
     else if (strcmp (value, "none") != 0) {
       nbdkit_error ("unknown zeromode '%s'", value);
       return -1;
     }
     return 0;
   }
+
+  if (strcmp (key, "fastzeromode") == 0) {
+    if (strcmp (value, "none") == 0)
+      fastzeromode = NOFAST;
+    else if (strcmp (value, "ignore") == 0)
+      fastzeromode = IGNORE;
+    else if (strcmp (value, "slow") == 0)
+      fastzeromode = SLOW;
+    else if (strcmp (value, "default") != 0) {
+      nbdkit_error ("unknown fastzeromode '%s'", value);
+      return -1;
+    }
+    return 0;
+  }
+
   return next (nxdata, key, value);
 }

 #define nozero_config_help \
-  "zeromode=<MODE>      Either 'none' (default),
'emulate', or 'notrim'.\n" \
+  "zeromode=<MODE>      One of 'none' (default),
'emulate', 'notrim', 'plugin'.\n" \
+  "fastzeromode=<MODE>  One of 'default', 'none',
'slow', 'ignore'.\n"

 /* Check that desired mode is supported by plugin. */
 static int
@@ -78,12 +105,13 @@ nozero_prepare (struct nbdkit_next_ops *next_ops, void
*nxdata, void *handle)
 {
   int r;

-  if (zeromode == NOTRIM) {
+  if (zeromode == NOTRIM || zeromode == PLUGIN) {
     r = next_ops->can_zero (nxdata);
     if (r == -1)
       return -1;
     if (!r) {
-      nbdkit_error ("zeromode 'notrim' requires plugin zero
support");
+      nbdkit_error ("zeromode '%s' requires plugin zero
support",
+                    zeromode == NOTRIM ? "notrim" :
"plugin");
       return -1;
     }
   }
@@ -94,9 +122,22 @@ nozero_prepare (struct nbdkit_next_ops *next_ops, void
*nxdata, void *handle)
 static int
 nozero_can_zero (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle)
 {
+  /* For NOTRIM and PLUGIN modes, we've already verified
next_ops->can_zero */
   return zeromode != NONE;
 }

+/* Advertise desired FAST_ZERO mode. */
+static int
+nozero_can_fast_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
+                      void *handle)
+{
+  if (zeromode == NONE)
+    return 0;
+  if (zeromode != EMULATE && fastzeromode == DEFAULT)
+    return next_ops->can_fast_zero (nxdata);
+  return fastzeromode != NOFAST;
+}
+
 static int
 nozero_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
              void *handle, uint32_t count, uint64_t offs, uint32_t flags,
@@ -106,9 +147,21 @@ nozero_zero (struct nbdkit_next_ops *next_ops, void
*nxdata,
   bool need_flush = false;

   assert (zeromode != NONE);
-  flags &= ~NBDKIT_FLAG_MAY_TRIM;
+  if (flags & NBDKIT_FLAG_FAST_ZERO) {
+    assert (fastzeromode != NOFAST);
+    if (fastzeromode == SLOW ||
+        (fastzeromode == DEFAULT && zeromode == EMULATE)) {
+      *err = ENOTSUP;
+      return -1;
+    }
+    if (fastzeromode == IGNORE)
+      flags &= ~NBDKIT_FLAG_FAST_ZERO;
+  }

   if (zeromode == NOTRIM)
+    flags &= ~NBDKIT_FLAG_MAY_TRIM;
+
+  if (zeromode != EMULATE)
     return next_ops->zero (nxdata, count, offs, flags, err);

   if (flags & NBDKIT_FLAG_FUA) {
@@ -144,6 +197,7 @@ static struct nbdkit_filter filter = {
   .config_help       = nozero_config_help,
   .prepare           = nozero_prepare,
   .can_zero          = nozero_can_zero,
+  .can_fast_zero     = nozero_can_fast_zero,
   .zero              = nozero_zero,
 };

diff --git a/filters/truncate/truncate.c b/filters/truncate/truncate.c
index 93d8f074..47d70b31 100644
--- a/filters/truncate/truncate.c
+++ b/filters/truncate/truncate.c
@@ -201,6 +201,14 @@ truncate_get_size (struct nbdkit_next_ops *next_ops, void
*nxdata,
   return h->size;
 }

+/* Override the plugin's .can_fast_zero, because zeroing a tail is fast. */
+static int
+truncate_can_fast_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
+                        void *handle)
+{
+  return 1;
+}
+
 /* Read data. */
 static int
 truncate_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
@@ -297,6 +305,12 @@ truncate_zero (struct nbdkit_next_ops *next_ops, void
*nxdata,
       n = count;
     else
       n = h->real_size - offset;
+    if (flags & NBDKIT_FLAG_FAST_ZERO &&
+        next_ops->can_fast_zero (nxdata) <= 0) {
+      /* TODO: Cache per connection? */
+      *err = ENOTSUP;
+      return -1;
+    }
     return next_ops->zero (nxdata, n, offset, flags, err);
   }
   return 0;
@@ -392,6 +406,7 @@ static struct nbdkit_filter filter = {
   .close             = truncate_close,
   .prepare           = truncate_prepare,
   .get_size          = truncate_get_size,
+  .can_fast_zero     = truncate_can_fast_zero,
   .pread             = truncate_pread,
   .pwrite            = truncate_pwrite,
   .trim              = truncate_trim,
diff --git a/tests/test-fua.sh b/tests/test-fua.sh
index c0d82db7..1c869e96 100755
--- a/tests/test-fua.sh
+++ b/tests/test-fua.sh
@@ -106,14 +106,14 @@ test $(grep -c 'connection=1 Flush' fua1.log) -lt
\
 # all earlier parts of the transaction do not have fua
 flush1=$(grep -c 'connection=1 Flush' fua2.log || :)
 flush2=$(grep -c 'connection=2 Flush' fua2.log || :)
-fua=$(grep -c 'connection=2.*fua=1 \.' fua2.log || :)
+fua=$(grep -c 'connection=2.*fua=1 .*\.' fua2.log || :)
 test $(( $flush2 - $flush1 + $fua )) = 2

 # Test 3: every part of split has fua, and no flushes are added
 flush1=$(grep -c 'connection=1 Flush' fua3.log || :)
 flush2=$(grep -c 'connection=2 Flush' fua3.log || :)
 test $flush1 = $flush2
-test $(grep -c 'connection=2.*fua=1 \.' fua3.log) = 32
+test $(grep -c 'connection=2.*fua=1 .*\.' fua3.log) = 32

 # Test 4: flush is no-op, and every transaction has fua
 if grep 'fua=0' fua4.log; then
-- 
2.21.0

Eric Blake

2019-Aug-23 14:40 UTC

head link

[Libguestfs] [nbdkit PATCH 3/3] plugins: Add .can_fast_zero hook

Allow plugins to affect the handling of the new NBD_CMD_FLAG_FAST_ZERO
flag, then update affected plugins.  In particular, in-memory plugins
are always fast; the full plugin is better served by omitting .zero
and relying on .pwrite fallback; nbd takes advantage of libnbd
extensions proposed in parallel to pass through support; and
v2 language bindings expose the choice to their scripts.

The testsuite is updated thanks to the sh plugin to cover this.  In
turn, the sh plugin has to be a bit smarter about handling missing
can_fast_zero to get decent fallback support from nbdkit.

Plugins untouched by this patch either do not support .zero with flags
(including v1 plugins; these are all okay with the default behavior of
advertising but always failing fast zeroes), or are too difficult to
analyze in this patch (so not advertising is easier than having to
decide - in particular, the file plugin is tricky, since BLKZEROOUT is
not reliably fast).  The nozero filter can be used to adjust fast zero
settings for a plugin that has not yet updated.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 docs/nbdkit-plugin.pod          |  74 +++++++++++++++----
 plugins/sh/nbdkit-sh-plugin.pod |  13 +++-
 server/plugins.c                |  25 +++++--
 include/nbdkit-plugin.h         |   2 +
 plugins/data/data.c             |  14 +++-
 plugins/full/full.c             |  12 ++--
 plugins/memory/memory.c         |  14 +++-
 plugins/nbd/nbd.c               |  28 +++++++-
 plugins/null/null.c             |   8 +++
 plugins/ocaml/ocaml.c           |  25 +++++++
 plugins/sh/sh.c                 |  39 +++++++---
 plugins/ocaml/NBDKit.ml         |  10 ++-
 plugins/ocaml/NBDKit.mli        |   2 +
 plugins/rust/src/lib.rs         |   3 +
 tests/test-eflags.sh            | 122 ++++++++++++++++++++++++++++----
 15 files changed, 332 insertions(+), 59 deletions(-)

diff --git a/docs/nbdkit-plugin.pod b/docs/nbdkit-plugin.pod
index bc3d9749..f3793e7a 100644
--- a/docs/nbdkit-plugin.pod
+++ b/docs/nbdkit-plugin.pod
@@ -609,19 +609,47 @@ C<.trim> callback has been defined.

 This is called during the option negotiation phase to find out if the
 plugin wants the C<.zero> callback to be utilized.  Support for
-writing zeroes is still advertised to the client (unless the nbdkit
-filter nozero is also used), so returning false merely serves as a way
-to avoid complicating the C<.zero> callback to have to fail with
-C<ENOTSUP> or C<EOPNOTSUPP> on the connections where it will never
be
-more efficient than using C<.pwrite> up front.
+writing zeroes is still advertised to the client (unless the
+L<nbdkit-nozero-filter(1)> is also used), so returning false merely
+serves as a way to avoid complicating the C<.zero> callback to have to
+fail with C<ENOTSUP> or C<EOPNOTSUPP> on the connections where it
will
+never be more efficient than using C<.pwrite> up front.

 If there is an error, C<.can_zero> should call C<nbdkit_error> with
an
 error message and return C<-1>.

-This callback is not required.  If omitted, then nbdkit always tries
-C<.zero> first if it is present, and gracefully falls back to
-C<.pwrite> if C<.zero> was absent or failed with C<ENOTSUP>
or
-C<EOPNOTSUPP>.
+This callback is not required.  If omitted, then for a normal zero
+request, nbdkit always tries C<.zero> first if it is present, and
+gracefully falls back to C<.pwrite> if C<.zero> was absent or
failed
+with C<ENOTSUP> or C<EOPNOTSUPP>.
+
+=head2 C<.can_fast_zero>
+
+ int can_fast_zero (void *handle);
+
+This is called during the option negotiation phase to find out if the
+plugin wants to advertise support for fast zero requests.  If this
+support is not advertised, a client cannot attempt fast zero requests,
+and has no way to tell if writing zeroes offers any speedups compared
+to using C<.pwrite> (other than compressed network traffic).  If
+support is advertised, then C<.zero> will have
+C<NBDKIT_FLAG_FAST_ZERO> set when the client has requested a fast
+zero, in which case the plugin must fail with C<ENOTSUP> or
+C<EOPNOTSUPP> up front if the request would not offer any benefits
+over C<.pwrite>.  Advertising support for fast zero requests does not
+require that writing zeroes be fast, only that the result (whether
+success or failure) is fast, so this should be advertised when
+feasible.
+
+If there is an error, C<.can_fast_zero> should call C<nbdkit_error>
+with an error message and return C<-1>.
+
+This callback is not required.  If omitted, then nbdkit returns true
+if C<.zero> is absent or C<.can_zero> returns false (in those
cases,
+nbdkit fails all fast zero requests, as its fallback to C<.pwrite> is
+not inherently faster), otherwise false (since it cannot be determined
+in advance if the plugin's C<.zero> will properly honor the semantics
+of C<NBDKIT_FLAG_FAST_ZERO>).

 =head2 C<.can_extents>

@@ -804,15 +832,25 @@ bytes of zeroes at C<offset> in the backing store.

 This function will not be called if C<.can_zero> returned false.  On
 input, the parameter C<flags> may include C<NBDKIT_FLAG_MAY_TRIM>
-unconditionally, and C<NBDKIT_FLAG_FUA> based on the result of
-C<.can_fua>.
+unconditionally, C<NBDKIT_FLAG_FUA> based on the result of
+C<.can_fua>, and C<NBDKIT_FLAG_FAST_ZERO> based on the result of
+C<.can_fast_zero>.

 If C<NBDKIT_FLAG_MAY_TRIM> is requested, the operation can punch a
 hole instead of writing actual zero bytes, but only if subsequent
-reads from the hole read as zeroes.  If this callback is omitted, or
-if it fails with C<ENOTSUP> or C<EOPNOTSUPP> (whether by
-C<nbdkit_set_error> or C<errno>), then C<.pwrite> will be
used
-instead.
+reads from the hole read as zeroes.
+
+If C<NBDKIT_FLAG_FAST_ZERO> is requested, the plugin must decide up
+front if the implementation is likely to be faster than a
+corresponding C<.pwrite>; if not, then it must immediately fail with
+C<ENOTSUP> or C<EOPNOTSUPP> (whether by C<nbdkit_set_error>
or
+C<errno>) and preferably without modifying the exported image.  It is
+acceptable to always fail a fast zero request (as a fast failure is
+better than attempting the write only to find out after the fact that
+it was not fast after all).  Note that on Linux, support for
+C<ioctl(BLKZEROOUT)> is insufficient for determining whether a zero
+request to a block device will be fast (because the kernel will
+perform a slow fallback when needed).

 The callback must write the whole C<count> bytes if it can.  The NBD
 protocol doesn't allow partial writes (instead, these would be
@@ -823,6 +861,11 @@ If there is an error, C<.zero> should call
C<nbdkit_error> with an
 error message, and C<nbdkit_set_error> to record an appropriate error
 (unless C<errno> is sufficient), then return C<-1>.

+If this callback is omitted, or if it fails with C<ENOTSUP> or
+C<EOPNOTSUPP> (whether by C<nbdkit_set_error> or C<errno>),
then
+C<.pwrite> will be used as an automatic fallback except when the
+client requested a fast zero.
+
 =head2 C<.extents>

  int extents (void *handle, uint32_t count, uint64_t offset,
@@ -1221,6 +1264,7 @@ and then users will be able to run it like this:
 =head1 SEE ALSO

 L<nbdkit(1)>,
+L<nbdkit-nozero-filter(3)>,
 L<nbdkit-filter(3)>.

 Standard plugins provided by nbdkit:
diff --git a/plugins/sh/nbdkit-sh-plugin.pod b/plugins/sh/nbdkit-sh-plugin.pod
index 9e9a133e..adb8a0db 100644
--- a/plugins/sh/nbdkit-sh-plugin.pod
+++ b/plugins/sh/nbdkit-sh-plugin.pod
@@ -289,7 +289,10 @@ The script should exit with code C<0> for true or
code C<3> for false.

 =item C<is_rotational>

+=item C<can_fast_zero>
+
  /path/to/script is_rotational <handle>
+ /path/to/script can_fast_zero <handle>

 The script should exit with code C<0> for true or code C<3> for
false.

@@ -361,12 +364,18 @@ also provide a C<can_trim> method which exits with
code C<0> (true).
  /path/to/script zero <handle> <count> <offset> <flags>

 The C<flags> parameter can be an empty string or a comma-separated
-list of the flags: C<"fua"> and C<"may_trim">
(eg. C<"">, C<"fua">,
-C<"fua,may_trim"> are all possible values).
+list of the flags: C<"fua">, C<"may_trim">, and
C<"fast"> (eg. C<"">,
+C<"fua">, C<"fua,may_trim,fast"> are some of
the 8 possible values).

 Unlike in other languages, if you provide a C<zero> method you
B<must>
 also provide a C<can_zero> method which exits with code C<0>
(true).

+To trigger a fallback to <pwrite> on a normal zero request, or to
+respond quickly to the C<"fast"> flag that a specific zero
request is
+no faster than a corresponding write, the script must output
+C<ENOTSUP> or C<EOPNOTSUPP> to stderr (possibly followed by a
+description of the problem) before exiting with code C<1> (failure).
+
 =item C<extents>

  /path/to/script extents <handle> <count> <offset>
<flags>
diff --git a/server/plugins.c b/server/plugins.c
index c6dcf408..84329cf4 100644
--- a/server/plugins.c
+++ b/server/plugins.c
@@ -404,11 +404,25 @@ plugin_can_zero (struct backend *b, struct connection
*conn)
 static int
 plugin_can_fast_zero (struct backend *b, struct connection *conn)
 {
+  struct backend_plugin *p = container_of (b, struct backend_plugin, backend);
+  int r;
+
   assert (connection_get_handle (conn, 0));

   debug ("can_fast_zero");

-  return 0; /* Upcoming patch will actually add support. */
+  if (p->plugin.can_fast_zero)
+    return p->plugin.can_fast_zero (connection_get_handle (conn, 0));
+  /* Advertise support for fast zeroes if no .zero or .can_zero is
+   * false: in those cases, we fail fast instead of using .pwrite.
+   * This also works when v1 plugin has only ._zero_old.
+   */
+  if (p->plugin.zero == NULL)
+    return 1;
+  r = plugin_can_zero (b, conn);
+  if (r == -1)
+    return -1;
+  return !r;
 }

 static int
@@ -656,15 +670,18 @@ plugin_zero (struct backend *b, struct connection *conn,
   }

   if (can_zero) {
-    /* if (!can_fast_zero) */
-    flags &= ~NBDKIT_FLAG_FAST_ZERO;
     errno = 0;
     if (p->plugin.zero)
       r = p->plugin.zero (connection_get_handle (conn, 0), count, offset,
                           flags);
-    else if (p->plugin._zero_old)
+    else if (p->plugin._zero_old) {
+      if (fast_zero) {
+        *err = EOPNOTSUPP;
+        return -1;
+      }
       r = p->plugin._zero_old (connection_get_handle (conn, 0), count,
offset,
                                may_trim);
+    }
     else
       emulate = true;
     if (r == -1)
diff --git a/include/nbdkit-plugin.h b/include/nbdkit-plugin.h
index 632df867..45ae7053 100644
--- a/include/nbdkit-plugin.h
+++ b/include/nbdkit-plugin.h
@@ -132,6 +132,8 @@ struct nbdkit_plugin {
   int (*cache) (void *handle, uint32_t count, uint64_t offset, uint32_t flags);

   int (*thread_model) (void);
+
+  int (*can_fast_zero) (void *handle);
 };

 extern void nbdkit_set_error (int err);
diff --git a/plugins/data/data.c b/plugins/data/data.c
index 14fb8997..9004a487 100644
--- a/plugins/data/data.c
+++ b/plugins/data/data.c
@@ -349,6 +349,13 @@ data_can_cache (void *handle)
   return NBDKIT_CACHE_NATIVE;
 }

+/* Fast zero. */
+static int
+data_can_fast_zero (void *handle)
+{
+  return 1;
+}
+
 /* Read data. */
 static int
 data_pread (void *handle, void *buf, uint32_t count, uint64_t offset,
@@ -375,8 +382,10 @@ data_pwrite (void *handle, const void *buf, uint32_t count,
uint64_t offset,
 static int
 data_zero (void *handle, uint32_t count, uint64_t offset, uint32_t flags)
 {
-  /* Flushing, and thus FUA flag, is a no-op */
-  assert ((flags & ~(NBDKIT_FLAG_FUA | NBDKIT_FLAG_MAY_TRIM)) == 0);
+  /* Flushing, and thus FUA flag, is a no-op. Assume that
+   * sparse_array_zero generally beats writes, so FAST_ZERO is a no-op. */
+  assert ((flags & ~(NBDKIT_FLAG_FUA | NBDKIT_FLAG_MAY_TRIM |
+                     NBDKIT_FLAG_FAST_ZERO)) == 0);
   ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
   sparse_array_zero (sa, count, offset);
   return 0;
@@ -423,6 +432,7 @@ static struct nbdkit_plugin plugin = {
   .can_multi_conn    = data_can_multi_conn,
   .can_fua           = data_can_fua,
   .can_cache         = data_can_cache,
+  .can_fast_zero     = data_can_fast_zero,
   .pread             = data_pread,
   .pwrite            = data_pwrite,
   .zero              = data_zero,
diff --git a/plugins/full/full.c b/plugins/full/full.c
index 9cfbcfcd..0b69a8c9 100644
--- a/plugins/full/full.c
+++ b/plugins/full/full.c
@@ -129,13 +129,10 @@ full_pwrite (void *handle, const void *buf, uint32_t
count, uint64_t offset,
   return -1;
 }

-/* Write zeroes. */
-static int
-full_zero (void *handle, uint32_t count, uint64_t offset, uint32_t flags)
-{
-  errno = ENOSPC;
-  return -1;
-}
+/* Omitting full_zero is intentional: that way, nbdkit defaults to
+ * permitting fast zeroes which respond with ENOTSUP, while normal
+ * zeroes fall back to pwrite and respond with ENOSPC.
+ */

 /* Trim. */
 static int
@@ -172,7 +169,6 @@ static struct nbdkit_plugin plugin = {
   .can_cache         = full_can_cache,
   .pread             = full_pread,
   .pwrite            = full_pwrite,
-  .zero              = full_zero,
   .trim              = full_trim,
   .extents           = full_extents,
   /* In this plugin, errno is preserved properly along error return
diff --git a/plugins/memory/memory.c b/plugins/memory/memory.c
index 09162ea2..e831a7b5 100644
--- a/plugins/memory/memory.c
+++ b/plugins/memory/memory.c
@@ -147,6 +147,13 @@ memory_can_cache (void *handle)
   return NBDKIT_CACHE_NATIVE;
 }

+/* Fast zero. */
+static int
+memory_can_fast_zero (void *handle)
+{
+  return 1;
+}
+
 /* Read data. */
 static int
 memory_pread (void *handle, void *buf, uint32_t count, uint64_t offset,
@@ -173,8 +180,10 @@ memory_pwrite (void *handle, const void *buf, uint32_t
count, uint64_t offset,
 static int
 memory_zero (void *handle, uint32_t count, uint64_t offset, uint32_t flags)
 {
-  /* Flushing, and thus FUA flag, is a no-op */
-  assert ((flags & ~(NBDKIT_FLAG_FUA | NBDKIT_FLAG_MAY_TRIM)) == 0);
+  /* Flushing, and thus FUA flag, is a no-op. Assume that
+   * sparse_array_zero generally beats writes, so FAST_ZERO is a no-op. */
+  assert ((flags & ~(NBDKIT_FLAG_FUA | NBDKIT_FLAG_MAY_TRIM |
+                     NBDKIT_FLAG_FAST_ZERO)) == 0);
   ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
   sparse_array_zero (sa, count, offset);
   return 0;
@@ -221,6 +230,7 @@ static struct nbdkit_plugin plugin = {
   .can_fua           = memory_can_fua,
   .can_multi_conn    = memory_can_multi_conn,
   .can_cache         = memory_can_cache,
+  .can_fast_zero     = memory_can_fast_zero,
   .pread             = memory_pread,
   .pwrite            = memory_pwrite,
   .zero              = memory_zero,
diff --git a/plugins/nbd/nbd.c b/plugins/nbd/nbd.c
index 09c8891e..cddcfde2 100644
--- a/plugins/nbd/nbd.c
+++ b/plugins/nbd/nbd.c
@@ -633,6 +633,24 @@ nbdplug_can_zero (void *handle)
   return i;
 }

+static int
+nbdplug_can_fast_zero (void *handle)
+{
+#if LIBNBD_HAVE_NBD_CAN_FAST_ZERO
+  struct handle *h = handle;
+  int i = nbd_can_fast_zero (h->nbd);
+
+  if (i == -1) {
+    nbdkit_error ("failure to check fast zero flag: %s",
nbd_get_error ());
+    return -1;
+  }
+  return i;
+#else
+  /* libnbd 0.9.8 lacks fast zero support */
+  return 0;
+#endif
+}
+
 static int
 nbdplug_can_fua (void *handle)
 {
@@ -724,12 +742,19 @@ nbdplug_zero (void *handle, uint32_t count, uint64_t
offset, uint32_t flags)
   struct transaction s;
   uint32_t f = 0;

-  assert (!(flags & ~(NBDKIT_FLAG_FUA | NBDKIT_FLAG_MAY_TRIM)));
+  assert (!(flags & ~(NBDKIT_FLAG_FUA | NBDKIT_FLAG_MAY_TRIM |
+                      NBDKIT_FLAG_FAST_ZERO)));

   if (!(flags & NBDKIT_FLAG_MAY_TRIM))
     f |= LIBNBD_CMD_FLAG_NO_HOLE;
   if (flags & NBDKIT_FLAG_FUA)
     f |= LIBNBD_CMD_FLAG_FUA;
+#if LIBNBD_HAVE_NBD_CAN_FAST_ZERO
+  if (flags & NBDKIT_FLAG_FAST_ZERO)
+    f |= LIBNBD_CMD_FLAG_FAST_ZERO;
+#else
+  assert (!(flags & NBDKIT_FLAG_FAST_ZERO));
+#endif
   nbdplug_prepare (&s);
   nbdplug_register (h, &s, nbd_aio_zero (h->nbd, count, offset, s.cb,
f));
   return nbdplug_reply (h, &s);
@@ -831,6 +856,7 @@ static struct nbdkit_plugin plugin = {
   .is_rotational      = nbdplug_is_rotational,
   .can_trim           = nbdplug_can_trim,
   .can_zero           = nbdplug_can_zero,
+  .can_fast_zero      = nbdplug_can_fast_zero,
   .can_fua            = nbdplug_can_fua,
   .can_multi_conn     = nbdplug_can_multi_conn,
   .can_extents        = nbdplug_can_extents,
diff --git a/plugins/null/null.c b/plugins/null/null.c
index 647624ba..559cb815 100644
--- a/plugins/null/null.c
+++ b/plugins/null/null.c
@@ -100,6 +100,13 @@ null_can_cache (void *handle)
   return NBDKIT_CACHE_NATIVE;
 }

+/* Fast zero. */
+static int
+null_can_fast_zero (void *handle)
+{
+  return 1;
+}
+
 /* Read data. */
 static int
 null_pread (void *handle, void *buf, uint32_t count, uint64_t offset,
@@ -167,6 +174,7 @@ static struct nbdkit_plugin plugin = {
   .get_size          = null_get_size,
   .can_multi_conn    = null_can_multi_conn,
   .can_cache         = null_can_cache,
+  .can_fast_zero     = null_can_fast_zero,
   .pread             = null_pread,
   .pwrite            = null_pwrite,
   .zero              = null_zero,
diff --git a/plugins/ocaml/ocaml.c b/plugins/ocaml/ocaml.c
index 144a449e..a655f9ca 100644
--- a/plugins/ocaml/ocaml.c
+++ b/plugins/ocaml/ocaml.c
@@ -134,6 +134,8 @@ static value cache_fn;

 static value thread_model_fn;

+static value can_fast_zero_fn;
+
 /*----------------------------------------------------------------------*/
 /* Wrapper functions that translate calls from C (ie. nbdkit) to OCaml. */

@@ -705,6 +707,25 @@ thread_model_wrapper (void)
   CAMLreturnT (int, Int_val (rv));
 }

+static int
+can_fast_zero_wrapper (void *h)
+{
+  CAMLparam0 ();
+  CAMLlocal1 (rv);
+
+  caml_leave_blocking_section ();
+
+  rv = caml_callback_exn (can_fast_zero_fn, *(value *) h);
+  if (Is_exception_result (rv)) {
+    nbdkit_error ("%s", caml_format_exception (Extract_exception
(rv)));
+    caml_enter_blocking_section ();
+    CAMLreturnT (int, -1);
+  }
+
+  caml_enter_blocking_section ();
+  CAMLreturnT (int, Bool_val (rv));
+}
+
 /*----------------------------------------------------------------------*/
 /* set_* functions called from OCaml code at load time to initialize
  * fields in the plugin struct.
@@ -792,6 +813,8 @@ SET(cache)

 SET(thread_model)

+SET(can_fast_zero)
+
 #undef SET

 static void
@@ -836,6 +859,8 @@ remove_roots (void)

   REMOVE (thread_model);

+  REMOVE (can_fast_zero);
+
 #undef REMOVE
 }

diff --git a/plugins/sh/sh.c b/plugins/sh/sh.c
index c73b08b7..d5db8b59 100644
--- a/plugins/sh/sh.c
+++ b/plugins/sh/sh.c
@@ -478,6 +478,9 @@ flags_string (uint32_t flags, char *buf, size_t len)

   if (flags & NBDKIT_FLAG_REQ_ONE)
     flag_append ("req_one", &comma, &buf, &len);
+
+  if (flags & NBDKIT_FLAG_FAST_ZERO)
+    flag_append("fast", &comma, &buf, &len);
 }

 static void
@@ -536,7 +539,7 @@ sh_pwrite (void *handle, const void *buf, uint32_t count,
uint64_t offset,

 /* Common code for handling all boolean methods like can_write etc. */
 static int
-boolean_method (void *handle, const char *method_name)
+boolean_method (void *handle, const char *method_name, int def)
 {
   char *h = handle;
   const char *args[] = { script, method_name, h, NULL };
@@ -546,8 +549,8 @@ boolean_method (void *handle, const char *method_name)
     return 1;
   case RET_FALSE:               /* false */
     return 0;
-  case MISSING:                 /* missing => assume false */
-    return 0;
+  case MISSING:                 /* missing => caller chooses default */
+    return def;
   case ERROR:                   /* error cases */
     return -1;
   default: abort ();
@@ -557,37 +560,37 @@ boolean_method (void *handle, const char *method_name)
 static int
 sh_can_write (void *handle)
 {
-  return boolean_method (handle, "can_write");
+  return boolean_method (handle, "can_write", 0);
 }

 static int
 sh_can_flush (void *handle)
 {
-  return boolean_method (handle, "can_flush");
+  return boolean_method (handle, "can_flush", 0);
 }

 static int
 sh_is_rotational (void *handle)
 {
-  return boolean_method (handle, "is_rotational");
+  return boolean_method (handle, "is_rotational", 0);
 }

 static int
 sh_can_trim (void *handle)
 {
-  return boolean_method (handle, "can_trim");
+  return boolean_method (handle, "can_trim", 0);
 }

 static int
 sh_can_zero (void *handle)
 {
-  return boolean_method (handle, "can_zero");
+  return boolean_method (handle, "can_zero", 0);
 }

 static int
 sh_can_extents (void *handle)
 {
-  return boolean_method (handle, "can_extents");
+  return boolean_method (handle, "can_extents", 0);
 }

 /* Not a boolean method, the method prints "none",
"emulate" or "native". */
@@ -646,7 +649,7 @@ sh_can_fua (void *handle)
 static int
 sh_can_multi_conn (void *handle)
 {
-  return boolean_method (handle, "can_multi_conn");
+  return boolean_method (handle, "can_multi_conn", 0);
 }

 /* Not a boolean method, the method prints "none",
"emulate" or "native". */
@@ -696,6 +699,21 @@ sh_can_cache (void *handle)
   }
 }

+static int
+sh_can_fast_zero (void *handle)
+{
+  int r = boolean_method (handle, "can_fast_zero", 2);
+  if (r < 2)
+    return r;
+  /* We need to duplicate the logic of plugins.c: if can_fast_zero is
+   * missing, we advertise fast fail support when can_zero is false.
+   */
+  r = sh_can_zero (handle);
+  if (r == -1)
+    return -1;
+  return !r;
+}
+
 static int
 sh_flush (void *handle, uint32_t flags)
 {
@@ -962,6 +980,7 @@ static struct nbdkit_plugin plugin = {
   .can_fua           = sh_can_fua,
   .can_multi_conn    = sh_can_multi_conn,
   .can_cache         = sh_can_cache,
+  .can_fast_zero     = sh_can_fast_zero,

   .pread             = sh_pread,
   .pwrite            = sh_pwrite,
diff --git a/plugins/ocaml/NBDKit.ml b/plugins/ocaml/NBDKit.ml
index e54a7705..7002ac03 100644
--- a/plugins/ocaml/NBDKit.ml
+++ b/plugins/ocaml/NBDKit.ml
@@ -96,6 +96,8 @@ type 'a plugin = {
   cache : ('a -> int32 -> int64 -> flags -> unit) option;

   thread_model : (unit -> thread_model) option;
+
+  can_fast_zero : ('a -> bool) option;
 }

 let default_callbacks = {
@@ -141,6 +143,8 @@ let default_callbacks = {
   cache = None;

   thread_model = None;
+
+  can_fast_zero = None;
 }

 external set_name : string -> unit = "ocaml_nbdkit_set_name"
"noalloc"
@@ -186,6 +190,8 @@ external set_cache : ('a -> int32 -> int64 ->
flags -> unit) -> unit = "ocaml_nb

 external set_thread_model : (unit -> thread_model) -> unit =
"ocaml_nbdkit_set_thread_model"

+external set_can_fast_zero : ('a -> bool) -> unit =
"ocaml_nbdkit_set_can_fast_zero"
+
 let may f = function None -> () | Some a -> f a

 let register_plugin plugin @@ -249,7 +255,9 @@ let register_plugin plugin   
may set_can_cache plugin.can_cache;
   may set_cache plugin.cache;

-  may set_thread_model plugin.thread_model
+  may set_thread_model plugin.thread_model;
+
+  may set_can_fast_zero plugin.can_fast_zero

 external _set_error : int -> unit = "ocaml_nbdkit_set_error"
"noalloc"

diff --git a/plugins/ocaml/NBDKit.mli b/plugins/ocaml/NBDKit.mli
index 778250ef..06648b7f 100644
--- a/plugins/ocaml/NBDKit.mli
+++ b/plugins/ocaml/NBDKit.mli
@@ -101,6 +101,8 @@ type 'a plugin = {
   cache : ('a -> int32 -> int64 -> flags -> unit) option;

   thread_model : (unit -> thread_model) option;
+
+  can_fast_zero : ('a -> bool) option;
 }
 (** The plugin fields and callbacks.  ['a] is the handle type. *)

diff --git a/plugins/rust/src/lib.rs b/plugins/rust/src/lib.rs
index 53619dd9..313b4ca6 100644
--- a/plugins/rust/src/lib.rs
+++ b/plugins/rust/src/lib.rs
@@ -105,6 +105,8 @@ pub struct Plugin {
                                  flags: u32) -> c_int>,

     pub thread_model: Option<extern fn () -> ThreadModel>,
+
+    pub can_fast_zero: Option<extern fn (h: *mut c_void) -> c_int>,
 }

 #[repr(C)]
@@ -163,6 +165,7 @@ impl Plugin {
             can_cache: None,
             cache: None,
             thread_model: None,
+            can_fast_zero: None,
         }
     }
 }
diff --git a/tests/test-eflags.sh b/tests/test-eflags.sh
index f5cd43ed..9b3a6a3a 100755
--- a/tests/test-eflags.sh
+++ b/tests/test-eflags.sh
@@ -68,6 +68,7 @@ SEND_DF=$((           1 <<  7 ))
 CAN_MULTI_CONN=$((    1 <<  8 ))
 SEND_RESIZE=$((       1 <<  9 ))
 SEND_CACHE=$((        1 << 10 ))
+SEND_FAST_ZERO=$((    1 << 11 ))

 do_nbdkit ()
 {
@@ -133,8 +134,8 @@ EOF
 #----------------------------------------------------------------------
 # can_write=true
 #
-# NBD_FLAG_SEND_WRITE_ZEROES is set on writable connections
-# even when can_zero returns false, because nbdkit reckons it
+# NBD_FLAG_SEND_WRITE_ZEROES and NBD_FLAG_SEND_FAST_ZERO are set on writable
+# connections even when can_zero returns false, because nbdkit reckons it
 # can emulate zeroing using pwrite.

 do_nbdkit <<'EOF'
@@ -145,8 +146,8 @@ case "$1" in
 esac
 EOF

-[ $eflags -eq $(( HAS_FLAGS|SEND_WRITE_ZEROES|SEND_DF )) ] ||
-    fail "$LINENO: expected HAS_FLAGS|SEND_WRITE_ZEROES|SEND_DF"
+[ $eflags -eq $(( HAS_FLAGS|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO )) ] ||
+    fail "$LINENO: expected
HAS_FLAGS|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO"

 #----------------------------------------------------------------------
 # --filter=nozero
@@ -255,8 +256,8 @@ case "$1" in
 esac
 EOF

-[ $eflags -eq $(( HAS_FLAGS|SEND_TRIM|SEND_WRITE_ZEROES|SEND_DF )) ] ||
-    fail "$LINENO: expected
HAS_FLAGS|SEND_TRIM|SEND_WRITE_ZEROES|SEND_DF"
+[ $eflags -eq $(( HAS_FLAGS|SEND_TRIM|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO
)) ] ||
+    fail "$LINENO: expected
HAS_FLAGS|SEND_TRIM|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO"

 #----------------------------------------------------------------------
 # can_write=true
@@ -271,8 +272,8 @@ case "$1" in
 esac
 EOF

-[ $eflags -eq $(( HAS_FLAGS|ROTATIONAL|SEND_WRITE_ZEROES|SEND_DF )) ] ||
-    fail "$LINENO: expected
HAS_FLAGS|ROTATIONAL|SEND_WRITE_ZEROES|SEND_DF"
+[ $eflags -eq $(( HAS_FLAGS|ROTATIONAL|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO
)) ] ||
+    fail "$LINENO: expected
HAS_FLAGS|ROTATIONAL|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO"

 #----------------------------------------------------------------------
 # -r
@@ -304,8 +305,8 @@ case "$1" in
 esac
 EOF

-[ $eflags -eq $(( HAS_FLAGS|SEND_FUA|SEND_WRITE_ZEROES|SEND_DF )) ] ||
-    fail "$LINENO: expected
HAS_FLAGS|SEND_FUA|SEND_WRITE_ZEROES|SEND_DF"
+[ $eflags -eq $(( HAS_FLAGS|SEND_FUA|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO
)) ] ||
+    fail "$LINENO: expected
HAS_FLAGS|SEND_FUA|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO"

 #----------------------------------------------------------------------
 # -r
@@ -341,8 +342,8 @@ case "$1" in
 esac
 EOF

-[ $eflags -eq $(( HAS_FLAGS|SEND_FLUSH|SEND_FUA|SEND_WRITE_ZEROES|SEND_DF )) ]
||
-    fail "$LINENO: expected
HAS_FLAGS|SEND_FLUSH|SEND_FUA|SEND_WRITE_ZEROES|SEND_DF"
+[ $eflags -eq $((
HAS_FLAGS|SEND_FLUSH|SEND_FUA|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO )) ] ||
+    fail "$LINENO: expected
HAS_FLAGS|SEND_FLUSH|SEND_FUA|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO"

 #----------------------------------------------------------------------
 # can_write=true
@@ -361,8 +362,8 @@ case "$1" in
 esac
 EOF

-[ $eflags -eq $(( HAS_FLAGS|SEND_FLUSH|SEND_WRITE_ZEROES|SEND_DF )) ] ||
-    fail "$LINENO: expected
HAS_FLAGS|SEND_FLUSH|SEND_WRITE_ZEROES|SEND_DF"
+[ $eflags -eq $(( HAS_FLAGS|SEND_FLUSH|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO
)) ] ||
+    fail "$LINENO: expected
HAS_FLAGS|SEND_FLUSH|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO"

 #----------------------------------------------------------------------
 # -r
@@ -448,3 +449,96 @@ EOF

 [ $eflags -eq $(( HAS_FLAGS|READ_ONLY|SEND_DF )) ] ||
     fail "$LINENO: expected HAS_FLAGS|READ_ONLY|SEND_DF"
+
+#----------------------------------------------------------------------
+# -r
+# can_fast_zero=true
+#
+# Fast zero support isn't advertised without regular zero support
+
+do_nbdkit -r <<'EOF'
+case "$1" in
+     get_size) echo 1M ;;
+     can_fast_zero) exit 0 ;;
+     *) exit 2 ;;
+esac
+EOF
+
+[ $eflags -eq $(( HAS_FLAGS|READ_ONLY|SEND_DF )) ] ||
+    fail "$LINENO: expected HAS_FLAGS|READ_ONLY|SEND_DF"
+
+#----------------------------------------------------------------------
+# --filter=nozero
+# can_write=true
+# can_fast_zero=true
+#
+# Fast zero support isn't advertised without regular zero support
+
+do_nbdkit --filter=nozero <<'EOF'
+case "$1" in
+     get_size) echo 1M ;;
+     can_write) exit 0 ;;
+     can_fast_zero) exit 0 ;;
+     *) exit 2 ;;
+esac
+EOF
+
+[ $eflags -eq $(( HAS_FLAGS|SEND_DF )) ] ||
+    fail "$LINENO: expected HAS_FLAGS|SEND_DF"
+
+#----------------------------------------------------------------------
+# can_write=true
+# can_zero=true
+#
+# Fast zero support is omitted for a plugin that has .zero but did not opt in
+
+do_nbdkit -r <<'EOF'
+case "$1" in
+     get_size) echo 1M ;;
+     can_write) exit 0 ;;
+     can_zero) exit 0 ;;
+     *) exit 2 ;;
+esac
+EOF
+
+[ $eflags -eq $(( HAS_FLAGS|READ_ONLY|SEND_DF )) ] ||
+    fail "$LINENO: expected HAS_FLAGS|READ_ONLY|SEND_DF"
+
+#----------------------------------------------------------------------
+# can_write=true
+# can_zero=true
+# can_fast_zero=false
+#
+# Fast zero support is omitted if the plugin says so
+
+do_nbdkit -r <<'EOF'
+case "$1" in
+     get_size) echo 1M ;;
+     can_write) exit 0 ;;
+     can_zero) exit 0 ;;
+     can_fast_zero) exit 3 ;;
+     *) exit 2 ;;
+esac
+EOF
+
+[ $eflags -eq $(( HAS_FLAGS|READ_ONLY|SEND_DF )) ] ||
+    fail "$LINENO: expected HAS_FLAGS|READ_ONLY|SEND_DF"
+
+#----------------------------------------------------------------------
+# can_write=true
+# can_zero=false
+# can_fast_zero=false
+#
+# Fast zero support is omitted if the plugin says so
+
+do_nbdkit -r <<'EOF'
+case "$1" in
+     get_size) echo 1M ;;
+     can_write) exit 0 ;;
+     can_fast_zero) exit 3 ;;
+     *) exit 2 ;;
+esac
+EOF
+
+[ $eflags -eq $(( HAS_FLAGS|READ_ONLY|SEND_DF )) ] ||
+    fail "$LINENO: expected HAS_FLAGS|READ_ONLY|SEND_DF"
-- 
2.21.0

Richard W.M. Jones

2019-Aug-27 12:14 UTC

head link

Re: [Libguestfs] cross-project patches: Add NBD Fast Zero support

On Fri, Aug 23, 2019 at 09:30:36AM -0500, Eric Blake
wrote:> I've run several tests to demonstrate why this is useful, as well as
> prove that because I have multiple interoperable projects, it is worth
> including in the NBD standard.  The original proposal was here:
> https://lists.debian.org/nbd/2019/03/msg00004.html
> where I stated:
> 
> > I will not push this without both:
> > - a positive review (for example, we may decide that burning another
> > NBD_FLAG_* is undesirable, and that we should instead have some sort
> > of NBD_OPT_ handshake for determining when the server supports
> > NBD_CMF_FLAG_FAST_ZERO)
> > - a reference client and server implementation (probably both via
qemu,
> > since it was qemu that raised the problem in the first place)
Is the plan to wait until NBD_CMF_FLAG_FAST_ZERO gets into the NBD
protocol doc before doing the rest?  Also I would like to release both
libnbd 1.0 and nbdkit 1.14 before we introduce any large new features.
Both should be released this week, in fact maybe even today or
tomorrow.

[...]> First, I had to create a scenario where falling back to writes is
> noticeably slower than performing a zero operation, and where
> pre-zeroing also shows an effect.  My choice: let's test 'qemu-img
> convert' on an image that is half-sparse (every other megabyte is a
> hole) to an in-memory nbd destination.  Then I use a series of nbdkit
> filters to force the destination to behave in various manners:
>  log logfile=>(sed ...|uniq -c) (track how many normal/fast zero
> requests the client makes)
>  nozero $params (fine-tune how zero requests behave - the parameters
> zeromode and fastzeromode are the real drivers of my various tests)
>  blocksize maxdata=256k (allows large zero requests, but forces large
> writes into smaller chunks, to magnify the effects of write delays and
> allow testing to provide obvious results with a smaller image)
>  delay delay-write=20ms delay-zero=5ms (also to magnify the effects on a
> smaller image, with writes penalized more than zeroing)
>  stats statsfile=/dev/stderr (to track overall time and a decent summary
> of how much I/O occurred).
>  noextents (forces the entire image to report that it is allocated,
> which eliminates any testing variability based on whether qemu-img uses
> that to bypass a zeroing operation [1])
I can't help thinking that a sh plugin might have been simpler ...
> I hope you enjoyed reading this far, and agree with my interpretation of
> the numbers about why this feature is useful!
Yes it seems reasonable.

The only thought I had is whether the qemu block layer does or should
combine requests in flight so that a write-zero (offset) followed by a
write-data (same offset) would erase the earlier request.  In some
circumstances that might provide a performance improvement without
needing any changes to protocols.
> - NBD should have a way to advertise (probably via NBD_INFO_ during
> NBD_OPT_GO) if the initial image is known to begin life with all zeroes
> (if that is the case, qemu-img can skip the extents calls and
> pre-zeroing pass altogether)
Yes, I really think we should do this one as well.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Eric Blake

2019-Aug-27 13:23 UTC

head link

Re: [Libguestfs] cross-project patches: Add NBD Fast Zero support

On 8/27/19 7:14 AM, Richard W.M. Jones wrote:
> 
> Is the plan to wait until NBD_CMF_FLAG_FAST_ZERO gets into the NBD
> protocol doc before doing the rest?  Also I would like to release both
> libnbd 1.0 and nbdkit 1.14 before we introduce any large new features.
> Both should be released this week, in fact maybe even today or
> tomorrow.
Sure, I don't mind this being the first feature for the eventual libnbd
1.2 and nbdkit 1.16.
> 
> [...]
>> First, I had to create a scenario where falling back to writes is
>> noticeably slower than performing a zero operation, and where
>> pre-zeroing also shows an effect.  My choice: let's test
'qemu-img
>> convert' on an image that is half-sparse (every other megabyte is a
>> hole) to an in-memory nbd destination.  Then I use a series of nbdkit
>> filters to force the destination to behave in various manners:
>>  log logfile=>(sed ...|uniq -c) (track how many normal/fast zero
>> requests the client makes)
>>  nozero $params (fine-tune how zero requests behave - the parameters
>> zeromode and fastzeromode are the real drivers of my various tests)
>>  blocksize maxdata=256k (allows large zero requests, but forces large
>> writes into smaller chunks, to magnify the effects of write delays and
>> allow testing to provide obvious results with a smaller image)
>>  delay delay-write=20ms delay-zero=5ms (also to magnify the effects on
a
>> smaller image, with writes penalized more than zeroing)
>>  stats statsfile=/dev/stderr (to track overall time and a decent
summary
>> of how much I/O occurred).
>>  noextents (forces the entire image to report that it is allocated,
>> which eliminates any testing variability based on whether qemu-img uses
>> that to bypass a zeroing operation [1])
> 
> I can't help thinking that a sh plugin might have been simpler ...
Maybe, but the extra cost of forking per request may have also made
obvious timing comparisons harder.  I'm just glad that nbdkit's
filtering system was flexible enough to do what I wanted, even if I did
have fun stringing together 6 filters :)
> 
>> I hope you enjoyed reading this far, and agree with my interpretation
of
>> the numbers about why this feature is useful!
> 
> Yes it seems reasonable.
> 
> The only thought I had is whether the qemu block layer does or should
> combine requests in flight so that a write-zero (offset) followed by a
> write-data (same offset) would erase the earlier request.  In some
> circumstances that might provide a performance improvement without
> needing any changes to protocols.
As in, maintain a backlog of requests that are needed but have not yet
been sent over the wire because of backlog, and merge those requests (by
splitting an existing large zero request into smaller pieces) if write
requests come in that window before actually transmitting to the NBD
server?  I know qemu has some write coalescing when servicing guest
behaviors; but I was testing on 'qemu-img convert' which does not depend
on guest behavior and therefore has already sent the zero request to the
NBD server before sending any data writes, so coalescing wouldn't see
anything to combine.  Or are you worried about qemu as the NBD server,
performing coalescing of incoming requests from the client?  But you are
right that some smarts about I/O coalescing at various points in the
data path may show some slight optimizations.
> 
>> - NBD should have a way to advertise (probably via NBD_INFO_ during
>> NBD_OPT_GO) if the initial image is known to begin life with all zeroes
>> (if that is the case, qemu-img can skip the extents calls and
>> pre-zeroing pass altogether)
> 
> Yes, I really think we should do this one as well.
Stay tuned for my next cross-project post ;)  Hopefully in the next week
or so.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Eric Blake

2019-Aug-28 13:04 UTC

head link

Re: [Libguestfs] [Qemu-devel] [PATCH 1/1] protocol: Add NBD_CMD_FLAG_FAST_ZERO

On 8/28/19 4:57 AM, Vladimir Sementsov-Ogievskiy wrote:
>> Hence, it is desirable to have a way for clients to specify that a
>> particular write zero request is being attempted for a fast wipe, and
>> get an immediate failure if the zero request would otherwise take the
>> same time as a write.  Conversely, if the client is not performing a
>> pre-initialization pass, it is still more efficient in terms of
>> networking traffic to send NBD_CMD_WRITE_ZERO requests where the
>> server implements the fallback to the slower write, than it is for the
>> client to have to perform the fallback to send NBD_CMD_WRITE with a
>> zeroed buffer.
> 
> How are you going to finally use it in qemu-img convert?
It's already in use there (in fact, the cover letter shows actual timing
examples of how qemu-img's use of BDRV_REQ_NO_FALLBACK, which translates
to NBD_CMD_FLAG_FAST_ZERO, observably affects timing).
> Ok, we have a loop
> of sending write-zero requests. And on first ENOTSUP we'll assume that
there
> is no benefit to continue? But what if actually server returns ENOTSUP only
> once when we have 1000 iterations? Seems we should still do zeroing if we
> have only a few ENOTSUPs...
When attempting a bulk zero, you try to wipe the entire device,
presumably with something that is large and aligned.  Yes, if you have
to split the write zero request due to size limitations, then you risk
that the first write zero succeeds but later ones fail, then you didn't
wipe the entire disk, but you also don't need to redo the work on the
first half of the image.  But it is much more likely that the first
write of the bulk zero is representative of the overall operation (and
so in practice, it only takes one fast zero attempt to learn if bulk
zeroing is worthwhile, then continue with fast zeroes without issues).
> 
> I understand that fail-on-first ENOTSUP is OK for raw-without-fallocte vs
qcow2,
> as first will always return ENOTSUP and second will never fail.. But in
such way
> we'll OK with simpler extension, which only have one server-advirtised
negotiation
> flag NBD_FLAG_ZERO_IS_FAST.
Advertising that a server's zero behavior is always going to be
successfully fast is a much harder flag to implement.  The justification
for the semantics I chose (advertising that the server can quickly
report failure if success is not fast, but not requiring fast zero)
covers the case when the decision of whether a zero is fast may also
depend on other factors - for example, if the server knows the image
starts in an all-zero state, then it can track a boolean: all write zero
requests while the boolean is set return immediate success (nothing to
do), but after the first regular write, the boolean is set to false, and
all further write zero requests fail as being potentially slow; and such
an implementation is still useful for the qemu-img convert case.
> 
> There is not such problem if we have only one iteration, so may be new
command
> FILL_ZERO, filling the whole device by zeros?
Or better yet, implement support for 64-bit commands.  Yes, my cover
letter called out further orthogonal extensions, and implementing 64-bit
zeroing (so that you can issue a write zero request over the entire
image in one command), as well as a way for a server to advertise when
the image begins life in an all-zero state, are also further extensions
coming down the pipeline.  But as not all servers have to implement all
of the extensions, each extension that can be orthogonally implemented
and show an improvement on its own is still worthwhile; and my cover
letter has shown that fast zeroes on their own make a measurable
difference to certain workloads.
>> +    If the server advertised `NBD_FLAG_SEND_FAST_ZERO` but
>> +    `NBD_CMD_FLAG_FAST_ZERO` is not set, then the server MUST NOT fail
>> +    with `NBD_ENOTSUP`, even if the operation is no faster than a
>> +    corresponding `NBD_CMD_WRITE`. Conversely, if
>> +    `NBD_CMD_FLAG_FAST_ZERO` is set, the server MUST fail quickly with
>> +    `NBD_ENOTSUP` unless the request can be serviced in less time than
>> +    a corresponding `NBD_CMD_WRITE`, and SHOULD NOT alter the contents
>> +    of the export when returning this failure. The server's
>> +    determination of a fast request MAY depend on a number of factors,
>> +    such as whether the request was suitably aligned, on whether the
>> +    `NBD_CMD_FLAG_NO_HOLE` flag was present, or even on whether a
>> +    previous `NBD_CMD_TRIM` had been performed on the region.  If the
>> +    server did not advertise `NBD_FLAG_SEND_FAST_ZERO`, then it SHOULD
>> +    NOT fail with `NBD_ENOTSUP`, regardless of the speed of servicing
>> +    a request, and SHOULD fail with `NBD_EINVAL` if the
>> +    `NBD_CMD_FLAG_FAST_ZERO` flag was set. A server MAY advertise
>> +    `NBD_FLAG_SEND_FAST_ZERO` whether or not it can perform fast
>> +    zeroing; similarly, a server SHOULD fail with `NBD_ENOTSUP` when
>> +    the flag is set if the server cannot quickly determine in advance
>> +    whether that request would have been fast, even if it turns out
>> +    that the same request without the flag would be fast after all.
>> +
> 
> What if WRITE_ZERO in the average is faster than WRITE (for example by
20%),
> but server never can guarantee performance for one WRITE_ZERO operation, do
> you restrict such case? Hmm, OK, SHOULD is not MUST actually..
I think my followup mail, based on Wouter's questions, covers this: the
goal is to document the use case of optimizing the copy of a sparse
image, by probing whether a bulk pre-zeroing pass is worthwhile.  That
should be the measuring rod - if the implementation can perform a faster
sparse copy because of write zeroes that are sometimes, but not always,
faster than writes, in spite of the duplicated I/O that happens to the
data portions of the image that were touched twice by the pre-zero pass
then the actual data pass, then succeeding on fast zero requests is
okay.  But if it makes the overall image copy slower, then failing with
ENOTSUP is probably better.  And at the end of the day, it is really
just a heuristic - if the server guessed wrong, the worst that happens
is slower performance (and not data corruption).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Eric Blake

2019-Aug-28 14:05 UTC

head link

Re: [Libguestfs] [Qemu-devel] [PATCH 0/5] Add NBD fast zero support to qemu client and server

On 8/28/19 8:55 AM, Vladimir Sementsov-Ogievskiy wrote:> 23.08.2019 17:37, Eric Blake wrote:
>> See the cross-post cover letter for more details:
>> https://www.redhat.com/archives/libguestfs/2019-August/msg00322.html
>>
>> Based-on:
https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg04805.html
>> [Andrey Shinkevich block: workaround for unaligned byte range in
fallocate()]
> 
> I assume, I can look at git://repo.or.cz/qemu/ericb.git fast-zero branch?
Yes, I posted a fast-zero branch for all four projects that I touched
(with the obvious similar URLs). They might have non-fast-forward
changes as I rebase things (for example, the nbdkit stuff needs to
s/1.13.9/1.15.0/ in docs about which version introduced things), but
should be sufficient to reproduce experiments with the feature supported.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Eric Blake

2019-Aug-30 23:10 UTC

head link

Re: [Libguestfs] [Qemu-devel] [PATCH 1/5] nbd: Improve per-export flag handling in server

On 8/30/19 1:00 PM, Vladimir Sementsov-Ogievskiy wrote:> 23.08.2019 17:37, Eric Blake wrote:
>> When creating a read-only image, we are still advertising support for
>> TRIM and WRITE_ZEROES to the client, even though the client should not
>> be issuing those commands.  But seeing this requires looking across
>> multiple functions:
>>
>> @@ -458,10 +458,13 @@ static int
nbd_negotiate_handle_export_name(NBDClient *client,
>>           return -EINVAL;
>>       }
>>
>> -    trace_nbd_negotiate_new_style_size_flags(client->exp->size,
>> -                                            
client->exp->nbdflags | myflags);
>> +    myflags = client->exp->nbdflags;
>> +    if (client->structured_reply) {
>> +        myflags |= NBD_FLAG_SEND_DF;
>> +    }
> 
> 
> why we cant do just
> client->exp->nbdflags |= NBD_FLAG_SEND_DF ?
Because myflags is the runtime flags for _this_ client, while
client->exp->nbdflags are the base flags shared by _all_ clients.  If
client A requests structured reply, but client B does not, then we don't
want to advertise DF to client B; but amending client->exp->nbdflags
would have that effect.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Eric Blake

2019-Aug-30 23:32 UTC

head link

Re: [Libguestfs] [Qemu-devel] [PATCH 1/5] nbd: Improve per-export flag handling in server

On 8/30/19 6:10 PM, Eric Blake wrote:> On 8/30/19 1:00 PM, Vladimir Sementsov-Ogievskiy wrote:
>> 23.08.2019 17:37, Eric Blake wrote:
>>> When creating a read-only image, we are still advertising support
for
>>> TRIM and WRITE_ZEROES to the client, even though the client should
not
>>> be issuing those commands.  But seeing this requires looking across
>>> multiple functions:
>>>
> 
>>> @@ -458,10 +458,13 @@ static int
nbd_negotiate_handle_export_name(NBDClient *client,
>>>           return -EINVAL;
>>>       }
>>>
>>> -   
trace_nbd_negotiate_new_style_size_flags(client->exp->size,
>>> -                                            
client->exp->nbdflags | myflags);
>>> +    myflags = client->exp->nbdflags;
>>> +    if (client->structured_reply) {
>>> +        myflags |= NBD_FLAG_SEND_DF;
>>> +    }
>>
>>
>> why we cant do just
>> client->exp->nbdflags |= NBD_FLAG_SEND_DF ?
> 
> Because myflags is the runtime flags for _this_ client, while
> client->exp->nbdflags are the base flags shared by _all_ clients.  If
> client A requests structured reply, but client B does not, then we
don't
> want to advertise DF to client B; but amending client->exp->nbdflags
> would have that effect.
I stand corrected - it looks like a fresh client->exp is created per
client, as evidenced by:

diff --git i/nbd/client.c w/nbd/client.c
index b9dc829175f9..9e05f1a0e2a3 100644
--- i/nbd/client.c
+++ w/nbd/client.c
@@ -1011,6 +1011,8 @@ int nbd_receive_negotiate(AioContext *aio_context,
QIOChannel *ioc,
     assert(info->name);
     trace_nbd_receive_negotiate_name(info->name);

+    if (getenv ("MY_HACK"))
+        info->structured_reply = false;
     result = nbd_start_negotiate(aio_context, ioc, tlscreds, hostname,
outioc,
                                  info->structured_reply, &zeroes, errp);

diff --git i/nbd/server.c w/nbd/server.c
index d5078f7468af..6f3a83704fb3 100644
--- i/nbd/server.c
+++ w/nbd/server.c
@@ -457,6 +457,7 @@ static int
nbd_negotiate_handle_export_name(NBDClient *client, bool no_zeroes,
     myflags = client->exp->nbdflags;
     if (client->structured_reply) {
         myflags |= NBD_FLAG_SEND_DF;
+        client->exp->nbdflags |= NBD_FLAG_SEND_DF;
     }
     trace_nbd_negotiate_new_style_size_flags(client->exp->size, myflags);
     stq_be_p(buf, client->exp->size);

$ ./qemu-nbd -r -f raw file -t &

$  ~/qemu/qemu-io -r -f raw --trace=nbd_\*size_flags
nbd://localhost:10809 -c quit
32145@1567207628.519883:nbd_receive_negotiate_size_flags Size is
1049088, export flags 0x48f

$ MY_HACK=1 ~/qemu/qemu-io -r -f raw --trace=nbd_\*size_flags
nbd://localhost:10809 -c quit
32156@1567207630.417815:nbd_receive_negotiate_size_flags Size is
1049088, export flags 0x40f

$  ~/qemu/qemu-io -r -f raw --trace=nbd_\*size_flags
nbd://localhost:10809 -c quit
32167@1567207635.202940:nbd_receive_negotiate_size_flags Size is
1049088, export flags 0x48f

The export flags change per client, so I _can_ store into
client->exp->nbdflags.  Will do that for v2.

Meanwhile, this points out a missing feature in libnbd - for testing
purposes, it would be really nice to be able to purposefully cripple the
client to NOT request structured replies automatically (default enabled,
but the ability to turn it off is useful for interop testing, as in this
thread).  I already recently added a --no-sr flag to nbdkit for a
similar reason (but that's creating a server which refuses to advertise,
where here I want a guest that refuses to ask).  Guess I'll be adding a
patch for that, too :)

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Eric Blake

2019-Aug-30 23:37 UTC

head link

Re: [Libguestfs] [Qemu-devel] [PATCH 2/5] nbd: Prepare for NBD_CMD_FLAG_FAST_ZERO

On 8/30/19 1:07 PM, Vladimir Sementsov-Ogievskiy wrote:> 23.08.2019 17:37, Eric Blake wrote:
>> Commit fe0480d6 and friends added BDRV_REQ_NO_FALLBACK as a way to
>> avoid wasting time on a preliminary write-zero request that will later
>> be rewritten by actual data, if it is known that the write-zero
>> request will use a slow fallback; but in doing so, could not optimize
>> for NBD.  The NBD specification is now considering an extension that
>> will allow passing on those semantics; this patch updates the new
>> protocol bits and 'qemu-nbd --list' output to recognize the
bit, as
>> well as the new errno value possible when using the new flag; while
>> upcoming patches will improve the client to use the feature when
>> present, and the server to advertise support for it.
>>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>> +++ b/nbd/server.c
>> @@ -55,6 +55,8 @@ static int system_errno_to_nbd_errno(int err)
>>           return NBD_ENOSPC;
>>       case EOVERFLOW:
>>           return NBD_EOVERFLOW;
>> +    case ENOTSUP:
>> +        return NBD_ENOTSUP;
> 
> This may provoke returning NBD_ENOTSUP in other cases, not only new one we
are going to add.
Correct. But the spec only said SHOULD avoid ENOTSUP in those other
cases, not MUST avoid ENOTSUP; and in practice, either the client that
is not suspecting it will treat it the same as NBD_EINVAL, or we've
managed to get a slightly better error message across than normal.  I
don't see that as a real show-stopper.

But if it bothers you, note that in nbdkit, I actually coded things up
to refuse to send NBD_EOVERFLOW unless NBD_CMD_FLAG_DF was set, and to
refuse to send NBD_ENOTSUP unless NBD_CMD_FLAG_FAST_ZERO was set. I
could copy that behavior here, if we want qemu to be that much stricter
as a server.

(Note to self: also check if, when using structured replies to send
error messages, in addition to the code, if the string contains
strerror(errno) from BEFORE the mapping, rather than after we've lost
information to the more-limited NBD_E* values)

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Eric Blake

2019-Sep-03 20:53 UTC

head link

Re: [Libguestfs] [PATCH 0/1] NBD protocol change to add fast zero support

On 8/23/19 9:34 AM, Eric Blake wrote:> See the cross-post cover letter for details:
> https://www.redhat.com/archives/libguestfs/2019-August/msg00322.html
> 
> Eric Blake (1):
>   protocol: Add NBD_CMD_FLAG_FAST_ZERO
> 
>  doc/proto.md | 50 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 49 insertions(+), 1 deletion(-)
I think we've had enough review time with no major objections to this.
Therefore, I've gone ahead and incorporated the wording changes that
were mentioned in discussion on this patch, as well as Rich's URI
specification, into the NBD project.  We can still amend the
specifications if any problems turn up.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Apparently Analagous Threads

Search for more maybe matching threads

Libguestfs - Aug 2019 - Re: [Qemu-devel] [PATCH 2/5] nbd: Prepare for NBD_CMD_FLAG_FAST_ZERO

[Libguestfs] cross-project patches: Add NBD Fast Zero support

[Libguestfs] [PATCH 0/1] NBD protocol change to add fast zero support

[Libguestfs] [PATCH 1/1] protocol: Add NBD_CMD_FLAG_FAST_ZERO

[Libguestfs] [PATCH 0/5] Add NBD fast zero support to qemu client and server

[Libguestfs] [PATCH 1/5] nbd: Improve per-export flag handling in server

[Libguestfs] [PATCH 2/5] nbd: Prepare for NBD_CMD_FLAG_FAST_ZERO

[Libguestfs] [PATCH 3/5] nbd: Implement client use of NBD FAST_ZERO

[Libguestfs] [PATCH 4/5] nbd: Implement server use of NBD FAST_ZERO

[Libguestfs] [PATCH 5/5] nbd: Tolerate more errors to structured reply request

[Libguestfs] [libnbd PATCH 0/1] libnbd support for new fast zero

[Libguestfs] [libnbd PATCH 1/1] api: Add support for FAST_ZERO flag

[Libguestfs] [nbdkit PATCH 0/3] nbdkit support for new NBD fast zero

[Libguestfs] [nbdkit PATCH 1/3] server: Add internal support for NBDKIT_FLAG_FAST_ZERO

[Libguestfs] [nbdkit PATCH 2/3] filters: Add .can_fast_zero hook

[Libguestfs] [nbdkit PATCH 3/3] plugins: Add .can_fast_zero hook

Re: [Libguestfs] cross-project patches: Add NBD Fast Zero support

Re: [Libguestfs] cross-project patches: Add NBD Fast Zero support

Re: [Libguestfs] [Qemu-devel] [PATCH 1/1] protocol: Add NBD_CMD_FLAG_FAST_ZERO

Re: [Libguestfs] [Qemu-devel] [PATCH 0/5] Add NBD fast zero support to qemu client and server

Re: [Libguestfs] [Qemu-devel] [PATCH 1/5] nbd: Improve per-export flag handling in server

Re: [Libguestfs] [Qemu-devel] [PATCH 1/5] nbd: Improve per-export flag handling in server

Re: [Libguestfs] [Qemu-devel] [PATCH 2/5] nbd: Prepare for NBD_CMD_FLAG_FAST_ZERO

Re: [Libguestfs] [PATCH 0/1] NBD protocol change to add fast zero support

Apparently Analagous Threads