Eric Blake
2019-Aug-23 14:30 UTC
[Libguestfs] cross-project patches: Add NBD Fast Zero support
This is a cover letter to a series of patches being proposed in tandem to four different projects: - nbd: Document a new NBD_CMD_FLAG_FAST_ZERO command flag - qemu: Implement the flag for both clients and server - libnbd: Implement the flag for clients - nbdkit: Implement the flag for servers, including the nbd passthrough client If you want to test the patches together, I've pushed a 'fast-zero' branch to each of: https://repo.or.cz/nbd/ericb.git/shortlog/refs/heads/fast-zero https://repo.or.cz/qemu/ericb.git/shortlog/refs/heads/fast-zero https://repo.or.cz/libnbd/ericb.git/shortlog/refs/heads/fast-zero https://repo.or.cz/nbdkit/ericb.git/shortlog/refs/heads/fast-zero I've run several tests to demonstrate why this is useful, as well as prove that because I have multiple interoperable projects, it is worth including in the NBD standard. The original proposal was here: https://lists.debian.org/nbd/2019/03/msg00004.html where I stated:> I will not push this without both: > - a positive review (for example, we may decide that burning another > NBD_FLAG_* is undesirable, and that we should instead have some sort > of NBD_OPT_ handshake for determining when the server supports > NBD_CMF_FLAG_FAST_ZERO) > - a reference client and server implementation (probably both via qemu, > since it was qemu that raised the problem in the first place)Consensus on that thread seemed to be that a new NBD_FLAG was okay; and this thread solves the second bullet of having reference implementations. Here's what I did for testing full-path interoperability: nbdkit memory -> qemu-nbd -> nbdkit nbd -> nbdsh $ nbdkit -p 10810 --filter=nozero --filter=delay memory 1m delay-write=3 zeromode=emulate $ qemu-nbd -p 10811 -f raw nbd://localhost:10810 $ nbdkit -p 10812 nbd nbd://localhost:10811 $ time nbdsh --connect nbd://localhost:10812 -c 'buf = h.zero(512, 0)' # takes more than 3 seconds, but succeeds $ time nbdsh --connect nbd://localhost:10812 -c 'buf = h.zero(512, 0, nbd.CMD_FLAG_FAST_ZERO)' # takes less than 1 second to fail with ENOTSUP And here's some demonstrations on why the feature matters, starting with this qemu thread as justification: https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg06389.html First, I had to create a scenario where falling back to writes is noticeably slower than performing a zero operation, and where pre-zeroing also shows an effect. My choice: let's test 'qemu-img convert' on an image that is half-sparse (every other megabyte is a hole) to an in-memory nbd destination. Then I use a series of nbdkit filters to force the destination to behave in various manners: log logfile=>(sed ...|uniq -c) (track how many normal/fast zero requests the client makes) nozero $params (fine-tune how zero requests behave - the parameters zeromode and fastzeromode are the real drivers of my various tests) blocksize maxdata=256k (allows large zero requests, but forces large writes into smaller chunks, to magnify the effects of write delays and allow testing to provide obvious results with a smaller image) delay delay-write=20ms delay-zero=5ms (also to magnify the effects on a smaller image, with writes penalized more than zeroing) stats statsfile=/dev/stderr (to track overall time and a decent summary of how much I/O occurred). noextents (forces the entire image to report that it is allocated, which eliminates any testing variability based on whether qemu-img uses that to bypass a zeroing operation [1]) So here's my one-time setup, followed by repetitions of the nbdkit command with different parameters to the nozero filter to explore different behaviors. $ qemu-img create -f qcow2 src 100m $ for i in `seq 0 2 99`; do qemu-io -f qcow2 -c "w ${i}m 1m" src; done $ nbdkit -U - --filter=log --filter=nozero --filter=blocksize \ --filter=delay --filter=stats --filter=noextents memory 100m \ logfile=>(sed -n '/Zero.*\.\./ s/.*\(fast=.\).*/\1/p' |sort|uniq -c) \ statsfile=/dev/stderr delay-write=20ms delay-zero=5s maxdata=256k \ --run 'qemu-img convert -n -f qcow2 -O raw src $nbd' $params Establish a baseline: when qemu-img does not see write zero support at all (such as when talking to /dev/nbd0, because the kernel NBD implementation still does not support write zeroes), qemu is forced to write the entire disk, including the holes, but doesn't waste any time pre-zeroing or checking block status for whether the disk is zero (the default of the nozero filter is to turn off write zero advertisement): paramselapsed time: 8.54488 s write: 400 ops, 104857600 bytes, 9.81712e+07 bits/s Next, let's emulate what qemu 3.1 was like, with a blind pre-zeroing pass of the entire image without regards to whether that pass is fast or slow. For this test, it was easier to use modern qemu and merely ignore the fast zero bit in nbdkit, but the numbers should be similar when actually using older qemu. If qemu guessed right that pre-zeroing is fast, we see: params='zeromode=plugin fastzeromode=ignore' elapsed time: 4.30183 s write: 200 ops, 52428800 bytes, 9.75005e+07 bits/s zero: 4 ops, 104857600 bytes, 1.95001e+08 bits/s 4 fast=1 which is definite win - instead of having to write the half of the image that was zero on the source, the fast pre-zeroing pass already cleared it (qemu-img currently breaks write zeroes into 32M chunks [1], and thus requires 4 zero requests to pre-zero the image). But if qemu guesses wrong: params='zeromode=emulate fastzeromode=ignore' elapsed time: 12.5065 s write: 600 ops, 157286400 bytes, 1.00611e+08 bits/s 4 fast=1 Ouch - that is actually slower than the case when zeroing is not used at all, because the zeroes turned into writes result in performing double the I/O over the data portions of the file (once during the pre-zero pass, then again during the data). The qemu 3.1 behavior is very bi-polar in nature, and we don't like that. So qemu 4.0 introduced BDRV_REQ_NO_FALLBACK, which qemu uses during the pre-zero request to fail quickly if pre-zeroing is not viable. At the time, NBD did not have a way to support fast zero requests, so qemu blindly assumes that pre-zeroing is not viable over NBD: params='zeromode=emulate fastzeromode=none' elapsed time: 8.32433 s write: 400 ops, 104857600 bytes, 1.00772e+08 bits/s 50 fast=0 When zeroing is slow, our time actually beats the baseline by about 0.2 seconds (although zeroing still turned into writes, the use of zero requests results in less network traffic; you also see that there are 50 zero requests, one per hole, rather than 4 requests for pre-zeroing the image). So we've avoided the pre-zeroing penalty. However: params='zeromode=plugin fastzeromode=none' elapsed time: 4.53951 s write: 200 ops, 52428800 bytes, 9.23955e+07 bits/s zero: 50 ops, 52428800 bytes, 9.23955e+07 bits/s 50 fast=0 when zeroing is fast, we're still 0.2 seconds slower than the pre-zeroing behavior (zeroing runs fast, but one request per hole is still more transactions than pre-zeroing used to use). The qemu 4.0 decision thus regained the worst degradation seen in 3.1 when zeroing is slow, but at a penalty to the case when zeroing is fast. Since guessing is never as nice as knowing, let's repeat the test, but now exploiting the new NBD fast zero: params='zeromode=emulate' elapsed time: 8.41174 s write: 400 ops, 104857600 bytes, 9.9725e+07 bits/s 50 fast=0 1 fast=1 Good: when zeroes are not fast, qemu-img's initial fast-zero request immediately fails, and then it switches back to writing the entire image using regular zeroing for the holes; performance is comparable to the baseline and to the qemu 4.0 behavior. params='zeromode=plugin' elapsed time: 4.31356 s write: 200 ops, 52428800 bytes, 9.72354e+07 bits/s zero: 4 ops, 104857600 bytes, 1.94471e+08 bits/s 4 fast=1 Good: when zeroes are fast, qemu-img is able to use pre-zeroing on the entire image, resulting in fewer zero transactions overall, getting us back to the qemu 3.1 maximum performance (and better than the 4.0 behavior). I hope you enjoyed reading this far, and agree with my interpretation of the numbers about why this feature is useful! [1] Orthogonal to these patches are other ideas I have for improving the NBD protocol in its effects to qemu-img convert, which will result in later cross-project patches: - NBD should have a way to advertise (probably via NBD_INFO_ during NBD_OPT_GO) if the initial image is known to begin life with all zeroes (if that is the case, qemu-img can skip the extents calls and pre-zeroing pass altogether) - improving support to allow NBD to pass larger zero requests (qemu is currently capping zero requests at 32m based on NBD_INFO_BLOCK_SIZE, but could easily go up to ~4G with proper info advertisement of maximum zero request sizing, or if we introduce 64-bit commands to the NBD protocol) Given that NBD extensions need not be present in every server, each orthogonal improvement should be tested in isolation to show that it helps, even though qemu-img will probably use all of the extensions at once when the server supports all of them. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Eric Blake
2019-Aug-23 14:34 UTC
[Libguestfs] [PATCH 0/1] NBD protocol change to add fast zero support
See the cross-post cover letter for details: https://www.redhat.com/archives/libguestfs/2019-August/msg00322.html Eric Blake (1): protocol: Add NBD_CMD_FLAG_FAST_ZERO doc/proto.md | 50 +++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 49 insertions(+), 1 deletion(-) -- 2.21.0
Eric Blake
2019-Aug-23 14:34 UTC
[Libguestfs] [PATCH 1/1] protocol: Add NBD_CMD_FLAG_FAST_ZERO
While it may be counterintuitive at first, the introduction of NBD_CMD_WRITE_ZEROES and NBD_CMD_BLOCK_STATUS has caused a performance regression in qemu [1], when copying a sparse file. When the destination file must contain the same contents as the source, but it is not known in advance whether the destination started life with all zero content, then there are cases where it is faster to request a bulk zero of the entire device followed by writing only the portions of the device that are to contain data, as that results in fewer I/O transactions overall. In fact, there are even situations where trimming the entire device prior to writing zeroes may be faster than bare write zero request [2]. However, if a bulk zero request ever falls back to the same speed as a normal write, a bulk pre-zeroing algorithm is actually a pessimization, as it ends up writing portions of the disk twice. [1] https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg06389.html [2] https://github.com/libguestfs/nbdkit/commit/407f8dde Hence, it is desirable to have a way for clients to specify that a particular write zero request is being attempted for a fast wipe, and get an immediate failure if the zero request would otherwise take the same time as a write. Conversely, if the client is not performing a pre-initialization pass, it is still more efficient in terms of networking traffic to send NBD_CMD_WRITE_ZERO requests where the server implements the fallback to the slower write, than it is for the client to have to perform the fallback to send NBD_CMD_WRITE with a zeroed buffer. Add a protocol flag and corresponding transmission advertisement flag to make it easier for clients to inform the server of their intent. If the server advertises NBD_FLAG_SEND_FAST_ZERO, then it promises two things: to perform a fallback to write when the client does not request NBD_CMD_FLAG_FAST_ZERO (so that the client benefits from the lower network overhead); and to fail quickly with ENOTSUP, preferably without modifying the export, if the client requested the flag but the server cannot write zeroes more efficiently than a normal write (so that the client is not penalized with the time of writing data areas of the disk twice). Note that the semantics are chosen so that servers should advertise the new flag whether or not they have fast zeroing (that is, this is NOT the server advertising that it has fast zeroes, but rather advertising that the client can get fast feedback as needed on whether zeroing is fast). It is also intentional that the new advertisement includes a new errno value, ENOTSUP, with rules that this error should not be returned for any pre-existing behaviors, must not happen when the client does not request a fast zero, and must be returned quickly if the client requested fast zero but anything other than the error would not be fast; while leaving it possible for clients to distinguish other errors like EINVAL if alignment constraints are not met. Clients should not send the flag unless the server advertised support, but well-behaved servers should already be reporting EINVAL to unrecognized flags. If the server does not advertise the new feature, clients can safely fall back to assuming that writing zeroes is no faster than normal writes (whether or not the assumption actually holds). Note that the Linux fallocate(2) interface may or may not be powerful enough to easily determine if zeroing will be efficient - in particular, FALLOC_FL_ZERO_RANGE in isolation does NOT give that insight; likewise, for block devices, it is known that ioctl(BLKZEROOUT) does NOT have a way for userspace to probe if it is efficient or slow. But with enough demand, the kernel may add another FALLOC_FL_ flag to use with FALLOC_FL_ZERO_RANGE, and/or appropriate ioctls with guaranteed ENOTSUP failures if a fast path cannot be taken. If a server cannot easily determine if write zeroes will be efficient, the server should either fail all NBD_CMD_FLAG_FAST_ZERO with ENOTSUP, or else choose to not advertise NBD_FLAG_SEND_FAST_ZERO. Signed-off-by: Eric Blake <eblake@redhat.com> --- doc/proto.md | 50 +++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 49 insertions(+), 1 deletion(-) diff --git a/doc/proto.md b/doc/proto.md index 52d3e7b..702688b 100644 --- a/doc/proto.md +++ b/doc/proto.md @@ -1070,6 +1070,18 @@ The field has the following format: which support the command without advertising this bit, and conversely that this bit does not guarantee that the command will succeed or have an impact. +- bit 11, `NBD_FLAG_SEND_FAST_ZERO`: allow clients to detect whether + `NBD_CMD_WRITE_ZEROES` is faster than a corresponding write. The + server MUST set this transmission flag to 1 if the + `NBD_CMD_WRITE_ZEROES` request supports the `NBD_CMD_FLAG_FAST_ZERO` + flag, and MUST set this transmission flag to 0 if + `NBD_FLAG_SEND_WRITE_ZEROES` is not set. Servers MAY set this this + transmission flag even if it will always use `NBD_ENOTSUP` failures for + requests with `NBD_CMD_FLAG_FAST_ZERO` set (such as if the server + cannot quickly determine whether a particular write zeroes request + will be faster than a regular write). Clients MUST NOT set the + `NBD_CMD_FLAG_FAST_ZERO` request flag unless this transmission flag + is set. Clients SHOULD ignore unknown flags. @@ -1647,6 +1659,12 @@ valid may depend on negotiation during the handshake phase. MUST NOT send metadata on more than one extent in the reply. Client implementors should note that using this flag on multiple contiguous requests is likely to be inefficient. +- bit 4, `NBD_CMD_FLAG_FAST_ZERO`; valid during + `NBD_CMD_WRITE_ZEROES`. If set, but the server cannot perform the + write zeroes any faster than it would for an equivalent + `NBD_CMD_WRITE`, then the server MUST fail quickly with an error of + `NBD_ENOTSUP`. The client MUST NOT set this unless the server advertised + `NBD_FLAG_SEND_FAST_ZERO`. ##### Structured reply flags @@ -2015,7 +2033,10 @@ The following request types exist: reached permanent storage, unless `NBD_CMD_FLAG_FUA` is in use. A client MUST NOT send a write zeroes request unless - `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags field. + `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags + field. Additionally, a client MUST NOT send the + `NBD_CMD_FLAG_FAST_ZERO` flag unless `NBD_FLAG_SEND_FAST_ZERO` was + set in the transimssion flags field. By default, the server MAY use trimming to zero out the area, even if it did not advertise `NBD_FLAG_SEND_TRIM`; but it MUST ensure @@ -2025,6 +2046,28 @@ The following request types exist: same area will not cause fragmentation or cause failure due to insufficient space. + If the server advertised `NBD_FLAG_SEND_FAST_ZERO` but + `NBD_CMD_FLAG_FAST_ZERO` is not set, then the server MUST NOT fail + with `NBD_ENOTSUP`, even if the operation is no faster than a + corresponding `NBD_CMD_WRITE`. Conversely, if + `NBD_CMD_FLAG_FAST_ZERO` is set, the server MUST fail quickly with + `NBD_ENOTSUP` unless the request can be serviced in less time than + a corresponding `NBD_CMD_WRITE`, and SHOULD NOT alter the contents + of the export when returning this failure. The server's + determination of a fast request MAY depend on a number of factors, + such as whether the request was suitably aligned, on whether the + `NBD_CMD_FLAG_NO_HOLE` flag was present, or even on whether a + previous `NBD_CMD_TRIM` had been performed on the region. If the + server did not advertise `NBD_FLAG_SEND_FAST_ZERO`, then it SHOULD + NOT fail with `NBD_ENOTSUP`, regardless of the speed of servicing + a request, and SHOULD fail with `NBD_EINVAL` if the + `NBD_CMD_FLAG_FAST_ZERO` flag was set. A server MAY advertise + `NBD_FLAG_SEND_FAST_ZERO` whether or not it can perform fast + zeroing; similarly, a server SHOULD fail with `NBD_ENOTSUP` when + the flag is set if the server cannot quickly determine in advance + whether that request would have been fast, even if it turns out + that the same request without the flag would be fast after all. + If an error occurs, the server MUST set the appropriate error code in the error field. @@ -2125,6 +2168,7 @@ The following error values are defined: * `NBD_EINVAL` (22), Invalid argument. * `NBD_ENOSPC` (28), No space left on device. * `NBD_EOVERFLOW` (75), Value too large. +* `NBD_ENOTSUP` (95), Operation not supported. * `NBD_ESHUTDOWN` (108), Server is in the process of being shut down. The server SHOULD return `NBD_ENOSPC` if it receives a write request @@ -2139,6 +2183,10 @@ read-only export. The server SHOULD NOT return `NBD_EOVERFLOW` except as documented in response to `NBD_CMD_READ` when `NBD_CMD_FLAG_DF` is supported. +The server SHOULD NOT return `NBD_ENOTSUP` except as documented in +response to `NBD_CMD_WRITE_ZEROES` when `NBD_CMD_FLAG_FAST_ZERO` is +supported. + The server SHOULD return `NBD_EINVAL` if it receives an unknown command. The server SHOULD return `NBD_EINVAL` if it receives an unknown -- 2.21.0
Eric Blake
2019-Aug-23 14:37 UTC
[Libguestfs] [PATCH 0/5] Add NBD fast zero support to qemu client and server
See the cross-post cover letter for more details: https://www.redhat.com/archives/libguestfs/2019-August/msg00322.html Based-on: https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg04805.html [Andrey Shinkevich block: workaround for unaligned byte range in fallocate()] Eric Blake (5): nbd: Improve per-export flag handling in server nbd: Prepare for NBD_CMD_FLAG_FAST_ZERO nbd: Implement client use of NBD FAST_ZERO nbd: Implement server use of NBD FAST_ZERO nbd: Tolerate more errors to structured reply request docs/interop/nbd.txt | 3 +- include/block/nbd.h | 6 +++- block/nbd.c | 7 +++++ blockdev-nbd.c | 3 +- nbd/client.c | 39 ++++++++++++++---------- nbd/common.c | 5 ++++ nbd/server.c | 70 ++++++++++++++++++++++++++------------------ qemu-nbd.c | 7 +++-- 8 files changed, 88 insertions(+), 52 deletions(-) -- 2.21.0
Eric Blake
2019-Aug-23 14:37 UTC
[Libguestfs] [PATCH 1/5] nbd: Improve per-export flag handling in server
When creating a read-only image, we are still advertising support for TRIM and WRITE_ZEROES to the client, even though the client should not be issuing those commands. But seeing this requires looking across multiple functions: All callers to nbd_export_new() passed a single flag based solely on whether the export allows writes. Later, we then pass a constant set of flags to nbd_negotiate_options() (namely, the set of flags which we always support, at least for writable images), which is then further dynamically modified based on client requests for structured options. Finally, when processing NBD_OPT_EXPORT_NAME or NBD_OPT_EXPORT_GO we bitwise-or the original caller's flag with the runtime set of flags we've built up over several functions. Let's refactor things to instead compute a baseline of flags as soon as possible, in nbd_export_new(), and changing the signature for the callers to pass in a simpler bool rather than having to figure out flags. We can then get rid of the 'myflags' parameter to various functions, and instead refer to client for everything we need (we still have to perform a bitwise-OR during NBD_OPT_EXPORT_NAME and NBD_OPT_EXPORT_GO, but it's easier to see what is being computed). This lets us quit advertising senseless flags for read-only images, as well as making the next patch for exposing FAST_ZERO support easier to write. Signed-off-by: Eric Blake <eblake@redhat.com> --- include/block/nbd.h | 2 +- blockdev-nbd.c | 3 +-- nbd/server.c | 62 +++++++++++++++++++++++++-------------------- qemu-nbd.c | 6 ++--- 4 files changed, 39 insertions(+), 34 deletions(-) diff --git a/include/block/nbd.h b/include/block/nbd.h index 991fd52a5134..2c87b42dfd48 100644 --- a/include/block/nbd.h +++ b/include/block/nbd.h @@ -326,7 +326,7 @@ typedef struct NBDClient NBDClient; NBDExport *nbd_export_new(BlockDriverState *bs, uint64_t dev_offset, uint64_t size, const char *name, const char *desc, - const char *bitmap, uint16_t nbdflags, bool shared, + const char *bitmap, bool readonly, bool shared, void (*close)(NBDExport *), bool writethrough, BlockBackend *on_eject_blk, Error **errp); void nbd_export_close(NBDExport *exp); diff --git a/blockdev-nbd.c b/blockdev-nbd.c index ecfa2ef0adb5..1cdffab4fce1 100644 --- a/blockdev-nbd.c +++ b/blockdev-nbd.c @@ -187,8 +187,7 @@ void qmp_nbd_server_add(const char *device, bool has_name, const char *name, writable = false; } - exp = nbd_export_new(bs, 0, len, name, NULL, bitmap, - writable ? 0 : NBD_FLAG_READ_ONLY, !writable, + exp = nbd_export_new(bs, 0, len, name, NULL, bitmap, !writable, !writable, NULL, false, on_eject_blk, errp); if (!exp) { return; diff --git a/nbd/server.c b/nbd/server.c index 0fb41c6c50ea..b5577828aa44 100644 --- a/nbd/server.c +++ b/nbd/server.c @@ -423,14 +423,14 @@ static void nbd_check_meta_export(NBDClient *client) /* Send a reply to NBD_OPT_EXPORT_NAME. * Return -errno on error, 0 on success. */ -static int nbd_negotiate_handle_export_name(NBDClient *client, - uint16_t myflags, bool no_zeroes, +static int nbd_negotiate_handle_export_name(NBDClient *client, bool no_zeroes, Error **errp) { char name[NBD_MAX_NAME_SIZE + 1]; char buf[NBD_REPLY_EXPORT_NAME_SIZE] = ""; size_t len; int ret; + uint16_t myflags; /* Client sends: [20 .. xx] export name (length bytes) @@ -458,10 +458,13 @@ static int nbd_negotiate_handle_export_name(NBDClient *client, return -EINVAL; } - trace_nbd_negotiate_new_style_size_flags(client->exp->size, - client->exp->nbdflags | myflags); + myflags = client->exp->nbdflags; + if (client->structured_reply) { + myflags |= NBD_FLAG_SEND_DF; + } + trace_nbd_negotiate_new_style_size_flags(client->exp->size, myflags); stq_be_p(buf, client->exp->size); - stw_be_p(buf + 8, client->exp->nbdflags | myflags); + stw_be_p(buf + 8, myflags); len = no_zeroes ? 10 : sizeof(buf); ret = nbd_write(client->ioc, buf, len, errp); if (ret < 0) { @@ -526,8 +529,7 @@ static int nbd_reject_length(NBDClient *client, bool fatal, Error **errp) /* Handle NBD_OPT_INFO and NBD_OPT_GO. * Return -errno on error, 0 if ready for next option, and 1 to move * into transmission phase. */ -static int nbd_negotiate_handle_info(NBDClient *client, uint16_t myflags, - Error **errp) +static int nbd_negotiate_handle_info(NBDClient *client, Error **errp) { int rc; char name[NBD_MAX_NAME_SIZE + 1]; @@ -540,6 +542,7 @@ static int nbd_negotiate_handle_info(NBDClient *client, uint16_t myflags, uint32_t sizes[3]; char buf[sizeof(uint64_t) + sizeof(uint16_t)]; uint32_t check_align = 0; + uint16_t myflags; /* Client sends: 4 bytes: L, name length (can be 0) @@ -637,10 +640,13 @@ static int nbd_negotiate_handle_info(NBDClient *client, uint16_t myflags, } /* Send NBD_INFO_EXPORT always */ - trace_nbd_negotiate_new_style_size_flags(exp->size, - exp->nbdflags | myflags); + myflags = exp->nbdflags; + if (client->structured_reply) { + myflags |= NBD_FLAG_SEND_DF; + } + trace_nbd_negotiate_new_style_size_flags(exp->size, myflags); stq_be_p(buf, exp->size); - stw_be_p(buf + 8, exp->nbdflags | myflags); + stw_be_p(buf + 8, myflags); rc = nbd_negotiate_send_info(client, NBD_INFO_EXPORT, sizeof(buf), buf, errp); if (rc < 0) { @@ -1037,8 +1043,7 @@ static int nbd_negotiate_meta_queries(NBDClient *client, * 1 if client sent NBD_OPT_ABORT, i.e. on valid disconnect, * errp is not set */ -static int nbd_negotiate_options(NBDClient *client, uint16_t myflags, - Error **errp) +static int nbd_negotiate_options(NBDClient *client, Error **errp) { uint32_t flags; bool fixedNewstyle = false; @@ -1172,13 +1177,12 @@ static int nbd_negotiate_options(NBDClient *client, uint16_t myflags, return 1; case NBD_OPT_EXPORT_NAME: - return nbd_negotiate_handle_export_name(client, - myflags, no_zeroes, + return nbd_negotiate_handle_export_name(client, no_zeroes, errp); case NBD_OPT_INFO: case NBD_OPT_GO: - ret = nbd_negotiate_handle_info(client, myflags, errp); + ret = nbd_negotiate_handle_info(client, errp); if (ret == 1) { assert(option == NBD_OPT_GO); return 0; @@ -1209,7 +1213,6 @@ static int nbd_negotiate_options(NBDClient *client, uint16_t myflags, } else { ret = nbd_negotiate_send_rep(client, NBD_REP_ACK, errp); client->structured_reply = true; - myflags |= NBD_FLAG_SEND_DF; } break; @@ -1232,8 +1235,7 @@ static int nbd_negotiate_options(NBDClient *client, uint16_t myflags, */ switch (option) { case NBD_OPT_EXPORT_NAME: - return nbd_negotiate_handle_export_name(client, - myflags, no_zeroes, + return nbd_negotiate_handle_export_name(client, no_zeroes, errp); default: @@ -1259,9 +1261,6 @@ static coroutine_fn int nbd_negotiate(NBDClient *client, Error **errp) { char buf[NBD_OLDSTYLE_NEGOTIATE_SIZE] = ""; int ret; - const uint16_t myflags = (NBD_FLAG_HAS_FLAGS | NBD_FLAG_SEND_TRIM | - NBD_FLAG_SEND_FLUSH | NBD_FLAG_SEND_FUA | - NBD_FLAG_SEND_WRITE_ZEROES | NBD_FLAG_SEND_CACHE); /* Old style negotiation header, no room for options [ 0 .. 7] passwd ("NBDMAGIC") @@ -1289,7 +1288,7 @@ static coroutine_fn int nbd_negotiate(NBDClient *client, Error **errp) error_prepend(errp, "write failed: "); return -EINVAL; } - ret = nbd_negotiate_options(client, myflags, errp); + ret = nbd_negotiate_options(client, errp); if (ret != 0) { if (ret < 0) { error_prepend(errp, "option negotiation failed: "); @@ -1461,7 +1460,7 @@ static void nbd_eject_notifier(Notifier *n, void *data) NBDExport *nbd_export_new(BlockDriverState *bs, uint64_t dev_offset, uint64_t size, const char *name, const char *desc, - const char *bitmap, uint16_t nbdflags, bool shared, + const char *bitmap, bool readonly, bool shared, void (*close)(NBDExport *), bool writethrough, BlockBackend *on_eject_blk, Error **errp) { @@ -1485,10 +1484,8 @@ NBDExport *nbd_export_new(BlockDriverState *bs, uint64_t dev_offset, /* Don't allow resize while the NBD server is running, otherwise we don't * care what happens with the node. */ perm = BLK_PERM_CONSISTENT_READ; - if ((nbdflags & NBD_FLAG_READ_ONLY) == 0) { + if (!readonly) { perm |= BLK_PERM_WRITE; - } else if (shared) { - nbdflags |= NBD_FLAG_CAN_MULTI_CONN; } blk = blk_new(bdrv_get_aio_context(bs), perm, BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED | @@ -1507,7 +1504,16 @@ NBDExport *nbd_export_new(BlockDriverState *bs, uint64_t dev_offset, exp->dev_offset = dev_offset; exp->name = g_strdup(name); exp->description = g_strdup(desc); - exp->nbdflags = nbdflags; + exp->nbdflags = (NBD_FLAG_HAS_FLAGS | NBD_FLAG_SEND_FLUSH | + NBD_FLAG_SEND_FUA | NBD_FLAG_SEND_CACHE); + if (readonly) { + exp->nbdflags |= NBD_FLAG_READ_ONLY; + if (shared) { + exp->nbdflags |= NBD_FLAG_CAN_MULTI_CONN; + } + } else { + exp->nbdflags |= NBD_FLAG_SEND_TRIM | NBD_FLAG_SEND_WRITE_ZEROES; + } assert(size <= INT64_MAX - dev_offset); exp->size = QEMU_ALIGN_DOWN(size, BDRV_SECTOR_SIZE); @@ -1532,7 +1538,7 @@ NBDExport *nbd_export_new(BlockDriverState *bs, uint64_t dev_offset, goto fail; } - if ((nbdflags & NBD_FLAG_READ_ONLY) && bdrv_is_writable(bs) && + if (readonly && bdrv_is_writable(bs) && bdrv_dirty_bitmap_enabled(bm)) { error_setg(errp, "Enabled bitmap '%s' incompatible with readonly export", diff --git a/qemu-nbd.c b/qemu-nbd.c index 55f5ceaf5c92..079702bb837f 100644 --- a/qemu-nbd.c +++ b/qemu-nbd.c @@ -600,7 +600,7 @@ int main(int argc, char **argv) BlockBackend *blk; BlockDriverState *bs; uint64_t dev_offset = 0; - uint16_t nbdflags = 0; + bool readonly = false; bool disconnect = false; const char *bindto = NULL; const char *port = NULL; @@ -782,7 +782,7 @@ int main(int argc, char **argv) } /* fall through */ case 'r': - nbdflags |= NBD_FLAG_READ_ONLY; + readonly = true; flags &= ~BDRV_O_RDWR; break; case 'P': @@ -1173,7 +1173,7 @@ int main(int argc, char **argv) } export = nbd_export_new(bs, dev_offset, fd_size, export_name, - export_description, bitmap, nbdflags, shared > 1, + export_description, bitmap, readonly, shared > 1, nbd_export_closed, writethrough, NULL, &error_fatal); -- 2.21.0
Eric Blake
2019-Aug-23 14:37 UTC
[Libguestfs] [PATCH 2/5] nbd: Prepare for NBD_CMD_FLAG_FAST_ZERO
Commit fe0480d6 and friends added BDRV_REQ_NO_FALLBACK as a way to avoid wasting time on a preliminary write-zero request that will later be rewritten by actual data, if it is known that the write-zero request will use a slow fallback; but in doing so, could not optimize for NBD. The NBD specification is now considering an extension that will allow passing on those semantics; this patch updates the new protocol bits and 'qemu-nbd --list' output to recognize the bit, as well as the new errno value possible when using the new flag; while upcoming patches will improve the client to use the feature when present, and the server to advertise support for it. Signed-off-by: Eric Blake <eblake@redhat.com> --- docs/interop/nbd.txt | 3 ++- include/block/nbd.h | 4 ++++ nbd/common.c | 5 +++++ nbd/server.c | 2 ++ qemu-nbd.c | 1 + 5 files changed, 14 insertions(+), 1 deletion(-) diff --git a/docs/interop/nbd.txt b/docs/interop/nbd.txt index 6dfec7f47647..45118809618e 100644 --- a/docs/interop/nbd.txt +++ b/docs/interop/nbd.txt @@ -53,4 +53,5 @@ the operation of that feature. * 2.12: NBD_CMD_BLOCK_STATUS for "base:allocation" * 3.0: NBD_OPT_STARTTLS with TLS Pre-Shared Keys (PSK), NBD_CMD_BLOCK_STATUS for "qemu:dirty-bitmap:", NBD_CMD_CACHE -* 4.2: NBD_FLAG_CAN_MULTI_CONN for sharable read-only exports +* 4.2: NBD_FLAG_CAN_MULTI_CONN for sharable read-only exports, +NBD_CMD_FLAG_FAST_ZERO diff --git a/include/block/nbd.h b/include/block/nbd.h index 2c87b42dfd48..21550747cf35 100644 --- a/include/block/nbd.h +++ b/include/block/nbd.h @@ -140,6 +140,7 @@ enum { NBD_FLAG_CAN_MULTI_CONN_BIT = 8, /* Multi-client cache consistent */ NBD_FLAG_SEND_RESIZE_BIT = 9, /* Send resize */ NBD_FLAG_SEND_CACHE_BIT = 10, /* Send CACHE (prefetch) */ + NBD_FLAG_SEND_FAST_ZERO_BIT = 11, /* FAST_ZERO flag for WRITE_ZEROES */ }; #define NBD_FLAG_HAS_FLAGS (1 << NBD_FLAG_HAS_FLAGS_BIT) @@ -153,6 +154,7 @@ enum { #define NBD_FLAG_CAN_MULTI_CONN (1 << NBD_FLAG_CAN_MULTI_CONN_BIT) #define NBD_FLAG_SEND_RESIZE (1 << NBD_FLAG_SEND_RESIZE_BIT) #define NBD_FLAG_SEND_CACHE (1 << NBD_FLAG_SEND_CACHE_BIT) +#define NBD_FLAG_SEND_FAST_ZERO (1 << NBD_FLAG_SEND_FAST_ZERO_BIT) /* New-style handshake (global) flags, sent from server to client, and control what will happen during handshake phase. */ @@ -205,6 +207,7 @@ enum { #define NBD_CMD_FLAG_DF (1 << 2) /* don't fragment structured read */ #define NBD_CMD_FLAG_REQ_ONE (1 << 3) /* only one extent in BLOCK_STATUS * reply chunk */ +#define NBD_CMD_FLAG_FAST_ZERO (1 << 4) /* fail if WRITE_ZEROES is not fast */ /* Supported request types */ enum { @@ -270,6 +273,7 @@ static inline bool nbd_reply_type_is_error(int type) #define NBD_EINVAL 22 #define NBD_ENOSPC 28 #define NBD_EOVERFLOW 75 +#define NBD_ENOTSUP 95 #define NBD_ESHUTDOWN 108 /* Details collected by NBD_OPT_EXPORT_NAME and NBD_OPT_GO */ diff --git a/nbd/common.c b/nbd/common.c index cc8b278e541d..ddfe7d118371 100644 --- a/nbd/common.c +++ b/nbd/common.c @@ -201,6 +201,8 @@ const char *nbd_err_lookup(int err) return "ENOSPC"; case NBD_EOVERFLOW: return "EOVERFLOW"; + case NBD_ENOTSUP: + return "ENOTSUP"; case NBD_ESHUTDOWN: return "ESHUTDOWN"; default: @@ -231,6 +233,9 @@ int nbd_errno_to_system_errno(int err) case NBD_EOVERFLOW: ret = EOVERFLOW; break; + case NBD_ENOTSUP: + ret = ENOTSUP; + break; case NBD_ESHUTDOWN: ret = ESHUTDOWN; break; diff --git a/nbd/server.c b/nbd/server.c index b5577828aa44..981bc3cb1151 100644 --- a/nbd/server.c +++ b/nbd/server.c @@ -55,6 +55,8 @@ static int system_errno_to_nbd_errno(int err) return NBD_ENOSPC; case EOVERFLOW: return NBD_EOVERFLOW; + case ENOTSUP: + return NBD_ENOTSUP; case ESHUTDOWN: return NBD_ESHUTDOWN; case EINVAL: diff --git a/qemu-nbd.c b/qemu-nbd.c index 079702bb837f..dce52f564b5a 100644 --- a/qemu-nbd.c +++ b/qemu-nbd.c @@ -294,6 +294,7 @@ static int qemu_nbd_client_list(SocketAddress *saddr, QCryptoTLSCreds *tls, [NBD_FLAG_CAN_MULTI_CONN_BIT] = "multi", [NBD_FLAG_SEND_RESIZE_BIT] = "resize", [NBD_FLAG_SEND_CACHE_BIT] = "cache", + [NBD_FLAG_SEND_FAST_ZERO_BIT] = "fast-zero", }; printf(" size: %" PRIu64 "\n", list[i].size); -- 2.21.0
Eric Blake
2019-Aug-23 14:37 UTC
[Libguestfs] [PATCH 3/5] nbd: Implement client use of NBD FAST_ZERO
The client side is fairly straightforward: if the server advertised fast zero support, then we can map that to BDRV_REQ_NO_FALLBACK support. A server that advertises FAST_ZERO but not WRITE_ZEROES is technically broken, but we can ignore that situation as it does not change our behavior. Signed-off-by: Eric Blake <eblake@redhat.com> --- Perhaps this is worth merging with the previous patch. --- block/nbd.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/block/nbd.c b/block/nbd.c index beed46fb3414..8339d7106366 100644 --- a/block/nbd.c +++ b/block/nbd.c @@ -1044,6 +1044,10 @@ static int nbd_client_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset, if (!(flags & BDRV_REQ_MAY_UNMAP)) { request.flags |= NBD_CMD_FLAG_NO_HOLE; } + if (flags & BDRV_REQ_NO_FALLBACK) { + assert(s->info.flags & NBD_FLAG_SEND_FAST_ZERO); + request.flags |= NBD_CMD_FLAG_FAST_ZERO; + } if (!bytes) { return 0; @@ -1239,6 +1243,9 @@ static int nbd_client_connect(BlockDriverState *bs, Error **errp) } if (s->info.flags & NBD_FLAG_SEND_WRITE_ZEROES) { bs->supported_zero_flags |= BDRV_REQ_MAY_UNMAP; + if (s->info.flags & NBD_FLAG_SEND_FAST_ZERO) { + bs->supported_zero_flags |= BDRV_REQ_NO_FALLBACK; + } } s->sioc = sioc; -- 2.21.0
Eric Blake
2019-Aug-23 14:37 UTC
[Libguestfs] [PATCH 4/5] nbd: Implement server use of NBD FAST_ZERO
The server side is fairly straightforward: we can always advertise support for detection of fast zero, and implement it by mapping the request to the block layer BDRV_REQ_NO_FALLBACK. Signed-off-by: Eric Blake <eblake@redhat.com> --- Again, this may be worth merging with the previous patch. --- nbd/server.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/nbd/server.c b/nbd/server.c index 981bc3cb1151..3c0799eda87f 100644 --- a/nbd/server.c +++ b/nbd/server.c @@ -1514,7 +1514,8 @@ NBDExport *nbd_export_new(BlockDriverState *bs, uint64_t dev_offset, exp->nbdflags |= NBD_FLAG_CAN_MULTI_CONN; } } else { - exp->nbdflags |= NBD_FLAG_SEND_TRIM | NBD_FLAG_SEND_WRITE_ZEROES; + exp->nbdflags |= (NBD_FLAG_SEND_TRIM | NBD_FLAG_SEND_WRITE_ZEROES | + NBD_FLAG_SEND_FAST_ZERO); } assert(size <= INT64_MAX - dev_offset); exp->size = QEMU_ALIGN_DOWN(size, BDRV_SECTOR_SIZE); @@ -2167,7 +2168,7 @@ static int nbd_co_receive_request(NBDRequestData *req, NBDRequest *request, if (request->type == NBD_CMD_READ && client->structured_reply) { valid_flags |= NBD_CMD_FLAG_DF; } else if (request->type == NBD_CMD_WRITE_ZEROES) { - valid_flags |= NBD_CMD_FLAG_NO_HOLE; + valid_flags |= NBD_CMD_FLAG_NO_HOLE | NBD_CMD_FLAG_FAST_ZERO; } else if (request->type == NBD_CMD_BLOCK_STATUS) { valid_flags |= NBD_CMD_FLAG_REQ_ONE; } @@ -2306,6 +2307,9 @@ static coroutine_fn int nbd_handle_request(NBDClient *client, if (!(request->flags & NBD_CMD_FLAG_NO_HOLE)) { flags |= BDRV_REQ_MAY_UNMAP; } + if (request->flags & NBD_CMD_FLAG_FAST_ZERO) { + flags |= BDRV_REQ_NO_FALLBACK; + } ret = blk_pwrite_zeroes(exp->blk, request->from + exp->dev_offset, request->len, flags); return nbd_send_generic_reply(client, request->handle, ret, -- 2.21.0
Eric Blake
2019-Aug-23 14:37 UTC
[Libguestfs] [PATCH 5/5] nbd: Tolerate more errors to structured reply request
A server may have a reason to reject a request for structured replies, beyond just not recognizing them as a valid request. It doesn't hurt us to continue talking to such a server; otherwise 'qemu-nbd --list' of such a server fails to display all possible details about the export. Encountered when temporarily tweaking nbdkit to reply with NBD_REP_ERR_POLICY. Present since structured reply support was first added (commit d795299b reused starttls handling, but starttls has to reject all errors). Signed-off-by: Eric Blake <eblake@redhat.com> --- nbd/client.c | 39 +++++++++++++++++++++++---------------- 1 file changed, 23 insertions(+), 16 deletions(-) diff --git a/nbd/client.c b/nbd/client.c index 49bf9906f94b..92444f5e9a62 100644 --- a/nbd/client.c +++ b/nbd/client.c @@ -1,5 +1,5 @@ /* - * Copyright (C) 2016-2018 Red Hat, Inc. + * Copyright (C) 2016-2019 Red Hat, Inc. * Copyright (C) 2005 Anthony Liguori <anthony@codemonkey.ws> * * Network Block Device Client Side @@ -142,17 +142,19 @@ static int nbd_receive_option_reply(QIOChannel *ioc, uint32_t opt, return 0; } -/* If reply represents success, return 1 without further action. - * If reply represents an error, consume the optional payload of - * the packet on ioc. Then return 0 for unsupported (so the client - * can fall back to other approaches), or -1 with errp set for other - * errors. +/* + * If reply represents success, return 1 without further action. If + * reply represents an error, consume the optional payload of the + * packet on ioc. Then return 0 for unsupported (so the client can + * fall back to other approaches), where @strict determines if only + * ERR_UNSUP or all errors fit that category, or -1 with errp set for + * other errors. */ static int nbd_handle_reply_err(QIOChannel *ioc, NBDOptionReply *reply, - Error **errp) + bool strict, Error **errp) { char *msg = NULL; - int result = -1; + int result = strict ? -1 : 0; if (!(reply->type & (1 << 31))) { return 1; @@ -163,6 +165,7 @@ static int nbd_handle_reply_err(QIOChannel *ioc, NBDOptionReply *reply, error_setg(errp, "server error %" PRIu32 " (%s) message is too long", reply->type, nbd_rep_lookup(reply->type)); + result = -1; goto cleanup; } msg = g_malloc(reply->length + 1); @@ -170,6 +173,7 @@ static int nbd_handle_reply_err(QIOChannel *ioc, NBDOptionReply *reply, error_prepend(errp, "Failed to read option error %" PRIu32 " (%s) message: ", reply->type, nbd_rep_lookup(reply->type)); + result = -1; goto cleanup; } msg[reply->length] = '\0'; @@ -258,7 +262,7 @@ static int nbd_receive_list(QIOChannel *ioc, char **name, char **description, if (nbd_receive_option_reply(ioc, NBD_OPT_LIST, &reply, errp) < 0) { return -1; } - error = nbd_handle_reply_err(ioc, &reply, errp); + error = nbd_handle_reply_err(ioc, &reply, true, errp); if (error <= 0) { return error; } @@ -371,7 +375,7 @@ static int nbd_opt_info_or_go(QIOChannel *ioc, uint32_t opt, if (nbd_receive_option_reply(ioc, opt, &reply, errp) < 0) { return -1; } - error = nbd_handle_reply_err(ioc, &reply, errp); + error = nbd_handle_reply_err(ioc, &reply, true, errp); if (error <= 0) { return error; } @@ -546,12 +550,15 @@ static int nbd_receive_query_exports(QIOChannel *ioc, } } -/* nbd_request_simple_option: Send an option request, and parse the reply +/* + * nbd_request_simple_option: Send an option request, and parse the reply. + * @strict controls whether ERR_UNSUP or all errors produce 0 status. * return 1 for successful negotiation, * 0 if operation is unsupported, * -1 with errp set for any other error */ -static int nbd_request_simple_option(QIOChannel *ioc, int opt, Error **errp) +static int nbd_request_simple_option(QIOChannel *ioc, int opt, bool strict, + Error **errp) { NBDOptionReply reply; int error; @@ -563,7 +570,7 @@ static int nbd_request_simple_option(QIOChannel *ioc, int opt, Error **errp) if (nbd_receive_option_reply(ioc, opt, &reply, errp) < 0) { return -1; } - error = nbd_handle_reply_err(ioc, &reply, errp); + error = nbd_handle_reply_err(ioc, &reply, strict, errp); if (error <= 0) { return error; } @@ -595,7 +602,7 @@ static QIOChannel *nbd_receive_starttls(QIOChannel *ioc, QIOChannelTLS *tioc; struct NBDTLSHandshakeData data = { 0 }; - ret = nbd_request_simple_option(ioc, NBD_OPT_STARTTLS, errp); + ret = nbd_request_simple_option(ioc, NBD_OPT_STARTTLS, true, errp); if (ret <= 0) { if (ret == 0) { error_setg(errp, "Server don't support STARTTLS option"); @@ -695,7 +702,7 @@ static int nbd_receive_one_meta_context(QIOChannel *ioc, return -1; } - ret = nbd_handle_reply_err(ioc, &reply, errp); + ret = nbd_handle_reply_err(ioc, &reply, true, errp); if (ret <= 0) { return ret; } @@ -951,7 +958,7 @@ static int nbd_start_negotiate(AioContext *aio_context, QIOChannel *ioc, if (structured_reply) { result = nbd_request_simple_option(ioc, NBD_OPT_STRUCTURED_REPLY, - errp); + false, errp); if (result < 0) { return -EINVAL; } -- 2.21.0
Eric Blake
2019-Aug-23 14:38 UTC
[Libguestfs] [libnbd PATCH 0/1] libnbd support for new fast zero
See the cross-post cover letter for details: https://www.redhat.com/archives/libguestfs/2019-August/msg00322.html Eric Blake (1): api: Add support for FAST_ZERO flag lib/nbd-protocol.h | 2 ++ generator/generator | 29 +++++++++++++++++++++++------ lib/flags.c | 12 ++++++++++++ lib/protocol.c | 1 + lib/rw.c | 9 ++++++++- tests/Makefile.am | 20 ++++++++++++++++++++ tests/eflags-plugin.sh | 5 +++++ 7 files changed, 71 insertions(+), 7 deletions(-) -- 2.21.0
Eric Blake
2019-Aug-23 14:38 UTC
[Libguestfs] [libnbd PATCH 1/1] api: Add support for FAST_ZERO flag
Qemu was able to demonstrate that knowing whether a zero operation is fast is useful when copying from one image to another: there is a choice between bulk pre-zeroing and then revisiting the data sections (fewer transactions, but depends on the zeroing to be fast), vs. visiting every portion of the disk only once (more transactions, but no time lost to duplicated I/O due to slow zeroes). As such, the NBD protocol is adding an extension to allow clients to request fast failure when zero is not efficient, from servers that advertise support for the new flag. In libnbd, this results in the addition of a new flag, a new functoin nbd_can_fast_zero, and the ability to use the new flag in nbd_zero variants. It also enhances the testsuite to test the flag against new enough nbdkit, which is being patched at the same time. Signed-off-by: Eric Blake <eblake@redhat.com> --- lib/nbd-protocol.h | 2 ++ generator/generator | 29 +++++++++++++++++++++++------ lib/flags.c | 12 ++++++++++++ lib/protocol.c | 1 + lib/rw.c | 9 ++++++++- tests/Makefile.am | 20 ++++++++++++++++++++ tests/eflags-plugin.sh | 5 +++++ 7 files changed, 71 insertions(+), 7 deletions(-) diff --git a/lib/nbd-protocol.h b/lib/nbd-protocol.h index 3e3fb4e..04e93d3 100644 --- a/lib/nbd-protocol.h +++ b/lib/nbd-protocol.h @@ -108,6 +108,7 @@ struct nbd_fixed_new_option_reply { #define NBD_FLAG_SEND_DF (1 << 7) #define NBD_FLAG_CAN_MULTI_CONN (1 << 8) #define NBD_FLAG_SEND_CACHE (1 << 10) +#define NBD_FLAG_SEND_FAST_ZERO (1 << 11) /* NBD options (new style handshake only). */ #define NBD_OPT_EXPORT_NAME 1 @@ -250,6 +251,7 @@ struct nbd_structured_reply_error { #define NBD_EINVAL 22 #define NBD_ENOSPC 28 #define NBD_EOVERFLOW 75 +#define NBD_ENOTSUP 95 #define NBD_ESHUTDOWN 108 #endif /* NBD_PROTOCOL_H */ diff --git a/generator/generator b/generator/generator index c509573..9b1f5d8 100755 --- a/generator/generator +++ b/generator/generator @@ -958,10 +958,11 @@ let all_enums = [ tls_enum ] let cmd_flags = { flag_prefix = "CMD_FLAG"; flags = [ - "FUA", 1 lsl 0; - "NO_HOLE", 1 lsl 1; - "DF", 1 lsl 2; - "REQ_ONE", 1 lsl 3; + "FUA", 1 lsl 0; + "NO_HOLE", 1 lsl 1; + "DF", 1 lsl 2; + "REQ_ONE", 1 lsl 3; + "FAST_ZERO", 1 lsl 4; ] } let all_flags = [ cmd_flags ] @@ -1389,6 +1390,19 @@ the server does not." ^ non_blocking_test_call_description; }; + "can_fast_zero", { + default_call with + args = []; ret = RBool; + permitted_states = [ Connected; Closed ]; + shortdesc = "does the server support the fast zero flag?"; + longdesc = "\ +Returns true if the server supports the use of the +C<LIBNBD_CMD_FLAG_FAST_ZERO> flag to the zero command +(see C<nbd_zero>, C<nbd_aio_zero>). Returns false if +the server does not." +^ non_blocking_test_call_description; + }; + "can_df", { default_call with args = []; ret = RBool; @@ -1668,9 +1682,12 @@ The C<flags> parameter may be C<0> for no flags, or may contain C<LIBNBD_CMD_FLAG_FUA> meaning that the server should not return until the data has been committed to permanent storage (if that is supported - some servers cannot do this, see -L<nbd_can_fua(3)>), and/or C<LIBNBD_CMD_FLAG_NO_HOLE> meaning that +L<nbd_can_fua(3)>), C<LIBNBD_CMD_FLAG_NO_HOLE> meaning that the server should favor writing actual allocated zeroes over -punching a hole."; +punching a hole, and/or C<LIBNBD_CMD_FLAG_FAST_ZERO> meaning +that the server must fail quickly if writing zeroes is no +faster than a normal write (if that is supported - some servers +cannot do this, see L<nbd_can_fast_zero(3)>)."; }; "block_status", { diff --git a/lib/flags.c b/lib/flags.c index 2bcacb8..d55d10a 100644 --- a/lib/flags.c +++ b/lib/flags.c @@ -47,6 +47,12 @@ nbd_internal_set_size_and_flags (struct nbd_handle *h, eflags &= ~NBD_FLAG_SEND_DF; } + if (eflags & NBD_FLAG_SEND_FAST_ZERO && + !(eflags & NBD_FLAG_SEND_WRITE_ZEROES)) { + debug (h, "server lacks write zeroes, ignoring claim of fast zero"); + eflags &= ~NBD_FLAG_SEND_FAST_ZERO; + } + h->exportsize = exportsize; h->eflags = eflags; return 0; @@ -100,6 +106,12 @@ nbd_unlocked_can_zero (struct nbd_handle *h) return get_flag (h, NBD_FLAG_SEND_WRITE_ZEROES); } +int +nbd_unlocked_can_fast_zero (struct nbd_handle *h) +{ + return get_flag (h, NBD_FLAG_SEND_FAST_ZERO); +} + int nbd_unlocked_can_df (struct nbd_handle *h) { diff --git a/lib/protocol.c b/lib/protocol.c index 6087887..acee203 100644 --- a/lib/protocol.c +++ b/lib/protocol.c @@ -36,6 +36,7 @@ nbd_internal_errno_of_nbd_error (uint32_t error) case NBD_EINVAL: return EINVAL; case NBD_ENOSPC: return ENOSPC; case NBD_EOVERFLOW: return EOVERFLOW; + case NBD_ENOTSUP: return ENOTSUP; case NBD_ESHUTDOWN: return ESHUTDOWN; default: return EINVAL; } diff --git a/lib/rw.c b/lib/rw.c index d427681..adb6bc2 100644 --- a/lib/rw.c +++ b/lib/rw.c @@ -426,7 +426,8 @@ nbd_unlocked_aio_zero (struct nbd_handle *h, return -1; } - if ((flags & ~(LIBNBD_CMD_FLAG_FUA | LIBNBD_CMD_FLAG_NO_HOLE)) != 0) { + if ((flags & ~(LIBNBD_CMD_FLAG_FUA | LIBNBD_CMD_FLAG_NO_HOLE | + LIBNBD_CMD_FLAG_FAST_ZERO)) != 0) { set_error (EINVAL, "invalid flag: %" PRIu32, flags); return -1; } @@ -437,6 +438,12 @@ nbd_unlocked_aio_zero (struct nbd_handle *h, return -1; } + if ((flags & LIBNBD_CMD_FLAG_FAST_ZERO) != 0 && + nbd_unlocked_can_fast_zero (h) != 1) { + set_error (EINVAL, "server does not support the fast zero flag"); + return -1; + } + if (count == 0) { /* NBD protocol forbids this. */ set_error (EINVAL, "count cannot be 0"); return -1; diff --git a/tests/Makefile.am b/tests/Makefile.am index 2b4ea93..a7bc1b5 100644 --- a/tests/Makefile.am +++ b/tests/Makefile.am @@ -145,6 +145,8 @@ check_PROGRAMS += \ can-not-trim-flag \ can-zero-flag \ can-not-zero-flag \ + can-fast-zero-flag \ + can-not-fast-zero-flag \ can-multi-conn-flag \ can-not-multi-conn-flag \ can-cache-flag \ @@ -177,6 +179,8 @@ TESTS += \ can-not-trim-flag \ can-zero-flag \ can-not-zero-flag \ + can-fast-zero-flag \ + can-not-fast-zero-flag \ can-multi-conn-flag \ can-not-multi-conn-flag \ can-cache-flag \ @@ -289,6 +293,22 @@ can_not_zero_flag_CPPFLAGS = \ can_not_zero_flag_CFLAGS = $(WARNINGS_CFLAGS) can_not_zero_flag_LDADD = $(top_builddir)/lib/libnbd.la +can_fast_zero_flag_SOURCES = eflags.c +can_fast_zero_flag_CPPFLAGS = \ + -I$(top_srcdir)/include -Dflag=can_fast_zero \ + -Drequire='"has_can_fast_zero=1"' \ + $(NULL) +can_fast_zero_flag_CFLAGS = $(WARNINGS_CFLAGS) +can_fast_zero_flag_LDADD = $(top_builddir)/lib/libnbd.la + +can_not_fast_zero_flag_SOURCES = eflags.c +can_not_fast_zero_flag_CPPFLAGS = \ + -I$(top_srcdir)/include -Dflag=can_fast_zero -Dvalue=false \ + -Drequire='"has_can_fast_zero=1"' \ + $(NULL) +can_not_fast_zero_flag_CFLAGS = $(WARNINGS_CFLAGS) +can_not_fast_zero_flag_LDADD = $(top_builddir)/lib/libnbd.la + can_multi_conn_flag_SOURCES = eflags.c can_multi_conn_flag_CPPFLAGS = \ -I$(top_srcdir)/include -Dflag=can_multi_conn \ diff --git a/tests/eflags-plugin.sh b/tests/eflags-plugin.sh index 8fccff8..3c4cc65 100755 --- a/tests/eflags-plugin.sh +++ b/tests/eflags-plugin.sh @@ -52,6 +52,11 @@ case "$1" in # be read-only and methods like can_trim will never be called. exit 0 ;; + can_zero) + # We have to default to answering this with true before + # can_fast_zero has an effect. + exit 0 + ;; *) exit 2 ;; -- 2.21.0
Eric Blake
2019-Aug-23 14:40 UTC
[Libguestfs] [nbdkit PATCH 0/3] nbdkit support for new NBD fast zero
See the cross-post cover letter for details: https://www.redhat.com/archives/libguestfs/2019-August/msg00322.html Notably, this series did NOT add fast zero support to the file plugin yet; there, I probably need to do more testing and/or kernel source code reading to learn whether to mark fallocate() as potentially slow, as well as to definitely mark ioctl(BLKZEROOUT) as definitely slow. That will be a followup patch. Eric Blake (3): server: Add internal support for NBDKIT_FLAG_FAST_ZERO filters: Add .can_fast_zero hook plugins: Add .can_fast_zero hook docs/nbdkit-filter.pod | 27 ++++-- docs/nbdkit-plugin.pod | 74 +++++++++++--- docs/nbdkit-protocol.pod | 11 +++ filters/delay/nbdkit-delay-filter.pod | 15 ++- filters/log/nbdkit-log-filter.pod | 2 +- filters/nozero/nbdkit-nozero-filter.pod | 41 ++++++-- plugins/sh/nbdkit-sh-plugin.pod | 13 ++- server/internal.h | 2 + common/protocol/protocol.h | 11 ++- server/filters.c | 33 ++++++- server/plugins.c | 47 ++++++++- server/protocol-handshake.c | 7 ++ server/protocol.c | 46 +++++++-- include/nbdkit-common.h | 7 +- include/nbdkit-filter.h | 3 + include/nbdkit-plugin.h | 2 + filters/blocksize/blocksize.c | 12 +++ filters/cache/cache.c | 20 ++++ filters/cow/cow.c | 20 ++++ filters/delay/delay.c | 28 +++++- filters/log/log.c | 16 ++-- filters/nozero/nozero.c | 62 +++++++++++- filters/truncate/truncate.c | 15 +++ plugins/data/data.c | 14 ++- plugins/full/full.c | 12 +-- plugins/memory/memory.c | 14 ++- plugins/nbd/nbd.c | 28 +++++- plugins/null/null.c | 8 ++ plugins/ocaml/ocaml.c | 26 ++++- plugins/sh/call.c | 1 - plugins/sh/sh.c | 39 ++++++-- plugins/ocaml/NBDKit.ml | 10 +- plugins/ocaml/NBDKit.mli | 2 + plugins/rust/src/lib.rs | 3 + tests/test-eflags.sh | 122 +++++++++++++++++++++--- tests/test-fua.sh | 4 +- 36 files changed, 686 insertions(+), 111 deletions(-) -- 2.21.0
Eric Blake
2019-Aug-23 14:40 UTC
[Libguestfs] [nbdkit PATCH 1/3] server: Add internal support for NBDKIT_FLAG_FAST_ZERO
Qemu was able to demonstrate that knowing whether a zero operation is fast is useful when copying from one image to another: there is a choice between bulk pre-zeroing and then revisiting the data sections (fewer transactions, but depends on the zeroing to be fast), vs. visiting every portion of the disk only once (more transactions, but no time lost to duplicated I/O due to slow zeroes). As such, the NBD protocol is adding an extension to allow clients to request fast failure when zero is not efficient, from servers that advertise support for the new flag. In nbdkit, start by plumbing a new .can_fast_zero through the backends (although it stays 0 in this patch, later patches will actually expose it to plugins and filters to override). Advertise the flag to the client when the plugin provides a .can_fast_zero, and wire in passing the flag down to .zero in the plugin. In turn, if the flag is set and the implementation .zero fails with ENOTSUP/EOPNOTSUPP, we no longer attempt the .pwrite fallback. Signed-off-by: Eric Blake <eblake@redhat.com> --- docs/nbdkit-protocol.pod | 11 +++++++++ server/internal.h | 2 ++ common/protocol/protocol.h | 11 +++++---- server/filters.c | 20 +++++++++++++--- server/plugins.c | 28 +++++++++++++++++++--- server/protocol-handshake.c | 7 ++++++ server/protocol.c | 46 +++++++++++++++++++++++++++++-------- include/nbdkit-common.h | 7 +++--- plugins/ocaml/ocaml.c | 1 - plugins/sh/call.c | 1 - 10 files changed, 110 insertions(+), 24 deletions(-) diff --git a/docs/nbdkit-protocol.pod b/docs/nbdkit-protocol.pod index 35db07b3..76c50bb8 100644 --- a/docs/nbdkit-protocol.pod +++ b/docs/nbdkit-protocol.pod @@ -173,6 +173,17 @@ This protocol extension allows a client to inform the server about intent to access a portion of the export, to allow the server an opportunity to cache things appropriately. +=item C<NBD_CMD_FLAG_FAST_ZERO> + +Supported in nbdkit E<ge> 1.13.9. + +This protocol extension allows a server to advertise that it can rank +all zero requests as fast or slow, at which point the client can make +fast zero requests which fail immediately with C<ENOTSUP> if the +request is no faster than a counterpart write would be, while normal +zero requests still benefit from compressed network traffic regardless +of the time taken. + =item Resize Extension I<Not supported>. diff --git a/server/internal.h b/server/internal.h index 22e13b6d..784c8b5c 100644 --- a/server/internal.h +++ b/server/internal.h @@ -168,6 +168,7 @@ struct connection { bool is_rotational; bool can_trim; bool can_zero; + bool can_fast_zero; bool can_fua; bool can_multi_conn; bool can_cache; @@ -275,6 +276,7 @@ struct backend { int (*is_rotational) (struct backend *, struct connection *conn); int (*can_trim) (struct backend *, struct connection *conn); int (*can_zero) (struct backend *, struct connection *conn); + int (*can_fast_zero) (struct backend *, struct connection *conn); int (*can_extents) (struct backend *, struct connection *conn); int (*can_fua) (struct backend *, struct connection *conn); int (*can_multi_conn) (struct backend *, struct connection *conn); diff --git a/common/protocol/protocol.h b/common/protocol/protocol.h index e9386430..bf548390 100644 --- a/common/protocol/protocol.h +++ b/common/protocol/protocol.h @@ -96,6 +96,7 @@ extern const char *name_of_nbd_flag (int); #define NBD_FLAG_SEND_DF (1 << 7) #define NBD_FLAG_CAN_MULTI_CONN (1 << 8) #define NBD_FLAG_SEND_CACHE (1 << 10) +#define NBD_FLAG_SEND_FAST_ZERO (1 << 11) /* NBD options (new style handshake only). */ extern const char *name_of_nbd_opt (int); @@ -223,10 +224,11 @@ extern const char *name_of_nbd_cmd (int); #define NBD_CMD_BLOCK_STATUS 7 extern const char *name_of_nbd_cmd_flag (int); -#define NBD_CMD_FLAG_FUA (1<<0) -#define NBD_CMD_FLAG_NO_HOLE (1<<1) -#define NBD_CMD_FLAG_DF (1<<2) -#define NBD_CMD_FLAG_REQ_ONE (1<<3) +#define NBD_CMD_FLAG_FUA (1<<0) +#define NBD_CMD_FLAG_NO_HOLE (1<<1) +#define NBD_CMD_FLAG_DF (1<<2) +#define NBD_CMD_FLAG_REQ_ONE (1<<3) +#define NBD_CMD_FLAG_FAST_ZERO (1<<4) /* Error codes (previously errno). * See http://git.qemu.org/?p=qemu.git;a=commitdiff;h=ca4414804114fd0095b317785bc0b51862e62ebb @@ -239,6 +241,7 @@ extern const char *name_of_nbd_error (int); #define NBD_EINVAL 22 #define NBD_ENOSPC 28 #define NBD_EOVERFLOW 75 +#define NBD_ENOTSUP 95 #define NBD_ESHUTDOWN 108 #endif /* NBDKIT_PROTOCOL_H */ diff --git a/server/filters.c b/server/filters.c index 14ca0cc6..0dd2393e 100644 --- a/server/filters.c +++ b/server/filters.c @@ -403,8 +403,11 @@ next_zero (void *nxdata, uint32_t count, uint64_t offset, uint32_t flags, int r; r = b_conn->b->zero (b_conn->b, b_conn->conn, count, offset, flags, err); - if (r == -1) - assert (*err && *err != ENOTSUP && *err != EOPNOTSUPP); + if (r == -1) { + assert (*err); + if (!(flags & NBDKIT_FLAG_FAST_ZERO)) + assert (*err != ENOTSUP && *err != EOPNOTSUPP); + } return r; } @@ -586,6 +589,15 @@ filter_can_zero (struct backend *b, struct connection *conn) return f->backend.next->can_zero (f->backend.next, conn); } +static int +filter_can_fast_zero (struct backend *b, struct connection *conn) +{ + struct backend_filter *f = container_of (b, struct backend_filter, backend); + + debug ("%s: can_fast_zero", f->name); + return 0; /* Next patch will query or pass through */ +} + static int filter_can_extents (struct backend *b, struct connection *conn) { @@ -738,7 +750,8 @@ filter_zero (struct backend *b, struct connection *conn, void *handle = connection_get_handle (conn, f->backend.i); struct b_conn nxdata = { .b = f->backend.next, .conn = conn }; - assert (!(flags & ~(NBDKIT_FLAG_MAY_TRIM | NBDKIT_FLAG_FUA))); + assert (!(flags & ~(NBDKIT_FLAG_MAY_TRIM | NBDKIT_FLAG_FUA | + NBDKIT_FLAG_FAST_ZERO))); debug ("%s: zero count=%" PRIu32 " offset=%" PRIu64 " flags=0x%" PRIx32, f->name, count, offset, flags); @@ -818,6 +831,7 @@ static struct backend filter_functions = { .is_rotational = filter_is_rotational, .can_trim = filter_can_trim, .can_zero = filter_can_zero, + .can_fast_zero = filter_can_fast_zero, .can_extents = filter_can_extents, .can_fua = filter_can_fua, .can_multi_conn = filter_can_multi_conn, diff --git a/server/plugins.c b/server/plugins.c index 34cc3cb5..c6dcf408 100644 --- a/server/plugins.c +++ b/server/plugins.c @@ -401,6 +401,16 @@ plugin_can_zero (struct backend *b, struct connection *conn) return plugin_can_write (b, conn); } +static int +plugin_can_fast_zero (struct backend *b, struct connection *conn) +{ + assert (connection_get_handle (conn, 0)); + + debug ("can_fast_zero"); + + return 0; /* Upcoming patch will actually add support. */ +} + static int plugin_can_extents (struct backend *b, struct connection *conn) { @@ -621,15 +631,18 @@ plugin_zero (struct backend *b, struct connection *conn, int r = -1; bool may_trim = flags & NBDKIT_FLAG_MAY_TRIM; bool fua = flags & NBDKIT_FLAG_FUA; + bool fast_zero = flags & NBDKIT_FLAG_FAST_ZERO; bool emulate = false; bool need_flush = false; int can_zero = 1; /* TODO cache this per-connection? */ assert (connection_get_handle (conn, 0)); - assert (!(flags & ~(NBDKIT_FLAG_MAY_TRIM | NBDKIT_FLAG_FUA))); + assert (!(flags & ~(NBDKIT_FLAG_MAY_TRIM | NBDKIT_FLAG_FUA | + NBDKIT_FLAG_FAST_ZERO))); - debug ("zero count=%" PRIu32 " offset=%" PRIu64 " may_trim=%d fua=%d", - count, offset, may_trim, fua); + debug ("zero count=%" PRIu32 " offset=%" PRIu64 + " may_trim=%d fua=%d fast_zero=%d", + count, offset, may_trim, fua, fast_zero); if (fua && plugin_can_fua (b, conn) != NBDKIT_FUA_NATIVE) { flags &= ~NBDKIT_FLAG_FUA; @@ -643,6 +656,8 @@ plugin_zero (struct backend *b, struct connection *conn, } if (can_zero) { + /* if (!can_fast_zero) */ + flags &= ~NBDKIT_FLAG_FAST_ZERO; errno = 0; if (p->plugin.zero) r = p->plugin.zero (connection_get_handle (conn, 0), count, offset, @@ -658,6 +673,12 @@ plugin_zero (struct backend *b, struct connection *conn, goto done; } + if (fast_zero) { + assert (r == -1); + *err = EOPNOTSUPP; + goto done; + } + assert (p->plugin.pwrite || p->plugin._pwrite_old); flags &= ~NBDKIT_FLAG_MAY_TRIM; threadlocal_set_error (0); @@ -762,6 +783,7 @@ static struct backend plugin_functions = { .is_rotational = plugin_is_rotational, .can_trim = plugin_can_trim, .can_zero = plugin_can_zero, + .can_fast_zero = plugin_can_fast_zero, .can_extents = plugin_can_extents, .can_fua = plugin_can_fua, .can_multi_conn = plugin_can_multi_conn, diff --git a/server/protocol-handshake.c b/server/protocol-handshake.c index 0f3bd280..84fcacfd 100644 --- a/server/protocol-handshake.c +++ b/server/protocol-handshake.c @@ -66,6 +66,13 @@ protocol_compute_eflags (struct connection *conn, uint16_t *flags) if (fl) { eflags |= NBD_FLAG_SEND_WRITE_ZEROES; conn->can_zero = true; + fl = backend->can_fast_zero (backend, conn); + if (fl == -1) + return -1; + if (fl) { + eflags |= NBD_FLAG_SEND_FAST_ZERO; + conn->can_fast_zero = true; + } } fl = backend->can_trim (backend, conn); diff --git a/server/protocol.c b/server/protocol.c index 72f6f2a8..77a045dc 100644 --- a/server/protocol.c +++ b/server/protocol.c @@ -110,7 +110,8 @@ validate_request (struct connection *conn, /* Validate flags */ if (flags & ~(NBD_CMD_FLAG_FUA | NBD_CMD_FLAG_NO_HOLE | - NBD_CMD_FLAG_DF | NBD_CMD_FLAG_REQ_ONE)) { + NBD_CMD_FLAG_DF | NBD_CMD_FLAG_REQ_ONE | + NBD_CMD_FLAG_FAST_ZERO)) { nbdkit_error ("invalid request: unknown flag (0x%x)", flags); *error = EINVAL; return false; @@ -121,6 +122,21 @@ validate_request (struct connection *conn, *error = EINVAL; return false; } + if (flags & NBD_CMD_FLAG_FAST_ZERO) { + if (cmd != NBD_CMD_WRITE_ZEROES) { + nbdkit_error ("invalid request: " + "FAST_ZERO flag needs WRITE_ZEROES request"); + *error = EINVAL; + return false; + } + if (!conn->can_fast_zero) { + nbdkit_error ("invalid request: " + "%s: fast zero support was not advertised", + name_of_nbd_cmd (cmd)); + *error = EINVAL; + return false; + } + } if (flags & NBD_CMD_FLAG_DF) { if (cmd != NBD_CMD_READ) { nbdkit_error ("invalid request: DF flag needs READ request"); @@ -297,8 +313,12 @@ handle_request (struct connection *conn, f |= NBDKIT_FLAG_MAY_TRIM; if (fua) f |= NBDKIT_FLAG_FUA; + if (flags & NBD_CMD_FLAG_FAST_ZERO) + f |= NBDKIT_FLAG_FAST_ZERO; if (backend->zero (backend, conn, count, offset, f, &err) == -1) { - assert (err && err != ENOTSUP && err != EOPNOTSUPP); + assert (err); + if (!(flags & NBD_CMD_FLAG_FAST_ZERO)) + assert (err != ENOTSUP && err != EOPNOTSUPP); return err; } break; @@ -368,7 +388,7 @@ skip_over_write_buffer (int sock, size_t count) /* Convert a system errno to an NBD_E* error code. */ static int -nbd_errno (int error, bool flag_df) +nbd_errno (int error, uint16_t flags) { switch (error) { case 0: @@ -390,10 +410,17 @@ nbd_errno (int error, bool flag_df) case ESHUTDOWN: return NBD_ESHUTDOWN; #endif + case ENOTSUP: +#if ENOTSUP != EOPNOTSUPP + case EOPNOTSUPP: +#endif + if (flags & NBD_CMD_FLAG_FAST_ZERO) + return NBD_ENOTSUP; + return NBD_EINVAL; case EOVERFLOW: - if (flag_df) + if (flags & NBD_CMD_FLAG_DF) return NBD_EOVERFLOW; - /* fallthrough */ + return NBD_EINVAL; case EINVAL: default: return NBD_EINVAL; @@ -402,7 +429,7 @@ nbd_errno (int error, bool flag_df) static int send_simple_reply (struct connection *conn, - uint64_t handle, uint16_t cmd, + uint64_t handle, uint16_t cmd, uint16_t flags, const char *buf, uint32_t count, uint32_t error) { @@ -413,7 +440,7 @@ send_simple_reply (struct connection *conn, reply.magic = htobe32 (NBD_SIMPLE_REPLY_MAGIC); reply.handle = handle; - reply.error = htobe32 (nbd_errno (error, false)); + reply.error = htobe32 (nbd_errno (error, flags)); r = conn->send (conn, &reply, sizeof reply, f); if (r == -1) { @@ -640,7 +667,7 @@ send_structured_reply_error (struct connection *conn, } /* Send the error. */ - error_data.error = htobe32 (nbd_errno (error, flags & NBD_CMD_FLAG_DF)); + error_data.error = htobe32 (nbd_errno (error, flags)); error_data.len = htobe16 (0); r = conn->send (conn, &error_data, sizeof error_data, 0); if (r == -1) { @@ -790,5 +817,6 @@ protocol_recv_request_send_reply (struct connection *conn) error); } else - return send_simple_reply (conn, request.handle, cmd, buf, count, error); + return send_simple_reply (conn, request.handle, cmd, flags, buf, count, + error); } diff --git a/include/nbdkit-common.h b/include/nbdkit-common.h index e004aa18..f33ce133 100644 --- a/include/nbdkit-common.h +++ b/include/nbdkit-common.h @@ -57,9 +57,10 @@ extern "C" { #define NBDKIT_THREAD_MODEL_SERIALIZE_REQUESTS 2 #define NBDKIT_THREAD_MODEL_PARALLEL 3 -#define NBDKIT_FLAG_MAY_TRIM (1<<0) /* Maps to !NBD_CMD_FLAG_NO_HOLE */ -#define NBDKIT_FLAG_FUA (1<<1) /* Maps to NBD_CMD_FLAG_FUA */ -#define NBDKIT_FLAG_REQ_ONE (1<<2) /* Maps to NBD_CMD_FLAG_REQ_ONE */ +#define NBDKIT_FLAG_MAY_TRIM (1<<0) /* Maps to !NBD_CMD_FLAG_NO_HOLE */ +#define NBDKIT_FLAG_FUA (1<<1) /* Maps to NBD_CMD_FLAG_FUA */ +#define NBDKIT_FLAG_REQ_ONE (1<<2) /* Maps to NBD_CMD_FLAG_REQ_ONE */ +#define NBDKIT_FLAG_FAST_ZERO (1<<3) /* Maps to NBD_CMD_FLAG_FAST_ZERO */ #define NBDKIT_FUA_NONE 0 #define NBDKIT_FUA_EMULATE 1 diff --git a/plugins/ocaml/ocaml.c b/plugins/ocaml/ocaml.c index c472281f..144a449e 100644 --- a/plugins/ocaml/ocaml.c +++ b/plugins/ocaml/ocaml.c @@ -857,7 +857,6 @@ ocaml_nbdkit_set_error (value nv) case 5: err = ENOSPC; break; case 6: err = ESHUTDOWN; break; case 7: err = EOVERFLOW; break; - /* Necessary for .zero support */ case 8: err = EOPNOTSUPP; break; /* Other errno values that server/protocol.c treats specially */ case 9: err = EROFS; break; diff --git a/plugins/sh/call.c b/plugins/sh/call.c index b86e7c9c..ab43e5ea 100644 --- a/plugins/sh/call.c +++ b/plugins/sh/call.c @@ -347,7 +347,6 @@ handle_script_error (char *ebuf, size_t len) err = ESHUTDOWN; skip = 9; } - /* Necessary for .zero support */ else if (strncasecmp (ebuf, "ENOTSUP", 7) == 0) { err = ENOTSUP; skip = 7; -- 2.21.0
Eric Blake
2019-Aug-23 14:40 UTC
[Libguestfs] [nbdkit PATCH 2/3] filters: Add .can_fast_zero hook
Allow filters to affect the handling of the new NBD_CMD_FLAG_FAST_ZERO flag, then update affected filters. In particular, the log filter logs the state of the new flag (requires a tweak to expected output in test-fua.sh), the delay filter gains a bool parameter delay-fast-zero, several filters reject all fast requests because of local writes or splitting a single client request into multiple plugin requests, and the nozero filter gains additional modes for controlling fast zero advertisement and support: zeromode→ none emulate notrim plugin ↓fastzeromode (new) --------------------------------------------- default 0 2 4 4 none 0 1 1 1 slow 0 2 2 2 ignore 0 3 3 3 0 - no zero advertised, thus no fast zero advertised 1 - fast zero not advertised 2 - fast zero advertised, but fast zero requests fail with ENOTSUP (ie. a fast zero was not possible) 3 - fast zero advertised, but fast zero requests are treated the same as normal requests (ignoring the fast zero flag, aids testing at the probable cost of spec non-compliance) 4 - fast zero advertisement/reaction is up to the plugin Mode none/default remains the default for back-compat, and mode plugin/default has no semantic change compared to omitting the nozero filter from the command line. Filters untouched by this patch are fine inheriting whatever fast-zero behavior the underlying plugin uses. Signed-off-by: Eric Blake <eblake@redhat.com> --- docs/nbdkit-filter.pod | 27 +++++++---- filters/delay/nbdkit-delay-filter.pod | 15 +++++- filters/log/nbdkit-log-filter.pod | 2 +- filters/nozero/nbdkit-nozero-filter.pod | 41 +++++++++++++--- server/filters.c | 15 +++++- include/nbdkit-filter.h | 3 ++ filters/blocksize/blocksize.c | 12 +++++ filters/cache/cache.c | 20 ++++++++ filters/cow/cow.c | 20 ++++++++ filters/delay/delay.c | 28 ++++++++++- filters/log/log.c | 16 ++++--- filters/nozero/nozero.c | 62 +++++++++++++++++++++++-- filters/truncate/truncate.c | 15 ++++++ tests/test-fua.sh | 4 +- 14 files changed, 248 insertions(+), 32 deletions(-) diff --git a/docs/nbdkit-filter.pod b/docs/nbdkit-filter.pod index 6e2bea61..ebce8961 100644 --- a/docs/nbdkit-filter.pod +++ b/docs/nbdkit-filter.pod @@ -361,6 +361,8 @@ calls. =head2 C<.can_zero> +=head2 C<.can_fast_zero> + =head2 C<.can_extents> =head2 C<.can_fua> @@ -380,6 +382,8 @@ calls. void *handle); int (*can_zero) (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle); + int (*can_fast_zero) (struct nbdkit_next_ops *next_ops, void *nxdata, + void *handle); int (*can_extents) (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle); int (*can_fua) (struct nbdkit_next_ops *next_ops, void *nxdata, @@ -517,22 +521,25 @@ turn, the filter should not call C<next_ops-E<gt>zero> if C<next_ops-E<gt>can_zero> did not return true. On input, the parameter C<flags> may include C<NBDKIT_FLAG_MAY_TRIM> -unconditionally, and C<NBDKIT_FLAG_FUA> based on the result of -C<.can_fua>. In turn, the filter may pass C<NBDKIT_FLAG_MAY_TRIM> -unconditionally, but should only pass C<NBDKIT_FLAG_FUA> on to -C<next_ops-E<gt>zero> if C<next_ops-E<gt>can_fua> returned a positive -value. +unconditionally, C<NBDKIT_FLAG_FUA> based on the result of +C<.can_fua>, and C<NBDKIT_FLAG_FAST_ZERO> based on the result of +C<.can_fast_zero>. In turn, the filter may pass +C<NBDKIT_FLAG_MAY_TRIM> unconditionally, but should only pass +C<NBDKIT_FLAG_FUA> or C<NBDKIT_FLAG_FAST_ZERO> on to +C<next_ops-E<gt>zero> if the corresponding C<next_ops-E<gt>can_fua> or +C<next_ops-E<gt>can_fast_zero> returned a positive value. Note that unlike the plugin C<.zero> which is permitted to fail with C<ENOTSUP> or C<EOPNOTSUPP> to force a fallback to C<.pwrite>, the -function C<next_ops-E<gt>zero> will never fail with C<err> set to -C<ENOTSUP> or C<EOPNOTSUPP> because the fallback has already taken -place. +function C<next_ops-E<gt>zero> will not fail with C<err> set to +C<ENOTSUP> or C<EOPNOTSUPP> unless C<NBDKIT_FLAG_FAST_ZERO> was used, +because otherwise the fallback has already taken place. If there is an error, C<.zero> should call C<nbdkit_error> with an error message B<and> return -1 with C<err> set to the positive errno -value to return to the client. The filter should never fail with -C<ENOTSUP> or C<EOPNOTSUPP> (while plugins have automatic fallback to +value to return to the client. The filter should not fail with +C<ENOTSUP> or C<EOPNOTSUPP> unless C<flags> includes +C<NBDKIT_FLAG_FAST_ZERO> (while plugins have automatic fallback to C<.pwrite>, filters do not). =head2 C<.extents> diff --git a/filters/delay/nbdkit-delay-filter.pod b/filters/delay/nbdkit-delay-filter.pod index 730cea4c..0a9c77f7 100644 --- a/filters/delay/nbdkit-delay-filter.pod +++ b/filters/delay/nbdkit-delay-filter.pod @@ -12,6 +12,7 @@ nbdkit-delay-filter - nbdkit delay filter delay-read=(SECS|NNms) delay-write=(SECS|NNms) delay-zero=(SECS|NNms) delay-trim=(SECS|NNms) delay-extents=(SECS|NNms) delay-cache=(SECS|NNms) + delay-fast-zero=BOOL =head1 DESCRIPTION @@ -56,7 +57,8 @@ Delay write operations by C<SECS> seconds or C<NN> milliseconds. =item B<delay-zero=>NNB<ms> -Delay zero operations by C<SECS> seconds or C<NN> milliseconds. +Delay zero operations by C<SECS> seconds or C<NN> milliseconds. See +also B<delay-fast-zero>. =item B<delay-trim=>SECS @@ -85,6 +87,17 @@ milliseconds. Delay write, zero and trim operations by C<SECS> seconds or C<NN> milliseconds. +=item B<delay-fast-zero=>BOOL + +The NBD specification documents an extension called fast zero, in +which the client may request that a server should reply with +C<ENOTSUP> as soon as possible if the zero operation offers no real +speedup over a corresponding write. By default, this parameter is +true, and fast zero requests are serviced by the plugin after the same +delay as any other zero request; but setting this parameter to false +instantly fails a fast zero response without waiting for or consulting +the plugin. + =back =head1 SEE ALSO diff --git a/filters/log/nbdkit-log-filter.pod b/filters/log/nbdkit-log-filter.pod index 9e102bc0..5d9625a1 100644 --- a/filters/log/nbdkit-log-filter.pod +++ b/filters/log/nbdkit-log-filter.pod @@ -55,7 +55,7 @@ on the connection). An example logging session of a client that performs a single successful read is: - 2018-01-27 20:38:22.959984 connection=1 Connect size=0x400 write=1 flush=1 rotational=0 trim=0 zero=1 fua=1 + 2018-01-27 20:38:22.959984 connection=1 Connect size=0x400 write=1 flush=1 rotational=0 trim=0 zero=1 fua=1 extents=1 cache=0 fast_zero=0 2018-01-27 20:38:23.001720 connection=1 Read id=1 offset=0x0 count=0x100 ... 2018-01-27 20:38:23.001995 connection=1 ...Read id=1 return=0 (Success) 2018-01-27 20:38:23.044259 connection=1 Disconnect transactions=1 diff --git a/filters/nozero/nbdkit-nozero-filter.pod b/filters/nozero/nbdkit-nozero-filter.pod index 144b8230..4fc7dc63 100644 --- a/filters/nozero/nbdkit-nozero-filter.pod +++ b/filters/nozero/nbdkit-nozero-filter.pod @@ -4,7 +4,8 @@ nbdkit-nozero-filter - nbdkit nozero filter =head1 SYNOPSIS - nbdkit --filter=nozero plugin [zeromode=MODE] [plugin-args...] + nbdkit --filter=nozero plugin [plugin-args...] \ + [zeromode=MODE] [fastzeromode=MODE] =head1 DESCRIPTION @@ -18,7 +19,7 @@ testing client or server fallbacks. =over 4 -=item B<zeromode=none|emulate|notrim> +=item B<zeromode=none|emulate|notrim|plugin> Optional, controls which mode the filter will use. Mode B<none> (default) means that zero support is not advertised to the @@ -29,8 +30,30 @@ efficient way to write zeros. Since nbdkit E<ge> 1.13.4, mode B<notrim> means that zero requests are forwarded on to the plugin, except that the plugin will never see the NBDKIT_MAY_TRIM flag, to determine if the client permitting trimming during zero operations -makes a difference (it is an error to request this mode if the plugin -does not support the C<zero> callback). +makes a difference. Since nbdkit E<ge> 1.13.9, mode B<plugin> leaves +normal zero requests up to the plugin, useful when combined with +C<fastzeromode> for experimenting with the effects of fast zero +requests. It is an error to request B<notrim> or B<plugin> if the +plugin does not support the C<zero> callback. + +=item B<fastzeromode=none|slow|ignore|default> + +Optional since nbdkit E<ge> 1.13.9, controls whether fast zeroes are +advertised to the client, and if so, how the filter will react to a +client fast zero request. Mode B<none> avoids advertising fast zero +support. Mode B<slow> advertises fast zero support unconditionally, +but treats all fast zero requests as an immediate C<ENOTSUP> failure +rather than performing a fallback. Mode B<ignore> advertises fast +zero support, but treats all client fast zero requests as if the flag +had not been used (this behavior is typically contrary to the NBD +specification, but can be useful for comparison against the actual +fast zero implementation to see if fast zeroes make a difference). +Mode B<default> is selected by default; when paired with +C<zeromode=emulate>, fast zeroes are advertised but fast zero requests +always fail (similar to C<slow>); when paired with C<zeromode=notrim> +or C<zeromode=plugin>, fast zero support is left to the plugin +(although in the latter case, the nozero filter could be omitted for +the same behavior). =back @@ -42,11 +65,17 @@ explicitly rather than with C<NBD_CMD_WRITE_ZEROES>: nbdkit --filter=nozero file disk.img Serve the file F<disk.img>, allowing the client to take advantage of -less network traffic via C<NBD_CMD_WRITE_ZEROES>, but still forcing -the data to be written explicitly rather than punching any holes: +less network traffic via C<NBD_CMD_WRITE_ZEROES>, but fail any fast +zero requests up front and force all other zero requests to write data +explicitly rather than punching any holes: nbdkit --filter=nozero file zeromode=emulate disk.img +Serve the file F<disk.img>, but do not advertise fast zero support to +the client even if the plugin supports it: + + nbdkit --filter=nozero file zeromode=plugin fastzeromode=none disk.img + =head1 SEE ALSO L<nbdkit(1)>, diff --git a/server/filters.c b/server/filters.c index 0dd2393e..f2de5e4e 100644 --- a/server/filters.c +++ b/server/filters.c @@ -314,6 +314,13 @@ next_can_zero (void *nxdata) return b_conn->b->can_zero (b_conn->b, b_conn->conn); } +static int +next_can_fast_zero (void *nxdata) +{ + struct b_conn *b_conn = nxdata; + return b_conn->b->can_fast_zero (b_conn->b, b_conn->conn); +} + static int next_can_extents (void *nxdata) { @@ -445,6 +452,7 @@ static struct nbdkit_next_ops next_ops = { .is_rotational = next_is_rotational, .can_trim = next_can_trim, .can_zero = next_can_zero, + .can_fast_zero = next_can_fast_zero, .can_extents = next_can_extents, .can_fua = next_can_fua, .can_multi_conn = next_can_multi_conn, @@ -593,9 +601,14 @@ static int filter_can_fast_zero (struct backend *b, struct connection *conn) { struct backend_filter *f = container_of (b, struct backend_filter, backend); + void *handle = connection_get_handle (conn, f->backend.i); + struct b_conn nxdata = { .b = f->backend.next, .conn = conn }; debug ("%s: can_fast_zero", f->name); - return 0; /* Next patch will query or pass through */ + if (f->filter.can_fast_zero) + return f->filter.can_fast_zero (&next_ops, &nxdata, handle); + else + return f->backend.next->can_fast_zero (f->backend.next, conn); } static int diff --git a/include/nbdkit-filter.h b/include/nbdkit-filter.h index 94f17789..d11cf881 100644 --- a/include/nbdkit-filter.h +++ b/include/nbdkit-filter.h @@ -71,6 +71,7 @@ struct nbdkit_next_ops { int (*is_rotational) (void *nxdata); int (*can_trim) (void *nxdata); int (*can_zero) (void *nxdata); + int (*can_fast_zero) (void *nxdata); int (*can_extents) (void *nxdata); int (*can_fua) (void *nxdata); int (*can_multi_conn) (void *nxdata); @@ -139,6 +140,8 @@ struct nbdkit_filter { void *handle); int (*can_zero) (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle); + int (*can_fast_zero) (struct nbdkit_next_ops *next_ops, void *nxdata, + void *handle); int (*can_extents) (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle); int (*can_fua) (struct nbdkit_next_ops *next_ops, void *nxdata, diff --git a/filters/blocksize/blocksize.c b/filters/blocksize/blocksize.c index 0978887f..47638c74 100644 --- a/filters/blocksize/blocksize.c +++ b/filters/blocksize/blocksize.c @@ -307,6 +307,18 @@ blocksize_zero (struct nbdkit_next_ops *next_ops, void *nxdata, uint32_t drop; bool need_flush = false; + if (flags & NBDKIT_FLAG_FAST_ZERO) { + /* If we have to split the transaction, an ENOTSUP fast failure in + * a later call would be unnecessarily delayed behind earlier + * calls; it's easier to just declare that anything that can't be + * done in one call to the plugin is not fast. + */ + if ((offs | count) & (minblock - 1) || count > maxlen) { + *err = ENOTSUP; + return -1; + } + } + if ((flags & NBDKIT_FLAG_FUA) && next_ops->can_fua (nxdata) == NBDKIT_FUA_EMULATE) { flags &= ~NBDKIT_FLAG_FUA; diff --git a/filters/cache/cache.c b/filters/cache/cache.c index b5dbccd2..7c1d6c4f 100644 --- a/filters/cache/cache.c +++ b/filters/cache/cache.c @@ -250,6 +250,17 @@ cache_can_cache (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle) return NBDKIT_CACHE_NATIVE; } +/* Override the plugin's .can_fast_zero, because our .zero is not fast */ +static int +cache_can_fast_zero (struct nbdkit_next_ops *next_ops, void *nxdata, + void *handle) +{ + /* It is better to advertise support even when we always reject fast + * zero attempts. + */ + return 1; +} + /* Read data. */ static int cache_pread (struct nbdkit_next_ops *next_ops, void *nxdata, @@ -418,6 +429,14 @@ cache_zero (struct nbdkit_next_ops *next_ops, void *nxdata, int r; bool need_flush = false; + /* We are purposefully avoiding next_ops->zero, so a zero request is + * never faster than plain writes. + */ + if (flags & NBDKIT_FLAG_FAST_ZERO) { + *err = ENOTSUP; + return -1; + } + block = malloc (blksize); if (block == NULL) { *err = errno; @@ -624,6 +643,7 @@ static struct nbdkit_filter filter = { .prepare = cache_prepare, .get_size = cache_get_size, .can_cache = cache_can_cache, + .can_fast_zero = cache_can_fast_zero, .pread = cache_pread, .pwrite = cache_pwrite, .zero = cache_zero, diff --git a/filters/cow/cow.c b/filters/cow/cow.c index 9d91d432..a5c1f978 100644 --- a/filters/cow/cow.c +++ b/filters/cow/cow.c @@ -179,6 +179,17 @@ cow_can_cache (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle) return NBDKIT_FUA_NATIVE; } +/* Override the plugin's .can_fast_zero, because our .zero is not fast */ +static int +cow_can_fast_zero (struct nbdkit_next_ops *next_ops, void *nxdata, + void *handle) +{ + /* It is better to advertise support even when we always reject fast + * zero attempts. + */ + return 1; +} + static int cow_flush (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle, uint32_t flags, int *err); /* Read data. */ @@ -340,6 +351,14 @@ cow_zero (struct nbdkit_next_ops *next_ops, void *nxdata, uint64_t blknum, blkoffs; int r; + /* We are purposefully avoiding next_ops->zero, so a zero request is + * never faster than plain writes. + */ + if (flags & NBDKIT_FLAG_FAST_ZERO) { + *err = ENOTSUP; + return -1; + } + block = malloc (BLKSIZE); if (block == NULL) { *err = errno; @@ -496,6 +515,7 @@ static struct nbdkit_filter filter = { .can_extents = cow_can_extents, .can_fua = cow_can_fua, .can_cache = cow_can_cache, + .can_fast_zero = cow_can_fast_zero, .pread = cow_pread, .pwrite = cow_pwrite, .zero = cow_zero, diff --git a/filters/delay/delay.c b/filters/delay/delay.c index c92a12d7..207d101e 100644 --- a/filters/delay/delay.c +++ b/filters/delay/delay.c @@ -37,6 +37,8 @@ #include <stdlib.h> #include <stdint.h> #include <string.h> +#include <stdbool.h> +#include <errno.h> #include <nbdkit-filter.h> @@ -46,6 +48,7 @@ static int delay_zero_ms = 0; /* zero delay (milliseconds) */ static int delay_trim_ms = 0; /* trim delay (milliseconds) */ static int delay_extents_ms = 0;/* extents delay (milliseconds) */ static int delay_cache_ms = 0; /* cache delay (milliseconds) */ +static int delay_fast_zero = 1; /* whether delaying zero includes fast zero */ static int parse_delay (const char *key, const char *value) @@ -182,6 +185,12 @@ delay_config (nbdkit_next_config *next, void *nxdata, return -1; return 0; } + else if (strcmp (key, "delay-fast-zero") == 0) { + delay_fast_zero = nbdkit_parse_bool (value); + if (delay_fast_zero < 0) + return -1; + return 0; + } else return next (nxdata, key, value); } @@ -194,7 +203,19 @@ delay_config (nbdkit_next_config *next, void *nxdata, "delay-trim=<NN>[ms] Trim delay in seconds/milliseconds.\n" \ "delay-extents=<NN>[ms] Extents delay in seconds/milliseconds.\n" \ "delay-cache=<NN>[ms] Cache delay in seconds/milliseconds.\n" \ - "wdelay=<NN>[ms] Write, zero and trim delay in secs/msecs." + "wdelay=<NN>[ms] Write, zero and trim delay in secs/msecs.\n" \ + "delay-fast-zero=<BOOL> Delay fast zero requests (default true).\n" + +/* Override the plugin's .can_fast_zero if needed */ +static int +delay_can_fast_zero (struct nbdkit_next_ops *next_ops, void *nxdata, + void *handle) +{ + /* Advertise if we are handling fast zero requests locally */ + if (delay_zero_ms && !delay_fast_zero) + return 1; + return next_ops->can_fast_zero (nxdata); +} /* Read data. */ static int @@ -225,6 +246,10 @@ delay_zero (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle, uint32_t count, uint64_t offset, uint32_t flags, int *err) { + if ((flags & NBDKIT_FLAG_FAST_ZERO) && delay_zero_ms && !delay_fast_zero) { + *err = ENOTSUP; + return -1; + } if (zero_delay (err) == -1) return -1; return next_ops->zero (nxdata, count, offset, flags, err); @@ -269,6 +294,7 @@ static struct nbdkit_filter filter = { .version = PACKAGE_VERSION, .config = delay_config, .config_help = delay_config_help, + .can_fast_zero = delay_can_fast_zero, .pread = delay_pread, .pwrite = delay_pwrite, .zero = delay_zero, diff --git a/filters/log/log.c b/filters/log/log.c index 7cf741e1..95667c61 100644 --- a/filters/log/log.c +++ b/filters/log/log.c @@ -260,14 +260,15 @@ log_prepare (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle) int F = next_ops->can_fua (nxdata); int e = next_ops->can_extents (nxdata); int c = next_ops->can_cache (nxdata); + int Z = next_ops->can_fast_zero (nxdata); if (size < 0 || w < 0 || f < 0 || r < 0 || t < 0 || z < 0 || F < 0 || - e < 0 || c < 0) + e < 0 || c < 0 || Z < 0) return -1; output (h, "Connect", 0, "size=0x%" PRIx64 " write=%d flush=%d " - "rotational=%d trim=%d zero=%d fua=%d extents=%d cache=%d", - size, w, f, r, t, z, F, e, c); + "rotational=%d trim=%d zero=%d fua=%d extents=%d cache=%d " + "fast_zero=%d", size, w, f, r, t, z, F, e, c, Z); return 0; } @@ -360,10 +361,13 @@ log_zero (struct nbdkit_next_ops *next_ops, void *nxdata, uint64_t id = get_id (h); int r; - assert (!(flags & ~(NBDKIT_FLAG_FUA | NBDKIT_FLAG_MAY_TRIM))); - output (h, "Zero", id, "offset=0x%" PRIx64 " count=0x%x trim=%d fua=%d ...", + assert (!(flags & ~(NBDKIT_FLAG_FUA | NBDKIT_FLAG_MAY_TRIM | + NBDKIT_FLAG_FAST_ZERO))); + output (h, "Zero", id, + "offset=0x%" PRIx64 " count=0x%x trim=%d fua=%d fast=%d...", offs, count, !!(flags & NBDKIT_FLAG_MAY_TRIM), - !!(flags & NBDKIT_FLAG_FUA)); + !!(flags & NBDKIT_FLAG_FUA), + !!(flags & NBDKIT_FLAG_FAST_ZERO)); r = next_ops->zero (nxdata, count, offs, flags, err); output_return (h, "...Zero", id, r, err); return r; diff --git a/filters/nozero/nozero.c b/filters/nozero/nozero.c index 964cce9f..e54f7c62 100644 --- a/filters/nozero/nozero.c +++ b/filters/nozero/nozero.c @@ -38,6 +38,7 @@ #include <string.h> #include <stdbool.h> #include <assert.h> +#include <errno.h> #include <nbdkit-filter.h> @@ -49,8 +50,16 @@ static enum ZeroMode { NONE, EMULATE, NOTRIM, + PLUGIN, } zeromode; +static enum FastZeroMode { + DEFAULT, + SLOW, + IGNORE, + NOFAST, +} fastzeromode; + static int nozero_config (nbdkit_next_config *next, void *nxdata, const char *key, const char *value) @@ -60,17 +69,35 @@ nozero_config (nbdkit_next_config *next, void *nxdata, zeromode = EMULATE; else if (strcmp (value, "notrim") == 0) zeromode = NOTRIM; + else if (strcmp (value, "plugin") == 0) + zeromode = PLUGIN; else if (strcmp (value, "none") != 0) { nbdkit_error ("unknown zeromode '%s'", value); return -1; } return 0; } + + if (strcmp (key, "fastzeromode") == 0) { + if (strcmp (value, "none") == 0) + fastzeromode = NOFAST; + else if (strcmp (value, "ignore") == 0) + fastzeromode = IGNORE; + else if (strcmp (value, "slow") == 0) + fastzeromode = SLOW; + else if (strcmp (value, "default") != 0) { + nbdkit_error ("unknown fastzeromode '%s'", value); + return -1; + } + return 0; + } + return next (nxdata, key, value); } #define nozero_config_help \ - "zeromode=<MODE> Either 'none' (default), 'emulate', or 'notrim'.\n" \ + "zeromode=<MODE> One of 'none' (default), 'emulate', 'notrim', 'plugin'.\n" \ + "fastzeromode=<MODE> One of 'default', 'none', 'slow', 'ignore'.\n" /* Check that desired mode is supported by plugin. */ static int @@ -78,12 +105,13 @@ nozero_prepare (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle) { int r; - if (zeromode == NOTRIM) { + if (zeromode == NOTRIM || zeromode == PLUGIN) { r = next_ops->can_zero (nxdata); if (r == -1) return -1; if (!r) { - nbdkit_error ("zeromode 'notrim' requires plugin zero support"); + nbdkit_error ("zeromode '%s' requires plugin zero support", + zeromode == NOTRIM ? "notrim" : "plugin"); return -1; } } @@ -94,9 +122,22 @@ nozero_prepare (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle) static int nozero_can_zero (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle) { + /* For NOTRIM and PLUGIN modes, we've already verified next_ops->can_zero */ return zeromode != NONE; } +/* Advertise desired FAST_ZERO mode. */ +static int +nozero_can_fast_zero (struct nbdkit_next_ops *next_ops, void *nxdata, + void *handle) +{ + if (zeromode == NONE) + return 0; + if (zeromode != EMULATE && fastzeromode == DEFAULT) + return next_ops->can_fast_zero (nxdata); + return fastzeromode != NOFAST; +} + static int nozero_zero (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle, uint32_t count, uint64_t offs, uint32_t flags, @@ -106,9 +147,21 @@ nozero_zero (struct nbdkit_next_ops *next_ops, void *nxdata, bool need_flush = false; assert (zeromode != NONE); - flags &= ~NBDKIT_FLAG_MAY_TRIM; + if (flags & NBDKIT_FLAG_FAST_ZERO) { + assert (fastzeromode != NOFAST); + if (fastzeromode == SLOW || + (fastzeromode == DEFAULT && zeromode == EMULATE)) { + *err = ENOTSUP; + return -1; + } + if (fastzeromode == IGNORE) + flags &= ~NBDKIT_FLAG_FAST_ZERO; + } if (zeromode == NOTRIM) + flags &= ~NBDKIT_FLAG_MAY_TRIM; + + if (zeromode != EMULATE) return next_ops->zero (nxdata, count, offs, flags, err); if (flags & NBDKIT_FLAG_FUA) { @@ -144,6 +197,7 @@ static struct nbdkit_filter filter = { .config_help = nozero_config_help, .prepare = nozero_prepare, .can_zero = nozero_can_zero, + .can_fast_zero = nozero_can_fast_zero, .zero = nozero_zero, }; diff --git a/filters/truncate/truncate.c b/filters/truncate/truncate.c index 93d8f074..47d70b31 100644 --- a/filters/truncate/truncate.c +++ b/filters/truncate/truncate.c @@ -201,6 +201,14 @@ truncate_get_size (struct nbdkit_next_ops *next_ops, void *nxdata, return h->size; } +/* Override the plugin's .can_fast_zero, because zeroing a tail is fast. */ +static int +truncate_can_fast_zero (struct nbdkit_next_ops *next_ops, void *nxdata, + void *handle) +{ + return 1; +} + /* Read data. */ static int truncate_pread (struct nbdkit_next_ops *next_ops, void *nxdata, @@ -297,6 +305,12 @@ truncate_zero (struct nbdkit_next_ops *next_ops, void *nxdata, n = count; else n = h->real_size - offset; + if (flags & NBDKIT_FLAG_FAST_ZERO && + next_ops->can_fast_zero (nxdata) <= 0) { + /* TODO: Cache per connection? */ + *err = ENOTSUP; + return -1; + } return next_ops->zero (nxdata, n, offset, flags, err); } return 0; @@ -392,6 +406,7 @@ static struct nbdkit_filter filter = { .close = truncate_close, .prepare = truncate_prepare, .get_size = truncate_get_size, + .can_fast_zero = truncate_can_fast_zero, .pread = truncate_pread, .pwrite = truncate_pwrite, .trim = truncate_trim, diff --git a/tests/test-fua.sh b/tests/test-fua.sh index c0d82db7..1c869e96 100755 --- a/tests/test-fua.sh +++ b/tests/test-fua.sh @@ -106,14 +106,14 @@ test $(grep -c 'connection=1 Flush' fua1.log) -lt \ # all earlier parts of the transaction do not have fua flush1=$(grep -c 'connection=1 Flush' fua2.log || :) flush2=$(grep -c 'connection=2 Flush' fua2.log || :) -fua=$(grep -c 'connection=2.*fua=1 \.' fua2.log || :) +fua=$(grep -c 'connection=2.*fua=1 .*\.' fua2.log || :) test $(( $flush2 - $flush1 + $fua )) = 2 # Test 3: every part of split has fua, and no flushes are added flush1=$(grep -c 'connection=1 Flush' fua3.log || :) flush2=$(grep -c 'connection=2 Flush' fua3.log || :) test $flush1 = $flush2 -test $(grep -c 'connection=2.*fua=1 \.' fua3.log) = 32 +test $(grep -c 'connection=2.*fua=1 .*\.' fua3.log) = 32 # Test 4: flush is no-op, and every transaction has fua if grep 'fua=0' fua4.log; then -- 2.21.0
Eric Blake
2019-Aug-23 14:40 UTC
[Libguestfs] [nbdkit PATCH 3/3] plugins: Add .can_fast_zero hook
Allow plugins to affect the handling of the new NBD_CMD_FLAG_FAST_ZERO flag, then update affected plugins. In particular, in-memory plugins are always fast; the full plugin is better served by omitting .zero and relying on .pwrite fallback; nbd takes advantage of libnbd extensions proposed in parallel to pass through support; and v2 language bindings expose the choice to their scripts. The testsuite is updated thanks to the sh plugin to cover this. In turn, the sh plugin has to be a bit smarter about handling missing can_fast_zero to get decent fallback support from nbdkit. Plugins untouched by this patch either do not support .zero with flags (including v1 plugins; these are all okay with the default behavior of advertising but always failing fast zeroes), or are too difficult to analyze in this patch (so not advertising is easier than having to decide - in particular, the file plugin is tricky, since BLKZEROOUT is not reliably fast). The nozero filter can be used to adjust fast zero settings for a plugin that has not yet updated. Signed-off-by: Eric Blake <eblake@redhat.com> --- docs/nbdkit-plugin.pod | 74 +++++++++++++++---- plugins/sh/nbdkit-sh-plugin.pod | 13 +++- server/plugins.c | 25 +++++-- include/nbdkit-plugin.h | 2 + plugins/data/data.c | 14 +++- plugins/full/full.c | 12 ++-- plugins/memory/memory.c | 14 +++- plugins/nbd/nbd.c | 28 +++++++- plugins/null/null.c | 8 +++ plugins/ocaml/ocaml.c | 25 +++++++ plugins/sh/sh.c | 39 +++++++--- plugins/ocaml/NBDKit.ml | 10 ++- plugins/ocaml/NBDKit.mli | 2 + plugins/rust/src/lib.rs | 3 + tests/test-eflags.sh | 122 ++++++++++++++++++++++++++++---- 15 files changed, 332 insertions(+), 59 deletions(-) diff --git a/docs/nbdkit-plugin.pod b/docs/nbdkit-plugin.pod index bc3d9749..f3793e7a 100644 --- a/docs/nbdkit-plugin.pod +++ b/docs/nbdkit-plugin.pod @@ -609,19 +609,47 @@ C<.trim> callback has been defined. This is called during the option negotiation phase to find out if the plugin wants the C<.zero> callback to be utilized. Support for -writing zeroes is still advertised to the client (unless the nbdkit -filter nozero is also used), so returning false merely serves as a way -to avoid complicating the C<.zero> callback to have to fail with -C<ENOTSUP> or C<EOPNOTSUPP> on the connections where it will never be -more efficient than using C<.pwrite> up front. +writing zeroes is still advertised to the client (unless the +L<nbdkit-nozero-filter(1)> is also used), so returning false merely +serves as a way to avoid complicating the C<.zero> callback to have to +fail with C<ENOTSUP> or C<EOPNOTSUPP> on the connections where it will +never be more efficient than using C<.pwrite> up front. If there is an error, C<.can_zero> should call C<nbdkit_error> with an error message and return C<-1>. -This callback is not required. If omitted, then nbdkit always tries -C<.zero> first if it is present, and gracefully falls back to -C<.pwrite> if C<.zero> was absent or failed with C<ENOTSUP> or -C<EOPNOTSUPP>. +This callback is not required. If omitted, then for a normal zero +request, nbdkit always tries C<.zero> first if it is present, and +gracefully falls back to C<.pwrite> if C<.zero> was absent or failed +with C<ENOTSUP> or C<EOPNOTSUPP>. + +=head2 C<.can_fast_zero> + + int can_fast_zero (void *handle); + +This is called during the option negotiation phase to find out if the +plugin wants to advertise support for fast zero requests. If this +support is not advertised, a client cannot attempt fast zero requests, +and has no way to tell if writing zeroes offers any speedups compared +to using C<.pwrite> (other than compressed network traffic). If +support is advertised, then C<.zero> will have +C<NBDKIT_FLAG_FAST_ZERO> set when the client has requested a fast +zero, in which case the plugin must fail with C<ENOTSUP> or +C<EOPNOTSUPP> up front if the request would not offer any benefits +over C<.pwrite>. Advertising support for fast zero requests does not +require that writing zeroes be fast, only that the result (whether +success or failure) is fast, so this should be advertised when +feasible. + +If there is an error, C<.can_fast_zero> should call C<nbdkit_error> +with an error message and return C<-1>. + +This callback is not required. If omitted, then nbdkit returns true +if C<.zero> is absent or C<.can_zero> returns false (in those cases, +nbdkit fails all fast zero requests, as its fallback to C<.pwrite> is +not inherently faster), otherwise false (since it cannot be determined +in advance if the plugin's C<.zero> will properly honor the semantics +of C<NBDKIT_FLAG_FAST_ZERO>). =head2 C<.can_extents> @@ -804,15 +832,25 @@ bytes of zeroes at C<offset> in the backing store. This function will not be called if C<.can_zero> returned false. On input, the parameter C<flags> may include C<NBDKIT_FLAG_MAY_TRIM> -unconditionally, and C<NBDKIT_FLAG_FUA> based on the result of -C<.can_fua>. +unconditionally, C<NBDKIT_FLAG_FUA> based on the result of +C<.can_fua>, and C<NBDKIT_FLAG_FAST_ZERO> based on the result of +C<.can_fast_zero>. If C<NBDKIT_FLAG_MAY_TRIM> is requested, the operation can punch a hole instead of writing actual zero bytes, but only if subsequent -reads from the hole read as zeroes. If this callback is omitted, or -if it fails with C<ENOTSUP> or C<EOPNOTSUPP> (whether by -C<nbdkit_set_error> or C<errno>), then C<.pwrite> will be used -instead. +reads from the hole read as zeroes. + +If C<NBDKIT_FLAG_FAST_ZERO> is requested, the plugin must decide up +front if the implementation is likely to be faster than a +corresponding C<.pwrite>; if not, then it must immediately fail with +C<ENOTSUP> or C<EOPNOTSUPP> (whether by C<nbdkit_set_error> or +C<errno>) and preferably without modifying the exported image. It is +acceptable to always fail a fast zero request (as a fast failure is +better than attempting the write only to find out after the fact that +it was not fast after all). Note that on Linux, support for +C<ioctl(BLKZEROOUT)> is insufficient for determining whether a zero +request to a block device will be fast (because the kernel will +perform a slow fallback when needed). The callback must write the whole C<count> bytes if it can. The NBD protocol doesn't allow partial writes (instead, these would be @@ -823,6 +861,11 @@ If there is an error, C<.zero> should call C<nbdkit_error> with an error message, and C<nbdkit_set_error> to record an appropriate error (unless C<errno> is sufficient), then return C<-1>. +If this callback is omitted, or if it fails with C<ENOTSUP> or +C<EOPNOTSUPP> (whether by C<nbdkit_set_error> or C<errno>), then +C<.pwrite> will be used as an automatic fallback except when the +client requested a fast zero. + =head2 C<.extents> int extents (void *handle, uint32_t count, uint64_t offset, @@ -1221,6 +1264,7 @@ and then users will be able to run it like this: =head1 SEE ALSO L<nbdkit(1)>, +L<nbdkit-nozero-filter(3)>, L<nbdkit-filter(3)>. Standard plugins provided by nbdkit: diff --git a/plugins/sh/nbdkit-sh-plugin.pod b/plugins/sh/nbdkit-sh-plugin.pod index 9e9a133e..adb8a0db 100644 --- a/plugins/sh/nbdkit-sh-plugin.pod +++ b/plugins/sh/nbdkit-sh-plugin.pod @@ -289,7 +289,10 @@ The script should exit with code C<0> for true or code C<3> for false. =item C<is_rotational> +=item C<can_fast_zero> + /path/to/script is_rotational <handle> + /path/to/script can_fast_zero <handle> The script should exit with code C<0> for true or code C<3> for false. @@ -361,12 +364,18 @@ also provide a C<can_trim> method which exits with code C<0> (true). /path/to/script zero <handle> <count> <offset> <flags> The C<flags> parameter can be an empty string or a comma-separated -list of the flags: C<"fua"> and C<"may_trim"> (eg. C<"">, C<"fua">, -C<"fua,may_trim"> are all possible values). +list of the flags: C<"fua">, C<"may_trim">, and C<"fast"> (eg. C<"">, +C<"fua">, C<"fua,may_trim,fast"> are some of the 8 possible values). Unlike in other languages, if you provide a C<zero> method you B<must> also provide a C<can_zero> method which exits with code C<0> (true). +To trigger a fallback to <pwrite> on a normal zero request, or to +respond quickly to the C<"fast"> flag that a specific zero request is +no faster than a corresponding write, the script must output +C<ENOTSUP> or C<EOPNOTSUPP> to stderr (possibly followed by a +description of the problem) before exiting with code C<1> (failure). + =item C<extents> /path/to/script extents <handle> <count> <offset> <flags> diff --git a/server/plugins.c b/server/plugins.c index c6dcf408..84329cf4 100644 --- a/server/plugins.c +++ b/server/plugins.c @@ -404,11 +404,25 @@ plugin_can_zero (struct backend *b, struct connection *conn) static int plugin_can_fast_zero (struct backend *b, struct connection *conn) { + struct backend_plugin *p = container_of (b, struct backend_plugin, backend); + int r; + assert (connection_get_handle (conn, 0)); debug ("can_fast_zero"); - return 0; /* Upcoming patch will actually add support. */ + if (p->plugin.can_fast_zero) + return p->plugin.can_fast_zero (connection_get_handle (conn, 0)); + /* Advertise support for fast zeroes if no .zero or .can_zero is + * false: in those cases, we fail fast instead of using .pwrite. + * This also works when v1 plugin has only ._zero_old. + */ + if (p->plugin.zero == NULL) + return 1; + r = plugin_can_zero (b, conn); + if (r == -1) + return -1; + return !r; } static int @@ -656,15 +670,18 @@ plugin_zero (struct backend *b, struct connection *conn, } if (can_zero) { - /* if (!can_fast_zero) */ - flags &= ~NBDKIT_FLAG_FAST_ZERO; errno = 0; if (p->plugin.zero) r = p->plugin.zero (connection_get_handle (conn, 0), count, offset, flags); - else if (p->plugin._zero_old) + else if (p->plugin._zero_old) { + if (fast_zero) { + *err = EOPNOTSUPP; + return -1; + } r = p->plugin._zero_old (connection_get_handle (conn, 0), count, offset, may_trim); + } else emulate = true; if (r == -1) diff --git a/include/nbdkit-plugin.h b/include/nbdkit-plugin.h index 632df867..45ae7053 100644 --- a/include/nbdkit-plugin.h +++ b/include/nbdkit-plugin.h @@ -132,6 +132,8 @@ struct nbdkit_plugin { int (*cache) (void *handle, uint32_t count, uint64_t offset, uint32_t flags); int (*thread_model) (void); + + int (*can_fast_zero) (void *handle); }; extern void nbdkit_set_error (int err); diff --git a/plugins/data/data.c b/plugins/data/data.c index 14fb8997..9004a487 100644 --- a/plugins/data/data.c +++ b/plugins/data/data.c @@ -349,6 +349,13 @@ data_can_cache (void *handle) return NBDKIT_CACHE_NATIVE; } +/* Fast zero. */ +static int +data_can_fast_zero (void *handle) +{ + return 1; +} + /* Read data. */ static int data_pread (void *handle, void *buf, uint32_t count, uint64_t offset, @@ -375,8 +382,10 @@ data_pwrite (void *handle, const void *buf, uint32_t count, uint64_t offset, static int data_zero (void *handle, uint32_t count, uint64_t offset, uint32_t flags) { - /* Flushing, and thus FUA flag, is a no-op */ - assert ((flags & ~(NBDKIT_FLAG_FUA | NBDKIT_FLAG_MAY_TRIM)) == 0); + /* Flushing, and thus FUA flag, is a no-op. Assume that + * sparse_array_zero generally beats writes, so FAST_ZERO is a no-op. */ + assert ((flags & ~(NBDKIT_FLAG_FUA | NBDKIT_FLAG_MAY_TRIM | + NBDKIT_FLAG_FAST_ZERO)) == 0); ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); sparse_array_zero (sa, count, offset); return 0; @@ -423,6 +432,7 @@ static struct nbdkit_plugin plugin = { .can_multi_conn = data_can_multi_conn, .can_fua = data_can_fua, .can_cache = data_can_cache, + .can_fast_zero = data_can_fast_zero, .pread = data_pread, .pwrite = data_pwrite, .zero = data_zero, diff --git a/plugins/full/full.c b/plugins/full/full.c index 9cfbcfcd..0b69a8c9 100644 --- a/plugins/full/full.c +++ b/plugins/full/full.c @@ -129,13 +129,10 @@ full_pwrite (void *handle, const void *buf, uint32_t count, uint64_t offset, return -1; } -/* Write zeroes. */ -static int -full_zero (void *handle, uint32_t count, uint64_t offset, uint32_t flags) -{ - errno = ENOSPC; - return -1; -} +/* Omitting full_zero is intentional: that way, nbdkit defaults to + * permitting fast zeroes which respond with ENOTSUP, while normal + * zeroes fall back to pwrite and respond with ENOSPC. + */ /* Trim. */ static int @@ -172,7 +169,6 @@ static struct nbdkit_plugin plugin = { .can_cache = full_can_cache, .pread = full_pread, .pwrite = full_pwrite, - .zero = full_zero, .trim = full_trim, .extents = full_extents, /* In this plugin, errno is preserved properly along error return diff --git a/plugins/memory/memory.c b/plugins/memory/memory.c index 09162ea2..e831a7b5 100644 --- a/plugins/memory/memory.c +++ b/plugins/memory/memory.c @@ -147,6 +147,13 @@ memory_can_cache (void *handle) return NBDKIT_CACHE_NATIVE; } +/* Fast zero. */ +static int +memory_can_fast_zero (void *handle) +{ + return 1; +} + /* Read data. */ static int memory_pread (void *handle, void *buf, uint32_t count, uint64_t offset, @@ -173,8 +180,10 @@ memory_pwrite (void *handle, const void *buf, uint32_t count, uint64_t offset, static int memory_zero (void *handle, uint32_t count, uint64_t offset, uint32_t flags) { - /* Flushing, and thus FUA flag, is a no-op */ - assert ((flags & ~(NBDKIT_FLAG_FUA | NBDKIT_FLAG_MAY_TRIM)) == 0); + /* Flushing, and thus FUA flag, is a no-op. Assume that + * sparse_array_zero generally beats writes, so FAST_ZERO is a no-op. */ + assert ((flags & ~(NBDKIT_FLAG_FUA | NBDKIT_FLAG_MAY_TRIM | + NBDKIT_FLAG_FAST_ZERO)) == 0); ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); sparse_array_zero (sa, count, offset); return 0; @@ -221,6 +230,7 @@ static struct nbdkit_plugin plugin = { .can_fua = memory_can_fua, .can_multi_conn = memory_can_multi_conn, .can_cache = memory_can_cache, + .can_fast_zero = memory_can_fast_zero, .pread = memory_pread, .pwrite = memory_pwrite, .zero = memory_zero, diff --git a/plugins/nbd/nbd.c b/plugins/nbd/nbd.c index 09c8891e..cddcfde2 100644 --- a/plugins/nbd/nbd.c +++ b/plugins/nbd/nbd.c @@ -633,6 +633,24 @@ nbdplug_can_zero (void *handle) return i; } +static int +nbdplug_can_fast_zero (void *handle) +{ +#if LIBNBD_HAVE_NBD_CAN_FAST_ZERO + struct handle *h = handle; + int i = nbd_can_fast_zero (h->nbd); + + if (i == -1) { + nbdkit_error ("failure to check fast zero flag: %s", nbd_get_error ()); + return -1; + } + return i; +#else + /* libnbd 0.9.8 lacks fast zero support */ + return 0; +#endif +} + static int nbdplug_can_fua (void *handle) { @@ -724,12 +742,19 @@ nbdplug_zero (void *handle, uint32_t count, uint64_t offset, uint32_t flags) struct transaction s; uint32_t f = 0; - assert (!(flags & ~(NBDKIT_FLAG_FUA | NBDKIT_FLAG_MAY_TRIM))); + assert (!(flags & ~(NBDKIT_FLAG_FUA | NBDKIT_FLAG_MAY_TRIM | + NBDKIT_FLAG_FAST_ZERO))); if (!(flags & NBDKIT_FLAG_MAY_TRIM)) f |= LIBNBD_CMD_FLAG_NO_HOLE; if (flags & NBDKIT_FLAG_FUA) f |= LIBNBD_CMD_FLAG_FUA; +#if LIBNBD_HAVE_NBD_CAN_FAST_ZERO + if (flags & NBDKIT_FLAG_FAST_ZERO) + f |= LIBNBD_CMD_FLAG_FAST_ZERO; +#else + assert (!(flags & NBDKIT_FLAG_FAST_ZERO)); +#endif nbdplug_prepare (&s); nbdplug_register (h, &s, nbd_aio_zero (h->nbd, count, offset, s.cb, f)); return nbdplug_reply (h, &s); @@ -831,6 +856,7 @@ static struct nbdkit_plugin plugin = { .is_rotational = nbdplug_is_rotational, .can_trim = nbdplug_can_trim, .can_zero = nbdplug_can_zero, + .can_fast_zero = nbdplug_can_fast_zero, .can_fua = nbdplug_can_fua, .can_multi_conn = nbdplug_can_multi_conn, .can_extents = nbdplug_can_extents, diff --git a/plugins/null/null.c b/plugins/null/null.c index 647624ba..559cb815 100644 --- a/plugins/null/null.c +++ b/plugins/null/null.c @@ -100,6 +100,13 @@ null_can_cache (void *handle) return NBDKIT_CACHE_NATIVE; } +/* Fast zero. */ +static int +null_can_fast_zero (void *handle) +{ + return 1; +} + /* Read data. */ static int null_pread (void *handle, void *buf, uint32_t count, uint64_t offset, @@ -167,6 +174,7 @@ static struct nbdkit_plugin plugin = { .get_size = null_get_size, .can_multi_conn = null_can_multi_conn, .can_cache = null_can_cache, + .can_fast_zero = null_can_fast_zero, .pread = null_pread, .pwrite = null_pwrite, .zero = null_zero, diff --git a/plugins/ocaml/ocaml.c b/plugins/ocaml/ocaml.c index 144a449e..a655f9ca 100644 --- a/plugins/ocaml/ocaml.c +++ b/plugins/ocaml/ocaml.c @@ -134,6 +134,8 @@ static value cache_fn; static value thread_model_fn; +static value can_fast_zero_fn; + /*----------------------------------------------------------------------*/ /* Wrapper functions that translate calls from C (ie. nbdkit) to OCaml. */ @@ -705,6 +707,25 @@ thread_model_wrapper (void) CAMLreturnT (int, Int_val (rv)); } +static int +can_fast_zero_wrapper (void *h) +{ + CAMLparam0 (); + CAMLlocal1 (rv); + + caml_leave_blocking_section (); + + rv = caml_callback_exn (can_fast_zero_fn, *(value *) h); + if (Is_exception_result (rv)) { + nbdkit_error ("%s", caml_format_exception (Extract_exception (rv))); + caml_enter_blocking_section (); + CAMLreturnT (int, -1); + } + + caml_enter_blocking_section (); + CAMLreturnT (int, Bool_val (rv)); +} + /*----------------------------------------------------------------------*/ /* set_* functions called from OCaml code at load time to initialize * fields in the plugin struct. @@ -792,6 +813,8 @@ SET(cache) SET(thread_model) +SET(can_fast_zero) + #undef SET static void @@ -836,6 +859,8 @@ remove_roots (void) REMOVE (thread_model); + REMOVE (can_fast_zero); + #undef REMOVE } diff --git a/plugins/sh/sh.c b/plugins/sh/sh.c index c73b08b7..d5db8b59 100644 --- a/plugins/sh/sh.c +++ b/plugins/sh/sh.c @@ -478,6 +478,9 @@ flags_string (uint32_t flags, char *buf, size_t len) if (flags & NBDKIT_FLAG_REQ_ONE) flag_append ("req_one", &comma, &buf, &len); + + if (flags & NBDKIT_FLAG_FAST_ZERO) + flag_append("fast", &comma, &buf, &len); } static void @@ -536,7 +539,7 @@ sh_pwrite (void *handle, const void *buf, uint32_t count, uint64_t offset, /* Common code for handling all boolean methods like can_write etc. */ static int -boolean_method (void *handle, const char *method_name) +boolean_method (void *handle, const char *method_name, int def) { char *h = handle; const char *args[] = { script, method_name, h, NULL }; @@ -546,8 +549,8 @@ boolean_method (void *handle, const char *method_name) return 1; case RET_FALSE: /* false */ return 0; - case MISSING: /* missing => assume false */ - return 0; + case MISSING: /* missing => caller chooses default */ + return def; case ERROR: /* error cases */ return -1; default: abort (); @@ -557,37 +560,37 @@ boolean_method (void *handle, const char *method_name) static int sh_can_write (void *handle) { - return boolean_method (handle, "can_write"); + return boolean_method (handle, "can_write", 0); } static int sh_can_flush (void *handle) { - return boolean_method (handle, "can_flush"); + return boolean_method (handle, "can_flush", 0); } static int sh_is_rotational (void *handle) { - return boolean_method (handle, "is_rotational"); + return boolean_method (handle, "is_rotational", 0); } static int sh_can_trim (void *handle) { - return boolean_method (handle, "can_trim"); + return boolean_method (handle, "can_trim", 0); } static int sh_can_zero (void *handle) { - return boolean_method (handle, "can_zero"); + return boolean_method (handle, "can_zero", 0); } static int sh_can_extents (void *handle) { - return boolean_method (handle, "can_extents"); + return boolean_method (handle, "can_extents", 0); } /* Not a boolean method, the method prints "none", "emulate" or "native". */ @@ -646,7 +649,7 @@ sh_can_fua (void *handle) static int sh_can_multi_conn (void *handle) { - return boolean_method (handle, "can_multi_conn"); + return boolean_method (handle, "can_multi_conn", 0); } /* Not a boolean method, the method prints "none", "emulate" or "native". */ @@ -696,6 +699,21 @@ sh_can_cache (void *handle) } } +static int +sh_can_fast_zero (void *handle) +{ + int r = boolean_method (handle, "can_fast_zero", 2); + if (r < 2) + return r; + /* We need to duplicate the logic of plugins.c: if can_fast_zero is + * missing, we advertise fast fail support when can_zero is false. + */ + r = sh_can_zero (handle); + if (r == -1) + return -1; + return !r; +} + static int sh_flush (void *handle, uint32_t flags) { @@ -962,6 +980,7 @@ static struct nbdkit_plugin plugin = { .can_fua = sh_can_fua, .can_multi_conn = sh_can_multi_conn, .can_cache = sh_can_cache, + .can_fast_zero = sh_can_fast_zero, .pread = sh_pread, .pwrite = sh_pwrite, diff --git a/plugins/ocaml/NBDKit.ml b/plugins/ocaml/NBDKit.ml index e54a7705..7002ac03 100644 --- a/plugins/ocaml/NBDKit.ml +++ b/plugins/ocaml/NBDKit.ml @@ -96,6 +96,8 @@ type 'a plugin = { cache : ('a -> int32 -> int64 -> flags -> unit) option; thread_model : (unit -> thread_model) option; + + can_fast_zero : ('a -> bool) option; } let default_callbacks = { @@ -141,6 +143,8 @@ let default_callbacks = { cache = None; thread_model = None; + + can_fast_zero = None; } external set_name : string -> unit = "ocaml_nbdkit_set_name" "noalloc" @@ -186,6 +190,8 @@ external set_cache : ('a -> int32 -> int64 -> flags -> unit) -> unit = "ocaml_nb external set_thread_model : (unit -> thread_model) -> unit = "ocaml_nbdkit_set_thread_model" +external set_can_fast_zero : ('a -> bool) -> unit = "ocaml_nbdkit_set_can_fast_zero" + let may f = function None -> () | Some a -> f a let register_plugin plugin @@ -249,7 +255,9 @@ let register_plugin plugin may set_can_cache plugin.can_cache; may set_cache plugin.cache; - may set_thread_model plugin.thread_model + may set_thread_model plugin.thread_model; + + may set_can_fast_zero plugin.can_fast_zero external _set_error : int -> unit = "ocaml_nbdkit_set_error" "noalloc" diff --git a/plugins/ocaml/NBDKit.mli b/plugins/ocaml/NBDKit.mli index 778250ef..06648b7f 100644 --- a/plugins/ocaml/NBDKit.mli +++ b/plugins/ocaml/NBDKit.mli @@ -101,6 +101,8 @@ type 'a plugin = { cache : ('a -> int32 -> int64 -> flags -> unit) option; thread_model : (unit -> thread_model) option; + + can_fast_zero : ('a -> bool) option; } (** The plugin fields and callbacks. ['a] is the handle type. *) diff --git a/plugins/rust/src/lib.rs b/plugins/rust/src/lib.rs index 53619dd9..313b4ca6 100644 --- a/plugins/rust/src/lib.rs +++ b/plugins/rust/src/lib.rs @@ -105,6 +105,8 @@ pub struct Plugin { flags: u32) -> c_int>, pub thread_model: Option<extern fn () -> ThreadModel>, + + pub can_fast_zero: Option<extern fn (h: *mut c_void) -> c_int>, } #[repr(C)] @@ -163,6 +165,7 @@ impl Plugin { can_cache: None, cache: None, thread_model: None, + can_fast_zero: None, } } } diff --git a/tests/test-eflags.sh b/tests/test-eflags.sh index f5cd43ed..9b3a6a3a 100755 --- a/tests/test-eflags.sh +++ b/tests/test-eflags.sh @@ -68,6 +68,7 @@ SEND_DF=$(( 1 << 7 )) CAN_MULTI_CONN=$(( 1 << 8 )) SEND_RESIZE=$(( 1 << 9 )) SEND_CACHE=$(( 1 << 10 )) +SEND_FAST_ZERO=$(( 1 << 11 )) do_nbdkit () { @@ -133,8 +134,8 @@ EOF #---------------------------------------------------------------------- # can_write=true # -# NBD_FLAG_SEND_WRITE_ZEROES is set on writable connections -# even when can_zero returns false, because nbdkit reckons it +# NBD_FLAG_SEND_WRITE_ZEROES and NBD_FLAG_SEND_FAST_ZERO are set on writable +# connections even when can_zero returns false, because nbdkit reckons it # can emulate zeroing using pwrite. do_nbdkit <<'EOF' @@ -145,8 +146,8 @@ case "$1" in esac EOF -[ $eflags -eq $(( HAS_FLAGS|SEND_WRITE_ZEROES|SEND_DF )) ] || - fail "$LINENO: expected HAS_FLAGS|SEND_WRITE_ZEROES|SEND_DF" +[ $eflags -eq $(( HAS_FLAGS|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO )) ] || + fail "$LINENO: expected HAS_FLAGS|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO" #---------------------------------------------------------------------- # --filter=nozero @@ -255,8 +256,8 @@ case "$1" in esac EOF -[ $eflags -eq $(( HAS_FLAGS|SEND_TRIM|SEND_WRITE_ZEROES|SEND_DF )) ] || - fail "$LINENO: expected HAS_FLAGS|SEND_TRIM|SEND_WRITE_ZEROES|SEND_DF" +[ $eflags -eq $(( HAS_FLAGS|SEND_TRIM|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO )) ] || + fail "$LINENO: expected HAS_FLAGS|SEND_TRIM|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO" #---------------------------------------------------------------------- # can_write=true @@ -271,8 +272,8 @@ case "$1" in esac EOF -[ $eflags -eq $(( HAS_FLAGS|ROTATIONAL|SEND_WRITE_ZEROES|SEND_DF )) ] || - fail "$LINENO: expected HAS_FLAGS|ROTATIONAL|SEND_WRITE_ZEROES|SEND_DF" +[ $eflags -eq $(( HAS_FLAGS|ROTATIONAL|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO )) ] || + fail "$LINENO: expected HAS_FLAGS|ROTATIONAL|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO" #---------------------------------------------------------------------- # -r @@ -304,8 +305,8 @@ case "$1" in esac EOF -[ $eflags -eq $(( HAS_FLAGS|SEND_FUA|SEND_WRITE_ZEROES|SEND_DF )) ] || - fail "$LINENO: expected HAS_FLAGS|SEND_FUA|SEND_WRITE_ZEROES|SEND_DF" +[ $eflags -eq $(( HAS_FLAGS|SEND_FUA|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO )) ] || + fail "$LINENO: expected HAS_FLAGS|SEND_FUA|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO" #---------------------------------------------------------------------- # -r @@ -341,8 +342,8 @@ case "$1" in esac EOF -[ $eflags -eq $(( HAS_FLAGS|SEND_FLUSH|SEND_FUA|SEND_WRITE_ZEROES|SEND_DF )) ] || - fail "$LINENO: expected HAS_FLAGS|SEND_FLUSH|SEND_FUA|SEND_WRITE_ZEROES|SEND_DF" +[ $eflags -eq $(( HAS_FLAGS|SEND_FLUSH|SEND_FUA|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO )) ] || + fail "$LINENO: expected HAS_FLAGS|SEND_FLUSH|SEND_FUA|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO" #---------------------------------------------------------------------- # can_write=true @@ -361,8 +362,8 @@ case "$1" in esac EOF -[ $eflags -eq $(( HAS_FLAGS|SEND_FLUSH|SEND_WRITE_ZEROES|SEND_DF )) ] || - fail "$LINENO: expected HAS_FLAGS|SEND_FLUSH|SEND_WRITE_ZEROES|SEND_DF" +[ $eflags -eq $(( HAS_FLAGS|SEND_FLUSH|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO )) ] || + fail "$LINENO: expected HAS_FLAGS|SEND_FLUSH|SEND_WRITE_ZEROES|SEND_DF|SEND_FAST_ZERO" #---------------------------------------------------------------------- # -r @@ -448,3 +449,96 @@ EOF [ $eflags -eq $(( HAS_FLAGS|READ_ONLY|SEND_DF )) ] || fail "$LINENO: expected HAS_FLAGS|READ_ONLY|SEND_DF" + +#---------------------------------------------------------------------- +# -r +# can_fast_zero=true +# +# Fast zero support isn't advertised without regular zero support + +do_nbdkit -r <<'EOF' +case "$1" in + get_size) echo 1M ;; + can_fast_zero) exit 0 ;; + *) exit 2 ;; +esac +EOF + +[ $eflags -eq $(( HAS_FLAGS|READ_ONLY|SEND_DF )) ] || + fail "$LINENO: expected HAS_FLAGS|READ_ONLY|SEND_DF" + +#---------------------------------------------------------------------- +# --filter=nozero +# can_write=true +# can_fast_zero=true +# +# Fast zero support isn't advertised without regular zero support + +do_nbdkit --filter=nozero <<'EOF' +case "$1" in + get_size) echo 1M ;; + can_write) exit 0 ;; + can_fast_zero) exit 0 ;; + *) exit 2 ;; +esac +EOF + +[ $eflags -eq $(( HAS_FLAGS|SEND_DF )) ] || + fail "$LINENO: expected HAS_FLAGS|SEND_DF" + +#---------------------------------------------------------------------- +# can_write=true +# can_zero=true +# +# Fast zero support is omitted for a plugin that has .zero but did not opt in + +do_nbdkit -r <<'EOF' +case "$1" in + get_size) echo 1M ;; + can_write) exit 0 ;; + can_zero) exit 0 ;; + *) exit 2 ;; +esac +EOF + +[ $eflags -eq $(( HAS_FLAGS|READ_ONLY|SEND_DF )) ] || + fail "$LINENO: expected HAS_FLAGS|READ_ONLY|SEND_DF" + +#---------------------------------------------------------------------- +# can_write=true +# can_zero=true +# can_fast_zero=false +# +# Fast zero support is omitted if the plugin says so + +do_nbdkit -r <<'EOF' +case "$1" in + get_size) echo 1M ;; + can_write) exit 0 ;; + can_zero) exit 0 ;; + can_fast_zero) exit 3 ;; + *) exit 2 ;; +esac +EOF + +[ $eflags -eq $(( HAS_FLAGS|READ_ONLY|SEND_DF )) ] || + fail "$LINENO: expected HAS_FLAGS|READ_ONLY|SEND_DF" + +#---------------------------------------------------------------------- +# can_write=true +# can_zero=false +# can_fast_zero=false +# +# Fast zero support is omitted if the plugin says so + +do_nbdkit -r <<'EOF' +case "$1" in + get_size) echo 1M ;; + can_write) exit 0 ;; + can_fast_zero) exit 3 ;; + *) exit 2 ;; +esac +EOF + +[ $eflags -eq $(( HAS_FLAGS|READ_ONLY|SEND_DF )) ] || + fail "$LINENO: expected HAS_FLAGS|READ_ONLY|SEND_DF" -- 2.21.0
Richard W.M. Jones
2019-Aug-27 12:14 UTC
Re: [Libguestfs] cross-project patches: Add NBD Fast Zero support
On Fri, Aug 23, 2019 at 09:30:36AM -0500, Eric Blake wrote:> I've run several tests to demonstrate why this is useful, as well as > prove that because I have multiple interoperable projects, it is worth > including in the NBD standard. The original proposal was here: > https://lists.debian.org/nbd/2019/03/msg00004.html > where I stated: > > > I will not push this without both: > > - a positive review (for example, we may decide that burning another > > NBD_FLAG_* is undesirable, and that we should instead have some sort > > of NBD_OPT_ handshake for determining when the server supports > > NBD_CMF_FLAG_FAST_ZERO) > > - a reference client and server implementation (probably both via qemu, > > since it was qemu that raised the problem in the first place)Is the plan to wait until NBD_CMF_FLAG_FAST_ZERO gets into the NBD protocol doc before doing the rest? Also I would like to release both libnbd 1.0 and nbdkit 1.14 before we introduce any large new features. Both should be released this week, in fact maybe even today or tomorrow. [...]> First, I had to create a scenario where falling back to writes is > noticeably slower than performing a zero operation, and where > pre-zeroing also shows an effect. My choice: let's test 'qemu-img > convert' on an image that is half-sparse (every other megabyte is a > hole) to an in-memory nbd destination. Then I use a series of nbdkit > filters to force the destination to behave in various manners: > log logfile=>(sed ...|uniq -c) (track how many normal/fast zero > requests the client makes) > nozero $params (fine-tune how zero requests behave - the parameters > zeromode and fastzeromode are the real drivers of my various tests) > blocksize maxdata=256k (allows large zero requests, but forces large > writes into smaller chunks, to magnify the effects of write delays and > allow testing to provide obvious results with a smaller image) > delay delay-write=20ms delay-zero=5ms (also to magnify the effects on a > smaller image, with writes penalized more than zeroing) > stats statsfile=/dev/stderr (to track overall time and a decent summary > of how much I/O occurred). > noextents (forces the entire image to report that it is allocated, > which eliminates any testing variability based on whether qemu-img uses > that to bypass a zeroing operation [1])I can't help thinking that a sh plugin might have been simpler ...> I hope you enjoyed reading this far, and agree with my interpretation of > the numbers about why this feature is useful!Yes it seems reasonable. The only thought I had is whether the qemu block layer does or should combine requests in flight so that a write-zero (offset) followed by a write-data (same offset) would erase the earlier request. In some circumstances that might provide a performance improvement without needing any changes to protocols.> - NBD should have a way to advertise (probably via NBD_INFO_ during > NBD_OPT_GO) if the initial image is known to begin life with all zeroes > (if that is the case, qemu-img can skip the extents calls and > pre-zeroing pass altogether)Yes, I really think we should do this one as well. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top
Eric Blake
2019-Aug-27 13:23 UTC
Re: [Libguestfs] cross-project patches: Add NBD Fast Zero support
On 8/27/19 7:14 AM, Richard W.M. Jones wrote:> > Is the plan to wait until NBD_CMF_FLAG_FAST_ZERO gets into the NBD > protocol doc before doing the rest? Also I would like to release both > libnbd 1.0 and nbdkit 1.14 before we introduce any large new features. > Both should be released this week, in fact maybe even today or > tomorrow.Sure, I don't mind this being the first feature for the eventual libnbd 1.2 and nbdkit 1.16.> > [...] >> First, I had to create a scenario where falling back to writes is >> noticeably slower than performing a zero operation, and where >> pre-zeroing also shows an effect. My choice: let's test 'qemu-img >> convert' on an image that is half-sparse (every other megabyte is a >> hole) to an in-memory nbd destination. Then I use a series of nbdkit >> filters to force the destination to behave in various manners: >> log logfile=>(sed ...|uniq -c) (track how many normal/fast zero >> requests the client makes) >> nozero $params (fine-tune how zero requests behave - the parameters >> zeromode and fastzeromode are the real drivers of my various tests) >> blocksize maxdata=256k (allows large zero requests, but forces large >> writes into smaller chunks, to magnify the effects of write delays and >> allow testing to provide obvious results with a smaller image) >> delay delay-write=20ms delay-zero=5ms (also to magnify the effects on a >> smaller image, with writes penalized more than zeroing) >> stats statsfile=/dev/stderr (to track overall time and a decent summary >> of how much I/O occurred). >> noextents (forces the entire image to report that it is allocated, >> which eliminates any testing variability based on whether qemu-img uses >> that to bypass a zeroing operation [1]) > > I can't help thinking that a sh plugin might have been simpler ...Maybe, but the extra cost of forking per request may have also made obvious timing comparisons harder. I'm just glad that nbdkit's filtering system was flexible enough to do what I wanted, even if I did have fun stringing together 6 filters :)> >> I hope you enjoyed reading this far, and agree with my interpretation of >> the numbers about why this feature is useful! > > Yes it seems reasonable. > > The only thought I had is whether the qemu block layer does or should > combine requests in flight so that a write-zero (offset) followed by a > write-data (same offset) would erase the earlier request. In some > circumstances that might provide a performance improvement without > needing any changes to protocols.As in, maintain a backlog of requests that are needed but have not yet been sent over the wire because of backlog, and merge those requests (by splitting an existing large zero request into smaller pieces) if write requests come in that window before actually transmitting to the NBD server? I know qemu has some write coalescing when servicing guest behaviors; but I was testing on 'qemu-img convert' which does not depend on guest behavior and therefore has already sent the zero request to the NBD server before sending any data writes, so coalescing wouldn't see anything to combine. Or are you worried about qemu as the NBD server, performing coalescing of incoming requests from the client? But you are right that some smarts about I/O coalescing at various points in the data path may show some slight optimizations.> >> - NBD should have a way to advertise (probably via NBD_INFO_ during >> NBD_OPT_GO) if the initial image is known to begin life with all zeroes >> (if that is the case, qemu-img can skip the extents calls and >> pre-zeroing pass altogether) > > Yes, I really think we should do this one as well.Stay tuned for my next cross-project post ;) Hopefully in the next week or so. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Eric Blake
2019-Aug-28 13:04 UTC
Re: [Libguestfs] [Qemu-devel] [PATCH 1/1] protocol: Add NBD_CMD_FLAG_FAST_ZERO
On 8/28/19 4:57 AM, Vladimir Sementsov-Ogievskiy wrote:>> Hence, it is desirable to have a way for clients to specify that a >> particular write zero request is being attempted for a fast wipe, and >> get an immediate failure if the zero request would otherwise take the >> same time as a write. Conversely, if the client is not performing a >> pre-initialization pass, it is still more efficient in terms of >> networking traffic to send NBD_CMD_WRITE_ZERO requests where the >> server implements the fallback to the slower write, than it is for the >> client to have to perform the fallback to send NBD_CMD_WRITE with a >> zeroed buffer. > > How are you going to finally use it in qemu-img convert?It's already in use there (in fact, the cover letter shows actual timing examples of how qemu-img's use of BDRV_REQ_NO_FALLBACK, which translates to NBD_CMD_FLAG_FAST_ZERO, observably affects timing).> Ok, we have a loop > of sending write-zero requests. And on first ENOTSUP we'll assume that there > is no benefit to continue? But what if actually server returns ENOTSUP only > once when we have 1000 iterations? Seems we should still do zeroing if we > have only a few ENOTSUPs...When attempting a bulk zero, you try to wipe the entire device, presumably with something that is large and aligned. Yes, if you have to split the write zero request due to size limitations, then you risk that the first write zero succeeds but later ones fail, then you didn't wipe the entire disk, but you also don't need to redo the work on the first half of the image. But it is much more likely that the first write of the bulk zero is representative of the overall operation (and so in practice, it only takes one fast zero attempt to learn if bulk zeroing is worthwhile, then continue with fast zeroes without issues).> > I understand that fail-on-first ENOTSUP is OK for raw-without-fallocte vs qcow2, > as first will always return ENOTSUP and second will never fail.. But in such way > we'll OK with simpler extension, which only have one server-advirtised negotiation > flag NBD_FLAG_ZERO_IS_FAST.Advertising that a server's zero behavior is always going to be successfully fast is a much harder flag to implement. The justification for the semantics I chose (advertising that the server can quickly report failure if success is not fast, but not requiring fast zero) covers the case when the decision of whether a zero is fast may also depend on other factors - for example, if the server knows the image starts in an all-zero state, then it can track a boolean: all write zero requests while the boolean is set return immediate success (nothing to do), but after the first regular write, the boolean is set to false, and all further write zero requests fail as being potentially slow; and such an implementation is still useful for the qemu-img convert case.> > There is not such problem if we have only one iteration, so may be new command > FILL_ZERO, filling the whole device by zeros?Or better yet, implement support for 64-bit commands. Yes, my cover letter called out further orthogonal extensions, and implementing 64-bit zeroing (so that you can issue a write zero request over the entire image in one command), as well as a way for a server to advertise when the image begins life in an all-zero state, are also further extensions coming down the pipeline. But as not all servers have to implement all of the extensions, each extension that can be orthogonally implemented and show an improvement on its own is still worthwhile; and my cover letter has shown that fast zeroes on their own make a measurable difference to certain workloads.>> + If the server advertised `NBD_FLAG_SEND_FAST_ZERO` but >> + `NBD_CMD_FLAG_FAST_ZERO` is not set, then the server MUST NOT fail >> + with `NBD_ENOTSUP`, even if the operation is no faster than a >> + corresponding `NBD_CMD_WRITE`. Conversely, if >> + `NBD_CMD_FLAG_FAST_ZERO` is set, the server MUST fail quickly with >> + `NBD_ENOTSUP` unless the request can be serviced in less time than >> + a corresponding `NBD_CMD_WRITE`, and SHOULD NOT alter the contents >> + of the export when returning this failure. The server's >> + determination of a fast request MAY depend on a number of factors, >> + such as whether the request was suitably aligned, on whether the >> + `NBD_CMD_FLAG_NO_HOLE` flag was present, or even on whether a >> + previous `NBD_CMD_TRIM` had been performed on the region. If the >> + server did not advertise `NBD_FLAG_SEND_FAST_ZERO`, then it SHOULD >> + NOT fail with `NBD_ENOTSUP`, regardless of the speed of servicing >> + a request, and SHOULD fail with `NBD_EINVAL` if the >> + `NBD_CMD_FLAG_FAST_ZERO` flag was set. A server MAY advertise >> + `NBD_FLAG_SEND_FAST_ZERO` whether or not it can perform fast >> + zeroing; similarly, a server SHOULD fail with `NBD_ENOTSUP` when >> + the flag is set if the server cannot quickly determine in advance >> + whether that request would have been fast, even if it turns out >> + that the same request without the flag would be fast after all. >> + > > What if WRITE_ZERO in the average is faster than WRITE (for example by 20%), > but server never can guarantee performance for one WRITE_ZERO operation, do > you restrict such case? Hmm, OK, SHOULD is not MUST actually..I think my followup mail, based on Wouter's questions, covers this: the goal is to document the use case of optimizing the copy of a sparse image, by probing whether a bulk pre-zeroing pass is worthwhile. That should be the measuring rod - if the implementation can perform a faster sparse copy because of write zeroes that are sometimes, but not always, faster than writes, in spite of the duplicated I/O that happens to the data portions of the image that were touched twice by the pre-zero pass then the actual data pass, then succeeding on fast zero requests is okay. But if it makes the overall image copy slower, then failing with ENOTSUP is probably better. And at the end of the day, it is really just a heuristic - if the server guessed wrong, the worst that happens is slower performance (and not data corruption). -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Eric Blake
2019-Aug-28 14:05 UTC
Re: [Libguestfs] [Qemu-devel] [PATCH 0/5] Add NBD fast zero support to qemu client and server
On 8/28/19 8:55 AM, Vladimir Sementsov-Ogievskiy wrote:> 23.08.2019 17:37, Eric Blake wrote: >> See the cross-post cover letter for more details: >> https://www.redhat.com/archives/libguestfs/2019-August/msg00322.html >> >> Based-on: https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg04805.html >> [Andrey Shinkevich block: workaround for unaligned byte range in fallocate()] > > I assume, I can look at git://repo.or.cz/qemu/ericb.git fast-zero branch?Yes, I posted a fast-zero branch for all four projects that I touched (with the obvious similar URLs). They might have non-fast-forward changes as I rebase things (for example, the nbdkit stuff needs to s/1.13.9/1.15.0/ in docs about which version introduced things), but should be sufficient to reproduce experiments with the feature supported. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Eric Blake
2019-Aug-30 23:10 UTC
Re: [Libguestfs] [Qemu-devel] [PATCH 1/5] nbd: Improve per-export flag handling in server
On 8/30/19 1:00 PM, Vladimir Sementsov-Ogievskiy wrote:> 23.08.2019 17:37, Eric Blake wrote: >> When creating a read-only image, we are still advertising support for >> TRIM and WRITE_ZEROES to the client, even though the client should not >> be issuing those commands. But seeing this requires looking across >> multiple functions: >>>> @@ -458,10 +458,13 @@ static int nbd_negotiate_handle_export_name(NBDClient *client, >> return -EINVAL; >> } >> >> - trace_nbd_negotiate_new_style_size_flags(client->exp->size, >> - client->exp->nbdflags | myflags); >> + myflags = client->exp->nbdflags; >> + if (client->structured_reply) { >> + myflags |= NBD_FLAG_SEND_DF; >> + } > > > why we cant do just > client->exp->nbdflags |= NBD_FLAG_SEND_DF ?Because myflags is the runtime flags for _this_ client, while client->exp->nbdflags are the base flags shared by _all_ clients. If client A requests structured reply, but client B does not, then we don't want to advertise DF to client B; but amending client->exp->nbdflags would have that effect. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Eric Blake
2019-Aug-30 23:32 UTC
Re: [Libguestfs] [Qemu-devel] [PATCH 1/5] nbd: Improve per-export flag handling in server
On 8/30/19 6:10 PM, Eric Blake wrote:> On 8/30/19 1:00 PM, Vladimir Sementsov-Ogievskiy wrote: >> 23.08.2019 17:37, Eric Blake wrote: >>> When creating a read-only image, we are still advertising support for >>> TRIM and WRITE_ZEROES to the client, even though the client should not >>> be issuing those commands. But seeing this requires looking across >>> multiple functions: >>> > >>> @@ -458,10 +458,13 @@ static int nbd_negotiate_handle_export_name(NBDClient *client, >>> return -EINVAL; >>> } >>> >>> - trace_nbd_negotiate_new_style_size_flags(client->exp->size, >>> - client->exp->nbdflags | myflags); >>> + myflags = client->exp->nbdflags; >>> + if (client->structured_reply) { >>> + myflags |= NBD_FLAG_SEND_DF; >>> + } >> >> >> why we cant do just >> client->exp->nbdflags |= NBD_FLAG_SEND_DF ? > > Because myflags is the runtime flags for _this_ client, while > client->exp->nbdflags are the base flags shared by _all_ clients. If > client A requests structured reply, but client B does not, then we don't > want to advertise DF to client B; but amending client->exp->nbdflags > would have that effect.I stand corrected - it looks like a fresh client->exp is created per client, as evidenced by: diff --git i/nbd/client.c w/nbd/client.c index b9dc829175f9..9e05f1a0e2a3 100644 --- i/nbd/client.c +++ w/nbd/client.c @@ -1011,6 +1011,8 @@ int nbd_receive_negotiate(AioContext *aio_context, QIOChannel *ioc, assert(info->name); trace_nbd_receive_negotiate_name(info->name); + if (getenv ("MY_HACK")) + info->structured_reply = false; result = nbd_start_negotiate(aio_context, ioc, tlscreds, hostname, outioc, info->structured_reply, &zeroes, errp); diff --git i/nbd/server.c w/nbd/server.c index d5078f7468af..6f3a83704fb3 100644 --- i/nbd/server.c +++ w/nbd/server.c @@ -457,6 +457,7 @@ static int nbd_negotiate_handle_export_name(NBDClient *client, bool no_zeroes, myflags = client->exp->nbdflags; if (client->structured_reply) { myflags |= NBD_FLAG_SEND_DF; + client->exp->nbdflags |= NBD_FLAG_SEND_DF; } trace_nbd_negotiate_new_style_size_flags(client->exp->size, myflags); stq_be_p(buf, client->exp->size); $ ./qemu-nbd -r -f raw file -t & $ ~/qemu/qemu-io -r -f raw --trace=nbd_\*size_flags nbd://localhost:10809 -c quit 32145@1567207628.519883:nbd_receive_negotiate_size_flags Size is 1049088, export flags 0x48f $ MY_HACK=1 ~/qemu/qemu-io -r -f raw --trace=nbd_\*size_flags nbd://localhost:10809 -c quit 32156@1567207630.417815:nbd_receive_negotiate_size_flags Size is 1049088, export flags 0x40f $ ~/qemu/qemu-io -r -f raw --trace=nbd_\*size_flags nbd://localhost:10809 -c quit 32167@1567207635.202940:nbd_receive_negotiate_size_flags Size is 1049088, export flags 0x48f The export flags change per client, so I _can_ store into client->exp->nbdflags. Will do that for v2. Meanwhile, this points out a missing feature in libnbd - for testing purposes, it would be really nice to be able to purposefully cripple the client to NOT request structured replies automatically (default enabled, but the ability to turn it off is useful for interop testing, as in this thread). I already recently added a --no-sr flag to nbdkit for a similar reason (but that's creating a server which refuses to advertise, where here I want a guest that refuses to ask). Guess I'll be adding a patch for that, too :) -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Eric Blake
2019-Aug-30 23:37 UTC
Re: [Libguestfs] [Qemu-devel] [PATCH 2/5] nbd: Prepare for NBD_CMD_FLAG_FAST_ZERO
On 8/30/19 1:07 PM, Vladimir Sementsov-Ogievskiy wrote:> 23.08.2019 17:37, Eric Blake wrote: >> Commit fe0480d6 and friends added BDRV_REQ_NO_FALLBACK as a way to >> avoid wasting time on a preliminary write-zero request that will later >> be rewritten by actual data, if it is known that the write-zero >> request will use a slow fallback; but in doing so, could not optimize >> for NBD. The NBD specification is now considering an extension that >> will allow passing on those semantics; this patch updates the new >> protocol bits and 'qemu-nbd --list' output to recognize the bit, as >> well as the new errno value possible when using the new flag; while >> upcoming patches will improve the client to use the feature when >> present, and the server to advertise support for it. >> >> Signed-off-by: Eric Blake <eblake@redhat.com>>> +++ b/nbd/server.c >> @@ -55,6 +55,8 @@ static int system_errno_to_nbd_errno(int err) >> return NBD_ENOSPC; >> case EOVERFLOW: >> return NBD_EOVERFLOW; >> + case ENOTSUP: >> + return NBD_ENOTSUP; > > This may provoke returning NBD_ENOTSUP in other cases, not only new one we are going to add.Correct. But the spec only said SHOULD avoid ENOTSUP in those other cases, not MUST avoid ENOTSUP; and in practice, either the client that is not suspecting it will treat it the same as NBD_EINVAL, or we've managed to get a slightly better error message across than normal. I don't see that as a real show-stopper. But if it bothers you, note that in nbdkit, I actually coded things up to refuse to send NBD_EOVERFLOW unless NBD_CMD_FLAG_DF was set, and to refuse to send NBD_ENOTSUP unless NBD_CMD_FLAG_FAST_ZERO was set. I could copy that behavior here, if we want qemu to be that much stricter as a server. (Note to self: also check if, when using structured replies to send error messages, in addition to the code, if the string contains strerror(errno) from BEFORE the mapping, rather than after we've lost information to the more-limited NBD_E* values) -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Eric Blake
2019-Sep-03 20:53 UTC
Re: [Libguestfs] [PATCH 0/1] NBD protocol change to add fast zero support
On 8/23/19 9:34 AM, Eric Blake wrote:> See the cross-post cover letter for details: > https://www.redhat.com/archives/libguestfs/2019-August/msg00322.html > > Eric Blake (1): > protocol: Add NBD_CMD_FLAG_FAST_ZERO > > doc/proto.md | 50 +++++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 49 insertions(+), 1 deletion(-)I think we've had enough review time with no major objections to this. Therefore, I've gone ahead and incorporated the wording changes that were mentioned in discussion on this patch, as well as Rich's URI specification, into the NBD project. We can still amend the specifications if any problems turn up. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Reasonably Related Threads
- Re: [Qemu-devel] [PATCH 0/5] Add NBD fast zero support to qemu client and server
- [nbdkit PATCH 0/3] nbdkit support for new NBD fast zero
- [PATCH 0/1] NBD protocol change to add fast zero support
- [libnbd PATCH 0/1] libnbd support for new fast zero
- Re: [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO