thr3ads.net - Libguestfs - Re: [Libguestfs] [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST

If this information is useful, please help other people find it:
Share via:

Eric Blake

2019-Mar-22 16:42 UTC

[Libguestfs] [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO

While it may be counterintuitive at first, the introduction of
NBD_CMD_WRITE_ZEROES and NBD_CMD_BLOCK_STATUS has caused a performance
regression in qemu [1], when copying a sparse file. When the
destination file must contain the same contents as the source, but it
is not known in advance whether the destination started life with all
zero content, then there are cases where it is faster to request a
bulk zero of the entire device followed by writing only the portions
of the device that are to contain data, as that results in fewer I/O
transactions overall. In fact, there are even situations where
trimming the entire device prior to writing zeroes may be faster than
bare write zero request [2]. However, if a bulk zero request ever
falls back to the same speed as a normal write, a bulk pre-zeroing
algorithm is actually a pessimization, as it ends up writing portions
of the disk twice.

[1] https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg06389.html
[2] https://github.com/libguestfs/nbdkit/commit/407f8dde

Hence, it is desirable to have a way for clients to specify that a
particular write zero request is being attempted for a fast wipe, and
get an immediate failure if the zero request would otherwise take the
same time as a write.  Conversely, if the client is not performing a
pre-initialization pass, it is still more efficient in terms of
networking traffic to send NBD_CMD_WRITE_ZERO requests where the
server implements the fallback to the slower write, than it is for the
client to have to perform the fallback to send NBD_CMD_WRITE with a
zeroed buffer.

Add a protocol flag and corresponding transmission advertisement flag
to make it easier for clients to inform the server of their intent. If
the server advertises NBD_FLAG_SEND_FAST_ZERO, then it promises two
things: to perform a fallback to write when the client does not
request NBD_CMD_FLAG_FAST_ZERO (so that the client benefits from the
lower network overhead); and to fail quickly with ENOTSUP if the
client requested the flag but the server cannot write zeroes more
efficiently than a normal write (so that the client is not penalized
with the time of writing data areas of the disk twice).

Note that the semantics are chosen so that servers should advertise
the new flag whether or not they have fast zeroing (that is, this is
NOT the server advertising that it has fast zeroes, but rather
advertising that the client can get feedback as needed on whether
zeroing is fast).  It is also intentional that the new advertisement
includes a new errno value, ENOTSUP, with rules that this error should
not be returned for any pre-existing behaviors, must not happen when
the client does not request a fast zero, and must be returned quickly
if the client requested fast zero but anything other than the error
would not be fast; while leaving it possible for clients to
distinguish other errors like EINVAL if alignment constraints are not
met.  Clients should not send the flag unless the server advertised
support, but well-behaved servers should already be reporting EINVAL
to unrecognized flags. If the server does not advertise the new
feature, clients can safely fall back to assuming that writing zeroes
is no faster than normal writes.

Note that the Linux fallocate(2) interface may or may not be powerful
enough to easily determine if zeroing will be efficient - in
particular, FALLOC_FL_ZERO_RANGE in isolation does NOT give that
insight; for block devices, it is known that ioctl(BLKZEROOUT) does
NOT have a way for userspace to probe if it is efficient or slow.  But
with enough demand, the kernel may add another FALLOC_FL_ flag to use
with FALLOC_FL_ZERO_RANGE, and/or appropriate ioctls with guaranteed
ENOTSUP failures if a fast path cannot be taken.  If a server cannot
easily determine if write zeroes will be efficient, it is better off
not advertising NBD_FLAG_SEND_FAST_ZERO.

Signed-off-by: Eric Blake <eblake@redhat.com>
---

I will not push this without both:
- a positive review (for example, we may decide that burning another
NBD_FLAG_* is undesirable, and that we should instead have some sort
of NBD_OPT_ handshake for determining when the server supports
NBD_CMF_FLAG_FAST_ZERO)
- a reference client and server implementation (probably both via qemu,
since it was qemu that raised the problem in the first place)

 doc/proto.md | 44 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/doc/proto.md b/doc/proto.md
index 8aaad96..1107766 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -1059,6 +1059,17 @@ The field has the following format:
   which support the command without advertising this bit, and
   conversely that this bit does not guarantee that the command will
   succeed or have an impact.
+- bit 11, `NBD_FLAG_SEND_FAST_ZERO`: allow clients to detect whether
+  `NBD_CMD_WRITE_ZEROES` is efficient. The server MUST set this
+  transmission flag to 1 if the `NBD_CMD_WRITE_ZEROES` request
+  supports the `NBD_CMD_FLAG_FAST_ZERO` flag, and MUST set this
+  transmission flag to 0 if `NBD_FLAG_SEND_WRITE_ZEROES` is not
+  set. Servers SHOULD NOT set this transmission flag if there is no
+  quick way to determine whether a particular write zeroes request
+  will be efficient, but the lack of an efficient write zero
+  implementation SHOULD NOT prevent a server from setting this
+  flag. Clients MUST NOT set the `NBD_CMD_FLAG_FAST_ZERO` request flag
+  unless this transmission flag is set.

 Clients SHOULD ignore unknown flags.

@@ -1636,6 +1647,12 @@ valid may depend on negotiation during the handshake
phase.
   MUST NOT send metadata on more than one extent in the reply. Client
   implementors should note that using this flag on multiple contiguous
   requests is likely to be inefficient.
+- bit 4, `NBD_CMD_FLAG_FAST_ZERO`; valid during
+  `NBD_CMD_WRITE_ZEROES`. If set, but the server cannot perform the
+  write zeroes any faster than it would for an equivalent
+  `NBD_CMD_WRITE`, then the server MUST fail quickly with an error of
+  `ENOTSUP`. The client MUST NOT set this unless the server advertised
+  `NBD_FLAG_SEND_FAST_ZERO`.

 ##### Structured reply flags

@@ -2004,7 +2021,10 @@ The following request types exist:
     reached permanent storage, unless `NBD_CMD_FLAG_FUA` is in use.

     A client MUST NOT send a write zeroes request unless
-    `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags field.
+    `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags
+    field. Additionally, a client MUST NOT send the
+    `NBD_CMD_FLAG_FAST_ZERO` flag unless `NBD_FLAG_SEND_FAST_ZERO` was
+    set in the transimssion flags field.

     By default, the server MAY use trimming to zero out the area, even
     if it did not advertise `NBD_FLAG_SEND_TRIM`; but it MUST ensure
@@ -2014,6 +2034,23 @@ The following request types exist:
     same area will not cause fragmentation or cause failure due to
     insufficient space.

+    If the server advertised `NBD_FLAG_SEND_FAST_ZERO` but
+    `NBD_CMD_FLAG_FAST_ZERO` is not set, then the server MUST NOT fail
+    with `ENOTSUP`, even if the operation is no faster than a
+    corresponding `NBD_CMD_WRITE`. Conversely, if
+    `NBD_CMD_FLAG_FAST_ZERO` is set, the server MUST fail quickly with
+    `ENOTSUP` unless the request can be serviced more efficiently than
+    a corresponding `NBD_CMD_WRITE`. The server's determination of
+    efficiency MAY depend on whether the request was suitably aligned,
+    on whether the `NBD_CMD_FLAG_NO_HOLE` flag was present, or even on
+    whether a previous `NBD_CMD_TRIM` had been performed on the
+    region.  If the server did not advertise
+    `NBD_FLAG_SEND_FAST_ZERO`, then it SHOULD NOT fail with `ENOTSUP`,
+    regardless of the speed of servicing a request, and SHOULD fail
+    with `EINVAL` if the `NBD_CMD_FLAG_FAST_ZERO` flag was set. A
+    server MAY advertise `NBD_FLAG_SEND_FAST_ZERO` whether or not it
+    can perform efficient zeroing.
+
     If an error occurs, the server MUST set the appropriate error code
     in the error field.

@@ -2114,6 +2151,7 @@ The following error values are defined:
 * `EINVAL` (22), Invalid argument.
 * `ENOSPC` (28), No space left on device.
 * `EOVERFLOW` (75), Value too large.
+* `ENOTSUP` (95), Operation not supported.
 * `ESHUTDOWN` (108), Server is in the process of being shut down.

 The server SHOULD return `ENOSPC` if it receives a write request
@@ -2125,6 +2163,10 @@ request is not aligned to advertised minimum block sizes.
Finally, it
 SHOULD return `EPERM` if it receives a write or trim request on a
 read-only export.

+The server SHOULD NOT return `ENOTSUP` except as documented in
+response to `NBD_CMD_WRITE_ZEROES` when `NBD_CMD_FLAG_FAST_ZERO` is
+supported.
+
 The server SHOULD return `EINVAL` if it receives an unknown command.

 The server SHOULD return `EINVAL` if it receives an unknown command flag. It
-- 
2.20.1

Eric Blake

2019-Mar-22 17:17 UTC

head link

Re: [Libguestfs] [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO

On 3/22/19 11:42 AM, Eric Blake wrote:
> 
> Hence, it is desirable to have a way for clients to specify that a
> particular write zero request is being attempted for a fast wipe, and
> get an immediate failure if the zero request would otherwise take the
> same time as a write.  Conversely, if the client is not performing a
> pre-initialization pass, it is still more efficient in terms of
> networking traffic to send NBD_CMD_WRITE_ZERO requests where the
> server implements the fallback to the slower write, than it is for the
> client to have to perform the fallback to send NBD_CMD_WRITE with a
> zeroed buffer.
> 
> Add a protocol flag and corresponding transmission advertisement flag
> to make it easier for clients to inform the server of their intent. If
Note that this is independent of proposals made on the NBD list in the
past [1] of having a way for the server to advertise that a particular
export starts in an all-zeroes state (faster than a series of 32-bit
NBD_CMD_BLOCK_STATUS would be able to do), although I may _also_ try to
revive proposed documentation and a reference implementation of that
optimization as well (as qemu-img convert can completely skip the
zeroing, whether the bulk wipe or per-hole writing, when it knows the
destination is already zero).

[1] https://lists.debian.org/nbd/2016/12/msg00015.html and following
(doc: Propose NBD_FLAG_INIT_ZEROES extension)
> 
> I will not push this without both:
> - a positive review (for example, we may decide that burning another
> NBD_FLAG_* is undesirable, and that we should instead have some sort
> of NBD_OPT_ handshake for determining when the server supports
> NBD_CMF_FLAG_FAST_ZERO)
> - a reference client and server implementation (probably both via qemu,
> since it was qemu that raised the problem in the first place)
The last time we mentioned the possibility of advertising an initial
zero state, we debated whether burning one of our 16 NBD_FLAG_*
transmission bits for that purpose was wise [2], but discussion stalled
after I never developed a counterproposal with NBD_OPT_* handshaking and
never produced a reference implementation.

[2] https://lists.debian.org/nbd/2016/12/msg00048.html

Also, keep in mind that knowing that something started as all zeroes
(which only affects startup; once you do any write, that early status
bit no longer means anything to current operation, so less important to
hand to the kernel during transmission phase, especially if the kernel
can ever learn to utilize NBD_CMD_BLOCK_STATUS) is indeed different from
knowing if probing for fast zeroing is supported (where it is easy to
conceive of the kernel needing to know if it can send
NBD_CMD_FLAG_FAST_ZERO).  So we may still want to use NBD_OPT_* to get
the initial zero extension, but NBD_FLAG to advertise the fast zero
extension.

On the other hand, it's also worth thinking about which extensions are
easy for servers to implement - NBD_FLAG_INIT_ZEROES and
NBD_FLAG_SEND_FAST_ZERO are orthogonal enough that I could see a full
2x2 mix of servers (unsupported, either one of the two supported, or
both supported), and where clients may make optimization choices based
on any of those four combinations.

[and if we're keeping score, other extension proposals that I want
revisit, in no particular order, include:
- 64-bit operations
- NBD_CMD_RESIZE
- more precision on TRIM/WRITE_ZERO alignment constraints
]

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Nir Soffer

2019-Mar-22 19:42 UTC

head link

Re: [Libguestfs] [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO

On Fri, Mar 22, 2019 at 6:43 PM Eric Blake <eblake@redhat.com> wrote:
> While it may be counterintuitive at first, the introduction of
> NBD_CMD_WRITE_ZEROES and NBD_CMD_BLOCK_STATUS has caused a performance
> regression in qemu [1], when copying a sparse file. When the
> destination file must contain the same contents as the source, but it
> is not known in advance whether the destination started life with all
> zero content, then there are cases where it is faster to request a
> bulk zero of the entire device followed by writing only the portions
> of the device that are to contain data, as that results in fewer I/O
> transactions overall. In fact, there are even situations where
> trimming the entire device prior to writing zeroes may be faster than
> bare write zero request [2]. However, if a bulk zero request ever
> falls back to the same speed as a normal write, a bulk pre-zeroing
> algorithm is actually a pessimization, as it ends up writing portions
> of the disk twice.
>
> [1] https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg06389.html
> [2] https://github.com/libguestfs/nbdkit/commit/407f8dde
>
> Hence, it is desirable to have a way for clients to specify that a
> particular write zero request is being attempted for a fast wipe, and
> get an immediate failure if the zero request would otherwise take the
> same time as a write.  Conversely, if the client is not performing a
> pre-initialization pass, it is still more efficient in terms of
> networking traffic to send NBD_CMD_WRITE_ZERO requests where the
> server implements the fallback to the slower write, than it is for the
> client to have to perform the fallback to send NBD_CMD_WRITE with a
> zeroed buffer.
>
> Add a protocol flag and corresponding transmission advertisement flag
> to make it easier for clients to inform the server of their intent. If
> the server advertises NBD_FLAG_SEND_FAST_ZERO, then it promises two
> things: to perform a fallback to write when the client does not
> request NBD_CMD_FLAG_FAST_ZERO (so that the client benefits from the
> lower network overhead); and to fail quickly with ENOTSUP if the
> client requested the flag but the server cannot write zeroes more
> efficiently than a normal write (so that the client is not penalized
> with the time of writing data areas of the disk twice).
>
I think the issue is not that zero is slow as normal write, but that it is
not fast
enough so it worth the zero entire disk before writing data.

For example, on storage server we had in the past BLKZEROOUT rate was
50G/s. On another server, it can run anywhere from 1G/s to  100G/s,
depending
on the allocation status of the zeroed range.

Note that the semantics are chosen so that servers should
advertise> the new flag whether or not they have fast zeroing (that is, this is
> NOT the server advertising that it has fast zeroes, but rather
> advertising that the client can get feedback as needed on whether
> zeroing is fast).  It is also intentional that the new advertisement
> includes a new errno value, ENOTSUP, with rules that this error should
> not be returned for any pre-existing behaviors, must not happen when
> the client does not request a fast zero, and must be returned quickly
> if the client requested fast zero but anything other than the error
> would not be fast; while leaving it possible for clients to
> distinguish other errors like EINVAL if alignment constraints are not
> met.  Clients should not send the flag unless the server advertised
> support, but well-behaved servers should already be reporting EINVAL
> to unrecognized flags. If the server does not advertise the new
> feature, clients can safely fall back to assuming that writing zeroes
> is no faster than normal writes.
> Note that the Linux fallocate(2) interface may or may not be powerful
> enough to easily determine if zeroing will be efficient - in
> particular, FALLOC_FL_ZERO_RANGE in isolation does NOT give that
> insight; for block devices, it is known that ioctl(BLKZEROOUT) does
> NOT have a way for userspace to probe if it is efficient or slow.  But
> with enough demand, the kernel may add another FALLOC_FL_ flag to use
> with FALLOC_FL_ZERO_RANGE, and/or appropriate ioctls with guaranteed
> ENOTSUP failures if a fast path cannot be taken.  If a server cannot
> easily determine if write zeroes will be efficient, it is better off
> not advertising NBD_FLAG_SEND_FAST_ZERO.
>
I think this can work for file based images. If fallocate() fails, the
client
will get ENOTSUP after the first call quickly.

For block device we don't have any way to know if a fallocate() or
BLKZEROOUT
will be fast, so I guess servers will never advertise FAST_ZERO.

Generally this new flag usefulness is limited. It will only help qemu-img
to convert
faster to file based images.

Do we have performance measurements showing significant speed up when
zeroing the entire image before coping data, compared with zeroing only the
unallocated ranges?

For example if the best speedup we can get in real world scenario is 2%, is
ti
worth complicating the protocol and using another bit?
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>
> I will not push this without both:
> - a positive review (for example, we may decide that burning another
> NBD_FLAG_* is undesirable, and that we should instead have some sort
> of NBD_OPT_ handshake for determining when the server supports
> NBD_CMF_FLAG_FAST_ZERO)
> - a reference client and server implementation (probably both via qemu,
> since it was qemu that raised the problem in the first place)
>
>  doc/proto.md | 44 +++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 43 insertions(+), 1 deletion(-)
>
> diff --git a/doc/proto.md b/doc/proto.md
> index 8aaad96..1107766 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -1059,6 +1059,17 @@ The field has the following format:
>    which support the command without advertising this bit, and
>    conversely that this bit does not guarantee that the command will
>    succeed or have an impact.
> +- bit 11, `NBD_FLAG_SEND_FAST_ZERO`: allow clients to detect whether
> +  `NBD_CMD_WRITE_ZEROES` is efficient. The server MUST set this
> +  transmission flag to 1 if the `NBD_CMD_WRITE_ZEROES` request
> +  supports the `NBD_CMD_FLAG_FAST_ZERO` flag, and MUST set this
> +  transmission flag to 0 if `NBD_FLAG_SEND_WRITE_ZEROES` is not
> +  set. Servers SHOULD NOT set this transmission flag if there is no
> +  quick way to determine whether a particular write zeroes request
> +  will be efficient, but the lack of an efficient write zero
>
I think we should use "fast" instead of "efficient". For
example when the
kernel
fallback to manual zeroing it is probably the most efficient way it can be
done,
but it is not fast.

> +  implementation SHOULD NOT prevent a server from setting this
> +  flag. Clients MUST NOT set the `NBD_CMD_FLAG_FAST_ZERO` request flag
> +  unless this transmission flag is set.
>
>  Clients SHOULD ignore unknown flags.
>
> @@ -1636,6 +1647,12 @@ valid may depend on negotiation during the
> handshake phase.
>    MUST NOT send metadata on more than one extent in the reply. Client
>    implementors should note that using this flag on multiple contiguous
>    requests is likely to be inefficient.
> +- bit 4, `NBD_CMD_FLAG_FAST_ZERO`; valid during
> +  `NBD_CMD_WRITE_ZEROES`. If set, but the server cannot perform the
> +  write zeroes any faster than it would for an equivalent
> +  `NBD_CMD_WRITE`, then the server MUST fail quickly with an error of
> +  `ENOTSUP`. The client MUST NOT set this unless the server advertised
> +  `NBD_FLAG_SEND_FAST_ZERO`.
>
>  ##### Structured reply flags
>
> @@ -2004,7 +2021,10 @@ The following request types exist:
>      reached permanent storage, unless `NBD_CMD_FLAG_FUA` is in use.
>
>      A client MUST NOT send a write zeroes request unless
> -    `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags field.
> +    `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags
> +    field. Additionally, a client MUST NOT send the
> +    `NBD_CMD_FLAG_FAST_ZERO` flag unless `NBD_FLAG_SEND_FAST_ZERO` was
> +    set in the transimssion flags field.
>
>      By default, the server MAY use trimming to zero out the area, even
>      if it did not advertise `NBD_FLAG_SEND_TRIM`; but it MUST ensure
> @@ -2014,6 +2034,23 @@ The following request types exist:
>      same area will not cause fragmentation or cause failure due to
>      insufficient space.
>
> +    If the server advertised `NBD_FLAG_SEND_FAST_ZERO` but
> +    `NBD_CMD_FLAG_FAST_ZERO` is not set, then the server MUST NOT fail
> +    with `ENOTSUP`, even if the operation is no faster than a
> +    corresponding `NBD_CMD_WRITE`. Conversely, if
> +    `NBD_CMD_FLAG_FAST_ZERO` is set, the server MUST fail quickly with
> +    `ENOTSUP` unless the request can be serviced more efficiently than
> +    a corresponding `NBD_CMD_WRITE`. The server's determination of
> +    efficiency MAY depend on whether the request was suitably aligned,
> +    on whether the `NBD_CMD_FLAG_NO_HOLE` flag was present, or even on
> +    whether a previous `NBD_CMD_TRIM` had been performed on the
> +    region.  If the server did not advertise
> +    `NBD_FLAG_SEND_FAST_ZERO`, then it SHOULD NOT fail with `ENOTSUP`,
> +    regardless of the speed of servicing a request, and SHOULD fail
> +    with `EINVAL` if the `NBD_CMD_FLAG_FAST_ZERO` flag was set. A
> +    server MAY advertise `NBD_FLAG_SEND_FAST_ZERO` whether or not it
> +    can perform efficient zeroing.
> +
>      If an error occurs, the server MUST set the appropriate error code
>      in the error field.
>
> @@ -2114,6 +2151,7 @@ The following error values are defined:
>  * `EINVAL` (22), Invalid argument.
>  * `ENOSPC` (28), No space left on device.
>  * `EOVERFLOW` (75), Value too large.
> +* `ENOTSUP` (95), Operation not supported.
>  * `ESHUTDOWN` (108), Server is in the process of being shut down.
>
>  The server SHOULD return `ENOSPC` if it receives a write request
> @@ -2125,6 +2163,10 @@ request is not aligned to advertised minimum block
> sizes. Finally, it
>  SHOULD return `EPERM` if it receives a write or trim request on a
>  read-only export.
>
> +The server SHOULD NOT return `ENOTSUP` except as documented in
> +response to `NBD_CMD_WRITE_ZEROES` when `NBD_CMD_FLAG_FAST_ZERO` is
> +supported.
>
This makes ENOTSUP less useful. I think it should be allowed to return
ENOTSUP
as response for other commands if needed.

> +
>  The server SHOULD return `EINVAL` if it receives an unknown command.
>
>  The server SHOULD return `EINVAL` if it receives an unknown command flag.
> It
> --
> 2.20.1
>
I think this makes sense, and should work, but we need more data supporting
that this is
useful in practice.

Nir

Eric Blake

2019-Mar-22 20:42 UTC

head link

Re: [Libguestfs] [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO

On 3/22/19 2:42 PM, Nir Soffer wrote:
>> Add a protocol flag and corresponding transmission advertisement flag
>> to make it easier for clients to inform the server of their intent. If
>> the server advertises NBD_FLAG_SEND_FAST_ZERO, then it promises two
>> things: to perform a fallback to write when the client does not
>> request NBD_CMD_FLAG_FAST_ZERO (so that the client benefits from the
>> lower network overhead); and to fail quickly with ENOTSUP if the
>> client requested the flag but the server cannot write zeroes more
>> efficiently than a normal write (so that the client is not penalized
>> with the time of writing data areas of the disk twice).
>>
> 
> I think the issue is not that zero is slow as normal write, but that it is
> not fast
> enough so it worth the zero entire disk before writing data.
In an image copy, where you don't know if the destination already
started life with all zero, then you HAVE to copy zeros into the image
for the holes; the only question is whether also pre-filling the entire
image (with fewer calls) and then overwriting the prefill is faster than
just writing the data areas once.  So there is a tradeoff to see how
much time do you add with the overhead of lots of small-length
WRITE_ZEROES for the holes, vs. the overhead of one large-length
WRITE_ZEROES for the entire image.  There's ALSO a factor of how much of
the image is holes vs. data - a pre-fill of only 10% of the image (which
is mostly sparse) is less wasteful than a pre-fill of 90% of the image
(which is mostly dense) - but that waste doesn't cost anything if
prefill is O(1) regardless of size; vs. being painful if it is O(n)
based on size.  There are definitely heuristics at play, and I don't
know that the NBD spec can go into any strong advice on what type of
speedups are in play, only whether the write zero is on par with normal
writes.

And, given the uncertainties on what speedups (or slowdowns) a pre-fill
might cause, it DOES show that knowing if an image started life all zero
is an even better optimization, because then you don't have to waste any
time on overwriting holes. But having another way to speed things up
does not necessarily render this proposal as useless.
>> Note that the Linux fallocate(2) interface may or may not be powerful
>> enough to easily determine if zeroing will be efficient - in
>> particular, FALLOC_FL_ZERO_RANGE in isolation does NOT give that
>> insight; for block devices, it is known that ioctl(BLKZEROOUT) does
>> NOT have a way for userspace to probe if it is efficient or slow.  But
>> with enough demand, the kernel may add another FALLOC_FL_ flag to use
>> with FALLOC_FL_ZERO_RANGE, and/or appropriate ioctls with guaranteed
>> ENOTSUP failures if a fast path cannot be taken.  If a server cannot
>> easily determine if write zeroes will be efficient, it is better off
>> not advertising NBD_FLAG_SEND_FAST_ZERO.
>>
> 
> I think this can work for file based images. If fallocate() fails, the
> client
> will get ENOTSUP after the first call quickly.
The negative case is fast, but that doesn't say anything about the
positive case.  Unless Linux adds a new FALLOC_FL_ bit, you have no
guarantee whether fallocate() reporting success may still have happened
because the kernel did a fallback to a slow write.  If fallocate() comes
back quickly, you got lucky; but if it takes the full time of a write(),
you lost your window of opportunity to report ENOTSUP quickly.  Hence,
my hope that the kernel folks add a new FALLOC_FL_ flag to give us the
semantics we want (of a guaranteed way to avoid slow fallbacks).
> 
> For block device we don't have any way to know if a fallocate() or
> BLKZEROOUT
> will be fast, so I guess servers will never advertise FAST_ZERO.
> 
As I said, you don't know that with BLKZEROOUT, but the kernel might
give us another ioctl that DOES know.
> Generally this new flag usefulness is limited. It will only help qemu-img
> to convert
> faster to file based images.
Limited use case is still a use case. If there are cases where you can
optimize by a simple extension to the protocol, and where either side
lacking the extension is not fatal to the protocol, then it is worth
doing. And so far, that is what this feels like to me.
> 
> Do we have performance measurements showing significant speed up when
> zeroing the entire image before coping data, compared with zeroing only the
> unallocated ranges?
Kevin may have more of an idea based on the patches he wrote for
qemu-img, and which spurred me into proposing this email; maybe he can
share numbers for his testing on regular files and/or block devices to
at least get a feel for whether a speedup is likely with a sufficient
NBD server.
> 
> For example if the best speedup we can get in real world scenario is 2%, is
> ti
> worth complicating the protocol and using another bit?
Gaining 2% of an hour may still be worth it.
>> +  set. Servers SHOULD NOT set this transmission flag if there is no
>> +  quick way to determine whether a particular write zeroes request
>> +  will be efficient, but the lack of an efficient write zero
>>
> 
> I think we should use "fast" instead of "efficient".
For example when the
> kernel
> fallback to manual zeroing it is probably the most efficient way it can be
> done,
> but it is not fast.
Seems like a simple enough wording change.

>> @@ -2114,6 +2151,7 @@ The following error values are defined:
>>  * `EINVAL` (22), Invalid argument.
>>  * `ENOSPC` (28), No space left on device.
>>  * `EOVERFLOW` (75), Value too large.
>> +* `ENOTSUP` (95), Operation not supported.
>>  * `ESHUTDOWN` (108), Server is in the process of being shut down.
>>
>>  The server SHOULD return `ENOSPC` if it receives a write request
>> @@ -2125,6 +2163,10 @@ request is not aligned to advertised minimum
block
>> sizes. Finally, it
>>  SHOULD return `EPERM` if it receives a write or trim request on a
>>  read-only export.
>>
>> +The server SHOULD NOT return `ENOTSUP` except as documented in
>> +response to `NBD_CMD_WRITE_ZEROES` when `NBD_CMD_FLAG_FAST_ZERO` is
>> +supported.
>>
> 
> This makes ENOTSUP less useful. I think it should be allowed to return
> ENOTSUP
> as response for other commands if needed.
Sorry, but we have the problem of back-compat to worry about. Remember,
the error values permitted in the NBD protocol are system-agnostic (they
_happen_ to match Linux errno values, but not all the world uses the
same values for those errors in their libc, so portable implementations
HAVE to map between NBD_EINVAL sent over the wire and libc EINVAL used
internally, even if the mapping is 1:1 on Linux).  Since the NBD
protocol has documented only a finite subset of valid errors, and
portable clients have to implement a mapping, it's very probably that
there exist clients written against the current NBD spec that will choke
hard (and probably hang up the connection) on receiving an unexpected
error number from the server which was not pre-compiled into their
mapping.  ANY server that replies with ENOTSUP at the moment is in
violation of the existing server requirements, whether or not clients
have a high quality of implementation and manage to tolerate the
server's noncompliance.

Thus, when we add new errno values as being valid returns, we have to
take care that servers SHOULD NOT send the new errno except to clients
that are prepared for the error - a server merely advertising
NBD_FLAG_SEND_FAST_ZERO is _still_ insufficient to give the server
rights to send ENOTSUP (since the server can't know if the client
recognized the advertisement, at least until the client finally sends a
NBD_CMD_FLAG_FAST_ZERO flag).  (Note, I said SHOULD NOT, not MUST NOT -
if your server goofs and leaks ENOTSUP to a client on any other command,
most clients will still be okay, and so you probably won't have people
complaining that your server is broken. The only MUST NOT send ENOTSUP
is for the case where the server advertised FAST_ZERO probing and the
client did not request FAST_ZERO, because then server has to assume the
client is relying on the server to do fallback handling for reduced
network traffic.)
> 
> I think this makes sense, and should work, but we need more data supporting
> that this is
> useful in practice.
Fair enough - since Kevin has already got patches proposed against qemu
to wire up a qemu flag BDRV_REQ_NO_FALLBACK, which should map in a
rather straightforward manner to my NBD proposal (any qemu request sent
with the BDRV_REQ_NO_FALLBACK bit set turns into an NBD_CMD_WRITE_ZEROES
with the NBD_CMD_FLAG_FAST_ZERO set), it should be pretty easy for me to
demonstrate a timing analysis of the proposed reference implementation,
to prove that it either makes a noticeable difference or was in the
noise.  But it may be a couple of weeks before I work on a reference
implementation - even if Kevin's patches are qemu 4.0 material to fix a
speed regression, getting a new NBD protocol extension included during
feature freeze is too much of a stretch.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Richard W.M. Jones

2019-Mar-22 22:06 UTC

head link

Re: [Libguestfs] [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO

On Fri, Mar 22, 2019 at 12:17:59PM -0500, Eric Blake
wrote:> On 3/22/19 11:42 AM, Eric Blake wrote:
> 
> > 
> > Hence, it is desirable to have a way for clients to specify that a
> > particular write zero request is being attempted for a fast wipe, and
> > get an immediate failure if the zero request would otherwise take the
> > same time as a write.  Conversely, if the client is not performing a
> > pre-initialization pass, it is still more efficient in terms of
> > networking traffic to send NBD_CMD_WRITE_ZERO requests where the
> > server implements the fallback to the slower write, than it is for the
> > client to have to perform the fallback to send NBD_CMD_WRITE with a
> > zeroed buffer.
> > 
> > Add a protocol flag and corresponding transmission advertisement flag
> > to make it easier for clients to inform the server of their intent. If
>
> Note that this is independent of proposals made on the NBD list in the
> past [1] of having a way for the server to advertise that a particular
> export starts in an all-zeroes state (faster than a series of 32-bit
> NBD_CMD_BLOCK_STATUS would be able to do), although I may _also_ try to
> revive proposed documentation and a reference implementation of that
> optimization as well (as qemu-img convert can completely skip the
> zeroing, whether the bulk wipe or per-hole writing, when it knows the
> destination is already zero).
It has to be said that this would be a lot easier to implement, and
for our purposes (optimizing qemu-img convert) it does everything we
need.

However the original proposal you put here seems reasonable.  I have
only one comment about it: Should the new error (ENOTSUP) be submitted
as a separate patch to the spec?
> [1] https://lists.debian.org/nbd/2016/12/msg00015.html and following
> (doc: Propose NBD_FLAG_INIT_ZEROES extension)
> 
> > 
> > I will not push this without both:
> > - a positive review (for example, we may decide that burning another
> > NBD_FLAG_* is undesirable, and that we should instead have some sort
> > of NBD_OPT_ handshake for determining when the server supports
> > NBD_CMD_FLAG_FAST_ZERO)
>From an implementation point of view I prefer simple flags over havingto implement a brand new option.

We can always work out how to extend the flags field if we run out of
flags.  For example, by implementing NBD_OPT_INFO2 with a much bigger
flags field.
> > - a reference client and server implementation (probably both via
qemu,
> > since it was qemu that raised the problem in the first place)
> 
> The last time we mentioned the possibility of advertising an initial
> zero state, we debated whether burning one of our 16 NBD_FLAG_*
> transmission bits for that purpose was wise [2], but discussion stalled
> after I never developed a counterproposal with NBD_OPT_* handshaking and
> never produced a reference implementation.
> 
> [2] https://lists.debian.org/nbd/2016/12/msg00048.html
> 
> Also, keep in mind that knowing that something started as all zeroes
> (which only affects startup; once you do any write, that early status
> bit no longer means anything to current operation, so less important to
> hand to the kernel during transmission phase, especially if the kernel
> can ever learn to utilize NBD_CMD_BLOCK_STATUS) is indeed different from
> knowing if probing for fast zeroing is supported (where it is easy to
> conceive of the kernel needing to know if it can send
> NBD_CMD_FLAG_FAST_ZERO).  So we may still want to use NBD_OPT_* to get
> the initial zero extension, but NBD_FLAG to advertise the fast zero
> extension.
> 
> On the other hand, it's also worth thinking about which extensions are
> easy for servers to implement - NBD_FLAG_INIT_ZEROES and
> NBD_FLAG_SEND_FAST_ZERO are orthogonal enough that I could see a full
> 2x2 mix of servers (unsupported, either one of the two supported, or
> both supported), and where clients may make optimization choices based
> on any of those four combinations.
> 
> [and if we're keeping score, other extension proposals that I want
> revisit, in no particular order, include:
> - 64-bit operations
> - NBD_CMD_RESIZE
> - more precision on TRIM/WRITE_ZERO alignment constraints
> ]
Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Richard W.M. Jones

2019-Apr-12 07:44 UTC

head link

Re: [Libguestfs] [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO

So I had a think about this.

Isn't this easier/better solved by lifting the 32 bit restriction on
the size of certain non-data requests (ie. NBD_CMD_BLOCK_STATUS,
NBD_CMD_TRIM, NBD_CMD_WRITE_ZEROES).  The client can then query the
disk efficiently to see if it starts as zeroes, and can decide on the
basis of that whether it needs to "infill" zeroes as it goes along, or
can ignore zeroes because they are already zero.

While at the same time lifting this restriction also solves other
problems we have, notably the 32 bit limitation on trims which affects
large mkfs greatly.

Previously discussed here:
https://lists.debian.org/nbd/2018/09/msg00001.html
& continuing the next month here:
https://lists.debian.org/nbd/2018/10/msg00000.html

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Kevin Wolf

2019-Apr-12 11:04 UTC

head link

Re: [Libguestfs] [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO

Am 12.04.2019 um 09:44 hat Richard W.M. Jones
geschrieben:> So I had a think about this.
> 
> Isn't this easier/better solved by lifting the 32 bit restriction on
> the size of certain non-data requests (ie. NBD_CMD_BLOCK_STATUS,
> NBD_CMD_TRIM, NBD_CMD_WRITE_ZEROES).  The client can then query the
> disk efficiently to see if it starts as zeroes, and can decide on the
> basis of that whether it needs to "infill" zeroes as it goes
along, or
> can ignore zeroes because they are already zero.
> 
> While at the same time lifting this restriction also solves other
> problems we have, notably the 32 bit limitation on trims which affects
> large mkfs greatly.
> 
> Previously discussed here:
> https://lists.debian.org/nbd/2018/09/msg00001.html
> & continuing the next month here:
> https://lists.debian.org/nbd/2018/10/msg00000.html
Actually, I think having both is useful.

Detecting that an image is already completely zeroed is useful because
then you don't need to do any preparation (but can we actually query
that e.g. for Nir's block device in question?).

But if you can't decide whether it's zeroed or you know it contains
non-zero data, you still need a way to choose the most efficient way to
write the image to it.

So having one of the features doesn't make the other one irrelevant.

Kevin

Seemingly Similar Threads

Search for more apparently analagous threads

Libguestfs - Mar 2019 - Re: [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO

[Libguestfs] [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO

Re: [Libguestfs] [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO

Re: [Libguestfs] [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO

Re: [Libguestfs] [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO

Re: [Libguestfs] [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO

Re: [Libguestfs] [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO

Re: [Libguestfs] [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO

Seemingly Similar Threads