Eric Blake
2019-Mar-22 16:42 UTC
[Libguestfs] [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO
While it may be counterintuitive at first, the introduction of NBD_CMD_WRITE_ZEROES and NBD_CMD_BLOCK_STATUS has caused a performance regression in qemu [1], when copying a sparse file. When the destination file must contain the same contents as the source, but it is not known in advance whether the destination started life with all zero content, then there are cases where it is faster to request a bulk zero of the entire device followed by writing only the portions of the device that are to contain data, as that results in fewer I/O transactions overall. In fact, there are even situations where trimming the entire device prior to writing zeroes may be faster than bare write zero request [2]. However, if a bulk zero request ever falls back to the same speed as a normal write, a bulk pre-zeroing algorithm is actually a pessimization, as it ends up writing portions of the disk twice. [1] https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg06389.html [2] https://github.com/libguestfs/nbdkit/commit/407f8dde Hence, it is desirable to have a way for clients to specify that a particular write zero request is being attempted for a fast wipe, and get an immediate failure if the zero request would otherwise take the same time as a write. Conversely, if the client is not performing a pre-initialization pass, it is still more efficient in terms of networking traffic to send NBD_CMD_WRITE_ZERO requests where the server implements the fallback to the slower write, than it is for the client to have to perform the fallback to send NBD_CMD_WRITE with a zeroed buffer. Add a protocol flag and corresponding transmission advertisement flag to make it easier for clients to inform the server of their intent. If the server advertises NBD_FLAG_SEND_FAST_ZERO, then it promises two things: to perform a fallback to write when the client does not request NBD_CMD_FLAG_FAST_ZERO (so that the client benefits from the lower network overhead); and to fail quickly with ENOTSUP if the client requested the flag but the server cannot write zeroes more efficiently than a normal write (so that the client is not penalized with the time of writing data areas of the disk twice). Note that the semantics are chosen so that servers should advertise the new flag whether or not they have fast zeroing (that is, this is NOT the server advertising that it has fast zeroes, but rather advertising that the client can get feedback as needed on whether zeroing is fast). It is also intentional that the new advertisement includes a new errno value, ENOTSUP, with rules that this error should not be returned for any pre-existing behaviors, must not happen when the client does not request a fast zero, and must be returned quickly if the client requested fast zero but anything other than the error would not be fast; while leaving it possible for clients to distinguish other errors like EINVAL if alignment constraints are not met. Clients should not send the flag unless the server advertised support, but well-behaved servers should already be reporting EINVAL to unrecognized flags. If the server does not advertise the new feature, clients can safely fall back to assuming that writing zeroes is no faster than normal writes. Note that the Linux fallocate(2) interface may or may not be powerful enough to easily determine if zeroing will be efficient - in particular, FALLOC_FL_ZERO_RANGE in isolation does NOT give that insight; for block devices, it is known that ioctl(BLKZEROOUT) does NOT have a way for userspace to probe if it is efficient or slow. But with enough demand, the kernel may add another FALLOC_FL_ flag to use with FALLOC_FL_ZERO_RANGE, and/or appropriate ioctls with guaranteed ENOTSUP failures if a fast path cannot be taken. If a server cannot easily determine if write zeroes will be efficient, it is better off not advertising NBD_FLAG_SEND_FAST_ZERO. Signed-off-by: Eric Blake <eblake@redhat.com> --- I will not push this without both: - a positive review (for example, we may decide that burning another NBD_FLAG_* is undesirable, and that we should instead have some sort of NBD_OPT_ handshake for determining when the server supports NBD_CMF_FLAG_FAST_ZERO) - a reference client and server implementation (probably both via qemu, since it was qemu that raised the problem in the first place) doc/proto.md | 44 +++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 43 insertions(+), 1 deletion(-) diff --git a/doc/proto.md b/doc/proto.md index 8aaad96..1107766 100644 --- a/doc/proto.md +++ b/doc/proto.md @@ -1059,6 +1059,17 @@ The field has the following format: which support the command without advertising this bit, and conversely that this bit does not guarantee that the command will succeed or have an impact. +- bit 11, `NBD_FLAG_SEND_FAST_ZERO`: allow clients to detect whether + `NBD_CMD_WRITE_ZEROES` is efficient. The server MUST set this + transmission flag to 1 if the `NBD_CMD_WRITE_ZEROES` request + supports the `NBD_CMD_FLAG_FAST_ZERO` flag, and MUST set this + transmission flag to 0 if `NBD_FLAG_SEND_WRITE_ZEROES` is not + set. Servers SHOULD NOT set this transmission flag if there is no + quick way to determine whether a particular write zeroes request + will be efficient, but the lack of an efficient write zero + implementation SHOULD NOT prevent a server from setting this + flag. Clients MUST NOT set the `NBD_CMD_FLAG_FAST_ZERO` request flag + unless this transmission flag is set. Clients SHOULD ignore unknown flags. @@ -1636,6 +1647,12 @@ valid may depend on negotiation during the handshake phase. MUST NOT send metadata on more than one extent in the reply. Client implementors should note that using this flag on multiple contiguous requests is likely to be inefficient. +- bit 4, `NBD_CMD_FLAG_FAST_ZERO`; valid during + `NBD_CMD_WRITE_ZEROES`. If set, but the server cannot perform the + write zeroes any faster than it would for an equivalent + `NBD_CMD_WRITE`, then the server MUST fail quickly with an error of + `ENOTSUP`. The client MUST NOT set this unless the server advertised + `NBD_FLAG_SEND_FAST_ZERO`. ##### Structured reply flags @@ -2004,7 +2021,10 @@ The following request types exist: reached permanent storage, unless `NBD_CMD_FLAG_FUA` is in use. A client MUST NOT send a write zeroes request unless - `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags field. + `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags + field. Additionally, a client MUST NOT send the + `NBD_CMD_FLAG_FAST_ZERO` flag unless `NBD_FLAG_SEND_FAST_ZERO` was + set in the transimssion flags field. By default, the server MAY use trimming to zero out the area, even if it did not advertise `NBD_FLAG_SEND_TRIM`; but it MUST ensure @@ -2014,6 +2034,23 @@ The following request types exist: same area will not cause fragmentation or cause failure due to insufficient space. + If the server advertised `NBD_FLAG_SEND_FAST_ZERO` but + `NBD_CMD_FLAG_FAST_ZERO` is not set, then the server MUST NOT fail + with `ENOTSUP`, even if the operation is no faster than a + corresponding `NBD_CMD_WRITE`. Conversely, if + `NBD_CMD_FLAG_FAST_ZERO` is set, the server MUST fail quickly with + `ENOTSUP` unless the request can be serviced more efficiently than + a corresponding `NBD_CMD_WRITE`. The server's determination of + efficiency MAY depend on whether the request was suitably aligned, + on whether the `NBD_CMD_FLAG_NO_HOLE` flag was present, or even on + whether a previous `NBD_CMD_TRIM` had been performed on the + region. If the server did not advertise + `NBD_FLAG_SEND_FAST_ZERO`, then it SHOULD NOT fail with `ENOTSUP`, + regardless of the speed of servicing a request, and SHOULD fail + with `EINVAL` if the `NBD_CMD_FLAG_FAST_ZERO` flag was set. A + server MAY advertise `NBD_FLAG_SEND_FAST_ZERO` whether or not it + can perform efficient zeroing. + If an error occurs, the server MUST set the appropriate error code in the error field. @@ -2114,6 +2151,7 @@ The following error values are defined: * `EINVAL` (22), Invalid argument. * `ENOSPC` (28), No space left on device. * `EOVERFLOW` (75), Value too large. +* `ENOTSUP` (95), Operation not supported. * `ESHUTDOWN` (108), Server is in the process of being shut down. The server SHOULD return `ENOSPC` if it receives a write request @@ -2125,6 +2163,10 @@ request is not aligned to advertised minimum block sizes. Finally, it SHOULD return `EPERM` if it receives a write or trim request on a read-only export. +The server SHOULD NOT return `ENOTSUP` except as documented in +response to `NBD_CMD_WRITE_ZEROES` when `NBD_CMD_FLAG_FAST_ZERO` is +supported. + The server SHOULD return `EINVAL` if it receives an unknown command. The server SHOULD return `EINVAL` if it receives an unknown command flag. It -- 2.20.1
Eric Blake
2019-Mar-22 17:17 UTC
Re: [Libguestfs] [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO
On 3/22/19 11:42 AM, Eric Blake wrote:> > Hence, it is desirable to have a way for clients to specify that a > particular write zero request is being attempted for a fast wipe, and > get an immediate failure if the zero request would otherwise take the > same time as a write. Conversely, if the client is not performing a > pre-initialization pass, it is still more efficient in terms of > networking traffic to send NBD_CMD_WRITE_ZERO requests where the > server implements the fallback to the slower write, than it is for the > client to have to perform the fallback to send NBD_CMD_WRITE with a > zeroed buffer. > > Add a protocol flag and corresponding transmission advertisement flag > to make it easier for clients to inform the server of their intent. IfNote that this is independent of proposals made on the NBD list in the past [1] of having a way for the server to advertise that a particular export starts in an all-zeroes state (faster than a series of 32-bit NBD_CMD_BLOCK_STATUS would be able to do), although I may _also_ try to revive proposed documentation and a reference implementation of that optimization as well (as qemu-img convert can completely skip the zeroing, whether the bulk wipe or per-hole writing, when it knows the destination is already zero). [1] https://lists.debian.org/nbd/2016/12/msg00015.html and following (doc: Propose NBD_FLAG_INIT_ZEROES extension)> > I will not push this without both: > - a positive review (for example, we may decide that burning another > NBD_FLAG_* is undesirable, and that we should instead have some sort > of NBD_OPT_ handshake for determining when the server supports > NBD_CMF_FLAG_FAST_ZERO) > - a reference client and server implementation (probably both via qemu, > since it was qemu that raised the problem in the first place)The last time we mentioned the possibility of advertising an initial zero state, we debated whether burning one of our 16 NBD_FLAG_* transmission bits for that purpose was wise [2], but discussion stalled after I never developed a counterproposal with NBD_OPT_* handshaking and never produced a reference implementation. [2] https://lists.debian.org/nbd/2016/12/msg00048.html Also, keep in mind that knowing that something started as all zeroes (which only affects startup; once you do any write, that early status bit no longer means anything to current operation, so less important to hand to the kernel during transmission phase, especially if the kernel can ever learn to utilize NBD_CMD_BLOCK_STATUS) is indeed different from knowing if probing for fast zeroing is supported (where it is easy to conceive of the kernel needing to know if it can send NBD_CMD_FLAG_FAST_ZERO). So we may still want to use NBD_OPT_* to get the initial zero extension, but NBD_FLAG to advertise the fast zero extension. On the other hand, it's also worth thinking about which extensions are easy for servers to implement - NBD_FLAG_INIT_ZEROES and NBD_FLAG_SEND_FAST_ZERO are orthogonal enough that I could see a full 2x2 mix of servers (unsupported, either one of the two supported, or both supported), and where clients may make optimization choices based on any of those four combinations. [and if we're keeping score, other extension proposals that I want revisit, in no particular order, include: - 64-bit operations - NBD_CMD_RESIZE - more precision on TRIM/WRITE_ZERO alignment constraints ] -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Nir Soffer
2019-Mar-22 19:42 UTC
Re: [Libguestfs] [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO
On Fri, Mar 22, 2019 at 6:43 PM Eric Blake <eblake@redhat.com> wrote:> While it may be counterintuitive at first, the introduction of > NBD_CMD_WRITE_ZEROES and NBD_CMD_BLOCK_STATUS has caused a performance > regression in qemu [1], when copying a sparse file. When the > destination file must contain the same contents as the source, but it > is not known in advance whether the destination started life with all > zero content, then there are cases where it is faster to request a > bulk zero of the entire device followed by writing only the portions > of the device that are to contain data, as that results in fewer I/O > transactions overall. In fact, there are even situations where > trimming the entire device prior to writing zeroes may be faster than > bare write zero request [2]. However, if a bulk zero request ever > falls back to the same speed as a normal write, a bulk pre-zeroing > algorithm is actually a pessimization, as it ends up writing portions > of the disk twice. > > [1] https://lists.gnu.org/archive/html/qemu-devel/2019-03/msg06389.html > [2] https://github.com/libguestfs/nbdkit/commit/407f8dde > > Hence, it is desirable to have a way for clients to specify that a > particular write zero request is being attempted for a fast wipe, and > get an immediate failure if the zero request would otherwise take the > same time as a write. Conversely, if the client is not performing a > pre-initialization pass, it is still more efficient in terms of > networking traffic to send NBD_CMD_WRITE_ZERO requests where the > server implements the fallback to the slower write, than it is for the > client to have to perform the fallback to send NBD_CMD_WRITE with a > zeroed buffer. > > Add a protocol flag and corresponding transmission advertisement flag > to make it easier for clients to inform the server of their intent. If > the server advertises NBD_FLAG_SEND_FAST_ZERO, then it promises two > things: to perform a fallback to write when the client does not > request NBD_CMD_FLAG_FAST_ZERO (so that the client benefits from the > lower network overhead); and to fail quickly with ENOTSUP if the > client requested the flag but the server cannot write zeroes more > efficiently than a normal write (so that the client is not penalized > with the time of writing data areas of the disk twice). >I think the issue is not that zero is slow as normal write, but that it is not fast enough so it worth the zero entire disk before writing data. For example, on storage server we had in the past BLKZEROOUT rate was 50G/s. On another server, it can run anywhere from 1G/s to 100G/s, depending on the allocation status of the zeroed range. Note that the semantics are chosen so that servers should advertise> the new flag whether or not they have fast zeroing (that is, this is > NOT the server advertising that it has fast zeroes, but rather > advertising that the client can get feedback as needed on whether > zeroing is fast). It is also intentional that the new advertisement > includes a new errno value, ENOTSUP, with rules that this error should > not be returned for any pre-existing behaviors, must not happen when > the client does not request a fast zero, and must be returned quickly > if the client requested fast zero but anything other than the error > would not be fast; while leaving it possible for clients to > distinguish other errors like EINVAL if alignment constraints are not > met. Clients should not send the flag unless the server advertised > support, but well-behaved servers should already be reporting EINVAL > to unrecognized flags. If the server does not advertise the new > feature, clients can safely fall back to assuming that writing zeroes > is no faster than normal writes.> Note that the Linux fallocate(2) interface may or may not be powerful > enough to easily determine if zeroing will be efficient - in > particular, FALLOC_FL_ZERO_RANGE in isolation does NOT give that > insight; for block devices, it is known that ioctl(BLKZEROOUT) does > NOT have a way for userspace to probe if it is efficient or slow. But > with enough demand, the kernel may add another FALLOC_FL_ flag to use > with FALLOC_FL_ZERO_RANGE, and/or appropriate ioctls with guaranteed > ENOTSUP failures if a fast path cannot be taken. If a server cannot > easily determine if write zeroes will be efficient, it is better off > not advertising NBD_FLAG_SEND_FAST_ZERO. >I think this can work for file based images. If fallocate() fails, the client will get ENOTSUP after the first call quickly. For block device we don't have any way to know if a fallocate() or BLKZEROOUT will be fast, so I guess servers will never advertise FAST_ZERO. Generally this new flag usefulness is limited. It will only help qemu-img to convert faster to file based images. Do we have performance measurements showing significant speed up when zeroing the entire image before coping data, compared with zeroing only the unallocated ranges? For example if the best speedup we can get in real world scenario is 2%, is ti worth complicating the protocol and using another bit?> Signed-off-by: Eric Blake <eblake@redhat.com> > --- > > I will not push this without both: > - a positive review (for example, we may decide that burning another > NBD_FLAG_* is undesirable, and that we should instead have some sort > of NBD_OPT_ handshake for determining when the server supports > NBD_CMF_FLAG_FAST_ZERO) > - a reference client and server implementation (probably both via qemu, > since it was qemu that raised the problem in the first place) > > doc/proto.md | 44 +++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 43 insertions(+), 1 deletion(-) > > diff --git a/doc/proto.md b/doc/proto.md > index 8aaad96..1107766 100644 > --- a/doc/proto.md > +++ b/doc/proto.md > @@ -1059,6 +1059,17 @@ The field has the following format: > which support the command without advertising this bit, and > conversely that this bit does not guarantee that the command will > succeed or have an impact. > +- bit 11, `NBD_FLAG_SEND_FAST_ZERO`: allow clients to detect whether > + `NBD_CMD_WRITE_ZEROES` is efficient. The server MUST set this > + transmission flag to 1 if the `NBD_CMD_WRITE_ZEROES` request > + supports the `NBD_CMD_FLAG_FAST_ZERO` flag, and MUST set this > + transmission flag to 0 if `NBD_FLAG_SEND_WRITE_ZEROES` is not > + set. Servers SHOULD NOT set this transmission flag if there is no > + quick way to determine whether a particular write zeroes request > + will be efficient, but the lack of an efficient write zero >I think we should use "fast" instead of "efficient". For example when the kernel fallback to manual zeroing it is probably the most efficient way it can be done, but it is not fast.> + implementation SHOULD NOT prevent a server from setting this > + flag. Clients MUST NOT set the `NBD_CMD_FLAG_FAST_ZERO` request flag > + unless this transmission flag is set. > > Clients SHOULD ignore unknown flags. > > @@ -1636,6 +1647,12 @@ valid may depend on negotiation during the > handshake phase. > MUST NOT send metadata on more than one extent in the reply. Client > implementors should note that using this flag on multiple contiguous > requests is likely to be inefficient. > +- bit 4, `NBD_CMD_FLAG_FAST_ZERO`; valid during > + `NBD_CMD_WRITE_ZEROES`. If set, but the server cannot perform the > + write zeroes any faster than it would for an equivalent > + `NBD_CMD_WRITE`, then the server MUST fail quickly with an error of > + `ENOTSUP`. The client MUST NOT set this unless the server advertised > + `NBD_FLAG_SEND_FAST_ZERO`. > > ##### Structured reply flags > > @@ -2004,7 +2021,10 @@ The following request types exist: > reached permanent storage, unless `NBD_CMD_FLAG_FUA` is in use. > > A client MUST NOT send a write zeroes request unless > - `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags field. > + `NBD_FLAG_SEND_WRITE_ZEROES` was set in the transmission flags > + field. Additionally, a client MUST NOT send the > + `NBD_CMD_FLAG_FAST_ZERO` flag unless `NBD_FLAG_SEND_FAST_ZERO` was > + set in the transimssion flags field. > > By default, the server MAY use trimming to zero out the area, even > if it did not advertise `NBD_FLAG_SEND_TRIM`; but it MUST ensure > @@ -2014,6 +2034,23 @@ The following request types exist: > same area will not cause fragmentation or cause failure due to > insufficient space. > > + If the server advertised `NBD_FLAG_SEND_FAST_ZERO` but > + `NBD_CMD_FLAG_FAST_ZERO` is not set, then the server MUST NOT fail > + with `ENOTSUP`, even if the operation is no faster than a > + corresponding `NBD_CMD_WRITE`. Conversely, if > + `NBD_CMD_FLAG_FAST_ZERO` is set, the server MUST fail quickly with > + `ENOTSUP` unless the request can be serviced more efficiently than > + a corresponding `NBD_CMD_WRITE`. The server's determination of > + efficiency MAY depend on whether the request was suitably aligned, > + on whether the `NBD_CMD_FLAG_NO_HOLE` flag was present, or even on > + whether a previous `NBD_CMD_TRIM` had been performed on the > + region. If the server did not advertise > + `NBD_FLAG_SEND_FAST_ZERO`, then it SHOULD NOT fail with `ENOTSUP`, > + regardless of the speed of servicing a request, and SHOULD fail > + with `EINVAL` if the `NBD_CMD_FLAG_FAST_ZERO` flag was set. A > + server MAY advertise `NBD_FLAG_SEND_FAST_ZERO` whether or not it > + can perform efficient zeroing. > + > If an error occurs, the server MUST set the appropriate error code > in the error field. > > @@ -2114,6 +2151,7 @@ The following error values are defined: > * `EINVAL` (22), Invalid argument. > * `ENOSPC` (28), No space left on device. > * `EOVERFLOW` (75), Value too large. > +* `ENOTSUP` (95), Operation not supported. > * `ESHUTDOWN` (108), Server is in the process of being shut down. > > The server SHOULD return `ENOSPC` if it receives a write request > @@ -2125,6 +2163,10 @@ request is not aligned to advertised minimum block > sizes. Finally, it > SHOULD return `EPERM` if it receives a write or trim request on a > read-only export. > > +The server SHOULD NOT return `ENOTSUP` except as documented in > +response to `NBD_CMD_WRITE_ZEROES` when `NBD_CMD_FLAG_FAST_ZERO` is > +supported. >This makes ENOTSUP less useful. I think it should be allowed to return ENOTSUP as response for other commands if needed.> + > The server SHOULD return `EINVAL` if it receives an unknown command. > > The server SHOULD return `EINVAL` if it receives an unknown command flag. > It > -- > 2.20.1 >I think this makes sense, and should work, but we need more data supporting that this is useful in practice. Nir
Eric Blake
2019-Mar-22 20:42 UTC
Re: [Libguestfs] [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO
On 3/22/19 2:42 PM, Nir Soffer wrote:>> Add a protocol flag and corresponding transmission advertisement flag >> to make it easier for clients to inform the server of their intent. If >> the server advertises NBD_FLAG_SEND_FAST_ZERO, then it promises two >> things: to perform a fallback to write when the client does not >> request NBD_CMD_FLAG_FAST_ZERO (so that the client benefits from the >> lower network overhead); and to fail quickly with ENOTSUP if the >> client requested the flag but the server cannot write zeroes more >> efficiently than a normal write (so that the client is not penalized >> with the time of writing data areas of the disk twice). >> > > I think the issue is not that zero is slow as normal write, but that it is > not fast > enough so it worth the zero entire disk before writing data.In an image copy, where you don't know if the destination already started life with all zero, then you HAVE to copy zeros into the image for the holes; the only question is whether also pre-filling the entire image (with fewer calls) and then overwriting the prefill is faster than just writing the data areas once. So there is a tradeoff to see how much time do you add with the overhead of lots of small-length WRITE_ZEROES for the holes, vs. the overhead of one large-length WRITE_ZEROES for the entire image. There's ALSO a factor of how much of the image is holes vs. data - a pre-fill of only 10% of the image (which is mostly sparse) is less wasteful than a pre-fill of 90% of the image (which is mostly dense) - but that waste doesn't cost anything if prefill is O(1) regardless of size; vs. being painful if it is O(n) based on size. There are definitely heuristics at play, and I don't know that the NBD spec can go into any strong advice on what type of speedups are in play, only whether the write zero is on par with normal writes. And, given the uncertainties on what speedups (or slowdowns) a pre-fill might cause, it DOES show that knowing if an image started life all zero is an even better optimization, because then you don't have to waste any time on overwriting holes. But having another way to speed things up does not necessarily render this proposal as useless.>> Note that the Linux fallocate(2) interface may or may not be powerful >> enough to easily determine if zeroing will be efficient - in >> particular, FALLOC_FL_ZERO_RANGE in isolation does NOT give that >> insight; for block devices, it is known that ioctl(BLKZEROOUT) does >> NOT have a way for userspace to probe if it is efficient or slow. But >> with enough demand, the kernel may add another FALLOC_FL_ flag to use >> with FALLOC_FL_ZERO_RANGE, and/or appropriate ioctls with guaranteed >> ENOTSUP failures if a fast path cannot be taken. If a server cannot >> easily determine if write zeroes will be efficient, it is better off >> not advertising NBD_FLAG_SEND_FAST_ZERO. >> > > I think this can work for file based images. If fallocate() fails, the > client > will get ENOTSUP after the first call quickly.The negative case is fast, but that doesn't say anything about the positive case. Unless Linux adds a new FALLOC_FL_ bit, you have no guarantee whether fallocate() reporting success may still have happened because the kernel did a fallback to a slow write. If fallocate() comes back quickly, you got lucky; but if it takes the full time of a write(), you lost your window of opportunity to report ENOTSUP quickly. Hence, my hope that the kernel folks add a new FALLOC_FL_ flag to give us the semantics we want (of a guaranteed way to avoid slow fallbacks).> > For block device we don't have any way to know if a fallocate() or > BLKZEROOUT > will be fast, so I guess servers will never advertise FAST_ZERO. >As I said, you don't know that with BLKZEROOUT, but the kernel might give us another ioctl that DOES know.> Generally this new flag usefulness is limited. It will only help qemu-img > to convert > faster to file based images.Limited use case is still a use case. If there are cases where you can optimize by a simple extension to the protocol, and where either side lacking the extension is not fatal to the protocol, then it is worth doing. And so far, that is what this feels like to me.> > Do we have performance measurements showing significant speed up when > zeroing the entire image before coping data, compared with zeroing only the > unallocated ranges?Kevin may have more of an idea based on the patches he wrote for qemu-img, and which spurred me into proposing this email; maybe he can share numbers for his testing on regular files and/or block devices to at least get a feel for whether a speedup is likely with a sufficient NBD server.> > For example if the best speedup we can get in real world scenario is 2%, is > ti > worth complicating the protocol and using another bit?Gaining 2% of an hour may still be worth it.>> + set. Servers SHOULD NOT set this transmission flag if there is no >> + quick way to determine whether a particular write zeroes request >> + will be efficient, but the lack of an efficient write zero >> > > I think we should use "fast" instead of "efficient". For example when the > kernel > fallback to manual zeroing it is probably the most efficient way it can be > done, > but it is not fast.Seems like a simple enough wording change.>> @@ -2114,6 +2151,7 @@ The following error values are defined: >> * `EINVAL` (22), Invalid argument. >> * `ENOSPC` (28), No space left on device. >> * `EOVERFLOW` (75), Value too large. >> +* `ENOTSUP` (95), Operation not supported. >> * `ESHUTDOWN` (108), Server is in the process of being shut down. >> >> The server SHOULD return `ENOSPC` if it receives a write request >> @@ -2125,6 +2163,10 @@ request is not aligned to advertised minimum block >> sizes. Finally, it >> SHOULD return `EPERM` if it receives a write or trim request on a >> read-only export. >> >> +The server SHOULD NOT return `ENOTSUP` except as documented in >> +response to `NBD_CMD_WRITE_ZEROES` when `NBD_CMD_FLAG_FAST_ZERO` is >> +supported. >> > > This makes ENOTSUP less useful. I think it should be allowed to return > ENOTSUP > as response for other commands if needed.Sorry, but we have the problem of back-compat to worry about. Remember, the error values permitted in the NBD protocol are system-agnostic (they _happen_ to match Linux errno values, but not all the world uses the same values for those errors in their libc, so portable implementations HAVE to map between NBD_EINVAL sent over the wire and libc EINVAL used internally, even if the mapping is 1:1 on Linux). Since the NBD protocol has documented only a finite subset of valid errors, and portable clients have to implement a mapping, it's very probably that there exist clients written against the current NBD spec that will choke hard (and probably hang up the connection) on receiving an unexpected error number from the server which was not pre-compiled into their mapping. ANY server that replies with ENOTSUP at the moment is in violation of the existing server requirements, whether or not clients have a high quality of implementation and manage to tolerate the server's noncompliance. Thus, when we add new errno values as being valid returns, we have to take care that servers SHOULD NOT send the new errno except to clients that are prepared for the error - a server merely advertising NBD_FLAG_SEND_FAST_ZERO is _still_ insufficient to give the server rights to send ENOTSUP (since the server can't know if the client recognized the advertisement, at least until the client finally sends a NBD_CMD_FLAG_FAST_ZERO flag). (Note, I said SHOULD NOT, not MUST NOT - if your server goofs and leaks ENOTSUP to a client on any other command, most clients will still be okay, and so you probably won't have people complaining that your server is broken. The only MUST NOT send ENOTSUP is for the case where the server advertised FAST_ZERO probing and the client did not request FAST_ZERO, because then server has to assume the client is relying on the server to do fallback handling for reduced network traffic.)> > I think this makes sense, and should work, but we need more data supporting > that this is > useful in practice.Fair enough - since Kevin has already got patches proposed against qemu to wire up a qemu flag BDRV_REQ_NO_FALLBACK, which should map in a rather straightforward manner to my NBD proposal (any qemu request sent with the BDRV_REQ_NO_FALLBACK bit set turns into an NBD_CMD_WRITE_ZEROES with the NBD_CMD_FLAG_FAST_ZERO set), it should be pretty easy for me to demonstrate a timing analysis of the proposed reference implementation, to prove that it either makes a noticeable difference or was in the noise. But it may be a couple of weeks before I work on a reference implementation - even if Kevin's patches are qemu 4.0 material to fix a speed regression, getting a new NBD protocol extension included during feature freeze is too much of a stretch. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Richard W.M. Jones
2019-Mar-22 22:06 UTC
Re: [Libguestfs] [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO
On Fri, Mar 22, 2019 at 12:17:59PM -0500, Eric Blake wrote:> On 3/22/19 11:42 AM, Eric Blake wrote: > > > > > Hence, it is desirable to have a way for clients to specify that a > > particular write zero request is being attempted for a fast wipe, and > > get an immediate failure if the zero request would otherwise take the > > same time as a write. Conversely, if the client is not performing a > > pre-initialization pass, it is still more efficient in terms of > > networking traffic to send NBD_CMD_WRITE_ZERO requests where the > > server implements the fallback to the slower write, than it is for the > > client to have to perform the fallback to send NBD_CMD_WRITE with a > > zeroed buffer. > > > > Add a protocol flag and corresponding transmission advertisement flag > > to make it easier for clients to inform the server of their intent. If > > Note that this is independent of proposals made on the NBD list in the > past [1] of having a way for the server to advertise that a particular > export starts in an all-zeroes state (faster than a series of 32-bit > NBD_CMD_BLOCK_STATUS would be able to do), although I may _also_ try to > revive proposed documentation and a reference implementation of that > optimization as well (as qemu-img convert can completely skip the > zeroing, whether the bulk wipe or per-hole writing, when it knows the > destination is already zero).It has to be said that this would be a lot easier to implement, and for our purposes (optimizing qemu-img convert) it does everything we need. However the original proposal you put here seems reasonable. I have only one comment about it: Should the new error (ENOTSUP) be submitted as a separate patch to the spec?> [1] https://lists.debian.org/nbd/2016/12/msg00015.html and following > (doc: Propose NBD_FLAG_INIT_ZEROES extension) > > > > > I will not push this without both: > > - a positive review (for example, we may decide that burning another > > NBD_FLAG_* is undesirable, and that we should instead have some sort > > of NBD_OPT_ handshake for determining when the server supports > > NBD_CMD_FLAG_FAST_ZERO)>From an implementation point of view I prefer simple flags over havingto implement a brand new option. We can always work out how to extend the flags field if we run out of flags. For example, by implementing NBD_OPT_INFO2 with a much bigger flags field.> > - a reference client and server implementation (probably both via qemu, > > since it was qemu that raised the problem in the first place) > > The last time we mentioned the possibility of advertising an initial > zero state, we debated whether burning one of our 16 NBD_FLAG_* > transmission bits for that purpose was wise [2], but discussion stalled > after I never developed a counterproposal with NBD_OPT_* handshaking and > never produced a reference implementation. > > [2] https://lists.debian.org/nbd/2016/12/msg00048.html > > Also, keep in mind that knowing that something started as all zeroes > (which only affects startup; once you do any write, that early status > bit no longer means anything to current operation, so less important to > hand to the kernel during transmission phase, especially if the kernel > can ever learn to utilize NBD_CMD_BLOCK_STATUS) is indeed different from > knowing if probing for fast zeroing is supported (where it is easy to > conceive of the kernel needing to know if it can send > NBD_CMD_FLAG_FAST_ZERO). So we may still want to use NBD_OPT_* to get > the initial zero extension, but NBD_FLAG to advertise the fast zero > extension. > > On the other hand, it's also worth thinking about which extensions are > easy for servers to implement - NBD_FLAG_INIT_ZEROES and > NBD_FLAG_SEND_FAST_ZERO are orthogonal enough that I could see a full > 2x2 mix of servers (unsupported, either one of the two supported, or > both supported), and where clients may make optimization choices based > on any of those four combinations. > > [and if we're keeping score, other extension proposals that I want > revisit, in no particular order, include: > - 64-bit operations > - NBD_CMD_RESIZE > - more precision on TRIM/WRITE_ZERO alignment constraints > ]Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW
Richard W.M. Jones
2019-Apr-12 07:44 UTC
Re: [Libguestfs] [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO
So I had a think about this. Isn't this easier/better solved by lifting the 32 bit restriction on the size of certain non-data requests (ie. NBD_CMD_BLOCK_STATUS, NBD_CMD_TRIM, NBD_CMD_WRITE_ZEROES). The client can then query the disk efficiently to see if it starts as zeroes, and can decide on the basis of that whether it needs to "infill" zeroes as it goes along, or can ignore zeroes because they are already zero. While at the same time lifting this restriction also solves other problems we have, notably the 32 bit limitation on trims which affects large mkfs greatly. Previously discussed here: https://lists.debian.org/nbd/2018/09/msg00001.html & continuing the next month here: https://lists.debian.org/nbd/2018/10/msg00000.html Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
Kevin Wolf
2019-Apr-12 11:04 UTC
Re: [Libguestfs] [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO
Am 12.04.2019 um 09:44 hat Richard W.M. Jones geschrieben:> So I had a think about this. > > Isn't this easier/better solved by lifting the 32 bit restriction on > the size of certain non-data requests (ie. NBD_CMD_BLOCK_STATUS, > NBD_CMD_TRIM, NBD_CMD_WRITE_ZEROES). The client can then query the > disk efficiently to see if it starts as zeroes, and can decide on the > basis of that whether it needs to "infill" zeroes as it goes along, or > can ignore zeroes because they are already zero. > > While at the same time lifting this restriction also solves other > problems we have, notably the 32 bit limitation on trims which affects > large mkfs greatly. > > Previously discussed here: > https://lists.debian.org/nbd/2018/09/msg00001.html > & continuing the next month here: > https://lists.debian.org/nbd/2018/10/msg00000.htmlActually, I think having both is useful. Detecting that an image is already completely zeroed is useful because then you don't need to do any preparation (but can we actually query that e.g. for Nir's block device in question?). But if you can't decide whether it's zeroed or you know it contains non-zero data, you still need a way to choose the most efficient way to write the image to it. So having one of the features doesn't make the other one irrelevant. Kevin
Reasonably Related Threads
- [PATCH 1/1] protocol: Add NBD_CMD_FLAG_FAST_ZERO
- Re: [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO
- Re: [RFC PATCH] protocol: Add NBD_CMD_FLAG_FAST_ZERO
- cross-project patches: Add NBD Fast Zero support
- [nbdkit PATCH 1/3] server: Add internal support for NBDKIT_FLAG_FAST_ZERO