The upstream protocol recently promoted NBD_CMD_WRITE_ZEROES from experimental to a documented extension. Exposing support for this allows plugin writers to create sparse files when driven by a client that knows how to use the extension; meanwhile, even if a plugin does not support this extension, the server benefits from less network traffic from the client. Eric Blake (5): protocol: Support NBD_FLAG_NO_ZEROES protocol: Validate request flags plugins: Add callback for writing zeroes protocol: Implement NBD_CMD_WRITE_ZEROES file: Support punching holes for write zero docs/nbdkit-plugin.pod | 19 +++++++++++++++++++ include/nbdkit-plugin.h | 1 + plugins/file/file.c | 32 +++++++++++++++++++++++++++++++ src/connections.c | 47 +++++++++++++++++++++++++++++++++++++++++----- src/internal.h | 1 + src/plugins.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++++ src/protocol.h | 18 +++++++++++------- 7 files changed, 156 insertions(+), 12 deletions(-) -- 2.9.3
Eric Blake
2017-Jan-20 20:16 UTC
[Libguestfs] [nbdkit PATCH 1/5] protocol: Support NBD_FLAG_NO_ZEROES
The upstream NBD protocol allows a reduction in the length of the handshake by both sides agreeing to skip the 124 bytes of padding in the final server reply. Signed-off-by: Eric Blake <eblake@redhat.com> --- src/connections.c | 14 +++++++++++--- src/protocol.h | 1 + 2 files changed, 12 insertions(+), 3 deletions(-) diff --git a/src/connections.c b/src/connections.c index 840e315..16c6584 100644 --- a/src/connections.c +++ b/src/connections.c @@ -42,6 +42,7 @@ #include <errno.h> #include <endian.h> #include <sys/types.h> +#include <stddef.h> #include <pthread.h> @@ -390,7 +391,7 @@ _negotiate_handshake_newstyle (struct connection *conn) uint16_t eflags; int fl; - gflags = NBD_FLAG_FIXED_NEWSTYLE; + gflags = NBD_FLAG_FIXED_NEWSTYLE | NBD_FLAG_NO_ZEROES; debug ("newstyle negotiation: flags: global 0x%x", gflags); @@ -409,8 +410,12 @@ _negotiate_handshake_newstyle (struct connection *conn) return -1; } cflags = be32toh (cflags); - /* ... which other than printing out, we ignore. */ + /* ... which we check for accuracy. */ debug ("newstyle negotiation: client flags: 0x%x", cflags); + if (cflags & ~gflags) { + nbdkit_error ("client requested unknown flags 0x%x", cflags); + return -1; + } /* Receive newstyle options. */ if (_negotiate_handshake_newstyle_options (conn) == -1) @@ -469,7 +474,10 @@ _negotiate_handshake_newstyle (struct connection *conn) handshake_finish.eflags = htobe16 (eflags); if (xwrite (conn->sockout, - &handshake_finish, sizeof handshake_finish) == -1) { + &handshake_finish, + (cflags & NBD_FLAG_NO_ZEROES) + ? offsetof (struct new_handshake_finish, zeroes) + : sizeof handshake_finish) == -1) { nbdkit_error ("write: %m"); return -1; } diff --git a/src/protocol.h b/src/protocol.h index c885d9e..23630a9 100644 --- a/src/protocol.h +++ b/src/protocol.h @@ -84,6 +84,7 @@ struct new_handshake_finish { /* Global flags. */ #define NBD_FLAG_FIXED_NEWSTYLE 1 +#define NBD_FLAG_NO_ZEROES 2 /* Per-export flags. */ #define NBD_FLAG_HAS_FLAGS 1 -- 2.9.3
Eric Blake
2017-Jan-20 20:16 UTC
[Libguestfs] [nbdkit PATCH 2/5] protocol: Validate request flags
Reject rather than silently ignoring unknown client request flags. Signed-off-by: Eric Blake <eblake@redhat.com> --- src/connections.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/src/connections.c b/src/connections.c index 16c6584..44b7530 100644 --- a/src/connections.c +++ b/src/connections.c @@ -546,6 +546,13 @@ validate_request (struct connection *conn, return 0; } + /* Validate flags */ + if (flags & ~NBD_CMD_FLAG_FUA) { + nbdkit_error ("invalid request: unknown flag (0x%x)", flags); + *error = EINVAL; + return 0; + } + /* Refuse over-large read and write requests. */ if ((cmd == NBD_CMD_WRITE || cmd == NBD_CMD_READ) && count > MAX_REQUEST_SIZE) { @@ -741,7 +748,7 @@ recv_request_send_reply (struct connection *conn) } cmd = be32toh (request.type); - flags = cmd; + flags = cmd & ~NBD_CMD_MASK_COMMAND; cmd &= NBD_CMD_MASK_COMMAND; offset = be64toh (request.offset); -- 2.9.3
Eric Blake
2017-Jan-20 20:16 UTC
[Libguestfs] [nbdkit PATCH 3/5] plugins: Add callback for writing zeroes
Similar to .trim, except that it guarantees that zeroes are read back, and also clients must obey the may_trim argument with regards to whether a hole may be used or whether the file must remain allocated with actual zeroes written. If the callback is not implemented, or if the callback fails with EOPNOTSUPP, fall back to fragmenting the request and calling .pwrite with a known-zero buffer. The handling of EOPNOTSUPP allows callbacks to avoid the need to reimplement the work of allocating an all-zero buffer; at least the file driver on Linux will benefit from these semantics as it means we can try to use fallocate(), then gracefully use normal writes if the underlying file system doesn't support what we need. Signed-off-by: Eric Blake <eblake@redhat.com> --- docs/nbdkit-plugin.pod | 19 +++++++++++++++++++ include/nbdkit-plugin.h | 1 + src/internal.h | 1 + src/plugins.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 71 insertions(+) diff --git a/docs/nbdkit-plugin.pod b/docs/nbdkit-plugin.pod index 30e1f86..ee6bbd5 100644 --- a/docs/nbdkit-plugin.pod +++ b/docs/nbdkit-plugin.pod @@ -432,6 +432,25 @@ callback. If there is an error, C<.trim> should call C<nbdkit_error> with an error message and return C<-1>. +=head2 C<.zero> + + int zero (void *handle, uint32_count, uint64_t offset, int may_trim); + +During the data serving phase, this callback is used to write C<count> +bytes of zeroes at C<offset> in the backing store. If C<may_trim> is +non-zero, the operation can punch a hole instead of writing actual +zero bytes, but only if subsequent reads from the hole read as zeroes. +If this callback is omitted, or if it fails with errno set to +EOPNOTSUPP, then C<.pwrite> will be used instead. + +The callback must write the whole C<count> bytes if it can. The NBD +protocol doesn't allow partial writes (instead, these would be +errors). If the whole C<count> bytes was written successfully, the +callback should return C<0> to indicate there was I<no> error. + +If there is an error, C<.zero> should call C<nbdkit_error> with an +error message and return C<-1>. + =head1 THREADS Each nbdkit plugin must declare its thread safety model by defining diff --git a/include/nbdkit-plugin.h b/include/nbdkit-plugin.h index bc9794e..3d25642 100644 --- a/include/nbdkit-plugin.h +++ b/include/nbdkit-plugin.h @@ -77,6 +77,7 @@ struct nbdkit_plugin { int (*pwrite) (void *handle, const void *buf, uint32_t count, uint64_t offset); int (*flush) (void *handle); int (*trim) (void *handle, uint32_t count, uint64_t offset); + int (*zero) (void *handle, uint32_t count, uint64_t offset, int may_trim); /* int (*set_exportname) (void *handle, const char *exportname); */ }; diff --git a/src/internal.h b/src/internal.h index bc4fa12..cb15bab 100644 --- a/src/internal.h +++ b/src/internal.h @@ -107,6 +107,7 @@ extern int plugin_pread (struct connection *conn, void *buf, uint32_t count, uin extern int plugin_pwrite (struct connection *conn, void *buf, uint32_t count, uint64_t offset); extern int plugin_flush (struct connection *conn); extern int plugin_trim (struct connection *conn, uint32_t count, uint64_t offset); +extern int plugin_zero (struct connection *conn, uint32_t count, uint64_t offset, int may_trim); /* sockets.c */ extern int *bind_unix_socket (size_t *); diff --git a/src/plugins.c b/src/plugins.c index 8f38761..574be03 100644 --- a/src/plugins.c +++ b/src/plugins.c @@ -48,6 +48,9 @@ static pthread_mutex_t connection_lock = PTHREAD_MUTEX_INITIALIZER; static pthread_mutex_t all_requests_lock = PTHREAD_MUTEX_INITIALIZER; +/* Maximum read or write request that we will handle. */ +#define MAX_REQUEST_SIZE (64 * 1024 * 1024) + /* Currently the server can only load one plugin (see TODO). Hence we * can just use globals to store these. */ @@ -252,6 +255,7 @@ plugin_dump_fields (void) HAS (pwrite); HAS (flush); HAS (trim); + HAS (zero); #undef HAS } @@ -506,3 +510,49 @@ plugin_trim (struct connection *conn, uint32_t count, uint64_t offset) return -1; } } + +int +plugin_zero (struct connection *conn, + uint32_t count, uint64_t offset, int may_trim) +{ + assert (dl); + assert (conn->handle); + char *buf; + uint32_t limit; + int result; + int err; + + debug ("zero count=%" PRIu32 " offset=%" PRIu64 " may_trim=%d", + count, offset, may_trim); + + if (!count) + return 0; + if (plugin.zero) { + errno = 0; + result = plugin.zero (conn->handle, count, offset, may_trim); + if (result == 0 || errno != EOPNOTSUPP) + return result; + } + + assert (plugin.pwrite); + limit = count < MAX_REQUEST_SIZE ? count : MAX_REQUEST_SIZE; + buf = calloc (limit, 1); + if (!buf) { + errno = ENOMEM; + return -1; + } + + while (count) { + result = plugin.pwrite (conn->handle, buf, limit, offset); + if (result < 0) + break; + count -= limit; + if (count < limit) + limit = count; + } + + err = errno; + free (buf); + errno = err; + return result; +} -- 2.9.3
Eric Blake
2017-Jan-20 20:16 UTC
[Libguestfs] [nbdkit PATCH 4/5] protocol: Implement NBD_CMD_WRITE_ZEROES
We always advertise this to the client (for writable exports), even when the plugin does not have any optimized implementation, because it allows for more efficient network traffic. Signed-off-by: Eric Blake <eblake@redhat.com> --- src/connections.c | 26 ++++++++++++++++++++++++-- src/protocol.h | 17 ++++++++++------- 2 files changed, 34 insertions(+), 9 deletions(-) diff --git a/src/connections.c b/src/connections.c index 44b7530..1b39547 100644 --- a/src/connections.c +++ b/src/connections.c @@ -180,6 +180,10 @@ _negotiate_handshake_oldstyle (struct connection *conn) eflags |= NBD_FLAG_READ_ONLY; conn->readonly = 1; } + if (!conn->readonly) { + eflags |= NBD_FLAG_SEND_WRITE_ZEROES; + } + fl = plugin_can_flush (conn); if (fl == -1) @@ -442,6 +446,9 @@ _negotiate_handshake_newstyle (struct connection *conn) eflags |= NBD_FLAG_READ_ONLY; conn->readonly = 1; } + if (!conn->readonly) { + eflags |= NBD_FLAG_SEND_WRITE_ZEROES; + } fl = plugin_can_flush (conn); if (fl == -1) @@ -520,6 +527,7 @@ validate_request (struct connection *conn, case NBD_CMD_READ: case NBD_CMD_WRITE: case NBD_CMD_TRIM: + case NBD_CMD_WRITE_ZEROES: r = valid_range (conn, offset, count); if (r == -1) return -1; @@ -547,11 +555,17 @@ validate_request (struct connection *conn, } /* Validate flags */ - if (flags & ~NBD_CMD_FLAG_FUA) { + if (flags & ~(NBD_CMD_FLAG_FUA | NBD_CMD_FLAG_NO_HOLE)) { nbdkit_error ("invalid request: unknown flag (0x%x)", flags); *error = EINVAL; return 0; } + if ((flags & NBD_CMD_FLAG_NO_HOLE) && + cmd != NBD_CMD_WRITE_ZEROES) { + nbdkit_error ("invalid request: NO_HOLE flag needs WRITE_ZEROES request"); + *error = EINVAL; + return 0; + } /* Refuse over-large read and write requests. */ if ((cmd == NBD_CMD_WRITE || cmd == NBD_CMD_READ) && @@ -565,7 +579,7 @@ validate_request (struct connection *conn, /* Readonly connection? */ if (conn->readonly && (cmd == NBD_CMD_WRITE || cmd == NBD_CMD_FLUSH || - cmd == NBD_CMD_TRIM)) { + cmd == NBD_CMD_TRIM || cmd == NBD_CMD_WRITE_ZEROES)) { nbdkit_error ("invalid request: write request on readonly connection"); *error = EROFS; return 0; @@ -647,6 +661,14 @@ _handle_request (struct connection *conn, } break; + case NBD_CMD_WRITE_ZEROES: + r = plugin_zero (conn, count, offset, !(flags & NBD_CMD_FLAG_NO_HOLE)); + if (r == -1) { + *error = errno ? errno : EIO; + return 0; + } + break; + default: abort (); } diff --git a/src/protocol.h b/src/protocol.h index 23630a9..4571a3a 100644 --- a/src/protocol.h +++ b/src/protocol.h @@ -87,12 +87,13 @@ struct new_handshake_finish { #define NBD_FLAG_NO_ZEROES 2 /* Per-export flags. */ -#define NBD_FLAG_HAS_FLAGS 1 -#define NBD_FLAG_READ_ONLY 2 -#define NBD_FLAG_SEND_FLUSH 4 -#define NBD_FLAG_SEND_FUA 8 -#define NBD_FLAG_ROTATIONAL 16 -#define NBD_FLAG_SEND_TRIM 32 +#define NBD_FLAG_HAS_FLAGS (1 << 0) +#define NBD_FLAG_READ_ONLY (1 << 1) +#define NBD_FLAG_SEND_FLUSH (1 << 2) +#define NBD_FLAG_SEND_FUA (1 << 3) +#define NBD_FLAG_ROTATIONAL (1 << 4) +#define NBD_FLAG_SEND_TRIM (1 << 5) +#define NBD_FLAG_SEND_WRITE_ZEROES (1 << 6) /* NBD options (new style handshake only). */ #define NBD_OPT_EXPORT_NAME 1 @@ -130,8 +131,10 @@ struct reply { #define NBD_CMD_DISC 2 /* Disconnect. */ #define NBD_CMD_FLUSH 3 #define NBD_CMD_TRIM 4 +#define NBD_CMD_WRITE_ZEROES 6 #define NBD_CMD_MASK_COMMAND 0xffff -#define NBD_CMD_FLAG_FUA (1<<16) +#define NBD_CMD_FLAG_FUA (1<<16) +#define NBD_CMD_FLAG_NO_HOLE (2<<16) /* Error codes (previously errno). * See http://git.qemu.org/?p=qemu.git;a=commitdiff;h=ca4414804114fd0095b317785bc0b51862e62ebb -- 2.9.3
Eric Blake
2017-Jan-20 20:16 UTC
[Libguestfs] [nbdkit PATCH 5/5] file: Support punching holes for write zero
On Linux, use fallocate() to punch holes as a more efficient way of writing zeroes. If hole punching is not allowed, or if we can't use fallocate (whether because this is not Linux, or because the file system on Linux doesn't support it), gracefully fall back to the write method. If wdelayms is set and a fallback occurs, we end up sleeping at least twice as long; oh well. Signed-off-by: Eric Blake <eblake@redhat.com> --- plugins/file/file.c | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/plugins/file/file.c b/plugins/file/file.c index 1368e4e..bd7a9c1 100644 --- a/plugins/file/file.c +++ b/plugins/file/file.c @@ -41,6 +41,7 @@ #include <sys/types.h> #include <sys/stat.h> #include <time.h> +#include <errno.h> #include <nbdkit-plugin.h> @@ -248,6 +249,36 @@ file_pwrite (void *handle, const void *buf, uint32_t count, uint64_t offset) return 0; } +/* Write data to the file. */ +static int +file_zero (void *handle, uint32_t count, uint64_t offset, int may_trim) +{ + struct handle *h = handle; + + if (wdelayms > 0) { + const struct timespec ts = { + .tv_sec = wdelayms / 1000, + .tv_nsec = (wdelayms * 1000000) % 1000000000 + }; + nanosleep (&ts, NULL); + } + +#ifdef FALLOC_FL_PUNCH_HOLE + if (may_trim) { + int r = fallocate (h->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + offset, count); + if (r == -1 && errno != EOPNOTSUPP) { + nbdkit_error ("pwrite: %m"); + } + return r; + } +#endif + + /* Trigger a fall back to writing */ + errno = EOPNOTSUPP; + return -1; +} + /* Flush the file to disk. */ static int file_flush (void *handle) @@ -275,6 +306,7 @@ static struct nbdkit_plugin plugin = { .get_size = file_get_size, .pread = file_pread, .pwrite = file_pwrite, + .zero = file_zero, .flush = file_flush, }; -- 2.9.3
Richard W.M. Jones
2017-Jan-21 14:47 UTC
Re: [Libguestfs] [nbdkit PATCH 0/5] Add WRITE_ZEROES support
On Fri, Jan 20, 2017 at 02:16:17PM -0600, Eric Blake wrote:> The upstream protocol recently promoted NBD_CMD_WRITE_ZEROES from > experimental to a documented extension. Exposing support for this > allows plugin writers to create sparse files when driven by a > client that knows how to use the extension; meanwhile, even if a > plugin does not support this extension, the server benefits from > less network traffic from the client.Thanks - pushed the series. Could you do a follow-up to add the .zero method to the various language plugins? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html
Eric Blake
2017-Jan-26 02:55 UTC
Re: [Libguestfs] [nbdkit PATCH 2/5] protocol: Validate request flags
On 01/20/2017 02:16 PM, Eric Blake wrote:> Reject rather than silently ignoring unknown client request flags. >> > + /* Validate flags */ > + if (flags & ~NBD_CMD_FLAG_FUA) { > + nbdkit_error ("invalid request: unknown flag (0x%x)", flags); > + *error = EINVAL; > + return 0; > + }Right now, our NBD_CMD_FLAG_FUA implementation causes a full flush action from the plugin, even if it is possible to write a client that knows how to preserve FUA semantics in a lighter-weight manner than a full fdatasync(). In other words, it obeys the semantics required by the NBD protocol, but not necessarily in the most optimum way. Unfortunately, the callback interfaces for a plugin do not have any way to pass flags from the client to the plugin (other than my new .zero callback, but right now it only supports a single may_trim argument used as a boolean, rather than an actual int flags argument). Do we want or need to enhance the set of callback interfaces to allow plugins that can act on flag values, rather than always implementing fua semantics ourselves by the heavy-weight .flush call? -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
Possibly Parallel Threads
- [PATCH nbdkit 0/2] server: Split out NBD protocol code from connections code.
- [nbdkit PATCH 0/7] Initial implementation of FUA flag passthrough
- [PATCH nbdkit 0/5] protocol: Generate map functions from NBD protocol flags to printable strings.
- [PATCH nbdkit v2] protocol: Implement NBD_OPT_GO.
- [nbdkit PATCH] connections: Improve error responses