thr3ads.net - Libguestfs - [Libguestfs] [PATCH nbdkit 0/9] [mainly for discussion and early review] Implement extents. [Mar 2019]

If this information is useful, please help other people find it:
Share via:

Richard W.M. Jones

2019-Mar-19 16:34 UTC

[Libguestfs] [PATCH nbdkit 0/9] [mainly for discussion and early review] Implement extents.

I want to post this but mainly for discussion and early review.  It's
not safe for these patches to all go upstream yet (because not all
filters have been checked/adjusted), but if any patches were to go
upstream then probably 1 & 2 only are safe.

File, VDDK, memory and data plugins all work, although I have only
done minimal testing on them.

The current tests, such as they are, all pass.  We probably do need
more testing of the extents feature specifically.

I have only tested against qemu (3.1.0) not against any other clients
(if any exist that support this feature?)

For reference the original plan which I'm working to is here:
https://www.redhat.com/archives/libguestfs/2019-March/msg00017.html
https://www.redhat.com/archives/libguestfs/2019-March/msg00021.html

Rich.

Richard W.M. Jones

2019-Mar-19 16:34 UTC

head link

[Libguestfs] [PATCH nbdkit 1/9] server: Implement extents/can_extents calls for plugins and filters.

This pair of calls allows plugins to describe which extents in the
virtual disk are allocated, holes or zeroes.
---
 docs/nbdkit-filter.pod  | 107 ++++++++
 docs/nbdkit-plugin.pod  | 103 +++++++
 include/nbdkit-common.h |  10 +-
 include/nbdkit-filter.h |  25 +-
 include/nbdkit-plugin.h |   6 +-
 server/internal.h       |   5 +
 server/extents.c        | 574 ++++++++++++++++++++++++++++++++++++++++
 server/filters.c        |  61 +++++
 server/plugins.c        |  56 +++-
 server/test-extents.c   | 151 +++++++++++
 .gitignore              |   1 +
 server/Makefile.am      |  18 +-
 12 files changed, 1111 insertions(+), 6 deletions(-)

diff --git a/docs/nbdkit-filter.pod b/docs/nbdkit-filter.pod
index dc9a262..7b25c8e 100644
--- a/docs/nbdkit-filter.pod
+++ b/docs/nbdkit-filter.pod
@@ -350,6 +350,8 @@ calls.
 
 =head2 C<.can_zero>
 
+=head2 C<.can_extents>
+
 =head2 C<.can_fua>
 
 =head2 C<.can_multi_conn>
@@ -365,6 +367,8 @@ calls.
                   void *handle);
  int (*can_zero) (struct nbdkit_next_ops *next_ops, void *nxdata,
                   void *handle);
+ int (*can_extents) (struct nbdkit_next_ops *next_ops, void *nxdata,
+                     void *handle);
  int (*can_fua) (struct nbdkit_next_ops *next_ops, void *nxdata,
                  void *handle);
  int (*can_multi_conn) (struct nbdkit_next_ops *next_ops, void *nxdata,
@@ -513,6 +517,109 @@ value to return to the client.  The filter should never
fail with
 C<EOPNOTSUPP> (while plugins have automatic fallback to C<.pwrite>,
 filters do not).
 
+=head2 C<.extents>
+
+ int (*extents) (struct nbdkit_next_ops *next_ops, void *nxdata,
+                 void *handle, uint32_t count, uint64_t offset, uint32_t flags,
+                 struct nbdkit_extents_map *extents_map,
+                 int *err);
+
+This intercepts the plugin C<.extents> method and can be used to
+modify extent requests.
+
+This function will not be called if C<.can_extents> returned false; in
+turn, the filter should not call C<next_ops-E<gt>extents> if
+C<next_ops-E<gt>can_extents> did not return true.
+
+The C<extents_map> parameter passed to this function is empty
+(representing a fully allocated disk).  If the filter does not need to
+adjust extents from the underlying plugin it can simply pass this
+through to the layer below:
+
+ myfilter_extents (...)
+ {
+   return next_ops->extents (nxdata, count, offset, flags,
+                             extents_map, err);
+ }
+
+It is also possible for filters to transform the extents map received
+back from the layer below.  Usually this must be done by allocating a
+new, empty map which is passed to the layer below.  Without error
+handling it would look like this:
+
+ myfilter_extents (...)
+ {
+   struct nbdkit_extents_map *map2 = nbdkit_extents_new ();
+   next_ops->extents (nxdata, count, offset, flags, map2, err);
+   /* transform map2 and return results in extents_map */
+   nbdkit_extents_foreach (map2, transform_offset, extents_map, /*..*/);
+   nbdkit_extents_free (map2);
+ }
+
+If there is an error, C<.extents> should call C<nbdkit_error> with
an
+error message B<and> return -1 with C<err> set to the positive
errno
+value to return to the client.
+
+=head3 Allocating and freeing nbdkit_extents_map
+
+Two functions are provided to filters only for allocating and freeing
+the map:
+
+ struct nbdkit_extents_map *nbdkit_extents_new ();
+
+Allocates and returns a new, empty extents map.  The initial state is
+that the whole disk is allocated with data.
+
+On error this function can return C<NULL>.  In this case it calls
+C<nbdkit_error> and/or C<nbdkit_set_error> as required.
+
+ void nbdkit_extents_free (struct nbdkit_extents_map *);
+
+Frees an existing extents map.
+
+=head3 Iterating over nbdkit_extents_map
+
+One function is provided to filters only to iterate over the extents
+map:
+
+ typedef int (*nbdkit_extents_callback) (uint64_t offset, uint64_t length,
+                                         uint32_t type, void *opaque);
+
+ int nbdkit_extents_foreach (
+            const struct nbdkit_extents_map *extents_map,
+            nbdkit_extents_callback fn, void *opaque,
+            uint32_t flags,
+            uint64_t range_offset, uint64_t range_length);
+
+Iterate over the extents in ascending order, calling C<fn> for each.
+C<fn> should return 0 on success or -1 on failure.  If a call to
C<fn>
+fails then C<nbdkit_extents_foreach> also fails immediately.
+
+C<nbdkit_extents_foreach> returns C<0> on success or C<-1> on
failure.
+On internal failures it calls C<nbdkit_error> or
C<nbdkit_set_error>.
+If C<fn> fails it is expected that C<fn> calls those functions as
+necessary.
+
+The C<flags> parameter can contain:
+
+=over 4
+
+=item C<NBDKIT_EXTENTS_FOREACH_FLAG_ONE>
+
+Only call C<fn> for the first extent, and then return.  (This is used
+to implement C<NBD_CMD_FLAG_REQ_ONE> in the NBD protocol.)
+
+=item C<NBDKIT_EXTENTS_FOREACH_FLAG_RANGE>
+
+If this flag is included then the C<range_offset> and
C<range_length>
+parameters are used, so only extents overlapping
+C<[range_offset...range_offset+range_length-1]> are returned.
+
+If this flag is not included in C<flags> then the range parameters are
+I<ignored>.
+
+=back
+
 =head1 ERROR HANDLING
 
 If there is an error in the filter itself, the filter should call
diff --git a/docs/nbdkit-plugin.pod b/docs/nbdkit-plugin.pod
index 47545f3..7b7b38b 100644
--- a/docs/nbdkit-plugin.pod
+++ b/docs/nbdkit-plugin.pod
@@ -555,6 +555,20 @@ This callback is not required.  If omitted, then nbdkit
always tries
 C<.zero> first if it is present, and gracefully falls back to
 C<.pwrite> if C<.zero> was absent or failed with
C<EOPNOTSUPP>.
 
+=head2 C<.can_extents>
+
+ int can_extents (void *handle);
+
+This is called during the option negotiation phase to find out if the
+plugin supports detecting allocated (non-sparse) regions of the disk
+with the C<.extents> callback.
+
+If there is an error, C<.can_extents> should call C<nbdkit_error>
with
+an error message and return C<-1>.
+
+This callback is not required.  If omitted, then we return true iff a
+C<.extents> callback has been defined.
+
 =head2 C<.can_fua>
 
  int can_fua (void *handle);
@@ -717,6 +731,95 @@ If there is an error, C<.zero> should call
C<nbdkit_error> with an
 error message, and C<nbdkit_set_error> to record an appropriate error
 (unless C<errno> is sufficient), then return C<-1>.
 
+=head2 C<.extents>
+
+ int extents (void *handle, uint32_t count, uint64_t offset,
+              uint32_t flags,
+              struct nbdkit_extents_map *extents_map);
+
+During the data serving phase, this callback is used to detect
+allocated, sparse and zeroed regions of the disk.
+
+This function will not be called if C<.can_extents> returned false.
+nbdkit's default behaviour in this case is to treat the whole virtual
+disk as if it was allocated.
+
+The callback should detect and return the list of extents overlapping
+the range C<[offset...offset+count-1]>.  The C<extents_map>
parameter
+points to an opaque object which the callback should fill in by
+calling C<nbdkit_extent_add> etc.  See L</Extents map> below.
+
+The C<flags> parameter may contain the flag C<NBDKIT_FLAG_REQ_ONE>
+which means that the client is only requesting information about the
+extent overlapping C<offset>.  The plugin may ignore this flag, or as
+an optimization it may return just a single extent for C<offset>.
+(Note that implementing this optimization correctly is difficult and
+subtle.  You must ensure that the upper limit of the single extent
+returned is correct.  If unsure then the safest option is to ignore
+this flag.)
+
+If there is an error, C<.extents> should call C<nbdkit_error> with
an
+error message, and C<nbdkit_set_error> to record an appropriate error
+(unless C<errno> is sufficient), then return C<-1>.
+
+=head3 Extents map
+
+The plugin C<extents> callback is passed an opaque pointer C<struct
+nbdkit_extents_map *extents_map>.  This structure represents a list of
+L<filesystem extents|https://en.wikipedia.org/wiki/Extent_(file_systems)>
+describing which areas of the disk are allocated, which are sparse
+(“holes”), and, if supported, which are zeroes.  Initially the map
+describes a single allocated data extent covering the entire virtual
+disk.
+
+The C<extents> callback has to fill in any holes or zeroes areas by
+calling C<nbdkit_extent*> functions, to describe all extents
+overlapping the range C<[offset...offset+count-1]>.  (nbdkit will
+ignore extents outside this range so plugins may, if it is easier to
+implement, return all extent information for the whole disk.)
+
+The following functions are available for plugins or filters to call:
+
+ int nbdkit_extent_add (struct nbdkit_extents_map *extents_map,
+                        uint64_t offset, uint64_t length, uint32_t type);
+
+Add an extent covering C<[offset...offset+length-1]> of one of
+the following four types:
+
+=over 4
+
+=item C<type = 0>
+
+A normal, allocated data extent.
+
+=item C<type = NBDKIT_EXTENT_HOLE|NBDKIT_EXTENT_ZERO>
+
+An unallocated extent, a.k.a. a “hole”, which reads back as zeroes.
+This is the normal type of hole applicable to most disks.
+
+=item C<type = NBDKIT_EXTENT_ZERO>
+
+An allocated extent which is known to contain only zeroes.
+
+=item C<type = NBDKIT_EXTENT_HOLE>
+
+An unallocated extent (hole) which does not read back as zeroes.  Note
+this should only be used in specialized circumstances such as when
+writing a plugin for (or to emulate) certain SCSI drives which do not
+guarantee that trimmed blocks read back as zeroes.
+
+=back
+
+If the new extent overlaps some previous extent in the map then the
+overlapping part(s) are overwritten with the new type.  Adjacent
+extents of the same type may also be coalesced.  Thus you shouldn't
+assume that C<nbdkit_extent_add> always adds exactly one more extent
+to the map.
+
+C<nbdkit_extent_add> returns C<0> on success or C<-1> on
failure.  On
+failure C<nbdkit_error> and/or C<nbdkit_set_error> has already been
+called.
+
 =head1 THREADS
 
 Each nbdkit plugin must declare its thread safety model by defining
diff --git a/include/nbdkit-common.h b/include/nbdkit-common.h
index cb9954e..8526171 100644
--- a/include/nbdkit-common.h
+++ b/include/nbdkit-common.h
@@ -1,5 +1,5 @@
 /* nbdkit
- * Copyright (C) 2013-2018 Red Hat Inc.
+ * Copyright (C) 2013-2019 Red Hat Inc.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
@@ -60,11 +60,15 @@ extern "C" {
 
 #define NBDKIT_FLAG_MAY_TRIM (1<<0) /* Maps to !NBD_CMD_FLAG_NO_HOLE */
 #define NBDKIT_FLAG_FUA      (1<<1) /* Maps to NBD_CMD_FLAG_FUA */
+#define NBDKIT_FLAG_REQ_ONE  (1<<2) /* Maps to NBD_CMD_FLAG_REQ_ONE */
 
 #define NBDKIT_FUA_NONE       0
 #define NBDKIT_FUA_EMULATE    1
 #define NBDKIT_FUA_NATIVE     2
 
+#define NBDKIT_EXTENT_HOLE    (1<<0) /* Same as NBD_STATE_HOLE */
+#define NBDKIT_EXTENT_ZERO    (1<<1) /* Same as NBD_STATE_ZERO */
+
 extern void nbdkit_error (const char *msg, ...) ATTRIBUTE_FORMAT_PRINTF (1, 2);
 extern void nbdkit_verror (const char *msg, va_list args);
 extern void nbdkit_debug (const char *msg, ...) ATTRIBUTE_FORMAT_PRINTF (1, 2);
@@ -76,6 +80,10 @@ extern int nbdkit_parse_bool (const char *str);
 extern int nbdkit_read_password (const char *value, char **password);
 extern char *nbdkit_realpath (const char *path);
 
+struct nbdkit_extents_map;
+extern int nbdkit_extent_add (struct nbdkit_extents_map *,
+                              uint64_t offset, uint64_t length, uint32_t type);
+
 /* A static non-NULL pointer which can be used when you don't need a
  * per-connection handle.
  */
diff --git a/include/nbdkit-filter.h b/include/nbdkit-filter.h
index 71c06c8..4efb98e 100644
--- a/include/nbdkit-filter.h
+++ b/include/nbdkit-filter.h
@@ -1,5 +1,5 @@
 /* nbdkit
- * Copyright (C) 2013-2018 Red Hat Inc.
+ * Copyright (C) 2013-2019 Red Hat Inc.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
@@ -44,6 +44,19 @@ extern "C" {
 
 #define NBDKIT_FILTER_API_VERSION 2
 
+#define NBDKIT_EXTENTS_FOREACH_FLAG_ONE   1
+#define NBDKIT_EXTENTS_FOREACH_FLAG_RANGE 2
+
+extern struct nbdkit_extents_map *nbdkit_extents_new (void);
+extern void nbdkit_extents_free (struct nbdkit_extents_map *);
+typedef int (*nbdkit_extents_callback) (uint64_t offset, uint64_t length,
+                                        uint32_t type, void *opaque);
+extern int nbdkit_extents_foreach (const struct nbdkit_extents_map *,
+                                   nbdkit_extents_callback fn, void *opaque,
+                                   uint32_t flags,
+                                   uint64_t range_offset,
+                                   uint64_t range_length);
+
 typedef int nbdkit_next_config (void *nxdata,
                                 const char *key, const char *value);
 typedef int nbdkit_next_config_complete (void *nxdata);
@@ -58,6 +71,7 @@ struct nbdkit_next_ops {
   int (*is_rotational) (void *nxdata);
   int (*can_trim) (void *nxdata);
   int (*can_zero) (void *nxdata);
+  int (*can_extents) (void *nxdata);
   int (*can_fua) (void *nxdata);
   int (*can_multi_conn) (void *nxdata);
 
@@ -71,6 +85,9 @@ struct nbdkit_next_ops {
                int *err);
   int (*zero) (void *nxdata, uint32_t count, uint64_t offset, uint32_t flags,
                int *err);
+  int (*extents) (void *nxdata, uint32_t count, uint64_t offset, uint32_t
flags,
+                  struct nbdkit_extents_map *extents_map,
+                  int *err);
 };
 
 struct nbdkit_filter {
@@ -120,6 +137,8 @@ struct nbdkit_filter {
                    void *handle);
   int (*can_zero) (struct nbdkit_next_ops *next_ops, void *nxdata,
                    void *handle);
+  int (*can_extents) (struct nbdkit_next_ops *next_ops, void *nxdata,
+                      void *handle);
   int (*can_fua) (struct nbdkit_next_ops *next_ops, void *nxdata,
                   void *handle);
   int (*can_multi_conn) (struct nbdkit_next_ops *next_ops, void *nxdata,
@@ -140,6 +159,10 @@ struct nbdkit_filter {
   int (*zero) (struct nbdkit_next_ops *next_ops, void *nxdata,
                void *handle, uint32_t count, uint64_t offset, uint32_t flags,
                int *err);
+  int (*extents) (struct nbdkit_next_ops *next_ops, void *nxdata,
+                  void *handle, uint32_t count, uint64_t offset, uint32_t
flags,
+                  struct nbdkit_extents_map *extents_map,
+                  int *err);
 };
 
 #define NBDKIT_REGISTER_FILTER(filter)                                  \
diff --git a/include/nbdkit-plugin.h b/include/nbdkit-plugin.h
index d43b2f5..76cec54 100644
--- a/include/nbdkit-plugin.h
+++ b/include/nbdkit-plugin.h
@@ -1,5 +1,5 @@
 /* nbdkit
- * Copyright (C) 2013-2018 Red Hat Inc.
+ * Copyright (C) 2013-2019 Red Hat Inc.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
@@ -125,6 +125,10 @@ struct nbdkit_plugin {
   const char *magic_config_key;
 
   int (*can_multi_conn) (void *handle);
+
+  int (*can_extents) (void *handle);
+  int (*extents) (void *handle, uint32_t count, uint64_t offset, uint32_t
flags,
+                  struct nbdkit_extents_map *extents_map);
 };
 
 extern void nbdkit_set_error (int err);
diff --git a/server/internal.h b/server/internal.h
index d40a82d..d0da213 100644
--- a/server/internal.h
+++ b/server/internal.h
@@ -274,6 +274,7 @@ struct backend {
   int (*is_rotational) (struct backend *, struct connection *conn);
   int (*can_trim) (struct backend *, struct connection *conn);
   int (*can_zero) (struct backend *, struct connection *conn);
+  int (*can_extents) (struct backend *, struct connection *conn);
   int (*can_fua) (struct backend *, struct connection *conn);
   int (*can_multi_conn) (struct backend *, struct connection *conn);
 
@@ -287,6 +288,10 @@ struct backend {
                uint64_t offset, uint32_t flags, int *err);
   int (*zero) (struct backend *, struct connection *conn, uint32_t count,
                uint64_t offset, uint32_t flags, int *err);
+  int (*extents) (struct backend *, struct connection *conn, uint32_t count,
+                  uint64_t offset, uint32_t flags,
+                  struct nbdkit_extents_map *extents_map,
+                  int *err);
 };
 
 /* plugins.c */
diff --git a/server/extents.c b/server/extents.c
new file mode 100644
index 0000000..f17ba05
--- /dev/null
+++ b/server/extents.c
@@ -0,0 +1,574 @@
+/* nbdkit
+ * Copyright (C) 2019 Red Hat Inc.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ *
+ * * Neither the name of Red Hat nor the names of its contributors may be
+ * used to endorse or promote products derived from this software without
+ * specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+ * PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <config.h>
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <string.h>
+#include <inttypes.h>
+#include <assert.h>
+
+#include "minmax.h"
+
+#include "internal.h"
+
+/* Trivial implementation as an ordered list of extents.  We will
+ * probably have to change this to a more efficient data structure if
+ * we have plugins with lots of extents.
+ *
+ * Invariants:
+ * - extents are stored in ascending order
+ * - extents must not overlap
+ * - adjacent extents must not have the same type
+ *
+ * Gaps are treated as type 0 (allocated data).
+ */
+struct nbdkit_extents_map {
+  struct extent *extents;
+  size_t nr_extents, allocated;
+};
+
+struct extent {
+  uint64_t offset;
+  uint64_t length;
+  uint32_t type;
+};
+
+struct nbdkit_extents_map *
+nbdkit_extents_new (void)
+{
+  struct nbdkit_extents_map *r;
+
+  r = malloc (sizeof *r);
+  if (r == NULL) {
+    nbdkit_error ("nbdkit_extents_new: malloc: %m");
+    return NULL;
+  }
+  r->extents = NULL;
+  r->nr_extents = r->allocated = 0;
+  return r;
+}
+
+void
+nbdkit_extents_free (struct nbdkit_extents_map *map)
+{
+  if (map) {
+    free (map->extents);
+    free (map);
+  }
+}
+
+static ssize_t
+find_extent_recursive (const struct nbdkit_extents_map *map,
+                       uint64_t offset, size_t first, size_t last)
+{
+  size_t i, h;
+
+  assert (first <= last);
+  assert (offset >= map->extents[first].offset);
+
+  if (last - first <= 8) {
+    /* Do a linear search for the last bit. */
+    for (i = first+1; i <= last; ++i) {
+      if (offset < map->extents[i].offset)
+        return i-1;
+    }
+    return last;
+  }
+
+  h = first + (last - first) / 2;
+
+  if (offset < map->extents[h].offset)
+    return find_extent_recursive (map, offset, first, h-1);
+  else
+    return find_extent_recursive (map, offset, h, last);
+}
+
+/* Find the extent before or containing offset.  If offset is inside
+ * the extent, the extent index is returned.  If offset is not in any
+ * extent, but there is an extent with a lower offset, then the index
+ * of the nearest lower extent is returned.  If offset is before the
+ * zeroth extent then -1 is returned.
+ */
+static ssize_t
+find_extent (const struct nbdkit_extents_map *map, uint64_t offset)
+{
+  if (map->nr_extents == 0 || offset < map->extents[0].offset)
+    return -1;
+
+  return find_extent_recursive (map, offset, 0, map->nr_extents-1);
+}
+
+/* Erase extents [i...i+count-1] from the array.  Extents after are
+ * renumbered.
+ */
+static void
+erase_extents (struct nbdkit_extents_map *map,
+                size_t i, size_t count)
+{
+  assert (i+count <= map->nr_extents);
+
+  memmove (&map->extents[i],
+           &map->extents[i+count],
+           (map->nr_extents - (i+count)) * sizeof (struct extent));
+  map->nr_extents -= count;
+}
+
+/* Insert new_extent in the map at index i.  Existing extents at index
+ * [i...] are moved up by one.
+ */
+static int
+insert_extent (struct nbdkit_extents_map *map,
+               struct extent new_extent, size_t i)
+{
+  assert (i <= map->nr_extents);
+
+  if (map->nr_extents >= map->allocated) {
+    size_t new_allocated;
+    struct extent *new_extents;
+
+    new_allocated = map->allocated;
+    if (new_allocated == 0)
+      new_allocated = 1;
+    new_allocated *= 2;
+    new_extents +      realloc (map->extents, new_allocated * sizeof (struct
extent));
+    if (new_extents == NULL) {
+      nbdkit_error ("extent_add: realloc: %m");
+      return -1;
+    }
+    map->allocated = new_allocated;
+    map->extents = new_extents;
+  }
+
+  memmove (&map->extents[i+1], &map->extents[i],
+           (map->nr_extents - i) * sizeof (struct extent));
+  map->nr_extents++;
+  map->extents[i] = new_extent;
+  return 0;
+}
+
+int
+nbdkit_extent_add (struct nbdkit_extents_map *map,
+                   uint64_t offset, uint64_t length, uint32_t type)
+{
+  ssize_t lo, hi;
+  bool coalesce_below, coalesce_above;
+  struct extent new_extent;
+
+  /* This might be considered an error, but be flexible for badly
+   * written plugins.
+   */
+  if (length == 0)
+    return 0;
+
+  /* Don't allow extents to exceeed the _signed_ 64 bit limit that is
+   * used in the rest of nbdkit.
+   */
+  if (offset > INT64_MAX ||
+      offset + length < offset ||
+      offset + length > INT64_MAX) {
+    nbdkit_error ("nbdkit_extent_add: "
+                  "extent cannot exceed signed 64 bit limit: "
+                  "offset=%" PRIu64 " length=%" PRIu64,
+                  offset, length);
+    return -1;
+  }
+
+ again:
+  lo = find_extent (map, offset);
+  hi = find_extent (map, offset + length - 1);
+
+  /* "lo" and "hi" are -1 or they point to the extent
+   * overlapping-or-below the corresponding endpoints of the new
+   * extent.
+   *
+   * The algorithm here has two parts: Firstly we remove any
+   * overlapping extents (this may involve erasing extents completely
+   * or shrinking them).  Secondly we insert the new extent at the
+   * right place in the array, if necessary allowing it to coalesce
+   * with the extent above or below.
+   *
+   * To remove overlapping extents we consider 7 cases:
+   *
+   * A.0   lo == -1 (there is no old extent before the new extent)
+   *     => do nothing
+   *
+   * A.1   ┌─────┐     offset of new extent exactly at
+   *       │ lo  │     beginning of lo and lo smaller/equal
+   *       └─────┘     to new extent
+   *       ↑
+   *       ▐▓▓▓▓▓▓▓▓▓
+   *     => erase lo
+   *     => creates situation A.0 or A.5
+   *
+   * A.2   ┌──────────┐   offset of new extent exactly at
+   *       │ lo       │   beginning of lo and lo bigger than
+   *       └──────────┘   new extent
+   *       ↑
+   *       ▐▓▓▓▓▓▓▓▓▓
+   *     => shrink lo “upwards”
+   *     => creates situation A.0 or A.5
+   *
+   * A.3   ┌──────────────┐  new extent in the middle
+   *       │ lo           │  of lo
+   *       └──────────────┘
+   *         ↑
+   *         ▓▓▓▓▓▓▓▓▓
+   *     => split lo
+   *     => creates situation A.5
+   *
+   * A.4   ┌─────┐     offset of new extent in the middle
+   *       │ lo  │     of lo
+   *       └─────┘
+   *         ↑
+   *         ▓▓▓▓▓▓▓▓▓
+   *     => shrink lo
+   *     => creates situation A.5
+   *
+   * A.5   ┌─────┐     offset of new extent after lo
+   *       │ lo  │
+   *       └─────┘
+   *               ↑
+   *               ▓▓▓▓▓▓▓▓▓
+   *     => do nothing
+   *
+   * B.0   lo == hi or hi == -1 (there is no higher extent)
+   *     => do nothing
+   *
+   * B.1          ┌─────┐     extent hi is fully covered
+   *              │ hi  │     by new extent
+   *              └─────┘
+   *                       ↑
+   *       ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
+   *     => erase hi
+   *     => creates situation B.0
+   *
+   * B.2          ┌─────┐     extent hi is partly overlapped
+   *              │ hi  │     by new extent
+   *              └─────┘
+   *                 ↑
+   *       ▓▓▓▓▓▓▓▓▓▓▓
+   *     => shrink hi
+   *     => creates situation B.0
+   *
+   * In addition extents [lo+1...hi-1] (if they exist) are erased
+   * since they must be fully covered by the new extent.
+   *
+   * At the end we're in situation A.0/A.5 and B.0, and so no old
+   * extents overlap the new extent.
+   *
+   * The insertion can then be done with optional coalescing (see code
+   * below for details - this is a lot simpler).
+   */
+
+  /* A.0 */
+  if (lo == -1)
+    goto erase_middle_extents;
+  /* A.5 */
+  if (offset >= map->extents[lo].offset + map->extents[lo].length)
+    goto erase_middle_extents;
+
+  /* A.1 or A.2 */
+  if (offset == map->extents[lo].offset) {
+    if (map->extents[lo].offset + map->extents[lo].length <= offset +
length) {
+      /* A.1 */
+      erase_extents (map, lo, 1);
+    }
+    else {
+      /* A.2 */
+      uint64_t end = map->extents[lo].offset + map->extents[lo].length;
+      map->extents[lo].offset = offset + length;
+      map->extents[lo].length = end - map->extents[lo].offset;
+    }
+    goto again;
+  }
+
+  /* A.3 */
+  if (map->extents[lo].offset < offset &&
+      map->extents[lo].offset + map->extents[lo].length > offset +
length) {
+    new_extent.offset = offset + length;
+    new_extent.length +      map->extents[lo].offset +
map->extents[lo].length - new_extent.offset;
+    new_extent.type = map->extents[lo].type;
+    map->extents[lo].length = offset - map->extents[lo].offset;
+    if (insert_extent (map, new_extent, lo+1) == -1)
+      return -1;
+    goto again;
+  }
+
+  /* We must be in situation A.4 assuming analysis above is exhaustive. */
+  assert (offset > map->extents[lo].offset);
+  assert (offset < map->extents[lo].offset + map->extents[lo].length);
+  assert (map->extents[lo].offset + map->extents[lo].length <= offset
+ length);
+
+  map->extents[lo].length = offset - map->extents[lo].offset;
+
+ erase_middle_extents:
+  /* Erase extents [lo+1..hi-1] if they exist as they must be
+   * completely covered by the new extent.
+   */
+  if (hi >= 0 && lo+1 < hi) {
+    erase_extents (map, lo+1, hi-lo-1);
+    goto again;
+  }
+
+  /* B.0 */
+  if (lo == hi || hi == -1)
+    goto insert;
+
+  /* B.1 */
+  if (map->extents[hi].offset + map->extents[hi].length <= offset +
length) {
+    erase_extents (map, hi, 1);
+    goto again;
+  }
+
+  /* We must be in situation B.2 if the analysis above is exhaustive. */
+  assert (map->extents[hi].offset < offset + length);
+
+  uint64_t end = map->extents[hi].offset + map->extents[hi].length;
+  map->extents[hi].offset = offset + length;
+  map->extents[hi].length = end - map->extents[hi].offset;
+
+ insert:
+  /* Now we know there is no extent which overlaps with the new
+   * extent, and the new extent must be inserted after lo.
+   */
+
+  /* However we have to deal with possibly coalescing the new extent
+   * with lo and/or lo+1.
+   */
+  coalesce_below +    lo >= 0 &&
+    map->extents[lo].type == type &&
+    map->extents[lo].offset + map->extents[lo].length == offset;
+  coalesce_above +    lo+1 < map->nr_extents &&
+    map->extents[lo+1].type == type &&
+    map->extents[lo+1].offset == offset + length;
+
+  if (coalesce_below && coalesce_above) {
+    /* Coalesce lo + new extent + lo+1
+     * => Extend lo to cover the whole range and remove lo+1.
+     */
+    map->extents[lo].length +      map->extents[lo+1].offset +
map->extents[lo+1].length
+      - map->extents[lo].offset;
+    erase_extents (map, lo+1, 1);
+  }
+  else if (coalesce_below) {
+    /* Coalesce lo + new extent.
+     * => Extend lo to cover new extent.
+     */
+    map->extents[lo].length = offset + length - map->extents[lo].offset;
+  }
+  else if (coalesce_above) {
+    /* Coalesce new extent + lo+1
+     * => Extend lo+1 to cover new extent.
+     */
+    map->extents[lo+1].length +      map->extents[lo+1].offset +
map->extents[lo+1].length - offset;
+    map->extents[lo+1].offset = offset;
+  }
+  else {
+    /* Normal case: Allocate a new extent and insert it after lo. */
+    new_extent.offset = offset;
+    new_extent.length = length;
+    new_extent.type = type;
+    if (insert_extent (map, new_extent, lo+1) == -1)
+      return -1;
+  }
+
+  return 0;
+}
+
+int
+nbdkit_extents_foreach (const struct nbdkit_extents_map *map,
+                        nbdkit_extents_callback fn, void *opaque,
+                        uint32_t flags,
+                        uint64_t range_offset, uint64_t range_length)
+{
+  const bool flag_range = flags & NBDKIT_EXTENTS_FOREACH_FLAG_RANGE;
+  const bool flag_one = flags & NBDKIT_EXTENTS_FOREACH_FLAG_ONE;
+  uint64_t offset, next;
+  size_t i;
+
+  if (flag_range) {
+    if (range_offset + range_length < range_offset) {
+      nbdkit_error ("nbdkit_extents_foreach: invalid range: "
+                    "offset=%" PRIu64 " length=%" PRIu64,
+                    range_offset, range_length);
+      return -1;
+    }
+  }
+  else {
+    range_offset = 0;
+    range_length = INT64_MAX + UINT64_C(1);
+  }
+
+  for (i = 0, offset = range_offset;
+       offset < range_offset + range_length;
+       offset = next) {
+    next = offset;
+
+    if (i >= map->nr_extents) {
+      /* Beyond last extent => Type 0 to end of range. */
+      next = range_offset + range_length;
+      if (fn (offset, next - offset, 0, opaque) == -1)
+        return -1;
+      if (flag_one)
+        return 0;
+      continue;
+    }
+
+    if (offset < map->extents[i].offset) {
+      /* In a gap before the i'th extent => Type 0 to next offset. */
+      next = MIN (map->extents[i].offset, range_offset + range_length);
+      if (fn (offset, next - offset, 0, opaque) == -1)
+        return -1;
+      if (flag_one)
+        return 0;
+      continue;
+    }
+
+    if (offset < map->extents[i].offset + map->extents[i].length) {
+      /* In an extent. */
+      next = MIN (map->extents[i].offset + map->extents[i].length,
+                  range_offset + range_length);
+      if (fn (offset, next - offset, map->extents[i].type, opaque) == -1)
+        return -1;
+      if (flag_one)
+        return 0;
+      continue;
+    }
+
+    /* After an extent, increment i and iterate. */
+    i++;
+  }
+
+  return 0;
+}
+
+#ifdef IN_TEST_EXTENTS
+/* These functions are only compiled for and called from the unit tests. */
+
+void
+dump_extents_map (const struct nbdkit_extents_map *map)
+{
+  size_t i;
+
+  fprintf (stderr, "map %p:\n", map);
+  fprintf (stderr, "\tnr_extents=%zu allocated=%zu\n",
+           map->nr_extents, map->allocated);
+  for (i = 0; i < map->nr_extents; ++i) {
+    fprintf (stderr, "\textent %zu:", i);
+    fprintf (stderr,
+             "\toffset=%" PRIu64 " length=%" PRIu64 "
"
+             "last=%" PRIu64 " type=%" PRIu32
"\n",
+             map->extents[i].offset, map->extents[i].length,
+             map->extents[i].offset + map->extents[i].length - 1,
+             map->extents[i].type);
+  }
+}
+
+void
+validate_extents_map (const struct nbdkit_extents_map *map)
+{
+  size_t i;
+  bool fail = false;
+
+  if (map->nr_extents > map->allocated) {
+    fprintf (stderr, "validate_extents_map: nr_extents >
allocated\n");
+    fail = true;
+  }
+
+  for (i = 0; i < map->nr_extents; ++i) {
+    /* Extent length must be > 0. */
+    if (map->extents[i].length == 0) {
+      fprintf (stderr, "validate_extents_map: %zu: extent length must be
> 0\n",
+               i);
+      fail = true;
+    }
+
+    /* Extent length must not wrap around, indicating that we have
+     * computed a negative length (as unsigned) somewhere.
+     */
+    if (map->extents[i].offset + map->extents[i].length <
+        map->extents[i].offset) {
+      fprintf (stderr, "validate_extents_map: %zu: negative
length\n",
+               i);
+      fail = true;
+    }
+
+    if (i > 0) {
+      /* Extents are stored in strictly ascending order. */
+      if (map->extents[i-1].offset >= map->extents[i].offset) {
+        fprintf (stderr,
+                 "validate_extents_map: %zu: "
+                 "not strictly ascending order\n", i);
+        fail = true;
+      }
+
+      /* Adjacent extents must not overlap. */
+      if (map->extents[i-1].offset + map->extents[i-1].length >
+          map->extents[i].offset) {
+        fprintf (stderr,
+                 "validate_extents_map: %zu: "
+                 "overlapping extents\n", i);
+        fail = true;
+      }
+
+      /* Adjacent extents either have a different type or are
+       * separated by a gap.
+       */
+      if (map->extents[i-1].type == map->extents[i].type &&
+          map->extents[i-1].offset + map->extents[i-1].length =+         
map->extents[i].offset) {
+        fprintf (stderr,
+                 "validate_extents_map: %zu: "
+                 "adjacent extents with same type not coalesced\n",
i);
+        fail = true;
+      }
+    }
+  }
+
+  if (fail) {
+    dump_extents_map (map);
+    abort ();
+  }
+}
+
+#endif /* IN_TEST_EXTENTS */
diff --git a/server/filters.c b/server/filters.c
index 5b7abc4..e284080 100644
--- a/server/filters.c
+++ b/server/filters.c
@@ -309,6 +309,13 @@ next_can_zero (void *nxdata)
   return b_conn->b->can_zero (b_conn->b, b_conn->conn);
 }
 
+static int
+next_can_extents (void *nxdata)
+{
+  struct b_conn *b_conn = nxdata;
+  return b_conn->b->can_extents (b_conn->b, b_conn->conn);
+}
+
 static int
 next_can_fua (void *nxdata)
 {
@@ -364,6 +371,16 @@ next_zero (void *nxdata, uint32_t count, uint64_t offset,
uint32_t flags,
   return b_conn->b->zero (b_conn->b, b_conn->conn, count, offset,
flags, err);
 }
 
+static int
+next_extents (void *nxdata, uint32_t count, uint64_t offset, uint32_t flags,
+              struct nbdkit_extents_map *extents_map,
+              int *err)
+{
+  struct b_conn *b_conn = nxdata;
+  return b_conn->b->extents (b_conn->b, b_conn->conn, count,
offset, flags,
+                             extents_map, err);
+}
+
 static struct nbdkit_next_ops next_ops = {
   .get_size = next_get_size,
   .can_write = next_can_write,
@@ -371,6 +388,7 @@ static struct nbdkit_next_ops next_ops = {
   .is_rotational = next_is_rotational,
   .can_trim = next_can_trim,
   .can_zero = next_can_zero,
+  .can_extents = next_can_extents,
   .can_fua = next_can_fua,
   .can_multi_conn = next_can_multi_conn,
   .pread = next_pread,
@@ -378,6 +396,7 @@ static struct nbdkit_next_ops next_ops = {
   .flush = next_flush,
   .trim = next_trim,
   .zero = next_zero,
+  .extents = next_extents,
 };
 
 static int
@@ -511,6 +530,21 @@ filter_can_zero (struct backend *b, struct connection
*conn)
     return f->backend.next->can_zero (f->backend.next, conn);
 }
 
+static int
+filter_can_extents (struct backend *b, struct connection *conn)
+{
+  struct backend_filter *f = container_of (b, struct backend_filter, backend);
+  void *handle = connection_get_handle (conn, f->backend.i);
+  struct b_conn nxdata = { .b = f->backend.next, .conn = conn };
+
+  debug ("%s: can_extents", f->name);
+
+  if (f->filter.can_extents)
+    return f->filter.can_extents (&next_ops, &nxdata, handle);
+  else
+    return f->backend.next->can_extents (f->backend.next, conn);
+}
+
 static int
 filter_can_fua (struct backend *b, struct connection *conn)
 {
@@ -646,6 +680,31 @@ filter_zero (struct backend *b, struct connection *conn,
                                   count, offset, flags, err);
 }
 
+static int
+filter_extents (struct backend *b, struct connection *conn,
+                uint32_t count, uint64_t offset, uint32_t flags,
+                struct nbdkit_extents_map *extents_map,
+                int *err)
+{
+  struct backend_filter *f = container_of (b, struct backend_filter, backend);
+  void *handle = connection_get_handle (conn, f->backend.i);
+  struct b_conn nxdata = { .b = f->backend.next, .conn = conn };
+
+  assert (!(flags & ~NBDKIT_FLAG_REQ_ONE));
+
+  debug ("%s: extents count=%" PRIu32 " offset=%" PRIu64
" flags=0x%" PRIx32,
+         f->name, count, offset, flags);
+
+  if (f->filter.extents)
+    return f->filter.extents (&next_ops, &nxdata, handle,
+                              count, offset, flags,
+                              extents_map, err);
+  else
+    return f->backend.next->extents (f->backend.next, conn,
+                                     count, offset, flags,
+                                     extents_map, err);
+}
+
 static struct backend filter_functions = {
   .free = filter_free,
   .thread_model = filter_thread_model,
@@ -667,6 +726,7 @@ static struct backend filter_functions = {
   .is_rotational = filter_is_rotational,
   .can_trim = filter_can_trim,
   .can_zero = filter_can_zero,
+  .can_extents = filter_can_extents,
   .can_fua = filter_can_fua,
   .can_multi_conn = filter_can_multi_conn,
   .pread = filter_pread,
@@ -674,6 +734,7 @@ static struct backend filter_functions = {
   .flush = filter_flush,
   .trim = filter_trim,
   .zero = filter_zero,
+  .extents = filter_extents,
 };
 
 /* Register and load a filter. */
diff --git a/server/plugins.c b/server/plugins.c
index 28b96ad..0bd523b 100644
--- a/server/plugins.c
+++ b/server/plugins.c
@@ -1,5 +1,5 @@
 /* nbdkit
- * Copyright (C) 2013-2018 Red Hat Inc.
+ * Copyright (C) 2013-2019 Red Hat Inc.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
@@ -199,6 +199,8 @@ plugin_dump_fields (struct backend *b)
   HAS (trim);
   HAS (zero);
   HAS (can_multi_conn);
+  HAS (can_extents);
+  HAS (extents);
 #undef HAS
 
   /* Custom fields. */
@@ -392,6 +394,21 @@ plugin_can_zero (struct backend *b, struct connection
*conn)
   return plugin_can_write (b, conn);
 }
 
+static int
+plugin_can_extents (struct backend *b, struct connection *conn)
+{
+  struct backend_plugin *p = container_of (b, struct backend_plugin, backend);
+
+  assert (connection_get_handle (conn, 0));
+
+  debug ("can_extents");
+
+  if (p->plugin.can_extents)
+    return p->plugin.can_extents (connection_get_handle (conn, 0));
+  else
+    return p->plugin.extents != NULL;
+}
+
 static int
 plugin_can_fua (struct backend *b, struct connection *conn)
 {
@@ -651,6 +668,41 @@ plugin_zero (struct backend *b, struct connection *conn,
   return r;
 }
 
+static int
+plugin_extents (struct backend *b, struct connection *conn,
+                uint32_t count, uint64_t offset, uint32_t flags,
+                struct nbdkit_extents_map *extents_map,
+                int *err)
+{
+  struct backend_plugin *p = container_of (b, struct backend_plugin, backend);
+  bool req_one = flags & NBDKIT_FLAG_REQ_ONE;
+  int r;
+
+  assert (connection_get_handle (conn, 0));
+  assert (!(flags & ~NBDKIT_FLAG_REQ_ONE));
+
+  debug ("extents count=%" PRIu32 " offset=%" PRIu64 "
req_one=%d",
+         count, offset, req_one);
+
+  if (!count)
+    return 0;
+
+  if (p->plugin.extents) {
+    r = p->plugin.extents (connection_get_handle (conn, 0), count, offset,
+                           flags, extents_map);
+    if (r == -1)
+      *err = get_error (p);
+    return r;
+  }
+  else {
+    /* By default we assume that everything in the range is allocated.
+     * However as that is the default state for the empty extents_map
+     * we don't need to do anything here.
+     */
+    return 0;
+  }
+}
+
 static struct backend plugin_functions = {
   .free = plugin_free,
   .thread_model = plugin_thread_model,
@@ -672,6 +724,7 @@ static struct backend plugin_functions = {
   .is_rotational = plugin_is_rotational,
   .can_trim = plugin_can_trim,
   .can_zero = plugin_can_zero,
+  .can_extents = plugin_can_extents,
   .can_fua = plugin_can_fua,
   .can_multi_conn = plugin_can_multi_conn,
   .pread = plugin_pread,
@@ -679,6 +732,7 @@ static struct backend plugin_functions = {
   .flush = plugin_flush,
   .trim = plugin_trim,
   .zero = plugin_zero,
+  .extents = plugin_extents,
 };
 
 /* Register and load a plugin. */
diff --git a/server/test-extents.c b/server/test-extents.c
new file mode 100644
index 0000000..e07ecda
--- /dev/null
+++ b/server/test-extents.c
@@ -0,0 +1,151 @@
+/* nbdkit
+ * Copyright (C) 2018-2019 Red Hat Inc.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ *
+ * * Neither the name of Red Hat nor the names of its contributors may be
+ * used to endorse or promote products derived from this software without
+ * specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS
IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+ * PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+/* Test the extents handling code (server/extents.c).  This stochastic
+ * test works by allocating an array representing a disk, allocating
+ * extents randomly while writing the same information into the disk
+ * array, and then reading the extents back and comparing it to what's
+ * in the disk array.
+ */
+
+#include <config.h>
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <inttypes.h>
+#include <time.h>
+#include <assert.h>
+
+#include "minmax.h"
+#include "random.h"
+
+#include "internal.h"
+
+extern void dump_extents_map (const struct nbdkit_extents_map *map);
+extern void validate_extents_map (const struct nbdkit_extents_map *map);
+
+#define DISK_SIZE (1<<10)
+static uint8_t disk[DISK_SIZE];
+
+static bool error_flagged = false;
+
+/* Stub for linking against server/extents.c. */
+void
+nbdkit_error (const char *fs, ...)
+{
+  error_flagged = true;
+}
+
+static int
+compare (uint64_t offset, uint64_t length, uint32_t type, void *mapv)
+{
+  const struct nbdkit_extents_map *map = mapv;
+  size_t j;
+
+  /* nbdkit_extents_foreach_range should ensure this is true even if
+   * there are extents which go below or above the range.
+   */
+  assert (offset >= 0);
+  assert (offset + length <= DISK_SIZE);
+
+  fprintf (stderr,
+           "compare: offset=%" PRIu64 " length=%" PRIu64
" "
+           "type=%" PRIu32 "\n",
+           offset, length, type);
+
+  for (j = offset; j < offset+length; ++j) {
+    if (disk[j] != type) {
+      fprintf (stderr, "error: disk[%zu] != expected type %" PRIu32
"\n",
+               j, type);
+      dump_extents_map (map);
+      exit (EXIT_FAILURE);
+    }
+  }
+
+  return 0;
+}
+
+int
+main (int argc, char *argv[])
+{
+  unsigned i;
+  uint64_t offset, length;
+  uint32_t type;
+  size_t j;
+  struct random_state random_state;
+  struct nbdkit_extents_map *map;
+
+  xsrandom (time (NULL), &random_state);
+
+  map = nbdkit_extents_new ();
+  assert (map != NULL);
+
+  for (i = 0; i < 1000; ++i) {
+    type = xrandom (&random_state);
+    type &= 3;
+    offset = xrandom (&random_state);
+    offset &= DISK_SIZE - 1;
+    /* XXX Change the distribution so we mainly choose small lengths
+     * but with a small possibility of choosing large lengths.
+     */
+    length = xrandom (&random_state);
+    length &= DISK_SIZE/8 - 1;
+    length++;
+
+    fprintf (stderr,
+             "insert: extent offset=%" PRIu64 " length=%"
PRIu64 " "
+             "type=%" PRIu32 "\n",
+             offset, length, type);
+
+    assert (nbdkit_extent_add (map, offset, length, type) == 0);
+
+    for (j = offset; j < MIN (offset+length, DISK_SIZE); ++j)
+      disk[j] = type;
+
+    /* Ask the extents code to validate that the invariants are still
+     * true.  This calls abort(3) if not.
+     */
+    validate_extents_map (map);
+    dump_extents_map (map);
+  }
+
+  /* Read out the extents and compare them to the disk. */
+  assert (nbdkit_extents_foreach (map, compare, map,
+                                  NBDKIT_EXTENTS_FOREACH_FLAG_RANGE,
+                                  0, DISK_SIZE-1) == 0);
+
+  nbdkit_extents_free (map);
+
+  exit (EXIT_SUCCESS);
+}
diff --git a/.gitignore b/.gitignore
index dc42abd..1f8b2e0 100644
--- a/.gitignore
+++ b/.gitignore
@@ -63,6 +63,7 @@ Makefile.in
 /server/nbdkit.pc
 /server/protostrings.c
 /server/synopsis.c
+/server/test-extents
 /server/test-utils
 /stamp-h1
 /tests/disk
diff --git a/server/Makefile.am b/server/Makefile.am
index 5eb575e..5722f7b 100644
--- a/server/Makefile.am
+++ b/server/Makefile.am
@@ -43,6 +43,7 @@ nbdkit_SOURCES = \
 	connections.c \
 	crypto.c \
 	debug.c \
+	extents.c \
 	filters.c \
 	internal.h \
 	locks.c \
@@ -129,9 +130,22 @@ pkgconfig_DATA = nbdkit.pc
 
 # Unit testing
 
-TESTS = test-utils
+TESTS = \
+	test-extents \
+	test-utils
 
-check_PROGRAMS = test-utils
+check_PROGRAMS = \
+	test-extents \
+	test-utils
+
+test_extents_SOURCES = \
+	test-extents.c \
+	extents.c
+test_extents_CPPFLAGS = \
+	-DIN_TEST_EXTENTS=1 \
+	-I$(top_srcdir)/include \
+	-I$(top_srcdir)/common/include
+test_extents_CFLAGS = $(WARNINGS_CFLAGS) $(VALGRIND_CFLAGS)
 
 test_utils_SOURCES = \
 	test-utils.c \
-- 
2.20.1

Richard W.M. Jones

2019-Mar-19 16:34 UTC

head link

[Libguestfs] [PATCH nbdkit 2/9] server: Add CLEANUP_EXTENTS_FREE macro.

Provides automatic cleanup of ‘struct nbdkit_extents_map *’ on exit
from a scope or function.
---
 server/internal.h  | 2 ++
 server/cleanup.c   | 8 +++++++-
 server/Makefile.am | 3 ++-
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/server/internal.h b/server/internal.h
index d0da213..825dd3e 100644
--- a/server/internal.h
+++ b/server/internal.h
@@ -139,6 +139,8 @@ extern void change_user (void);
 /* cleanup.c */
 extern void cleanup_free (void *ptr);
 #define CLEANUP_FREE __attribute__((cleanup (cleanup_free)))
+extern void cleanup_extents_free (void *ptr);
+#define CLEANUP_EXTENTS_FREE __attribute__((cleanup (cleanup_extents_free)))
 extern void cleanup_unlock (pthread_mutex_t **ptr);
 #define CLEANUP_UNLOCK __attribute__((cleanup (cleanup_unlock)))
 #define ACQUIRE_LOCK_FOR_CURRENT_SCOPE(mutex) \
diff --git a/server/cleanup.c b/server/cleanup.c
index c29ecec..1eec613 100644
--- a/server/cleanup.c
+++ b/server/cleanup.c
@@ -1,5 +1,5 @@
 /* nbdkit
- * Copyright (C) 2013-2018 Red Hat Inc.
+ * Copyright (C) 2013-2019 Red Hat Inc.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
@@ -46,6 +46,12 @@ cleanup_free (void *ptr)
   free (* (void **) ptr);
 }
 
+void
+cleanup_extents_free (void *ptr)
+{
+  nbdkit_extents_free (* (void **) ptr);
+}
+
 void
 cleanup_unlock (pthread_mutex_t **ptr)
 {
diff --git a/server/Makefile.am b/server/Makefile.am
index 5722f7b..c139c05 100644
--- a/server/Makefile.am
+++ b/server/Makefile.am
@@ -150,7 +150,8 @@ test_extents_CFLAGS = $(WARNINGS_CFLAGS) $(VALGRIND_CFLAGS)
 test_utils_SOURCES = \
 	test-utils.c \
 	utils.c \
-	cleanup.c
+	cleanup.c \
+	extents.c
 test_utils_CPPFLAGS = \
 	-I$(top_srcdir)/include \
 	-I$(top_srcdir)/common/include
-- 
2.20.1

Richard W.M. Jones

2019-Mar-19 16:35 UTC

head link

[Libguestfs] [PATCH nbdkit 3/9] server: Implement Block Status requests to read allocation status.

This commit implements the core NBD protocol for the "base:allocation"
Block Status replies.
---
 server/internal.h                    |   7 +
 server/protocol.h                    |  17 +-
 server/protocol-handshake-newstyle.c |  79 ++++++++-
 server/protocol.c                    | 248 +++++++++++++++++++++++++--
 4 files changed, 335 insertions(+), 16 deletions(-)

diff --git a/server/internal.h b/server/internal.h
index 825dd3e..03d6119 100644
--- a/server/internal.h
+++ b/server/internal.h
@@ -183,6 +183,7 @@ struct connection {
   bool can_multi_conn;
   bool using_tls;
   bool structured_replies;
+  bool meta_context_base_allocation;
 
   int sockin, sockout;
   connection_recv_function recv;
@@ -219,6 +220,12 @@ extern int protocol_handshake_newstyle (struct connection
*conn)
 extern int protocol_recv_request_send_reply (struct connection *conn)
   __attribute__((__nonnull__ (1)));
 
+/* The context ID of base:allocation.  As far as I can tell it doesn't
+ * matter what this is as long as nbdkit always returns the same
+ * number.
+ */
+#define base_allocation_id 1
+
 /* crypto.c */
 #define root_tls_certificates_dir sysconfdir "/pki/" PACKAGE_NAME
 extern void crypto_init (bool tls_set_on_cli);
diff --git a/server/protocol.h b/server/protocol.h
index 4fe3c75..a7de2f0 100644
--- a/server/protocol.h
+++ b/server/protocol.h
@@ -112,6 +112,7 @@ extern const char *name_of_nbd_rep (int);
 #define NBD_REP_ACK          1
 #define NBD_REP_SERVER       2
 #define NBD_REP_INFO         3
+#define NBD_REP_META_CONTEXT 4
 #define NBD_REP_ERR_UNSUP    0x80000001
 #define NBD_REP_ERR_POLICY   0x80000002
 #define NBD_REP_ERR_INVALID  0x80000003
@@ -128,6 +129,18 @@ struct fixed_new_option_reply_info_export {
   uint16_t eflags;              /* per-export flags */
 } __attribute__((packed));
 
+/* NBD_REP_META_CONTEXT reply (follows fixed_new_option_reply). */
+struct fixed_new_option_reply_meta_context {
+  uint32_t context_id;          /* metadata context ID */
+  /* followed by a string */
+} __attribute__((packed));
+
+/* NBD_REPLY_TYPE_BLOCK_STATUS block descriptor. */
+struct block_descriptor {
+  uint32_t length;              /* length of block */
+  uint32_t status_flags;        /* block type (hole etc) */
+} __attribute__((packed));
+
 /* New-style handshake server reply when using NBD_OPT_EXPORT_NAME.
  * Modern clients use NBD_OPT_GO instead of this.
  */
@@ -187,7 +200,7 @@ extern const char *name_of_nbd_reply_type (int);
 #define NBD_REPLY_TYPE_NONE         0
 #define NBD_REPLY_TYPE_OFFSET_DATA  1
 #define NBD_REPLY_TYPE_OFFSET_HOLE  2
-#define NBD_REPLY_TYPE_BLOCK_STATUS 3
+#define NBD_REPLY_TYPE_BLOCK_STATUS 5
 #define NBD_REPLY_TYPE_ERROR        ((1<<15) + 1)
 #define NBD_REPLY_TYPE_ERROR_OFFSET ((1<<15) + 2)
 
@@ -199,10 +212,12 @@ extern const char *name_of_nbd_cmd (int);
 #define NBD_CMD_FLUSH             3
 #define NBD_CMD_TRIM              4
 #define NBD_CMD_WRITE_ZEROES      6
+#define NBD_CMD_BLOCK_STATUS      7
 
 extern const char *name_of_nbd_cmd_flag (int);
 #define NBD_CMD_FLAG_FUA      (1<<0)
 #define NBD_CMD_FLAG_NO_HOLE  (1<<1)
+#define NBD_CMD_FLAG_REQ_ONE  (1<<3)
 
 /* Error codes (previously errno).
  * See
http://git.qemu.org/?p=qemu.git;a=commitdiff;h=ca4414804114fd0095b317785bc0b51862e62ebb
diff --git a/server/protocol-handshake-newstyle.c
b/server/protocol-handshake-newstyle.c
index db01f7b..5edeaf3 100644
--- a/server/protocol-handshake-newstyle.c
+++ b/server/protocol-handshake-newstyle.c
@@ -133,6 +133,34 @@ send_newstyle_option_reply_info_export (struct connection
*conn,
   return 0;
 }
 
+static int
+send_newstyle_option_reply_meta_context (struct connection *conn,
+                                         uint32_t option, uint32_t reply,
+                                         uint32_t context_id,
+                                         const char *name)
+{
+  struct fixed_new_option_reply fixed_new_option_reply;
+  struct fixed_new_option_reply_meta_context context;
+  const size_t namelen = strlen (name);
+
+  fixed_new_option_reply.magic = htobe64 (NBD_REP_MAGIC);
+  fixed_new_option_reply.option = htobe32 (option);
+  fixed_new_option_reply.reply = htobe32 (reply);
+  fixed_new_option_reply.replylen = htobe32 (sizeof context + namelen);
+  context.context_id = htobe32 (context_id);
+
+  if (conn->send (conn,
+                  &fixed_new_option_reply,
+                  sizeof fixed_new_option_reply) == -1 ||
+      conn->send (conn, &context, sizeof context) == -1 ||
+      conn->send (conn, name, namelen) == -1) {
+    nbdkit_error ("write: %m");
+    return -1;
+  }
+
+  return 0;
+}
+
 /* Sub-function during negotiate_handshake_newstyle, to uniformly handle
  * a client hanging up on a message boundary.
  */
@@ -452,6 +480,8 @@ negotiate_handshake_newstyle_options (struct connection
*conn)
     case NBD_OPT_LIST_META_CONTEXT:
     case NBD_OPT_SET_META_CONTEXT:
       {
+        int r;
+        bool can_extents;
         uint32_t opt_index;
         uint32_t exportnamelen;
         uint32_t nr_queries;
@@ -469,6 +499,16 @@ negotiate_handshake_newstyle_options (struct connection
*conn)
           continue;
         }
 
+        /* Work out if the server supports base:allocation. */
+        r = backend->can_extents (backend, conn);
+        if (r == -1) {
+          if (send_newstyle_option_reply (conn, option, NBD_REP_ERR_INVALID)
+              == -1)
+            return -1;
+          continue;
+        }
+        can_extents = r;
+
         /* Minimum length of the option payload is:
          *   32 bit export name length followed by empty export name
          * + 32 bit number of queries followed by no queries
@@ -503,7 +543,17 @@ negotiate_handshake_newstyle_options (struct connection
*conn)
          * for SET: nr_queries == 0 means reset all contexts
          */
         if (nr_queries == 0) {
-          /* Nothing is supported now. */
+          if (option == NBD_OPT_SET_META_CONTEXT)
+            conn->meta_context_base_allocation = false;
+          else /* LIST */ {
+            if (can_extents) {
+              if (send_newstyle_option_reply_meta_context
+                  (conn, option, NBD_REP_META_CONTEXT,
+                   base_allocation_id, "base:allocation") == -1)
+                return -1;
+            }
+          }
+
           if (send_newstyle_option_reply (conn, option, NBD_REP_ACK) == -1)
             return -1;
         }
@@ -525,7 +575,32 @@ negotiate_handshake_newstyle_options (struct connection
*conn)
                    option == NBD_OPT_LIST_META_CONTEXT ? "query" :
"set",
                    (int) querylen, &data[opt_index]);
 
-            /* Ignore query - nothing is supported. */
+            /* For LIST, "base:" returns all supported contexts in
the
+             * base namespace.  We only support "base:allocation".
+             */
+            if (option == NBD_OPT_LIST_META_CONTEXT &&
+                querylen == 5 &&
+                strncmp (&data[opt_index], "base:", 5) == 0) {
+              if (can_extents) {
+                if (send_newstyle_option_reply_meta_context
+                    (conn, option, NBD_REP_META_CONTEXT,
+                     base_allocation_id, "base:allocation") == -1)
+                  return -1;
+              }
+            }
+            /* "base:allocation" requested by name. */
+            else if (querylen == 15 &&
+                     strncmp (&data[opt_index],
"base:allocation", 15) == 0) {
+              if (can_extents) {
+                if (send_newstyle_option_reply_meta_context
+                    (conn, option, NBD_REP_META_CONTEXT,
+                     base_allocation_id, "base:allocation") == -1)
+                  return -1;
+                if (option == NBD_OPT_SET_META_CONTEXT)
+                  conn->meta_context_base_allocation = true;
+              }
+            }
+            /* Every other query must be ignored. */
 
             opt_index += querylen;
             nr_queries--;
diff --git a/server/protocol.c b/server/protocol.c
index f117d42..c713e12 100644
--- a/server/protocol.c
+++ b/server/protocol.c
@@ -36,6 +36,7 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <stdint.h>
+#include <stdbool.h>
 #include <inttypes.h>
 #include <string.h>
 #include <unistd.h>
@@ -78,6 +79,7 @@ validate_request (struct connection *conn,
   case NBD_CMD_WRITE:
   case NBD_CMD_TRIM:
   case NBD_CMD_WRITE_ZEROES:
+  case NBD_CMD_BLOCK_STATUS:
     if (!valid_range (conn, offset, count)) {
       /* XXX Allow writes to extend the disk? */
       nbdkit_error ("invalid request: %s: offset and count are out of
range: "
@@ -106,7 +108,8 @@ validate_request (struct connection *conn,
   }
 
   /* Validate flags */
-  if (flags & ~(NBD_CMD_FLAG_FUA | NBD_CMD_FLAG_NO_HOLE)) {
+  if (flags & ~(NBD_CMD_FLAG_FUA | NBD_CMD_FLAG_NO_HOLE |
+                NBD_CMD_FLAG_REQ_ONE)) {
     nbdkit_error ("invalid request: unknown flag (0x%x)", flags);
     *error = EINVAL;
     return false;
@@ -117,6 +120,12 @@ validate_request (struct connection *conn,
     *error = EINVAL;
     return false;
   }
+  if ((flags & NBD_CMD_FLAG_REQ_ONE) &&
+      cmd != NBD_CMD_BLOCK_STATUS) {
+    nbdkit_error ("invalid request: REQ_ONE flag needs BLOCK_STATUS
request");
+    *error = EINVAL;
+    return false;
+  }
   if (!conn->can_fua && (flags & NBD_CMD_FLAG_FUA)) {
     nbdkit_error ("invalid request: FUA flag not supported");
     *error = EINVAL;
@@ -157,14 +166,37 @@ validate_request (struct connection *conn,
     return false;
   }
 
+  /* Block status allowed? */
+  if (cmd == NBD_CMD_BLOCK_STATUS) {
+    if (!conn->structured_replies) {
+      nbdkit_error ("invalid request: "
+                    "%s: structured replies was not negotiated",
+                    name_of_nbd_cmd (cmd));
+      *error = EINVAL;
+      return false;
+    }
+    if (!conn->meta_context_base_allocation) {
+      nbdkit_error ("invalid request: "
+                    "%s: base:allocation was not negotiated",
+                    name_of_nbd_cmd (cmd));
+      *error = EINVAL;
+      return false;
+    }
+  }
+
   return true;                     /* Command validates. */
 }
 
 /* This is called with the request lock held to actually execute the
  * request (by calling the plugin).  Note that the request fields have
  * been validated already in 'validate_request' so we don't have to
- * check them again.  'buf' is either the data to be written or the
- * data to be returned, and points to a buffer of size 'count' bytes.
+ * check them again.
+ *
+ * 'buf' is either the data to be written or the data to be returned,
+ * and points to a buffer of size 'count' bytes.
+ *
+ * 'extents_map' is an empty extents map used for block status
+ * requests only.
  *
  * In all cases, the return value is the system errno value that will
  * later be converted to the nbd error to send back to the client (0
@@ -173,7 +205,7 @@ validate_request (struct connection *conn,
 static uint32_t
 handle_request (struct connection *conn,
                 uint16_t cmd, uint16_t flags, uint64_t offset, uint32_t count,
-                void *buf)
+                void *buf, struct nbdkit_extents_map *extents_map)
 {
   uint32_t f = 0;
   bool fua = conn->can_fua && (flags & NBD_CMD_FLAG_FUA);
@@ -217,6 +249,14 @@ handle_request (struct connection *conn,
       return err;
     break;
 
+  case NBD_CMD_BLOCK_STATUS:
+    if (flags & NBD_CMD_FLAG_REQ_ONE)
+      f |= NBDKIT_FLAG_REQ_ONE;
+    if (backend->extents (backend, conn, count, offset, f,
+                          extents_map, &err) == -1)
+      return err;
+    break;
+
   default:
     abort ();
   }
@@ -224,6 +264,97 @@ handle_request (struct connection *conn,
   return 0;
 }
 
+static int
+count_extents (uint64_t offset, uint64_t length, uint32_t type,
+               void *rv)
+{
+  size_t *rp = rv;
+
+  (*rp)++;
+  return 0;
+}
+
+struct copy_extents_data {
+  size_t i;
+  struct block_descriptor *blocks;
+  size_t nr_blocks;
+};
+
+static int
+copy_extents (uint64_t offset, uint64_t length, uint32_t type,
+              void *dv)
+{
+  struct copy_extents_data *data = dv;
+  uint32_t type_flags;
+
+  assert (data->i < data->nr_blocks);
+
+  /* Because the original request is limited to a 32 bit count, length
+   * can never be > 32 bits in size.
+   */
+  assert (length <= UINT32_MAX);
+
+  /* Convert NBDKIT_EXTENT_* flags to NBD_STATE_* flags.  However
+   * since these are deliberately chosen to be the same binary values
+   * we only have to mask here.
+   */
+  type_flags = type & 3;
+
+  data->blocks[data->i].length = length;
+  data->blocks[data->i].status_flags = type_flags;
+
+  data->i++;
+
+  return 0;
+}
+
+/* See note in protocol_recv_request_send_reply below.  This returns 0
+ * on success or a positive errno.
+ */
+static int
+block_status_final_map (uint16_t flags,
+                        uint32_t count, uint64_t offset,
+                        struct nbdkit_extents_map *extents_map,
+                        struct block_descriptor **blocks,
+                        size_t *nr_blocks)
+{
+  const bool req_one = flags & NBD_CMD_FLAG_REQ_ONE;
+  uint32_t foreach_flags;
+  struct copy_extents_data data;
+
+  foreach_flags = NBDKIT_EXTENTS_FOREACH_FLAG_RANGE;
+  if (req_one)
+    foreach_flags |= NBDKIT_EXTENTS_FOREACH_FLAG_ONE;
+
+  /* Calculate the number of blocks we will be returning. */
+  *nr_blocks = 0;
+  if (nbdkit_extents_foreach (extents_map,
+                              count_extents, nr_blocks,
+                              foreach_flags,
+                              offset, (uint64_t) count) == -1)
+      return errno;
+  assert (!req_one || *nr_blocks == 1);
+
+  /* Allocate the final array. */
+  *blocks = malloc (sizeof (struct block_descriptor) * *nr_blocks);
+  if (*blocks == NULL) {
+    nbdkit_error ("malloc: %m");
+    return errno;
+  }
+
+  /* Copy the extents into the array. */
+  data.i = 0;
+  data.blocks = *blocks;
+  data.nr_blocks = *nr_blocks;
+  if (nbdkit_extents_foreach (extents_map,
+                              copy_extents, &data,
+                              foreach_flags,
+                              offset, (uint64_t) count) == -1)
+    return errno;
+
+  return 0;
+}
+
 static int
 skip_over_write_buffer (int sock, size_t count)
 {
@@ -359,6 +490,60 @@ send_structured_reply_read (struct connection *conn,
   return 1;                     /* command processed ok */
 }
 
+static int
+send_structured_reply_block_status (struct connection *conn,
+                                    uint64_t handle,
+                                    uint16_t cmd, uint16_t flags,
+                                    uint32_t count, uint64_t offset,
+                                    const struct block_descriptor *blocks,
+                                    size_t nr_blocks)
+{
+  ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&conn->write_lock);
+  struct structured_reply reply;
+  uint32_t context_id;
+  size_t i;
+  int r;
+
+  assert (cmd == NBD_CMD_BLOCK_STATUS);
+
+  reply.magic = htobe32 (NBD_STRUCTURED_REPLY_MAGIC);
+  reply.handle = handle;
+  reply.flags = htobe16 (NBD_REPLY_FLAG_DONE);
+  reply.type = htobe16 (NBD_REPLY_TYPE_BLOCK_STATUS);
+  reply.length = htobe32 (sizeof context_id +
+                          nr_blocks * sizeof (struct block_descriptor));
+
+  r = conn->send (conn, &reply, sizeof reply);
+  if (r == -1) {
+    nbdkit_error ("write reply: %s: %m", name_of_nbd_cmd (cmd));
+    return connection_set_status (conn, -1);
+  }
+
+  /* Send the base:allocation context ID. */
+  context_id = htobe32 (base_allocation_id);
+  r = conn->send (conn, &context_id, sizeof context_id);
+  if (r == -1) {
+    nbdkit_error ("write reply: %s: %m", name_of_nbd_cmd (cmd));
+    return connection_set_status (conn, -1);
+  }
+
+  /* Send each block descriptor. */
+  for (i = 0; i < nr_blocks; ++i) {
+    struct block_descriptor bd = blocks[i];
+
+    bd.length = htobe32 (bd.length);
+    bd.status_flags = htobe32 (bd.status_flags);
+
+    r = conn->send (conn, &bd, sizeof bd);
+    if (r == -1) {
+      nbdkit_error ("write reply: %s: %m", name_of_nbd_cmd (cmd));
+      return connection_set_status (conn, -1);
+    }
+  }
+
+  return 1;                     /* command processed ok */
+}
+
 static int
 send_structured_reply_error (struct connection *conn,
                              uint64_t handle, uint16_t cmd, uint32_t error)
@@ -402,6 +587,9 @@ protocol_recv_request_send_reply (struct connection *conn)
   uint32_t magic, count, error = 0;
   uint64_t offset;
   CLEANUP_FREE char *buf = NULL;
+  CLEANUP_EXTENTS_FREE struct nbdkit_extents_map *extents_map = NULL;
+  CLEANUP_FREE struct block_descriptor *blocks = NULL;
+  size_t nr_blocks = 0;
 
   /* Read the request packet. */
   {
@@ -449,6 +637,7 @@ protocol_recv_request_send_reply (struct connection *conn)
     if (cmd == NBD_CMD_READ || cmd == NBD_CMD_WRITE) {
       buf = malloc (count);
       if (buf == NULL) {
+      out_of_memory:
         perror ("malloc");
         error = ENOMEM;
         if (cmd == NBD_CMD_WRITE &&
@@ -458,6 +647,13 @@ protocol_recv_request_send_reply (struct connection *conn)
       }
     }
 
+    /* Allocate the extents map for block status only. */
+    if (cmd == NBD_CMD_BLOCK_STATUS) {
+      extents_map = nbdkit_extents_new ();
+      if (extents_map == NULL)
+        goto out_of_memory;
+    }
+
     /* Receive the write data buffer. */
     if (cmd == NBD_CMD_WRITE) {
       r = conn->recv (conn, buf, count);
@@ -478,11 +674,29 @@ protocol_recv_request_send_reply (struct connection *conn)
   }
   else {
     lock_request (conn);
-    error = handle_request (conn, cmd, flags, offset, count, buf);
+    error = handle_request (conn, cmd, flags, offset, count, buf, extents_map);
     assert ((int) error >= 0);
     unlock_request (conn);
   }
 
+  /* XXX There are complicated requirements for the block status
+   * reply, such as the offset, length and number of extents returned
+   * in the structured reply.  To allow a simple implementation for
+   * plugins we don't make the plugins obey these requirements.  This
+   * means at some point we need to filter what the plugin gives us to
+   * obey the protocol requirements.  There are several places we
+   * could do that.  Currently we do it here.  Another possibility is
+   * to do it in server/plugins.c.
+   *
+   * Also note this only deals with base:allocation.  If in future we
+   * want to describe other block status metadata then this code will
+   * require an overhaul.
+   */
+  if (error == 0 && cmd == NBD_CMD_BLOCK_STATUS) {
+    error = block_status_final_map (flags, count, offset, extents_map,
+                                    &blocks, &nr_blocks);
+  }
+
   /* Send the reply packet. */
  send_reply:
   if (connection_get_status (conn) < 0)
@@ -498,15 +712,23 @@ protocol_recv_request_send_reply (struct connection *conn)
   }
 
   /* Currently we prefer to send simple replies for everything except
-   * where we have to (ie. NBD_CMD_READ when structured_replies have
-   * been negotiated).  However this prevents us from sending
-   * human-readable error messages to the client, so we should
-   * reconsider this in future.
+   * where we have to (ie. NBD_CMD_READ and NBD_CMD_BLOCK_STATUS when
+   * structured_replies have been negotiated).  However this prevents
+   * us from sending human-readable error messages to the client, so
+   * we should reconsider this in future.
    */
-  if (conn->structured_replies && cmd == NBD_CMD_READ) {
-    if (!error)
-      return send_structured_reply_read (conn, request.handle, cmd,
-                                         buf, count, offset);
+  if (conn->structured_replies &&
+      (cmd == NBD_CMD_READ || cmd == NBD_CMD_BLOCK_STATUS)) {
+    if (!error) {
+      if (cmd == NBD_CMD_READ)
+        return send_structured_reply_read (conn, request.handle, cmd,
+                                           buf, count, offset);
+      else /* NBD_CMD_BLOCK_STATUS */
+        return send_structured_reply_block_status (conn, request.handle,
+                                                   cmd, flags,
+                                                   count, offset,
+                                                   blocks, nr_blocks);
+    }
     else
       return send_structured_reply_error (conn, request.handle, cmd, error);
   }
-- 
2.20.1

Richard W.M. Jones

2019-Mar-19 16:35 UTC

head link

[Libguestfs] [PATCH nbdkit 4/9] server: Export nbdkit_extent* symbols.

Allows them to be called from plugins and filters.
---
 server/nbdkit.syms | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/server/nbdkit.syms b/server/nbdkit.syms
index 672abd2..95ef067 100644
--- a/server/nbdkit.syms
+++ b/server/nbdkit.syms
@@ -42,6 +42,10 @@
     nbdkit_absolute_path;
     nbdkit_debug;
     nbdkit_error;
+    nbdkit_extent_add;
+    nbdkit_extents_free;
+    nbdkit_extents_foreach;
+    nbdkit_extents_new;
     nbdkit_parse_bool;
     nbdkit_parse_size;
     nbdkit_read_password;
-- 
2.20.1

Richard W.M. Jones

2019-Mar-19 16:35 UTC

head link

[Libguestfs] [PATCH nbdkit 5/9] offset: Implement mapping of extents.

Allows you to safely use nbdkit-offset-filter on top of a plugin
supporting extents.
---
 filters/offset/offset.c | 43 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/filters/offset/offset.c b/filters/offset/offset.c
index 058571d..4e3f74d 100644
--- a/filters/offset/offset.c
+++ b/filters/offset/offset.c
@@ -37,6 +37,7 @@
 #include <stdlib.h>
 #include <stdint.h>
 #include <string.h>
+#include <assert.h>
 
 #include <nbdkit-filter.h>
 
@@ -132,6 +133,47 @@ offset_zero (struct nbdkit_next_ops *next_ops, void
*nxdata,
   return next_ops->zero (nxdata, count, offs + offset, flags, err);
 }
 
+/* Extents. */
+static int
+subtract_offset (uint64_t offs, uint64_t length, uint32_t type,
+                 void *vp)
+{
+  struct nbdkit_extents_map *extents_map = vp;
+
+  assert (offs >= offset);
+  offs -= offset;
+  return nbdkit_extent_add (extents_map, offs, length, type);
+}
+
+static int
+offset_extents (struct nbdkit_next_ops *next_ops, void *nxdata,
+                void *handle, uint32_t count, uint64_t offs, uint32_t flags,
+                struct nbdkit_extents_map *extents_map,
+                int *err)
+{
+  struct nbdkit_extents_map *map2;
+
+  map2 = nbdkit_extents_new ();
+  if (map2 == NULL)
+    return -1;
+  if (next_ops->extents (nxdata, count, offs + offset,
+                         flags, map2, err) == -1) {
+    nbdkit_extents_free (map2);
+    return -1;
+  }
+
+  /* Transform offsets in map2, return result in extents_map. */
+  if (nbdkit_extents_foreach (map2, subtract_offset, extents_map,
+                              NBDKIT_EXTENTS_FOREACH_FLAG_RANGE,
+                              offset, range) == -1) {
+    nbdkit_extents_free (map2);
+    return -1;
+  }
+  nbdkit_extents_free (map2);
+
+  return 0;
+}
+
 static struct nbdkit_filter filter = {
   .name              = "offset",
   .longname          = "nbdkit offset filter",
@@ -144,6 +186,7 @@ static struct nbdkit_filter filter = {
   .pwrite            = offset_pwrite,
   .trim              = offset_trim,
   .zero              = offset_zero,
+  .extents           = offset_extents,
 };
 
 NBDKIT_REGISTER_FILTER(filter)
-- 
2.20.1

Richard W.M. Jones

2019-Mar-19 16:35 UTC

head link

[Libguestfs] [PATCH nbdkit 6/9] partition: Implement mapping of extents.

---
 filters/partition/partition.c | 52 +++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/filters/partition/partition.c b/filters/partition/partition.c
index c1fe7a1..c8fcc7f 100644
--- a/filters/partition/partition.c
+++ b/filters/partition/partition.c
@@ -38,6 +38,7 @@
 #include <stdint.h>
 #include <string.h>
 #include <inttypes.h>
+#include <assert.h>
 
 #include <nbdkit-filter.h>
 
@@ -222,6 +223,56 @@ partition_zero (struct nbdkit_next_ops *next_ops, void
*nxdata,
   return next_ops->zero (nxdata, count, offs + h->offset, flags, err);
 }
 
+/* Extents. */
+struct subtract_offset_data {
+  struct handle *h;
+  struct nbdkit_extents_map *extents_map;
+};
+
+static int
+subtract_offset (uint64_t offs, uint64_t length, uint32_t type,
+                 void *dv)
+{
+  struct subtract_offset_data *data = dv;
+
+  assert (offs >= data->h->offset);
+  offs -= data->h->offset;
+  return nbdkit_extent_add (data->extents_map, offs, length, type);
+}
+
+static int
+partition_extents (struct nbdkit_next_ops *next_ops, void *nxdata,
+                   void *handle, uint32_t count, uint64_t offs, uint32_t flags,
+                   struct nbdkit_extents_map *extents_map,
+                   int *err)
+{
+  struct handle *h = handle;
+  struct nbdkit_extents_map *map2;
+  struct subtract_offset_data data;
+
+  map2 = nbdkit_extents_new ();
+  if (map2 == NULL)
+    return -1;
+  if (next_ops->extents (nxdata, count, offs + h->offset,
+                         flags, map2, err) == -1) {
+    nbdkit_extents_free (map2);
+    return -1;
+  }
+
+  /* Transform offsets in map2, return result in extents_map. */
+  data.h = h;
+  data.extents_map = extents_map;
+  if (nbdkit_extents_foreach (map2, subtract_offset, &data,
+                              NBDKIT_EXTENTS_FOREACH_FLAG_RANGE,
+                              h->offset, h->range) == -1) {
+    nbdkit_extents_free (map2);
+    return -1;
+  }
+  nbdkit_extents_free (map2);
+
+  return 0;
+}
+
 static struct nbdkit_filter filter = {
   .name              = "partition",
   .longname          = "nbdkit partition filter",
@@ -237,6 +288,7 @@ static struct nbdkit_filter filter = {
   .pwrite            = partition_pwrite,
   .trim              = partition_trim,
   .zero              = partition_zero,
+  .extents           = partition_extents,
 };
 
 NBDKIT_REGISTER_FILTER(filter)
-- 
2.20.1

Richard W.M. Jones

2019-Mar-19 16:35 UTC

head link

[Libguestfs] [PATCH nbdkit 7/9] data, memory: Implement extents.

These plugins are both based on the same sparse array structure which
supports a simple implementation of extents.
---
 common/sparse/sparse.h  |  7 ++++++-
 common/sparse/sparse.c  | 29 ++++++++++++++++++++++++++++-
 plugins/data/data.c     | 16 +++++++++++++++-
 plugins/memory/memory.c | 16 +++++++++++++++-
 4 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/common/sparse/sparse.h b/common/sparse/sparse.h
index 818d804..23e31b9 100644
--- a/common/sparse/sparse.h
+++ b/common/sparse/sparse.h
@@ -1,5 +1,5 @@
 /* nbdkit
- * Copyright (C) 2017-2018 Red Hat Inc.
+ * Copyright (C) 2017-2019 Red Hat Inc.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
@@ -87,4 +87,9 @@ extern void sparse_array_zero (struct sparse_array *sa,
                                uint32_t count, uint64_t offset)
   __attribute__((__nonnull__ (1)));
 
+/* Return information about allocated pages and holes. */
+extern int sparse_array_extents (struct sparse_array *sa,
+                                 uint32_t count, uint64_t offset,
+                                 struct nbdkit_extents_map *extents_map);
+
 #endif /* NBDKIT_SPARSE_H */
diff --git a/common/sparse/sparse.c b/common/sparse/sparse.c
index a5ace48..56bfa77 100644
--- a/common/sparse/sparse.c
+++ b/common/sparse/sparse.c
@@ -1,5 +1,5 @@
 /* nbdkit
- * Copyright (C) 2017-2018 Red Hat Inc.
+ * Copyright (C) 2017-2019 Red Hat Inc.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
@@ -360,3 +360,30 @@ sparse_array_zero (struct sparse_array *sa, uint32_t count,
uint64_t offset)
     offset += n;
   }
 }
+
+int
+sparse_array_extents (struct sparse_array *sa,
+                      uint32_t count, uint64_t offset,
+                      struct nbdkit_extents_map *extents_map)
+{
+  uint32_t n, type;
+  void *p;
+
+  while (count > 0) {
+    p = lookup (sa, offset, false, &n, NULL);
+    if (n > count)
+      n = count;
+
+    if (p == NULL)
+      type = NBDKIT_EXTENT_HOLE | NBDKIT_EXTENT_ZERO;
+    else
+      type = 0; /* allocated data */
+    if (nbdkit_extent_add (extents_map, offset, n, type) == -1)
+      return -1;
+
+    count -= n;
+    offset += n;
+  }
+
+  return 0;
+}
diff --git a/plugins/data/data.c b/plugins/data/data.c
index f9d3881..d580805 100644
--- a/plugins/data/data.c
+++ b/plugins/data/data.c
@@ -1,5 +1,5 @@
 /* nbdkit
- * Copyright (C) 2018 Red Hat Inc.
+ * Copyright (C) 2018-2019 Red Hat Inc.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
@@ -378,6 +378,19 @@ data_trim (void *handle, uint32_t count, uint64_t offset)
   return 0;
 }
 
+/* Extents. */
+static int
+data_extents (void *handle, uint32_t count, uint64_t offset,
+              uint32_t flags, struct nbdkit_extents_map *extents_map)
+{
+  int r;
+
+  pthread_mutex_lock (&lock);
+  r = sparse_array_extents (sa, count, offset, extents_map);
+  pthread_mutex_unlock (&lock);
+  return r;
+}
+
 static struct nbdkit_plugin plugin = {
   .name              = "data",
   .version           = PACKAGE_VERSION,
@@ -394,6 +407,7 @@ static struct nbdkit_plugin plugin = {
   .pwrite            = data_pwrite,
   .zero              = data_zero,
   .trim              = data_trim,
+  .extents           = data_extents,
   /* In this plugin, errno is preserved properly along error return
    * paths from failed system calls.
    */
diff --git a/plugins/memory/memory.c b/plugins/memory/memory.c
index e27e127..fa09a35 100644
--- a/plugins/memory/memory.c
+++ b/plugins/memory/memory.c
@@ -1,5 +1,5 @@
 /* nbdkit
- * Copyright (C) 2017-2018 Red Hat Inc.
+ * Copyright (C) 2017-2019 Red Hat Inc.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
@@ -170,6 +170,19 @@ memory_trim (void *handle, uint32_t count, uint64_t offset)
   return 0;
 }
 
+/* Extents. */
+static int
+memory_extents (void *handle, uint32_t count, uint64_t offset,
+                uint32_t flags, struct nbdkit_extents_map *extents_map)
+{
+  int r;
+
+  pthread_mutex_lock (&lock);
+  r = sparse_array_extents (sa, count, offset, extents_map);
+  pthread_mutex_unlock (&lock);
+  return r;
+}
+
 static struct nbdkit_plugin plugin = {
   .name              = "memory",
   .version           = PACKAGE_VERSION,
@@ -185,6 +198,7 @@ static struct nbdkit_plugin plugin = {
   .pwrite            = memory_pwrite,
   .zero              = memory_zero,
   .trim              = memory_trim,
+  .extents           = memory_extents,
   /* In this plugin, errno is preserved properly along error return
    * paths from failed system calls.
    */
-- 
2.20.1

Richard W.M. Jones

2019-Mar-19 16:35 UTC

head link

[Libguestfs] [PATCH nbdkit 8/9] vddk: Implement extents.

This uses a new API VixDiskLib_QueryAllocatedBlocks provided in
VDDK >= 6.7.

Thanks: Martin Kletzander.
---
 plugins/vddk/vddk-structs.h |  15 ++++-
 plugins/vddk/vddk.c         | 127 ++++++++++++++++++++++++++++++++++++
 2 files changed, 141 insertions(+), 1 deletion(-)

diff --git a/plugins/vddk/vddk-structs.h b/plugins/vddk/vddk-structs.h
index dbed94a..df88322 100644
--- a/plugins/vddk/vddk-structs.h
+++ b/plugins/vddk/vddk-structs.h
@@ -1,5 +1,5 @@
 /* nbdkit
- * Copyright (C) 2013-2018 Red Hat Inc.
+ * Copyright (C) 2013-2019 Red Hat Inc.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
@@ -47,6 +47,9 @@ typedef uint64_t VixError;
 #define VIXDISKLIB_FLAG_OPEN_READ_ONLY 4
 #define VIXDISKLIB_SECTOR_SIZE 512
 
+#define VIXDISKLIB_MIN_CHUNK_SIZE 128
+#define VIXDISKLIB_MAX_CHUNK_NUMBER (512*1024)
+
 typedef void *VixDiskLibConnection;
 typedef void *VixDiskLibHandle;
 
@@ -124,4 +127,14 @@ typedef struct VixDiskLibInfo {
   char *uuid;
 } VixDiskLibInfo;
 
+typedef struct {
+  uint64_t offset;
+  uint64_t length;
+} VixDiskLibBlock;
+
+typedef struct {
+  uint32_t numBlocks;
+  VixDiskLibBlock blocks[1];
+} VixDiskLibBlockList;
+
 #endif /* NBDKIT_VDDK_STRUCTS_H */
diff --git a/plugins/vddk/vddk.c b/plugins/vddk/vddk.c
index 56871cc..ce6633c 100644
--- a/plugins/vddk/vddk.c
+++ b/plugins/vddk/vddk.c
@@ -45,6 +45,8 @@
 #include <nbdkit-plugin.h>
 
 #include "isaligned.h"
+#include "minmax.h"
+#include "rounding.h"
 
 #include "vddk-structs.h"
 
@@ -67,6 +69,8 @@ static VixError (*VixDiskLib_GetInfo) (VixDiskLibHandle
handle, VixDiskLibInfo *
 static void (*VixDiskLib_FreeInfo) (VixDiskLibInfo *info);
 static VixError (*VixDiskLib_Read) (VixDiskLibHandle handle, uint64_t
start_sector, uint64_t nr_sectors, unsigned char *buf);
 static VixError (*VixDiskLib_Write) (VixDiskLibHandle handle, uint64_t
start_sector, uint64_t nr_sectors, const unsigned char *buf);
+static VixError (*VixDiskLib_QueryAllocatedBlocks) (VixDiskLibHandle
diskHandle, uint64_t start_sector, uint64_t nr_sectors, uint64_t chunk_size,
VixDiskLibBlockList **block_list);
+static VixError (*VixDiskLib_FreeBlockList) (VixDiskLibBlockList *block_list);
 
 /* Parameters passed to InitEx. */
 #define VDDK_MAJOR 5
@@ -174,6 +178,11 @@ vddk_load (void)
   VixDiskLib_FreeInfo = dlsym (dl, "VixDiskLib_FreeInfo");
   VixDiskLib_Read = dlsym (dl, "VixDiskLib_Read");
   VixDiskLib_Write = dlsym (dl, "VixDiskLib_Write");
+
+  /* Added in VDDK 6.7, these will be NULL for earlier versions: */
+  VixDiskLib_QueryAllocatedBlocks +    dlsym (dl,
"VixDiskLib_QueryAllocatedBlocks");
+  VixDiskLib_FreeBlockList = dlsym (dl, "VixDiskLib_FreeBlockList");
 }
 
 static void
@@ -570,6 +579,122 @@ vddk_pwrite (void *handle, const void *buf, uint32_t
count, uint64_t offset)
   return 0;
 }
 
+static int
+vddk_can_extents (void *handle)
+{
+  struct vddk_handle *h = handle;
+  VixError err;
+  VixDiskLibBlockList *block_list;
+
+  /* This call was added in VDDK 6.7.  In earlier versions the
+   * function pointer will be NULL and we cannot query extents.
+   */
+  if (VixDiskLib_QueryAllocatedBlocks == NULL) {
+    nbdkit_debug ("can_extents: VixDiskLib_QueryAllocatedBlocks == NULL,
"
+                  "probably this is VDDK < 6.7");
+    return 0;
+  }
+
+  /* However even when the call is available it rarely works well so
+   * the best thing we can do here is to try the call and if it's
+   * non-functional return false.
+   */
+  DEBUG_CALL ("VixDiskLib_QueryAllocatedBlocks",
+              "handle, 0, %d sectors, %d sectors",
+              VIXDISKLIB_MIN_CHUNK_SIZE, VIXDISKLIB_MIN_CHUNK_SIZE);
+  err = VixDiskLib_QueryAllocatedBlocks (h->handle,
+                                         0, VIXDISKLIB_MIN_CHUNK_SIZE,
+                                         VIXDISKLIB_MIN_CHUNK_SIZE,
+                                         &block_list);
+  if (err == VIX_OK) {
+    DEBUG_CALL ("VixDiskLib_FreeBlockList", "block_list");
+    VixDiskLib_FreeBlockList (block_list);
+  }
+  if (err != VIX_OK) {
+    char *errmsg = VixDiskLib_GetErrorText (err, NULL);
+    nbdkit_debug ("can_extents: VixDiskLib_QueryAllocatedBlocks test
failed, "
+                  "extents support will be disabled: "
+                  "original error: %s",
+                  errmsg);
+    VixDiskLib_FreeErrorText (errmsg);
+    return 0;
+  }
+
+  return 1;
+}
+
+static int
+vddk_extents (void *handle, uint32_t count, uint64_t offset, uint32_t flags,
+              struct nbdkit_extents_map *extents_map)
+{
+  struct vddk_handle *h = handle;
+  uint64_t end, start_sector;
+
+  /* nbdkit extents_map defaults to all allocated.  However VDDK gives
+   * us the allocated blocks.  Therefore set the whole region to a
+   * hole first.
+   */
+  if (nbdkit_extent_add (extents_map, offset, count,
+                         NBDKIT_EXTENT_HOLE|NBDKIT_EXTENT_ZERO) == -1)
+    return -1;
+
+  /* We can only query whole chunks.  Therefore start with the first
+   * chunk before offset.
+   */
+  end = offset + count;
+  start_sector +    ROUND_DOWN (offset, VIXDISKLIB_MIN_CHUNK_SIZE *
VIXDISKLIB_SECTOR_SIZE)
+    / VIXDISKLIB_SECTOR_SIZE;
+  while (start_sector * VIXDISKLIB_SECTOR_SIZE < end) {
+    VixError err;
+    uint32_t i;
+    uint64_t nr_chunks, nr_sectors;
+    VixDiskLibBlockList *block_list;
+
+    assert (IS_ALIGNED (start_sector, VIXDISKLIB_MIN_CHUNK_SIZE));
+
+    nr_chunks +      ROUND_UP (end - start_sector * VIXDISKLIB_SECTOR_SIZE,
+                VIXDISKLIB_MIN_CHUNK_SIZE * VIXDISKLIB_SECTOR_SIZE)
+      / (VIXDISKLIB_MIN_CHUNK_SIZE * VIXDISKLIB_SECTOR_SIZE);
+    nr_chunks = MIN (nr_chunks, VIXDISKLIB_MAX_CHUNK_NUMBER);
+    nr_sectors = nr_chunks * VIXDISKLIB_MIN_CHUNK_SIZE;
+
+    DEBUG_CALL ("VixDiskLib_QueryAllocatedBlocks",
+                "handle, %" PRIu64 " sectors, %" PRIu64
" sectors, "
+                "%d sectors",
+                start_sector, nr_sectors, VIXDISKLIB_MIN_CHUNK_SIZE);
+    err = VixDiskLib_QueryAllocatedBlocks (h->handle,
+                                           start_sector, nr_sectors,
+                                           VIXDISKLIB_MIN_CHUNK_SIZE,
+                                           &block_list);
+    if (err != VIX_OK) {
+      VDDK_ERROR (err, "VixDiskLib_QueryAllocatedBlocks");
+      return -1;
+    }
+
+    for (i = 0; i < block_list->numBlocks; ++i) {
+      uint64_t offset, length;
+
+      offset = block_list->blocks[i].offset * VIXDISKLIB_SECTOR_SIZE;
+      length = block_list->blocks[i].length * VIXDISKLIB_SECTOR_SIZE;
+
+      if (nbdkit_extent_add (extents_map,
+                             offset, length, 0 /* allocated data */) == -1) {
+        DEBUG_CALL ("VixDiskLib_FreeBlockList",
"block_list");
+        VixDiskLib_FreeBlockList (block_list);
+        return -1;
+      }
+    }
+    DEBUG_CALL ("VixDiskLib_FreeBlockList", "block_list");
+    VixDiskLib_FreeBlockList (block_list);
+
+    start_sector += nr_sectors;
+  }
+
+  return 0;
+}
+
 static struct nbdkit_plugin plugin = {
   .name              = "vddk",
   .longname          = "VMware VDDK plugin",
@@ -585,6 +710,8 @@ static struct nbdkit_plugin plugin = {
   .get_size          = vddk_get_size,
   .pread             = vddk_pread,
   .pwrite            = vddk_pwrite,
+  .can_extents       = vddk_can_extents,
+  .extents           = vddk_extents,
 };
 
 NBDKIT_REGISTER_PLUGIN(plugin)
-- 
2.20.1

Richard W.M. Jones

2019-Mar-19 16:35 UTC

head link

[Libguestfs] [PATCH nbdkit 9/9] file: Implement extents.

This uses lseek SEEK_DATA/SEEK_HOLE to search for allocated data and
holes in the underlying file.
---
 plugins/file/file.c | 142 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 132 insertions(+), 10 deletions(-)

diff --git a/plugins/file/file.c b/plugins/file/file.c
index 628f8fb..704deaa 100644
--- a/plugins/file/file.c
+++ b/plugins/file/file.c
@@ -1,5 +1,5 @@
 /* nbdkit
- * Copyright (C) 2013-2018 Red Hat Inc.
+ * Copyright (C) 2013-2019 Red Hat Inc.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
@@ -37,6 +37,7 @@
 #include <stdlib.h>
 #include <stdbool.h>
 #include <string.h>
+#include <inttypes.h>
 #include <fcntl.h>
 #include <unistd.h>
 #include <sys/types.h>
@@ -44,6 +45,8 @@
 #include <sys/ioctl.h>
 #include <errno.h>
 
+#include <pthread.h>
+
 #if defined(__linux__) && !defined(FALLOC_FL_PUNCH_HOLE)
 #include <linux/falloc.h>   /* For FALLOC_FL_*, glibc < 2.18 */
 #endif
@@ -68,7 +71,11 @@
 
 static char *filename = NULL;
 
-int file_debug_zero;            /* to enable: -D file.zero=1 */
+/* Any callbacks using lseek must be protected by this lock. */
+static pthread_mutex_t lseek_lock = PTHREAD_MUTEX_INITIALIZER;
+
+/* to enable: -D file.zero=1 */
+int file_debug_zero;
 
 static void
 file_unload (void)
@@ -220,6 +227,21 @@ file_close (void *handle)
 
 #define THREAD_MODEL NBDKIT_THREAD_MODEL_PARALLEL
 
+/* For block devices, stat->st_size is not the true size. */
+static int64_t
+block_device_size (int fd)
+{
+  off_t size;
+
+  size = lseek (fd, 0, SEEK_END);
+  if (size == -1) {
+    nbdkit_error ("lseek (to find device size): %m");
+    return -1;
+  }
+
+  return size;
+}
+
 /* Get the file size. */
 static int64_t
 file_get_size (void *handle)
@@ -227,15 +249,11 @@ file_get_size (void *handle)
   struct handle *h = handle;
 
   if (h->is_block_device) {
-    /* Block device, so st_size will not be the true size. */
-    off_t size;
-
-    size = lseek (h->fd, 0, SEEK_END);
-    if (size == -1) {
-      nbdkit_error ("lseek (to find device size): %m");
-      return -1;
-    }
+    int64_t size;
 
+    pthread_mutex_lock (&lseek_lock);
+    size = block_device_size (h->fd);
+    pthread_mutex_unlock (&lseek_lock);
     return size;
   } else {
     /* Regular file. */
@@ -501,6 +519,106 @@ file_trim (void *handle, uint32_t count, uint64_t offset,
uint32_t flags)
   return 0;
 }
 
+#ifdef SEEK_HOLE
+/* Extents. */
+
+static int
+file_can_extents (void *handle)
+{
+  struct handle *h = handle;
+  off_t r;
+
+  /* A simple test to see whether SEEK_HOLE etc is likely to work on
+   * the current filesystem.
+   */
+  pthread_mutex_lock (&lseek_lock);
+  r = lseek (h->fd, 0, SEEK_HOLE);
+  pthread_mutex_unlock (&lseek_lock);
+  if (r == -1) {
+    nbdkit_debug ("extents disabled: lseek: SEEK_HOLE: %m");
+    return 0;
+  }
+  return 1;
+}
+
+static int
+do_extents (void *handle, uint32_t count, uint64_t offset,
+            uint32_t flags, struct nbdkit_extents_map *extents_map)
+{
+  struct handle *h = handle;
+  const bool req_one = flags & NBDKIT_FLAG_REQ_ONE;
+  size_t nr_extents = 0;
+  uint64_t end = offset + count;
+
+  do {
+    off_t pos;
+
+    pos = lseek (h->fd, offset, SEEK_DATA);
+    if (pos == -1) {
+      if (errno == ENXIO) {
+        /* The current man page does not describe this situation well,
+         * but a proposed change to POSIX adds these words for ENXIO:
+         * "or the whence argument is SEEK_DATA and the offset falls
+         * within the final hole of the file."
+         */
+        pos = end;
+      }
+      else {
+        nbdkit_error ("lseek: SEEK_DATA: %" PRIu64 ": %m",
offset);
+        return -1;
+      }
+    }
+
+    /* We know there is a hole from offset to pos-1. */
+    if (pos > offset) {
+      if (nbdkit_extent_add (extents_map, offset, pos - offset,
+                             NBDKIT_EXTENT_HOLE | NBDKIT_EXTENT_ZERO) == -1)
+        return -1;
+      nr_extents++;
+    }
+
+    offset = pos;
+    if (offset >= end)
+      break;
+
+    pos = lseek (h->fd, offset, SEEK_HOLE);
+    if (pos == -1) {
+      nbdkit_error ("lseek: SEEK_HOLE: %" PRIu64 ": %m",
offset);
+      return -1;
+    }
+
+    /* We know there is data from offset to pos-1, but we don't have
+     * to add this to the map because that is the default.
+     */
+    if (pos > offset)
+      nr_extents++;
+
+    offset = pos;
+
+    /* If the req_one flag is set, we must want until TWO extents have
+     * been added so that we have the limit of the first extent.
+     */
+    if (req_one && nr_extents >= 2)
+      break;
+  } while (offset < end);
+
+  return 0;
+}
+
+static int
+file_extents (void *handle, uint32_t count, uint64_t offset,
+              uint32_t flags, struct nbdkit_extents_map *extents_map)
+{
+  int r;
+
+  pthread_mutex_lock (&lseek_lock);
+  r = do_extents (handle, count, offset, flags, extents_map);
+  pthread_mutex_unlock (&lseek_lock);
+
+  return r;
+}
+#endif /* SEEK_HOLE */
+
 static struct nbdkit_plugin plugin = {
   .name              = "file",
   .longname          = "nbdkit file plugin",
@@ -522,6 +640,10 @@ static struct nbdkit_plugin plugin = {
   .flush             = file_flush,
   .trim              = file_trim,
   .zero              = file_zero,
+#ifdef SEEK_HOLE
+  .can_extents       = file_can_extents,
+  .extents           = file_extents,
+#endif
   .errno_is_preserved = 1,
 };
 
-- 
2.20.1

Eric Blake

2019-Mar-20 05:35 UTC

head link

Re: [Libguestfs] [PATCH nbdkit 1/9] server: Implement extents/can_extents calls for plugins and filters.

On 3/19/19 11:34 AM, Richard W.M. Jones wrote:> This pair of calls allows plugins to describe which extents in the
> virtual disk are allocated, holes or zeroes.
> ---
> + void nbdkit_extents_free (struct nbdkit_extents_map *);
> +
> +Frees an existing extents map.
Is this like free() where it is safe to call on NULL?
> +
> +=head3 Iterating over nbdkit_extents_map
> +
> +One function is provided to filters only to iterate over the extents
> +map:
> +
> + typedef int (*nbdkit_extents_callback) (uint64_t offset, uint64_t length,
> +                                         uint32_t type, void *opaque);
> +
> + int nbdkit_extents_foreach (
> +            const struct nbdkit_extents_map *extents_map,
> +            nbdkit_extents_callback fn, void *opaque,
> +            uint32_t flags,
> +            uint64_t range_offset, uint64_t range_length);
> +
> +=item C<NBDKIT_EXTENTS_FOREACH_FLAG_RANGE>
> +
> +If this flag is included then the C<range_offset> and
C<range_length>
> +parameters are used, so only extents overlapping
> +C<[range_offset...range_offset+range_length-1]> are returned.
Must range_offset+range_length-1 lie within the reported
next_opts->get_size(), or can range extend beyond the end of the plugin
(that is, can I pass range == -1 to get from a given offset to the end
of the file, or do I have to compute the end range to avoid internal
errors)?
> +++ b/docs/nbdkit-plugin.pod
> +The callback should detect and return the list of extents overlapping
> +the range C<[offset...offset+count-1]>.  The C<extents_map>
parameter
> +points to an opaque object which the callback should fill in by
> +calling C<nbdkit_extent_add> etc.  See L</Extents map> below.
> +
> +The C<flags> parameter may contain the flag
C<NBDKIT_FLAG_REQ_ONE>
> +which means that the client is only requesting information about the
> +extent overlapping C<offset>.  The plugin may ignore this flag, or
as
> +an optimization it may return just a single extent for C<offset>.
> +(Note that implementing this optimization correctly is difficult and
> +subtle.  You must ensure that the upper limit of the single extent
> +returned is correct.  If unsure then the safest option is to ignore
> +this flag.)
Indeed. Or in other words, since the extent map coalesces adjacent
regions with the same type, merely calling nbdkit_extent_add(0) exactly
once won't work, because that coalesces with the extent map's default of
the entire image already being reported as data; so to report a single
data extent that ends sooner than the client's requested range, it is
necessary to report a different extent starting at the next byte.
> +
> +If there is an error, C<.extents> should call C<nbdkit_error>
with an
> +error message, and C<nbdkit_set_error> to record an appropriate
error
> +(unless C<errno> is sufficient), then return C<-1>.
> +
> +=head3 Extents map
> +
> +The plugin C<extents> callback is passed an opaque pointer
C<struct
> +nbdkit_extents_map *extents_map>.  This structure represents a list of
> +L<filesystem
extents|https://en.wikipedia.org/wiki/Extent_(file_systems)>
> +describing which areas of the disk are allocated, which are sparse
> +(“holes”), and, if supported, which are zeroes.  Initially the map
> +describes a single allocated data extent covering the entire virtual
> +disk.
> +
> +The C<extents> callback has to fill in any holes or zeroes areas by
> +calling C<nbdkit_extent*> functions, to describe all extents
> +overlapping the range C<[offset...offset+count-1]>.  (nbdkit will
> +ignore extents outside this range so plugins may, if it is easier to
> +implement, return all extent information for the whole disk.)
Worth emphasizing that reporting DATA is always a safe fallback, and
that plugins should favor speed over precision?  (If the time it takes
you to report ZERO or SPARSE is not significantly faster than the time
for .pread to just read the extent in question, then reporting DATA is
the better option).

> +++ b/server/extents.c
> +/* Trivial implementation as an ordered list of extents.  We will
> + * probably have to change this to a more efficient data structure if
> + * we have plugins with lots of extents.
> + *
> + * Invariants:
> + * - extents are stored in ascending order
> + * - extents must not overlap
> + * - adjacent extents must not have the same type
> + *
Any invariant about not exceeding .get_size()?  (I guess that's not
easily possible unless nbdkit_extents_new() were to take a parameter, as
the max size may differ for a filter compared to what the filter passes
to the plugin).  I'm slightly worried about a malicious plugin that
purposefully slams us with add() calls way beyond the end of the image.
Maybe you want a magic type for marking EOF when the map is first
created, with different rules on how things coalesce?  (Then again, if
the plugin crashes because it manages memory poorly, that's not really
the end of the world; we're more worried about a client triggering bad
behavior, not a plugin).

Hmm, what if we had the signature nbdkit_extents_new(offset, length) to
bound the extents we bother to track from the user - we still allow the
user to call nbdkit_extents_add() with information outside of the bounds
we care about, but merely return 0 on those cases without allocating any
memory?  Then we are guaranteed that the guest cannot allocate more
memory than proportional to the size of the client's request (which is
currently capped at 32 bits by the protocol).

> +
> +/* Find the extent before or containing offset.  If offset is inside
> + * the extent, the extent index is returned.  If offset is not in any
> + * extent, but there is an extent with a lower offset, then the index
> + * of the nearest lower extent is returned.  If offset is before the
> + * zeroth extent then -1 is returned.
Technically, aren't gaps in the map treated as implicit extents of type
0? Then again, I guess you return the next lower extent (rather than the
gap) to make the coalescing nicer.

> +int
> +nbdkit_extent_add (struct nbdkit_extents_map *map,
> +                   uint64_t offset, uint64_t length, uint32_t type)
> +{
> +  ssize_t lo, hi;
> +  bool coalesce_below, coalesce_above;
> +  struct extent new_extent;
> +
> +  /* This might be considered an error, but be flexible for badly
> +   * written plugins.
> +   */
> +  if (length == 0)
> +    return 0;
> +
> +  /* Don't allow extents to exceeed the _signed_ 64 bit limit that is
exceed
> +   * used in the rest of nbdkit.
> +   */
> +  if (offset > INT64_MAX ||
> +      offset + length < offset ||
> +      offset + length > INT64_MAX) {
> +    nbdkit_error ("nbdkit_extent_add: "
> +                  "extent cannot exceed signed 64 bit limit: "
> +                  "offset=%" PRIu64 " length=%"
PRIu64,
> +                  offset, length);
> +    return -1;
> +  }
> +
> + again:
> +  lo = find_extent (map, offset);
> +  hi = find_extent (map, offset + length - 1);
> +
> +  /* "lo" and "hi" are -1 or they point to the extent
> +   * overlapping-or-below the corresponding endpoints of the new
> +   * extent.
> +   *
> +   * The algorithm here has two parts: Firstly we remove any
> +   * overlapping extents (this may involve erasing extents completely
> +   * or shrinking them).  Secondly we insert the new extent at the
> +   * right place in the array, if necessary allowing it to coalesce
> +   * with the extent above or below.
> +   *
> +   * To remove overlapping extents we consider 7 cases:
> +   *
> +   * A.0   lo == -1 (there is no old extent before the new extent)
> +   *     => do nothing
> +   *
> +   * A.1   ┌─────┐     offset of new extent exactly at
> +   *       │ lo  │     beginning of lo and lo smaller/equal
> +   *       └─────┘     to new extent
> +   *       ↑
> +   *       ▐▓▓▓▓▓▓▓▓▓
> +   *     => erase lo
> +   *     => creates situation A.0 or A.5
> +   *
> +   * A.2   ┌──────────┐   offset of new extent exactly at
> +   *       │ lo       │   beginning of lo and lo bigger than
> +   *       └──────────┘   new extent
> +   *       ↑
> +   *       ▐▓▓▓▓▓▓▓▓▓
> +   *     => shrink lo “upwards”
> +   *     => creates situation A.0 or A.5
> +   *
> +   * A.3   ┌──────────────┐  new extent in the middle
> +   *       │ lo           │  of lo
> +   *       └──────────────┘
> +   *         ↑
> +   *         ▓▓▓▓▓▓▓▓▓
> +   *     => split lo
> +   *     => creates situation A.5
> +   *
> +   * A.4   ┌─────┐     offset of new extent in the middle
> +   *       │ lo  │     of lo
> +   *       └─────┘
> +   *         ↑
> +   *         ▓▓▓▓▓▓▓▓▓
> +   *     => shrink lo
> +   *     => creates situation A.5
> +   *
> +   * A.5   ┌─────┐     offset of new extent after lo
> +   *       │ lo  │
> +   *       └─────┘
> +   *               ↑
> +   *               ▓▓▓▓▓▓▓▓▓
> +   *     => do nothing
> +   *
> +   * B.0   lo == hi or hi == -1 (there is no higher extent)
> +   *     => do nothing
> +   *
> +   * B.1          ┌─────┐     extent hi is fully covered
> +   *              │ hi  │     by new extent
> +   *              └─────┘
> +   *                       ↑
> +   *       ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
> +   *     => erase hi
> +   *     => creates situation B.0
> +   *
> +   * B.2          ┌─────┐     extent hi is partly overlapped
> +   *              │ hi  │     by new extent
> +   *              └─────┘
> +   *                 ↑
> +   *       ▓▓▓▓▓▓▓▓▓▓▓
> +   *     => shrink hi
> +   *     => creates situation B.0
> +   *
> +   * In addition extents [lo+1...hi-1] (if they exist) are erased
> +   * since they must be fully covered by the new extent.
> +   *
> +   * At the end we're in situation A.0/A.5 and B.0, and so no old
> +   * extents overlap the new extent.
> +   *
> +   * The insertion can then be done with optional coalescing (see code
> +   * below for details - this is a lot simpler).
> +   */
> +
> +  /* A.0 */
> +  if (lo == -1)
> +    goto erase_middle_extents;
> +  /* A.5 */
> +  if (offset >= map->extents[lo].offset +
map->extents[lo].length)
> +    goto erase_middle_extents;
> +
> +  /* A.1 or A.2 */
> +  if (offset == map->extents[lo].offset) {
> +    if (map->extents[lo].offset + map->extents[lo].length <=
offset + length) {
> +      /* A.1 */
> +      erase_extents (map, lo, 1);
> +    }
> +    else {
> +      /* A.2 */
> +      uint64_t end = map->extents[lo].offset +
map->extents[lo].length;
> +      map->extents[lo].offset = offset + length;
> +      map->extents[lo].length = end - map->extents[lo].offset;
> +    }
> +    goto again;
> +  }
> +
> +  /* A.3 */
> +  if (map->extents[lo].offset < offset &&
> +      map->extents[lo].offset + map->extents[lo].length > offset
+ length) {
> +    new_extent.offset = offset + length;
> +    new_extent.length > +      map->extents[lo].offset +
map->extents[lo].length - new_extent.offset;
> +    new_extent.type = map->extents[lo].type;
> +    map->extents[lo].length = offset - map->extents[lo].offset;
> +    if (insert_extent (map, new_extent, lo+1) == -1)
> +      return -1;
> +    goto again;
> +  }
> +
> +  /* We must be in situation A.4 assuming analysis above is exhaustive. */
> +  assert (offset > map->extents[lo].offset);
> +  assert (offset < map->extents[lo].offset +
map->extents[lo].length);
> +  assert (map->extents[lo].offset + map->extents[lo].length <=
offset + length);
> +
> +  map->extents[lo].length = offset - map->extents[lo].offset;
> +
> + erase_middle_extents:
> +  /* Erase extents [lo+1..hi-1] if they exist as they must be
> +   * completely covered by the new extent.
> +   */
> +  if (hi >= 0 && lo+1 < hi) {
> +    erase_extents (map, lo+1, hi-lo-1);
> +    goto again;
> +  }
> +
> +  /* B.0 */
> +  if (lo == hi || hi == -1)
> +    goto insert;
> +
> +  /* B.1 */
> +  if (map->extents[hi].offset + map->extents[hi].length <= offset
+ length) {
> +    erase_extents (map, hi, 1);
For worst-case guest behavior, this can result in several memmove()
calls during a single _add(). I guess the way to be more efficient is to
use a different data structure rather than an array; I also suspect that
we want to cater to the typical case of a plugin calling _add() in
ascending order (so actions at the end of the list should be fast, no
matter whether we use array or some other structure).
> +    goto again;
> +  }
> +
> +  /* We must be in situation B.2 if the analysis above is exhaustive. */
> +  assert (map->extents[hi].offset < offset + length);
> +
> +  uint64_t end = map->extents[hi].offset + map->extents[hi].length;
> +  map->extents[hi].offset = offset + length;
> +  map->extents[hi].length = end - map->extents[hi].offset;
> +
> + insert:
> +  /* Now we know there is no extent which overlaps with the new
> +   * extent, and the new extent must be inserted after lo.
> +   */
> +
> +  /* However we have to deal with possibly coalescing the new extent
> +   * with lo and/or lo+1.
> +   */
Can we also take the shortcut that if we are in a gap, AND the _add is
for type 0, we don't have to do anything (because coalescing with the
gap is still the correct behavior)?
> +  coalesce_below > +    lo >= 0 &&
> +    map->extents[lo].type == type &&
> +    map->extents[lo].offset + map->extents[lo].length == offset;
> +  coalesce_above > +    lo+1 < map->nr_extents &&
> +    map->extents[lo+1].type == type &&
> +    map->extents[lo+1].offset == offset + length;
> +
> +  if (coalesce_below && coalesce_above) {
> +    /* Coalesce lo + new extent + lo+1
> +     * => Extend lo to cover the whole range and remove lo+1.
> +     */
> +    map->extents[lo].length > +      map->extents[lo+1].offset +
map->extents[lo+1].length
> +      - map->extents[lo].offset;
> +    erase_extents (map, lo+1, 1);
> +  }
> +  else if (coalesce_below) {
> +    /* Coalesce lo + new extent.
> +     * => Extend lo to cover new extent.
> +     */
> +    map->extents[lo].length = offset + length -
map->extents[lo].offset;
> +  }
> +  else if (coalesce_above) {
> +    /* Coalesce new extent + lo+1
> +     * => Extend lo+1 to cover new extent.
> +     */
> +    map->extents[lo+1].length > +      map->extents[lo+1].offset
+ map->extents[lo+1].length - offset;
> +    map->extents[lo+1].offset = offset;
> +  }
> +  else {
> +    /* Normal case: Allocate a new extent and insert it after lo. */
> +    new_extent.offset = offset;
> +    new_extent.length = length;
> +    new_extent.type = type;
> +    if (insert_extent (map, new_extent, lo+1) == -1)
> +      return -1;
> +  }
> +
> +  return 0;
Hairy, but well-commented and appears to be correct on my late-night
reading.
> +}
> +
> +int
> +nbdkit_extents_foreach (const struct nbdkit_extents_map *map,
> +                        nbdkit_extents_callback fn, void *opaque,
> +                        uint32_t flags,
> +                        uint64_t range_offset, uint64_t range_length)
> +{
> +  const bool flag_range = flags & NBDKIT_EXTENTS_FOREACH_FLAG_RANGE;
> +  const bool flag_one = flags & NBDKIT_EXTENTS_FOREACH_FLAG_ONE;
> +  uint64_t offset, next;
> +  size_t i;
> +
> +  if (flag_range) {
> +    if (range_offset + range_length < range_offset) {
> +      nbdkit_error ("nbdkit_extents_foreach: invalid range: "
> +                    "offset=%" PRIu64 " length=%"
PRIu64,
> +                    range_offset, range_length);
> +      return -1;
> +    }
Do we still need FLAG_RANGE for this call if we instead change _new() to
take a range? (That is, if we ignore _add calls outside of the initial
range, then it is apparant that traversal will automatically be bounded
without having to request different bounds - and I'm having a hard time
seeing where a filter would want to request anything less than the full
range it passed to next_ops, other than when doing short reads with
FLAG_ONE).

The NBD protocol allows for short replies (whether or not FLAG_ONE was
requested) - should we have a fifth type that the plugin can call
_add(END) as a way of specifically marking that they did not provide
extent information beyond that point, and therefore nbdkit must do a
short reply (instead of treating the rest of the range as data)?  (For
example, in qemu with the qcow2 file, a block status operation never
reads more than one L2 cluster; requests that overlap regions of the
disk that would require reading a second L2 cluster are instead
truncated).  The END type would not be transmitted over the NBD wire,
but may make some other operations easier (such as your addressing your
comment about a plugin honoring REQ_ONE having subtle semantics if it
doesn't take coalescing of subsequent offsets into account).

> +  }
> +  else {
> +    range_offset = 0;
> +    range_length = INT64_MAX + UINT64_C(1);
> +  }
> +
> +  for (i = 0, offset = range_offset;
> +       offset < range_offset + range_length;
> +       offset = next) {
> +    next = offset;
> +
> +    if (i >= map->nr_extents) {
> +      /* Beyond last extent => Type 0 to end of range. */
> +      next = range_offset + range_length;
> +      if (fn (offset, next - offset, 0, opaque) == -1)
> +        return -1;
> +      if (flag_one)
> +        return 0;
> +      continue;
> +    }
> +
> +    if (offset < map->extents[i].offset) {
> +      /* In a gap before the i'th extent => Type 0 to next offset.
*/
Did we guarantee that type 0 always appears as a gap, or is it sometimes
a gap and sometimes an actual extent entry?
> +      next = MIN (map->extents[i].offset, range_offset + range_length);
> +      if (fn (offset, next - offset, 0, opaque) == -1)
> +        return -1;
> +      if (flag_one)
> +        return 0;
> +      continue;
> +    }
> +
> +    if (offset < map->extents[i].offset + map->extents[i].length)
{
> +      /* In an extent. */
> +      next = MIN (map->extents[i].offset + map->extents[i].length,
> +                  range_offset + range_length);
> +      if (fn (offset, next - offset, map->extents[i].type, opaque) ==
-1)
> +        return -1;
> +      if (flag_one)
> +        return 0;
> +      continue;
> +    }
> +
> +    /* After an extent, increment i and iterate. */
> +    i++;
> +  }
> +
> +  return 0;
> +}
> +
> +#ifdef IN_TEST_EXTENTS
That's as far as I got today.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Eric Blake

2019-Mar-20 13:49 UTC

head link

Re: [Libguestfs] [PATCH nbdkit 1/9] server: Implement extents/can_extents calls for plugins and filters.

On 3/19/19 11:34 AM, Richard W.M. Jones wrote:> This pair of calls allows plugins to describe which extents in the
> virtual disk are allocated, holes or zeroes.
> ---
Resuming where I left off, if it helps:

> +#ifdef IN_TEST_EXTENTS
> +/* These functions are only compiled for and called from the unit tests.
*/
> +
> +void
> +dump_extents_map (const struct nbdkit_extents_map *map)
> +{
> +  size_t i;
> +
> +  fprintf (stderr, "map %p:\n", map);
> +  fprintf (stderr, "\tnr_extents=%zu allocated=%zu\n",
> +           map->nr_extents, map->allocated);
> +  for (i = 0; i < map->nr_extents; ++i) {
> +    fprintf (stderr, "\textent %zu:", i);
> +    fprintf (stderr,
> +             "\toffset=%" PRIu64 " length=%" PRIu64
" "
> +             "last=%" PRIu64 " type=%" PRIu32
"\n",
Worth using fixed-width formatting so that consecutive lines are
tabular, and/or using hex offsets (in addition to decimal) for easier
spotting of alignments?
> +             map->extents[i].offset, map->extents[i].length,
> +             map->extents[i].offset + map->extents[i].length - 1,
> +             map->extents[i].type);
> +  }
> +}
> +
> +void
> +validate_extents_map (const struct nbdkit_extents_map *map)
> +{
> +  size_t i;
> +  bool fail = false;
> +
> +  if (map->nr_extents > map->allocated) {
> +    fprintf (stderr, "validate_extents_map: nr_extents >
allocated\n");
> +    fail = true;
> +  }
> +
> +  for (i = 0; i < map->nr_extents; ++i) {
Of course, if we failed the previous check, this now reads beyond the
end of the allocation. But it's debug code.
> +    /* Extent length must be > 0. */
> +    if (map->extents[i].length == 0) {
> +      fprintf (stderr, "validate_extents_map: %zu: extent length must
be > 0\n",
> +               i);
> +      fail = true;
> +    }
> +
> +    /* Extent length must not wrap around, indicating that we have
> +     * computed a negative length (as unsigned) somewhere.
> +     */
> +    if (map->extents[i].offset + map->extents[i].length <
> +        map->extents[i].offset) {
> +      fprintf (stderr, "validate_extents_map: %zu: negative
length\n",
> +               i);
> +      fail = true;
> +    }
> +
No validation that extents don't exceed INT64_MAX?
> +    if (i > 0) {
> +      /* Extents are stored in strictly ascending order. */
> +      if (map->extents[i-1].offset >= map->extents[i].offset) {
> +        fprintf (stderr,
> +                 "validate_extents_map: %zu: "
> +                 "not strictly ascending order\n", i);
> +        fail = true;
> +      }
> +
> +      /* Adjacent extents must not overlap. */
> +      if (map->extents[i-1].offset + map->extents[i-1].length >
> +          map->extents[i].offset) {
> +        fprintf (stderr,
> +                 "validate_extents_map: %zu: "
> +                 "overlapping extents\n", i);
> +        fail = true;
> +      }
> +
> +      /* Adjacent extents either have a different type or are
> +       * separated by a gap.
This changes if you don't allow gaps (it does sound like the
implementation will be easier if it is gap-free...)
> +       */
> +      if (map->extents[i-1].type == map->extents[i].type &&
> +          map->extents[i-1].offset + map->extents[i-1].length =>
+          map->extents[i].offset) {
> +        fprintf (stderr,
> +                 "validate_extents_map: %zu: "
> +                 "adjacent extents with same type not
coalesced\n", i);
> +        fail = true;
> +      }
> +    }
> +  }
> +
> +  if (fail) {
> +    dump_extents_map (map);
> +    abort ();
> +  }
> +}
> @@ -511,6 +530,21 @@ filter_can_zero (struct backend *b, struct connection
*conn)
>      return f->backend.next->can_zero (f->backend.next, conn);
>  }
>  
> +static int
> +filter_can_extents (struct backend *b, struct connection *conn)
> +{
> +  struct backend_filter *f = container_of (b, struct backend_filter,
backend);
> +  void *handle = connection_get_handle (conn, f->backend.i);
> +  struct b_conn nxdata = { .b = f->backend.next, .conn = conn };
> +
> +  debug ("%s: can_extents", f->name);
> +
> +  if (f->filter.can_extents)
> +    return f->filter.can_extents (&next_ops, &nxdata, handle);
> +  else
> +    return f->backend.next->can_extents (f->backend.next, conn);
For now, should we default filter.can_extents to false until we
completed the work for all filters?
> +++ b/server/test-extents.c
> +
> +static int
> +compare (uint64_t offset, uint64_t length, uint32_t type, void *mapv)
> +{
> +  const struct nbdkit_extents_map *map = mapv;
> +  size_t j;
> +
> +  /* nbdkit_extents_foreach_range should ensure this is true even if
> +   * there are extents which go below or above the range.
> +   */
> +  assert (offset >= 0);
Dead assertion (always true since offset is unsigned).
> +
> +  /* Read out the extents and compare them to the disk. */
> +  assert (nbdkit_extents_foreach (map, compare, map,
> +                                  NBDKIT_EXTENTS_FOREACH_FLAG_RANGE,
> +                                  0, DISK_SIZE-1) == 0);
> +
The test is rendered less powerful when NDEBUG is defined.  But it
doesn't affect production code (and maybe we could ensure that this file
#undef NDEBUG regardless of what the user decides via -D for assertions
in the overall project)


-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Eric Blake

2019-Mar-20 13:50 UTC

head link

Re: [Libguestfs] [PATCH nbdkit 2/9] server: Add CLEANUP_EXTENTS_FREE macro.

On 3/19/19 11:34 AM, Richard W.M. Jones wrote:> Provides automatic cleanup of ‘struct nbdkit_extents_map *’ on exit
> from a scope or function.
> ---
>  server/internal.h  | 2 ++
>  server/cleanup.c   | 8 +++++++-
>  server/Makefile.am | 3 ++-
>  3 files changed, 11 insertions(+), 2 deletions(-)
ACK

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Eric Blake

2019-Mar-20 14:49 UTC

head link

Re: [Libguestfs] [PATCH nbdkit 3/9] server: Implement Block Status requests to read allocation status.

On 3/19/19 11:35 AM, Richard W.M. Jones wrote:> This commit implements the core NBD protocol for the
"base:allocation"
> Block Status replies.
> ---
>  server/internal.h                    |   7 +
>  server/protocol.h                    |  17 +-
>  server/protocol-handshake-newstyle.c |  79 ++++++++-
>  server/protocol.c                    | 248 +++++++++++++++++++++++++--
>  4 files changed, 335 insertions(+), 16 deletions(-)
> 
>  
> +static int
> +send_newstyle_option_reply_meta_context (struct connection *conn,
> +                                         uint32_t option, uint32_t reply,
> +                                         uint32_t context_id,
> +                                         const char *name)
> +{
> +  struct fixed_new_option_reply fixed_new_option_reply;
> +  struct fixed_new_option_reply_meta_context context;
> +  const size_t namelen = strlen (name);
> +
> +  fixed_new_option_reply.magic = htobe64 (NBD_REP_MAGIC);
> +  fixed_new_option_reply.option = htobe32 (option);
> +  fixed_new_option_reply.reply = htobe32 (reply);
> +  fixed_new_option_reply.replylen = htobe32 (sizeof context + namelen);
> +  context.context_id = htobe32 (context_id);
The NBD protocol recommends that context_id for replies to
OPT_LIST_META_CONTEXT be set to 0, but requires the client to ignore the
context_id.  Here, you are blindly returning a context_id of 1
regardless of _LIST or _SET.
> @@ -469,6 +499,16 @@ negotiate_handshake_newstyle_options (struct
connection *conn)
>            continue;
>          }
>  
> +        /* Work out if the server supports base:allocation. */
> +        r = backend->can_extents (backend, conn);
> +        if (r == -1) {
> +          if (send_newstyle_option_reply (conn, option,
NBD_REP_ERR_INVALID)
> +              == -1)
> +            return -1;
> +          continue;
> +        }
> +        can_extents = r;
Hmm. For NBD_CMD_WRITE_ZEROES, we decided that in the common case,
.can_zero would default to true for plugins even when the plugin lacks
.zero (on the grounds that always advertising it to the client allows
for less network traffic, even if nbdkit has to do the fallback) (see
commits 7ac98831, bde4fc4). We then added the nozero filter to
specifically disable the advertisement as needed.

This seems like another case where we should advertise the feature by
default (even when the plugin lacks .extents) on the grounds that a
client might benefit from knowing that an image can't report sparseness,
then provide a filter comparable to nozero that fine-tunes what we
actually advertise.

But I don't think that fine-tuning affects this patch (which looks right
to me to just query .can_extents), but rather what we default to in
patch 1 for plugins (I also argued that defaulting filters to default to
false until we've audited all filters may be wise).
>  
> +static int
> +count_extents (uint64_t offset, uint64_t length, uint32_t type,
> +               void *rv)
> +{
> +  size_t *rp = rv;
> +
> +  (*rp)++;
> +  return 0;
> +}
If you take my suggestion of an END type, I think that foreach() should
still iterate over any extent marked END, but count_extents should not
count it.  Or maybe foreach gains a flag for whether the caller is
interested in visiting END extents or just stopping right away when one
is encountered (filters would be interested, while this code is probably
not).
> +static int
> +copy_extents (uint64_t offset, uint64_t length, uint32_t type,
> +              void *dv)
> +{
> +  struct copy_extents_data *data = dv;
> +  uint32_t type_flags;
> +
> +  assert (data->i < data->nr_blocks);
> +
> +  /* Because the original request is limited to a 32 bit count, length
> +   * can never be > 32 bits in size.
> +   */
> +  assert (length <= UINT32_MAX);
If we pre-populate a map with data over the client's initial range and
with END beyond the range, it is conceivable that a client will slam the
end of the file to be a data extent which will then coalesce to
something larger than 32 bits. But we can also be clever and call
nbd_extents_add(offset + UINT32_MAX, END) after calling into the client
prior to calling foreach to keep this assertion true, even if we want to
support the NBD protocol's ability for the last extent (when no REQ_ONE
flag) to advertise a length longer that the client's request.
> +  /* Calculate the number of blocks we will be returning. */
> +  *nr_blocks = 0;
> +  if (nbdkit_extents_foreach (extents_map,
> +                              count_extents, nr_blocks,
> +                              foreach_flags,
> +                              offset, (uint64_t) count) == -1)
> +      return errno;
> +  assert (!req_one || *nr_blocks == 1);
> +
You probably ought to return an error if nr_blocks is 0 (possible if a
client registered an END type immediately on offset), as the NBD spec
requires a successful call to make progress.
> +  /* Allocate the final array. */
> +  *blocks = malloc (sizeof (struct block_descriptor) * *nr_blocks);
Hmm. We probably ought to cap the number of extents we are willing to
send to avoid an over-long reply.  Since we reject NBD_CMD_READ calls
requesting more than 64M of data, we probably also ought to guarantee a
truncated reply if it would otherwise generate more than 64M of extent
information in the reply (or maybe pick a smaller limit, maybe 1M?).
Each extent block is 64 bytes, so capping at 1M blocks would be a 64M
reply; a lower cap of 64k blocks or even 1k blocks would still be
reasonable.
> +static int
> +send_structured_reply_block_status (struct connection *conn,
> +                                    uint64_t handle,
> +                                    uint16_t cmd, uint16_t flags,
> +                                    uint32_t count, uint64_t offset,
> +                                    const struct block_descriptor *blocks,
> +                                    size_t nr_blocks)
> +{
Looks good.
> @@ -402,6 +587,9 @@ protocol_recv_request_send_reply (struct connection
*conn)
> @@ -458,6 +647,13 @@ protocol_recv_request_send_reply (struct connection
*conn)
>        }
>      }
>  
> +    /* Allocate the extents map for block status only. */
> +    if (cmd == NBD_CMD_BLOCK_STATUS) {
> +      extents_map = nbdkit_extents_new ();
May need tweaking if you like my idea of having _new() take the client's
initial bounds as parameters.
> +  /* XXX There are complicated requirements for the block status
> +   * reply, such as the offset, length and number of extents returned
> +   * in the structured reply.  To allow a simple implementation for
> +   * plugins we don't make the plugins obey these requirements.  This
> +   * means at some point we need to filter what the plugin gives us to
> +   * obey the protocol requirements.  There are several places we
> +   * could do that.  Currently we do it here.  Another possibility is
> +   * to do it in server/plugins.c.
> +   *
> +   * Also note this only deals with base:allocation.  If in future we
> +   * want to describe other block status metadata then this code will
> +   * require an overhaul.
Correct observations; permitting arbitrary metadata contexts will indeed
be a much more interesting addition, if we ever decide to do so.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Eric Blake

2019-Mar-20 14:49 UTC

head link

Re: [Libguestfs] [PATCH nbdkit 4/9] server: Export nbdkit_extent* symbols.

On 3/19/19 11:35 AM, Richard W.M. Jones wrote:> Allows them to be called from plugins and filters.
> ---
>  server/nbdkit.syms | 4 ++++
>  1 file changed, 4 insertions(+)
Squash this into 1?
> 
> diff --git a/server/nbdkit.syms b/server/nbdkit.syms
> index 672abd2..95ef067 100644
> --- a/server/nbdkit.syms
> +++ b/server/nbdkit.syms
> @@ -42,6 +42,10 @@
>      nbdkit_absolute_path;
>      nbdkit_debug;
>      nbdkit_error;
> +    nbdkit_extent_add;
> +    nbdkit_extents_free;
> +    nbdkit_extents_foreach;
> +    nbdkit_extents_new;
>      nbdkit_parse_bool;
>      nbdkit_parse_size;
>      nbdkit_read_password;
> 
-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Eric Blake

2019-Mar-20 14:55 UTC

head link

Re: [Libguestfs] [PATCH nbdkit 5/9] offset: Implement mapping of extents.

On 3/19/19 11:35 AM, Richard W.M. Jones wrote:> Allows you to safely use nbdkit-offset-filter on top of a plugin
> supporting extents.
> ---
>  filters/offset/offset.c | 43 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 43 insertions(+)
> 
ACK.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Reasonably Related Threads

Search for more seemingly similar threads

Libguestfs - Mar 2019 - [PATCH nbdkit 0/9] [mainly for discussion and early review] Implement extents.

[Libguestfs] [PATCH nbdkit 0/9] [mainly for discussion and early review] Implement extents.

[Libguestfs] [PATCH nbdkit 1/9] server: Implement extents/can_extents calls for plugins and filters.

[Libguestfs] [PATCH nbdkit 2/9] server: Add CLEANUP_EXTENTS_FREE macro.

[Libguestfs] [PATCH nbdkit 3/9] server: Implement Block Status requests to read allocation status.

[Libguestfs] [PATCH nbdkit 4/9] server: Export nbdkit_extent* symbols.

[Libguestfs] [PATCH nbdkit 5/9] offset: Implement mapping of extents.

[Libguestfs] [PATCH nbdkit 6/9] partition: Implement mapping of extents.

[Libguestfs] [PATCH nbdkit 7/9] data, memory: Implement extents.

[Libguestfs] [PATCH nbdkit 8/9] vddk: Implement extents.

[Libguestfs] [PATCH nbdkit 9/9] file: Implement extents.

Re: [Libguestfs] [PATCH nbdkit 1/9] server: Implement extents/can_extents calls for plugins and filters.

Re: [Libguestfs] [PATCH nbdkit 1/9] server: Implement extents/can_extents calls for plugins and filters.

Re: [Libguestfs] [PATCH nbdkit 2/9] server: Add CLEANUP_EXTENTS_FREE macro.

Re: [Libguestfs] [PATCH nbdkit 3/9] server: Implement Block Status requests to read allocation status.

Re: [Libguestfs] [PATCH nbdkit 4/9] server: Export nbdkit_extent* symbols.

Re: [Libguestfs] [PATCH nbdkit 5/9] offset: Implement mapping of extents.

Reasonably Related Threads