thr3ads.net - Libguestfs - Re: [Libguestfs] [PATCH nbdkit 1/9] server: Implement extents/can_extents calls for plugins and filters. [Mar 2019]

If this information is useful, please help other people find it:
Share via:

Eric Blake

2019-Mar-20 05:35 UTC

Re: [Libguestfs] [PATCH nbdkit 1/9] server: Implement extents/can_extents calls for plugins and filters.

On 3/19/19 11:34 AM, Richard W.M. Jones wrote:> This pair of calls allows plugins to describe which extents in the
> virtual disk are allocated, holes or zeroes.
> ---
> + void nbdkit_extents_free (struct nbdkit_extents_map *);
> +
> +Frees an existing extents map.
Is this like free() where it is safe to call on NULL?
> +
> +=head3 Iterating over nbdkit_extents_map
> +
> +One function is provided to filters only to iterate over the extents
> +map:
> +
> + typedef int (*nbdkit_extents_callback) (uint64_t offset, uint64_t length,
> +                                         uint32_t type, void *opaque);
> +
> + int nbdkit_extents_foreach (
> +            const struct nbdkit_extents_map *extents_map,
> +            nbdkit_extents_callback fn, void *opaque,
> +            uint32_t flags,
> +            uint64_t range_offset, uint64_t range_length);
> +
> +=item C<NBDKIT_EXTENTS_FOREACH_FLAG_RANGE>
> +
> +If this flag is included then the C<range_offset> and
C<range_length>
> +parameters are used, so only extents overlapping
> +C<[range_offset...range_offset+range_length-1]> are returned.
Must range_offset+range_length-1 lie within the reported
next_opts->get_size(), or can range extend beyond the end of the plugin
(that is, can I pass range == -1 to get from a given offset to the end
of the file, or do I have to compute the end range to avoid internal
errors)?
> +++ b/docs/nbdkit-plugin.pod
> +The callback should detect and return the list of extents overlapping
> +the range C<[offset...offset+count-1]>.  The C<extents_map>
parameter
> +points to an opaque object which the callback should fill in by
> +calling C<nbdkit_extent_add> etc.  See L</Extents map> below.
> +
> +The C<flags> parameter may contain the flag
C<NBDKIT_FLAG_REQ_ONE>
> +which means that the client is only requesting information about the
> +extent overlapping C<offset>.  The plugin may ignore this flag, or
as
> +an optimization it may return just a single extent for C<offset>.
> +(Note that implementing this optimization correctly is difficult and
> +subtle.  You must ensure that the upper limit of the single extent
> +returned is correct.  If unsure then the safest option is to ignore
> +this flag.)
Indeed. Or in other words, since the extent map coalesces adjacent
regions with the same type, merely calling nbdkit_extent_add(0) exactly
once won't work, because that coalesces with the extent map's default of
the entire image already being reported as data; so to report a single
data extent that ends sooner than the client's requested range, it is
necessary to report a different extent starting at the next byte.
> +
> +If there is an error, C<.extents> should call C<nbdkit_error>
with an
> +error message, and C<nbdkit_set_error> to record an appropriate
error
> +(unless C<errno> is sufficient), then return C<-1>.
> +
> +=head3 Extents map
> +
> +The plugin C<extents> callback is passed an opaque pointer
C<struct
> +nbdkit_extents_map *extents_map>.  This structure represents a list of
> +L<filesystem
extents|https://en.wikipedia.org/wiki/Extent_(file_systems)>
> +describing which areas of the disk are allocated, which are sparse
> +(“holes”), and, if supported, which are zeroes.  Initially the map
> +describes a single allocated data extent covering the entire virtual
> +disk.
> +
> +The C<extents> callback has to fill in any holes or zeroes areas by
> +calling C<nbdkit_extent*> functions, to describe all extents
> +overlapping the range C<[offset...offset+count-1]>.  (nbdkit will
> +ignore extents outside this range so plugins may, if it is easier to
> +implement, return all extent information for the whole disk.)
Worth emphasizing that reporting DATA is always a safe fallback, and
that plugins should favor speed over precision?  (If the time it takes
you to report ZERO or SPARSE is not significantly faster than the time
for .pread to just read the extent in question, then reporting DATA is
the better option).

> +++ b/server/extents.c
> +/* Trivial implementation as an ordered list of extents.  We will
> + * probably have to change this to a more efficient data structure if
> + * we have plugins with lots of extents.
> + *
> + * Invariants:
> + * - extents are stored in ascending order
> + * - extents must not overlap
> + * - adjacent extents must not have the same type
> + *
Any invariant about not exceeding .get_size()?  (I guess that's not
easily possible unless nbdkit_extents_new() were to take a parameter, as
the max size may differ for a filter compared to what the filter passes
to the plugin).  I'm slightly worried about a malicious plugin that
purposefully slams us with add() calls way beyond the end of the image.
Maybe you want a magic type for marking EOF when the map is first
created, with different rules on how things coalesce?  (Then again, if
the plugin crashes because it manages memory poorly, that's not really
the end of the world; we're more worried about a client triggering bad
behavior, not a plugin).

Hmm, what if we had the signature nbdkit_extents_new(offset, length) to
bound the extents we bother to track from the user - we still allow the
user to call nbdkit_extents_add() with information outside of the bounds
we care about, but merely return 0 on those cases without allocating any
memory?  Then we are guaranteed that the guest cannot allocate more
memory than proportional to the size of the client's request (which is
currently capped at 32 bits by the protocol).

> +
> +/* Find the extent before or containing offset.  If offset is inside
> + * the extent, the extent index is returned.  If offset is not in any
> + * extent, but there is an extent with a lower offset, then the index
> + * of the nearest lower extent is returned.  If offset is before the
> + * zeroth extent then -1 is returned.
Technically, aren't gaps in the map treated as implicit extents of type
0? Then again, I guess you return the next lower extent (rather than the
gap) to make the coalescing nicer.

> +int
> +nbdkit_extent_add (struct nbdkit_extents_map *map,
> +                   uint64_t offset, uint64_t length, uint32_t type)
> +{
> +  ssize_t lo, hi;
> +  bool coalesce_below, coalesce_above;
> +  struct extent new_extent;
> +
> +  /* This might be considered an error, but be flexible for badly
> +   * written plugins.
> +   */
> +  if (length == 0)
> +    return 0;
> +
> +  /* Don't allow extents to exceeed the _signed_ 64 bit limit that is
exceed
> +   * used in the rest of nbdkit.
> +   */
> +  if (offset > INT64_MAX ||
> +      offset + length < offset ||
> +      offset + length > INT64_MAX) {
> +    nbdkit_error ("nbdkit_extent_add: "
> +                  "extent cannot exceed signed 64 bit limit: "
> +                  "offset=%" PRIu64 " length=%"
PRIu64,
> +                  offset, length);
> +    return -1;
> +  }
> +
> + again:
> +  lo = find_extent (map, offset);
> +  hi = find_extent (map, offset + length - 1);
> +
> +  /* "lo" and "hi" are -1 or they point to the extent
> +   * overlapping-or-below the corresponding endpoints of the new
> +   * extent.
> +   *
> +   * The algorithm here has two parts: Firstly we remove any
> +   * overlapping extents (this may involve erasing extents completely
> +   * or shrinking them).  Secondly we insert the new extent at the
> +   * right place in the array, if necessary allowing it to coalesce
> +   * with the extent above or below.
> +   *
> +   * To remove overlapping extents we consider 7 cases:
> +   *
> +   * A.0   lo == -1 (there is no old extent before the new extent)
> +   *     => do nothing
> +   *
> +   * A.1   ┌─────┐     offset of new extent exactly at
> +   *       │ lo  │     beginning of lo and lo smaller/equal
> +   *       └─────┘     to new extent
> +   *       ↑
> +   *       ▐▓▓▓▓▓▓▓▓▓
> +   *     => erase lo
> +   *     => creates situation A.0 or A.5
> +   *
> +   * A.2   ┌──────────┐   offset of new extent exactly at
> +   *       │ lo       │   beginning of lo and lo bigger than
> +   *       └──────────┘   new extent
> +   *       ↑
> +   *       ▐▓▓▓▓▓▓▓▓▓
> +   *     => shrink lo “upwards”
> +   *     => creates situation A.0 or A.5
> +   *
> +   * A.3   ┌──────────────┐  new extent in the middle
> +   *       │ lo           │  of lo
> +   *       └──────────────┘
> +   *         ↑
> +   *         ▓▓▓▓▓▓▓▓▓
> +   *     => split lo
> +   *     => creates situation A.5
> +   *
> +   * A.4   ┌─────┐     offset of new extent in the middle
> +   *       │ lo  │     of lo
> +   *       └─────┘
> +   *         ↑
> +   *         ▓▓▓▓▓▓▓▓▓
> +   *     => shrink lo
> +   *     => creates situation A.5
> +   *
> +   * A.5   ┌─────┐     offset of new extent after lo
> +   *       │ lo  │
> +   *       └─────┘
> +   *               ↑
> +   *               ▓▓▓▓▓▓▓▓▓
> +   *     => do nothing
> +   *
> +   * B.0   lo == hi or hi == -1 (there is no higher extent)
> +   *     => do nothing
> +   *
> +   * B.1          ┌─────┐     extent hi is fully covered
> +   *              │ hi  │     by new extent
> +   *              └─────┘
> +   *                       ↑
> +   *       ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
> +   *     => erase hi
> +   *     => creates situation B.0
> +   *
> +   * B.2          ┌─────┐     extent hi is partly overlapped
> +   *              │ hi  │     by new extent
> +   *              └─────┘
> +   *                 ↑
> +   *       ▓▓▓▓▓▓▓▓▓▓▓
> +   *     => shrink hi
> +   *     => creates situation B.0
> +   *
> +   * In addition extents [lo+1...hi-1] (if they exist) are erased
> +   * since they must be fully covered by the new extent.
> +   *
> +   * At the end we're in situation A.0/A.5 and B.0, and so no old
> +   * extents overlap the new extent.
> +   *
> +   * The insertion can then be done with optional coalescing (see code
> +   * below for details - this is a lot simpler).
> +   */
> +
> +  /* A.0 */
> +  if (lo == -1)
> +    goto erase_middle_extents;
> +  /* A.5 */
> +  if (offset >= map->extents[lo].offset +
map->extents[lo].length)
> +    goto erase_middle_extents;
> +
> +  /* A.1 or A.2 */
> +  if (offset == map->extents[lo].offset) {
> +    if (map->extents[lo].offset + map->extents[lo].length <=
offset + length) {
> +      /* A.1 */
> +      erase_extents (map, lo, 1);
> +    }
> +    else {
> +      /* A.2 */
> +      uint64_t end = map->extents[lo].offset +
map->extents[lo].length;
> +      map->extents[lo].offset = offset + length;
> +      map->extents[lo].length = end - map->extents[lo].offset;
> +    }
> +    goto again;
> +  }
> +
> +  /* A.3 */
> +  if (map->extents[lo].offset < offset &&
> +      map->extents[lo].offset + map->extents[lo].length > offset
+ length) {
> +    new_extent.offset = offset + length;
> +    new_extent.length > +      map->extents[lo].offset +
map->extents[lo].length - new_extent.offset;
> +    new_extent.type = map->extents[lo].type;
> +    map->extents[lo].length = offset - map->extents[lo].offset;
> +    if (insert_extent (map, new_extent, lo+1) == -1)
> +      return -1;
> +    goto again;
> +  }
> +
> +  /* We must be in situation A.4 assuming analysis above is exhaustive. */
> +  assert (offset > map->extents[lo].offset);
> +  assert (offset < map->extents[lo].offset +
map->extents[lo].length);
> +  assert (map->extents[lo].offset + map->extents[lo].length <=
offset + length);
> +
> +  map->extents[lo].length = offset - map->extents[lo].offset;
> +
> + erase_middle_extents:
> +  /* Erase extents [lo+1..hi-1] if they exist as they must be
> +   * completely covered by the new extent.
> +   */
> +  if (hi >= 0 && lo+1 < hi) {
> +    erase_extents (map, lo+1, hi-lo-1);
> +    goto again;
> +  }
> +
> +  /* B.0 */
> +  if (lo == hi || hi == -1)
> +    goto insert;
> +
> +  /* B.1 */
> +  if (map->extents[hi].offset + map->extents[hi].length <= offset
+ length) {
> +    erase_extents (map, hi, 1);
For worst-case guest behavior, this can result in several memmove()
calls during a single _add(). I guess the way to be more efficient is to
use a different data structure rather than an array; I also suspect that
we want to cater to the typical case of a plugin calling _add() in
ascending order (so actions at the end of the list should be fast, no
matter whether we use array or some other structure).
> +    goto again;
> +  }
> +
> +  /* We must be in situation B.2 if the analysis above is exhaustive. */
> +  assert (map->extents[hi].offset < offset + length);
> +
> +  uint64_t end = map->extents[hi].offset + map->extents[hi].length;
> +  map->extents[hi].offset = offset + length;
> +  map->extents[hi].length = end - map->extents[hi].offset;
> +
> + insert:
> +  /* Now we know there is no extent which overlaps with the new
> +   * extent, and the new extent must be inserted after lo.
> +   */
> +
> +  /* However we have to deal with possibly coalescing the new extent
> +   * with lo and/or lo+1.
> +   */
Can we also take the shortcut that if we are in a gap, AND the _add is
for type 0, we don't have to do anything (because coalescing with the
gap is still the correct behavior)?
> +  coalesce_below > +    lo >= 0 &&
> +    map->extents[lo].type == type &&
> +    map->extents[lo].offset + map->extents[lo].length == offset;
> +  coalesce_above > +    lo+1 < map->nr_extents &&
> +    map->extents[lo+1].type == type &&
> +    map->extents[lo+1].offset == offset + length;
> +
> +  if (coalesce_below && coalesce_above) {
> +    /* Coalesce lo + new extent + lo+1
> +     * => Extend lo to cover the whole range and remove lo+1.
> +     */
> +    map->extents[lo].length > +      map->extents[lo+1].offset +
map->extents[lo+1].length
> +      - map->extents[lo].offset;
> +    erase_extents (map, lo+1, 1);
> +  }
> +  else if (coalesce_below) {
> +    /* Coalesce lo + new extent.
> +     * => Extend lo to cover new extent.
> +     */
> +    map->extents[lo].length = offset + length -
map->extents[lo].offset;
> +  }
> +  else if (coalesce_above) {
> +    /* Coalesce new extent + lo+1
> +     * => Extend lo+1 to cover new extent.
> +     */
> +    map->extents[lo+1].length > +      map->extents[lo+1].offset
+ map->extents[lo+1].length - offset;
> +    map->extents[lo+1].offset = offset;
> +  }
> +  else {
> +    /* Normal case: Allocate a new extent and insert it after lo. */
> +    new_extent.offset = offset;
> +    new_extent.length = length;
> +    new_extent.type = type;
> +    if (insert_extent (map, new_extent, lo+1) == -1)
> +      return -1;
> +  }
> +
> +  return 0;
Hairy, but well-commented and appears to be correct on my late-night
reading.
> +}
> +
> +int
> +nbdkit_extents_foreach (const struct nbdkit_extents_map *map,
> +                        nbdkit_extents_callback fn, void *opaque,
> +                        uint32_t flags,
> +                        uint64_t range_offset, uint64_t range_length)
> +{
> +  const bool flag_range = flags & NBDKIT_EXTENTS_FOREACH_FLAG_RANGE;
> +  const bool flag_one = flags & NBDKIT_EXTENTS_FOREACH_FLAG_ONE;
> +  uint64_t offset, next;
> +  size_t i;
> +
> +  if (flag_range) {
> +    if (range_offset + range_length < range_offset) {
> +      nbdkit_error ("nbdkit_extents_foreach: invalid range: "
> +                    "offset=%" PRIu64 " length=%"
PRIu64,
> +                    range_offset, range_length);
> +      return -1;
> +    }
Do we still need FLAG_RANGE for this call if we instead change _new() to
take a range? (That is, if we ignore _add calls outside of the initial
range, then it is apparant that traversal will automatically be bounded
without having to request different bounds - and I'm having a hard time
seeing where a filter would want to request anything less than the full
range it passed to next_ops, other than when doing short reads with
FLAG_ONE).

The NBD protocol allows for short replies (whether or not FLAG_ONE was
requested) - should we have a fifth type that the plugin can call
_add(END) as a way of specifically marking that they did not provide
extent information beyond that point, and therefore nbdkit must do a
short reply (instead of treating the rest of the range as data)?  (For
example, in qemu with the qcow2 file, a block status operation never
reads more than one L2 cluster; requests that overlap regions of the
disk that would require reading a second L2 cluster are instead
truncated).  The END type would not be transmitted over the NBD wire,
but may make some other operations easier (such as your addressing your
comment about a plugin honoring REQ_ONE having subtle semantics if it
doesn't take coalescing of subsequent offsets into account).

> +  }
> +  else {
> +    range_offset = 0;
> +    range_length = INT64_MAX + UINT64_C(1);
> +  }
> +
> +  for (i = 0, offset = range_offset;
> +       offset < range_offset + range_length;
> +       offset = next) {
> +    next = offset;
> +
> +    if (i >= map->nr_extents) {
> +      /* Beyond last extent => Type 0 to end of range. */
> +      next = range_offset + range_length;
> +      if (fn (offset, next - offset, 0, opaque) == -1)
> +        return -1;
> +      if (flag_one)
> +        return 0;
> +      continue;
> +    }
> +
> +    if (offset < map->extents[i].offset) {
> +      /* In a gap before the i'th extent => Type 0 to next offset.
*/
Did we guarantee that type 0 always appears as a gap, or is it sometimes
a gap and sometimes an actual extent entry?
> +      next = MIN (map->extents[i].offset, range_offset + range_length);
> +      if (fn (offset, next - offset, 0, opaque) == -1)
> +        return -1;
> +      if (flag_one)
> +        return 0;
> +      continue;
> +    }
> +
> +    if (offset < map->extents[i].offset + map->extents[i].length)
{
> +      /* In an extent. */
> +      next = MIN (map->extents[i].offset + map->extents[i].length,
> +                  range_offset + range_length);
> +      if (fn (offset, next - offset, map->extents[i].type, opaque) ==
-1)
> +        return -1;
> +      if (flag_one)
> +        return 0;
> +      continue;
> +    }
> +
> +    /* After an extent, increment i and iterate. */
> +    i++;
> +  }
> +
> +  return 0;
> +}
> +
> +#ifdef IN_TEST_EXTENTS
That's as far as I got today.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Richard W.M. Jones

2019-Mar-20 11:50 UTC

head link

Re: [Libguestfs] [PATCH nbdkit 1/9] server: Implement extents/can_extents calls for plugins and filters.

On Wed, Mar 20, 2019 at 12:35:33AM -0500, Eric Blake
wrote:> On 3/19/19 11:34 AM, Richard W.M. Jones wrote:
> > This pair of calls allows plugins to describe which extents in the
> > virtual disk are allocated, holes or zeroes.
> > ---
> 
> 
> > + void nbdkit_extents_free (struct nbdkit_extents_map *);
> > +
> > +Frees an existing extents map.
> 
> Is this like free() where it is safe to call on NULL?
Yes - otherwise CLEANUP_EXTENTS_FREE wouldn't work.  Do we need to
document this?
> > +=item C<NBDKIT_EXTENTS_FOREACH_FLAG_RANGE>
> > +
> > +If this flag is included then the C<range_offset> and
C<range_length>
> > +parameters are used, so only extents overlapping
> > +C<[range_offset...range_offset+range_length-1]> are returned.
> 
> Must range_offset+range_length-1 lie within the reported
> next_opts->get_size(), or can range extend beyond the end of the plugin
> (that is, can I pass range == -1 to get from a given offset to the end
> of the file, or do I have to compute the end range to avoid internal
> errors)?
Extent maps themselves are independent of the size of the plugin, and
the foreach function will let you put any range there you want.

However the server won't accept any NBD_CMD_BLOCK_STATUS request with
a range beyond the end of the export size (valid_range in
server/protocol.c), and so the core server code should not use a range
which is outside the file (block_status_final_map in the same file).
>From the point of view of filters that might call the foreachfunction, they shouldn't choose a range which is outside the
offset/count passed to the layer below, since the layer below is only
guaranteed to set the extents correctly for that range (except in the
case of REQ_ONE where the rules are even more complicated
unfortunately).
> > +++ b/docs/nbdkit-plugin.pod
> 
> > +The callback should detect and return the list of extents overlapping
> > +the range C<[offset...offset+count-1]>.  The
C<extents_map> parameter
> > +points to an opaque object which the callback should fill in by
> > +calling C<nbdkit_extent_add> etc.  See L</Extents map>
below.
> > +
> > +The C<flags> parameter may contain the flag
C<NBDKIT_FLAG_REQ_ONE>
> > +which means that the client is only requesting information about the
> > +extent overlapping C<offset>.  The plugin may ignore this flag,
or as
> > +an optimization it may return just a single extent for
C<offset>.
> > +(Note that implementing this optimization correctly is difficult and
> > +subtle.  You must ensure that the upper limit of the single extent
> > +returned is correct.  If unsure then the safest option is to ignore
> > +this flag.)
> 
> Indeed. Or in other words, since the extent map coalesces adjacent
> regions with the same type, merely calling nbdkit_extent_add(0) exactly
> once won't work, because that coalesces with the extent map's
default of
> the entire image already being reported as data; so to report a single
> data extent that ends sooner than the client's requested range, it is
> necessary to report a different extent starting at the next byte.
Unfortunately the current implementation doesn't coalesce type = 0
with the "background" extent ...  It probably should do to reduce
surprises [see below].

As you say though, rules around REQ_ONE are very complex - I
implemented it for the file plugin, but my initial implementation for
the VDDK plugin was incorrect and I removed it.  Unfortunately because
VDDK QueryAllocatedBlocks is so slow we will need to implement this
optimization at some point and try to get it right.
> > +If there is an error, C<.extents> should call
C<nbdkit_error> with an
> > +error message, and C<nbdkit_set_error> to record an appropriate
error
> > +(unless C<errno> is sufficient), then return C<-1>.
> > +
> > +=head3 Extents map
> > +
> > +The plugin C<extents> callback is passed an opaque pointer
C<struct
> > +nbdkit_extents_map *extents_map>.  This structure represents a
list of
> > +L<filesystem
extents|https://en.wikipedia.org/wiki/Extent_(file_systems)>
> > +describing which areas of the disk are allocated, which are sparse
> > +(“holes”), and, if supported, which are zeroes.  Initially the map
> > +describes a single allocated data extent covering the entire virtual
> > +disk.
> > +
> > +The C<extents> callback has to fill in any holes or zeroes
areas by
> > +calling C<nbdkit_extent*> functions, to describe all extents
> > +overlapping the range C<[offset...offset+count-1]>.  (nbdkit
will
> > +ignore extents outside this range so plugins may, if it is easier to
> > +implement, return all extent information for the whole disk.)
> 
> Worth emphasizing that reporting DATA is always a safe fallback, and
> that plugins should favor speed over precision?  (If the time it takes
> you to report ZERO or SPARSE is not significantly faster than the time
> for .pread to just read the extent in question, then reporting DATA is
> the better option).
Yes I'll add this observation.
> > +++ b/server/extents.c
> 
> > +/* Trivial implementation as an ordered list of extents.  We will
> > + * probably have to change this to a more efficient data structure if
> > + * we have plugins with lots of extents.
> > + *
> > + * Invariants:
> > + * - extents are stored in ascending order
> > + * - extents must not overlap
> > + * - adjacent extents must not have the same type
> > + *
> 
> Any invariant about not exceeding .get_size()?  (I guess that's not
> easily possible unless nbdkit_extents_new() were to take a parameter, as
> the max size may differ for a filter compared to what the filter passes
> to the plugin).  I'm slightly worried about a malicious plugin that
> purposefully slams us with add() calls way beyond the end of the image.
> Maybe you want a magic type for marking EOF when the map is first
> created, with different rules on how things coalesce?  (Then again, if
> the plugin crashes because it manages memory poorly, that's not really
> the end of the world; we're more worried about a client triggering bad
> behavior, not a plugin).
> 
> Hmm, what if we had the signature nbdkit_extents_new(offset, length) to
> bound the extents we bother to track from the user - we still allow the
> user to call nbdkit_extents_add() with information outside of the bounds
> we care about, but merely return 0 on those cases without allocating any
> memory?  Then we are guaranteed that the guest cannot allocate more
> memory than proportional to the size of the client's request (which is
> currently capped at 32 bits by the protocol).
It sounds like it would make the implementation more complex ...
> > +/* Find the extent before or containing offset.  If offset is inside
> > + * the extent, the extent index is returned.  If offset is not in any
> > + * extent, but there is an extent with a lower offset, then the index
> > + * of the nearest lower extent is returned.  If offset is before the
> > + * zeroth extent then -1 is returned.
> 
> Technically, aren't gaps in the map treated as implicit extents of type
> 0? Then again, I guess you return the next lower extent (rather than the
> gap) to make the coalescing nicer.
... but another possible way to implement the structure would be to
have an explicit extent covering every part of the range (either
[0 to INT64_MAX], or [offset to offset+length-1]).  This would be a
bit like the common/range struct.

I'll see if that makes the implementation easier.
> For worst-case guest behavior, this can result in several memmove()
> calls during a single _add(). I guess the way to be more efficient is to
> use a different data structure rather than an array; I also suspect that
> we want to cater to the typical case of a plugin calling _add() in
> ascending order (so actions at the end of the list should be fast, no
> matter whether we use array or some other structure).
I believe the structure we really want for efficiency is:

https://en.wikipedia.org/wiki/Interval_tree

I tried implementing one of these a while back and it's very complex.

I've cut some other comments about the structure because I want to see
if implementing this using an array which always contains extents
covering the full range is simpler.

[...]
> The NBD protocol allows for short replies (whether or not FLAG_ONE was
> requested) - should we have a fifth type that the plugin can call
> _add(END) as a way of specifically marking that they did not provide
> extent information beyond that point, and therefore nbdkit must do a
> short reply (instead of treating the rest of the range as data)?  (For
> example, in qemu with the qcow2 file, a block status operation never
> reads more than one L2 cluster; requests that overlap regions of the
> disk that would require reading a second L2 cluster are instead
> truncated).  The END type would not be transmitted over the NBD wire,
> but may make some other operations easier (such as your addressing your
> comment about a plugin honoring REQ_ONE having subtle semantics if it
> doesn't take coalescing of subsequent offsets into account).
OK I didn't know this about the protocol but will see if it can be
implemented.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Eric Blake

2019-Mar-20 12:36 UTC

head link

Re: [Libguestfs] [PATCH nbdkit 1/9] server: Implement extents/can_extents calls for plugins and filters.

On 3/20/19 6:50 AM, Richard W.M. Jones wrote:> On Wed, Mar 20, 2019 at 12:35:33AM -0500, Eric Blake wrote:
>> On 3/19/19 11:34 AM, Richard W.M. Jones wrote:
>>> This pair of calls allows plugins to describe which extents in the
>>> virtual disk are allocated, holes or zeroes.
>>> ---
>>
>>
>>> + void nbdkit_extents_free (struct nbdkit_extents_map *);
>>> +
>>> +Frees an existing extents map.
>>
>> Is this like free() where it is safe to call on NULL?
> 
> Yes - otherwise CLEANUP_EXTENTS_FREE wouldn't work.  Do we need to
> document this?
Documenting it makes it obvious that this is valid:

myextents (...) {
  nbdkit_extents_map *map2 = NULL;
  if (allocate_something_else() < 0)
    goto cleanup;
  map2 = nbdkit_extents_new ();
  ...
 cleanup:
  nbdkit_extents_free (map2);

rather than requiring an 'if (map2)' in the cleanup code.  On the other
hand, only filters (and our core code) call it, so maybe just a comment
in the source code is sufficient (we don't have to worry about
out-of-tree filters the way we do about out-of-tree plugins).
> 
>>> +=item C<NBDKIT_EXTENTS_FOREACH_FLAG_RANGE>
>>> +
>>> +If this flag is included then the C<range_offset> and
C<range_length>
>>> +parameters are used, so only extents overlapping
>>> +C<[range_offset...range_offset+range_length-1]> are
returned.
>>
>> Must range_offset+range_length-1 lie within the reported
>> next_opts->get_size(), or can range extend beyond the end of the
plugin
>> (that is, can I pass range == -1 to get from a given offset to the end
>> of the file, or do I have to compute the end range to avoid internal
>> errors)?
> 
> Extent maps themselves are independent of the size of the plugin, and
> the foreach function will let you put any range there you want.
> 
> However the server won't accept any NBD_CMD_BLOCK_STATUS request with
> a range beyond the end of the export size (valid_range in
> server/protocol.c), and so the core server code should not use a range
> which is outside the file (block_status_final_map in the same file).
> 
>>From the point of view of filters that might call the foreach
> function, they shouldn't choose a range which is outside the
> offset/count passed to the layer below, since the layer below is only
> guaranteed to set the extents correctly for that range (except in the
> case of REQ_ONE where the rules are even more complicated
> unfortunately).
Or even only the initial subset of the range.
> 
>>> +++ b/docs/nbdkit-plugin.pod
>>
>>> +The callback should detect and return the list of extents
overlapping
>>> +the range C<[offset...offset+count-1]>.  The
C<extents_map> parameter
>>> +points to an opaque object which the callback should fill in by
>>> +calling C<nbdkit_extent_add> etc.  See L</Extents map>
below.
>>> +
>>> +The C<flags> parameter may contain the flag
C<NBDKIT_FLAG_REQ_ONE>
>>> +which means that the client is only requesting information about
the
>>> +extent overlapping C<offset>.  The plugin may ignore this
flag, or as
>>> +an optimization it may return just a single extent for
C<offset>.
>>> +(Note that implementing this optimization correctly is difficult
and
>>> +subtle.  You must ensure that the upper limit of the single extent
>>> +returned is correct.  If unsure then the safest option is to
ignore
>>> +this flag.)
>>
>> Indeed. Or in other words, since the extent map coalesces adjacent
>> regions with the same type, merely calling nbdkit_extent_add(0) exactly
>> once won't work, because that coalesces with the extent map's
default of
>> the entire image already being reported as data; so to report a single
>> data extent that ends sooner than the client's requested range, it
is
>> necessary to report a different extent starting at the next byte.
> 
> Unfortunately the current implementation doesn't coalesce type = 0
> with the "background" extent ...  It probably should do to reduce
> surprises [see below].
> 
> As you say though, rules around REQ_ONE are very complex - I
> implemented it for the file plugin, but my initial implementation for
> the VDDK plugin was incorrect and I removed it.  Unfortunately because
> VDDK QueryAllocatedBlocks is so slow we will need to implement this
> optimization at some point and try to get it right.
The more I think about it, the more I like a fifth type END as an
explicit way to guarantee an extent beyond which the plugin did not
provide information, and that nbdkit should just do short replies when
encountering END. (The NBD protocol requires that NBD_CMD_BLOCK_STATUS
makes progress on success - it must either fail with an error, or report
at least one extent successfully - so if a plugin creates an END extent
directly on the requested offset, nbdkit has to treat that as an error
return back to the client; all other offsets for END are merely an early
termination hint).

>> Technically, aren't gaps in the map treated as implicit extents of
type
>> 0? Then again, I guess you return the next lower extent (rather than
the
>> gap) to make the coalescing nicer.
> 
> ... but another possible way to implement the structure would be to
> have an explicit extent covering every part of the range (either
> [0 to INT64_MAX], or [offset to offset+length-1]).  This would be a
> bit like the common/range struct.
> 
> I'll see if that makes the implementation easier.
In other words, if nbdkit_extents_new() prepopulates an extent, then
there are no gaps to worry about, and you are always guaranteed to find
an existing extent that coincides with the offset given to
nbdkit_extents_add().
> 
>> For worst-case guest behavior, this can result in several memmove()
>> calls during a single _add(). I guess the way to be more efficient is
to
>> use a different data structure rather than an array; I also suspect
that
>> we want to cater to the typical case of a plugin calling _add() in
>> ascending order (so actions at the end of the list should be fast, no
>> matter whether we use array or some other structure).
> 
> I believe the structure we really want for efficiency is:
> 
> https://en.wikipedia.org/wiki/Interval_tree
> 
> I tried implementing one of these a while back and it's very complex.
A full-fledged interval tree allows overlapping intervals; you at least
have the benefit that due to splitting/coalescing you always have a
linear list with no overlap. Still, a balanced binary heap with O(log n)
performance is probably better than an array with O(n) performance, if n
can get large. But as I said earlier, it's okay to get the interface
right with a slower (but easier) implementation, then later switch the
implementation without breaking the interface for speedups.

>> The NBD protocol allows for short replies (whether or not FLAG_ONE was
>> requested) - should we have a fifth type that the plugin can call
>> _add(END) as a way of specifically marking that they did not provide
>> extent information beyond that point, and therefore nbdkit must do a
>> short reply (instead of treating the rest of the range as data)?  (For
>> example, in qemu with the qcow2 file, a block status operation never
>> reads more than one L2 cluster; requests that overlap regions of the
>> disk that would require reading a second L2 cluster are instead
>> truncated).  The END type would not be transmitted over the NBD wire,
>> but may make some other operations easier (such as your addressing your
>> comment about a plugin honoring REQ_ONE having subtle semantics if it
>> doesn't take coalescing of subsequent offsets into account).
> 
> OK I didn't know this about the protocol but will see if it can be
> implemented.
The protocol also allows returning information beyond the end if it does
not create any more extents, but I think that part of the protocol is
harder to give as a contract to plugins; maybe you do it by
pre-allocating the map with two extents: the default for the range, and
an unlimited END extent immediately beyond the range, so that it is
obvious if the user overwrite END to provide more data.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Maybe Matching Threads

Search for more seemingly similar threads

Libguestfs - Mar 2019 - Re: [PATCH nbdkit 1/9] server: Implement extents/can_extents calls for plugins and filters.

Re: [Libguestfs] [PATCH nbdkit 1/9] server: Implement extents/can_extents calls for plugins and filters.

Re: [Libguestfs] [PATCH nbdkit 1/9] server: Implement extents/can_extents calls for plugins and filters.

Re: [Libguestfs] [PATCH nbdkit 1/9] server: Implement extents/can_extents calls for plugins and filters.

Maybe Matching Threads