thr3ads.net - Libguestfs - [Libguestfs] Supporting sparse disks in nbdkit [Mar 2019]

If this information is useful, please help other people find it:
Share via:

Richard W.M. Jones

2019-Mar-08 13:48 UTC

[Libguestfs] Supporting sparse disks in nbdkit

I've posted a couple of patches towards the ultimate goal of
implementing NBD_CMD_BLOCK_STATUS / base:allocation in nbdkit.  Before
I can do the final patch I think we need to discuss how this would be
exposed to plugins since at the end of the day they need to implement
the feature.

Background reading:
 - preparatory patches:
     https://www.redhat.com/archives/libguestfs/2019-March/msg00013.html
     https://www.redhat.com/archives/libguestfs/2019-March/msg00016.html
 - NBD protocol, see in particular NBD_CMD_BLOCK_STATUS and
     NBD_REPLY_TYPE_BLOCK_STATUS:
     https://github.com/NetworkBlockDevice/nbd/blob/master/doc/proto.md

I think we shouldn't modify the pread() callback.  If we decide to
implement Structured Replies properly at some point in the future we
might need to do that, but it's not necessary now.

We could introduce a new call ‘extents’ to return the list of extents.
I believe it would look like this:

  struct nbdkit_extent {
    uint64_t offset;
    uint32_t length;  // XXX is 32 bit right here?
    uint32_t flag;    // hole, zero, data ... more in future?
  };

  int extents (void *handle, uint32_t count, uint64_t offset,
               uint32_t flags /* always 0? */,
               size_t *nr_extents, struct nbdkit_extent *extents);

The function is meant to scan [offset, offset+count-1] and return a
list of all extents overlapping this range, and their status (hole,
zero, data).

To make writing plugins easier we could say that extents don't need to
be returned in order, and may include extents which don't actually
overlap the requested range.  Also missing regions would mean "hole"
(makes writing the VDDK plugin easier), and adjacent extents of the
same type would be coalesced automatically.  But it's an error if
returned extents overlap each other.

nbdkit would need to do some massaging on this to get it into the
right format for NBD_CMD_BLOCK_STATUS.  (I'm very confused about what
NBD_CMD_FLAG_REQ_ONE is supposed to do.)

We will also need a corresponding ‘can_extents’, which is analogous to
‘can_write’ etc and is what would control the output of
NBD_OPT_{SET,LIST}_META_CONTEXT.

For nbdkit-file-filter:

- Fairly simple implementation using SEEK_HOLE/SEEK_DATA.

- Not sure how we detect zeroes without reading the file.

For nbdkit-memory-plugin:

 - Pretty simple implementation, which can even detect non-hole zeroes.

For VDDK:

 - VixDiskLib_QueryAllocatedBlocks can return allocated blocks, but
   doesn't return holes separately (they are assumed from what is
   omitted from the list).  No support for detecting zeroes that I can
   see.

Some existing filters would have to be modified to correctly adjust
‘extents’ offsets:

 - nbdkit-offset-filter

 - nbdkit-partition-filter

 - nbdkit-truncate-filter (? maybe not)

 - nbdkit-xz-filter is complicated: XZ files support sparseness so in
   theory we should try to return this data

Your thoughts on this appreciated,

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Eric Blake

2019-Mar-08 16:50 UTC

head link

Re: [Libguestfs] Supporting sparse disks in nbdkit

On 3/8/19 7:48 AM, Richard W.M. Jones wrote:> I've posted a couple of patches towards the ultimate goal of
> implementing NBD_CMD_BLOCK_STATUS / base:allocation in nbdkit.  Before
> I can do the final patch I think we need to discuss how this would be
> exposed to plugins since at the end of the day they need to implement
> the feature.
Quick thoughts for now; I'll probably have to spend more time on this
over the weekend and reply with more details or ideas later.
> 
> Background reading:
>  - preparatory patches:
>      https://www.redhat.com/archives/libguestfs/2019-March/msg00013.html
>      https://www.redhat.com/archives/libguestfs/2019-March/msg00016.html
>  - NBD protocol, see in particular NBD_CMD_BLOCK_STATUS and
>      NBD_REPLY_TYPE_BLOCK_STATUS:
>      https://github.com/NetworkBlockDevice/nbd/blob/master/doc/proto.md
> 
> I think we shouldn't modify the pread() callback.  If we decide to
> implement Structured Replies properly at some point in the future we
> might need to do that, but it's not necessary now.
Agreed; pread should stay for simple, and a plugin/filter that supports
structured reads should have a new set of callbacks. A while ago, we had
a thread about split callbacks for better parallelism - where nbdkit
calls the plugin to start a request (which returns control to nbdkit
immediately), then the plugin calls a callback to complete the request.
That may be worth adding now.
> 
> We could introduce a new call ‘extents’ to return the list of extents.
> I believe it would look like this:
> 
>   struct nbdkit_extent {
>     uint64_t offset;
>     uint32_t length;  // XXX is 32 bit right here?
For now, the NBD spec only supports 32-bit lengths on an extent;
although there has been talk of how we would extend the protocol to
allow the client to make 64-bit length requests (not for read or write,
but definitely useful for flush, write zeros, block status); at which
point, a 64-bit block status returning a 64-bit extent length makes
sense. Perhaps we should require clients to use 64-bit lengths from the
outset, in anticipation of NBD spec improvements? (but then nbdkit has
to post-process those values down to a 32-bit result when talking to a
client that did not negotiate 64-bit commands)
>     uint32_t flag;    // hole, zero, data ... more in future?
>   };
> 
>   int extents (void *handle, uint32_t count, uint64_t offset,
>                uint32_t flags /* always 0? */,
Unless we support the REQ_ONE flag (where the client specifically wants
only one extent returned).
>                size_t *nr_extents, struct nbdkit_extent *extents);
> 
> The function is meant to scan [offset, offset+count-1] and return a
> list of all extents overlapping this range, and their status (hole,
> zero, data).
> 
> To make writing plugins easier we could say that extents don't need to
> be returned in order, and may include extents which don't actually
> overlap the requested range.  Also missing regions would mean
"hole"
> (makes writing the VDDK plugin easier), and adjacent extents of the
> same type would be coalesced automatically.  But it's an error if
> returned extents overlap each other.
I'd go one step further - missing regions prior to a present region mean
"hole"; missing regions after the last present region mean "no
data
available, don't advertise status to the client".  This is because the
NBD spec intentionally tries to allow both:

- a server has more information available cheaply, allow it to return
beyond the end so that a smart client can reduce the number of
subsequent queries needed (a dumb client will ignore the extra data and
probably query on the tail, but that doesn't hurt the server); but to
avoid too much client confusion, any extra info reported has to be
limited to a single extent, and coalesced with an extent that overlaps
the guest's actual request

- a server cannot cheaply obtain information over the entire length
requested by the client, so it returns just an initial answer over the
head, and the client has to ask again to find out about the tail

The NBD spec also took care to require a successful return to make
progress, so a client that wants to iterate over the entire export can
write a sane loop (even if the client has to coalesce identical status
in adjacent returns, it at least doesn't have to worry about getting
stuck on an inf-loop for a 0-length return).
> 
> nbdkit would need to do some massaging on this to get it into the
> right format for NBD_CMD_BLOCK_STATUS.  (I'm very confused about what
> NBD_CMD_FLAG_REQ_ONE is supposed to do.)
REQ_ONE says that a successful answer has to be exactly one extent,
covering only the head of the region. (That is, force the server to
behave as if obtaining information beyond the first extent is
expensive). It also tells the server that it must not report extra
information beyond the initial request (because at least qemu 3.0 would
assert that the server supplied too much information).

You're also right that nbdkit can do REQ_ONE massaging, so whether we
pass REQ_ONE on to the client or not makes it sound like the client
needs a tri-state opt-in (no block status support, block status but let
nbdkit handle REQ_ONE, block status and the plugin can handle REQ_ONE
efficiently itself)
> 
> We will also need a corresponding ‘can_extents’, which is analogous to
> ‘can_write’ etc and is what would control the output of
> NBD_OPT_{SET,LIST}_META_CONTEXT.
Yep, and I think it needs to be tri-state, like can_fua, as argued above.
> 
> For nbdkit-file-filter:
file-plugin ?
> 
> - Fairly simple implementation using SEEK_HOLE/SEEK_DATA.
> 
> - Not sure how we detect zeroes without reading the file.
Not detecting allocated zeroes is okay; reporting the holes as zeroes is
safe.

Or take a leaf from the existing nozero-filter, where we default to
maximum speed (no detection of allocated zeroes), but where a filter can
be added that does read-based zero detection (slower, but potentially
reports more zero blocks).
> 
> For nbdkit-memory-plugin:
> 
>  - Pretty simple implementation, which can even detect non-hole zeroes.
> 
> For VDDK:
> 
>  - VixDiskLib_QueryAllocatedBlocks can return allocated blocks, but
>    doesn't return holes separately (they are assumed from what is
>    omitted from the list).  No support for detecting zeroes that I can
>    see.
> 
> Some existing filters would have to be modified to correctly adjust
> ‘extents’ offsets:
Yeah, we'll have to think about every single filter, and whether it can
even allow .can_extents, or must override the plugin.
> 
>  - nbdkit-offset-filter
> 
>  - nbdkit-partition-filter
> 
>  - nbdkit-truncate-filter (? maybe not)
There's also an obvious use of the truncate-filter that treats the
underlying plugin as fully allocated (if the plugin does not
.can_extents), but where the truncated tail reports as sparse/zero.
> 
>  - nbdkit-xz-filter is complicated: XZ files support sparseness so in
>    theory we should try to return this data
cache and cow filters need some thought.

error needs a plan for whether we can inject errors on block_status
requests, as well as consideration for whether it munges plugin extents
based on current error policy.

log needs a patch to log the new callback.
> 
> Your thoughts on this appreciated,
> 
Looking forward to seeing what you come up with, but it sounds like you
are on track for supporting NBD_CMD_BLOCK_STATUS.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Richard W.M. Jones

2019-Mar-11 19:44 UTC

head link

Re: [Libguestfs] Supporting sparse disks in nbdkit

On Fri, Mar 08, 2019 at 10:50:35AM -0600, Eric Blake
wrote:> >   int extents (void *handle, uint32_t count, uint64_t offset,
> >                uint32_t flags /* always 0? */,
> 
> Unless we support the REQ_ONE flag (where the client specifically wants
> only one extent returned).
...> > nbdkit would need to do some massaging on this to get it into the
> > right format for NBD_CMD_BLOCK_STATUS.  (I'm very confused about
what
> > NBD_CMD_FLAG_REQ_ONE is supposed to do.)
> 
> REQ_ONE says that a successful answer has to be exactly one extent,
> covering only the head of the region. (That is, force the server to
> behave as if obtaining information beyond the first extent is
> expensive). It also tells the server that it must not report extra
> information beyond the initial request (because at least qemu 3.0 would
> assert that the server supplied too much information).
> 
> You're also right that nbdkit can do REQ_ONE massaging, so whether we
> pass REQ_ONE on to the client or not makes it sound like the client
> needs a tri-state opt-in (no block status support, block status but let
> nbdkit handle REQ_ONE, block status and the plugin can handle REQ_ONE
> efficiently itself)
I had a think about the REQ_ONE flag a little.  What do you think
about this plan?

If NBD_CMD_FLAG_REQ_ONE is passed by the nbd client, nbdkit does not
need to pass it to the plugin.  Instead nbdkit would turn the request
into:

extents (count = 1, offset = <offset>)

The plugin could still respond with multiple extents, but nbdkit would
pick only the one covering the <offset> to send back.

In this case can_extents would return just a boolean since there is no
special supported needed by the plugin AFAICT.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Seemingly Similar Threads

Search for more seemingly similar threads

Libguestfs - Mar 2019 - Supporting sparse disks in nbdkit

[Libguestfs] Supporting sparse disks in nbdkit

Re: [Libguestfs] Supporting sparse disks in nbdkit

Re: [Libguestfs] Supporting sparse disks in nbdkit

Seemingly Similar Threads