I've posted a couple of patches towards the ultimate goal of implementing NBD_CMD_BLOCK_STATUS / base:allocation in nbdkit. Before I can do the final patch I think we need to discuss how this would be exposed to plugins since at the end of the day they need to implement the feature. Background reading: - preparatory patches: https://www.redhat.com/archives/libguestfs/2019-March/msg00013.html https://www.redhat.com/archives/libguestfs/2019-March/msg00016.html - NBD protocol, see in particular NBD_CMD_BLOCK_STATUS and NBD_REPLY_TYPE_BLOCK_STATUS: https://github.com/NetworkBlockDevice/nbd/blob/master/doc/proto.md I think we shouldn't modify the pread() callback. If we decide to implement Structured Replies properly at some point in the future we might need to do that, but it's not necessary now. We could introduce a new call ‘extents’ to return the list of extents. I believe it would look like this: struct nbdkit_extent { uint64_t offset; uint32_t length; // XXX is 32 bit right here? uint32_t flag; // hole, zero, data ... more in future? }; int extents (void *handle, uint32_t count, uint64_t offset, uint32_t flags /* always 0? */, size_t *nr_extents, struct nbdkit_extent *extents); The function is meant to scan [offset, offset+count-1] and return a list of all extents overlapping this range, and their status (hole, zero, data). To make writing plugins easier we could say that extents don't need to be returned in order, and may include extents which don't actually overlap the requested range. Also missing regions would mean "hole" (makes writing the VDDK plugin easier), and adjacent extents of the same type would be coalesced automatically. But it's an error if returned extents overlap each other. nbdkit would need to do some massaging on this to get it into the right format for NBD_CMD_BLOCK_STATUS. (I'm very confused about what NBD_CMD_FLAG_REQ_ONE is supposed to do.) We will also need a corresponding ‘can_extents’, which is analogous to ‘can_write’ etc and is what would control the output of NBD_OPT_{SET,LIST}_META_CONTEXT. For nbdkit-file-filter: - Fairly simple implementation using SEEK_HOLE/SEEK_DATA. - Not sure how we detect zeroes without reading the file. For nbdkit-memory-plugin: - Pretty simple implementation, which can even detect non-hole zeroes. For VDDK: - VixDiskLib_QueryAllocatedBlocks can return allocated blocks, but doesn't return holes separately (they are assumed from what is omitted from the list). No support for detecting zeroes that I can see. Some existing filters would have to be modified to correctly adjust ‘extents’ offsets: - nbdkit-offset-filter - nbdkit-partition-filter - nbdkit-truncate-filter (? maybe not) - nbdkit-xz-filter is complicated: XZ files support sparseness so in theory we should try to return this data Your thoughts on this appreciated, Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top
On 3/8/19 7:48 AM, Richard W.M. Jones wrote:> I've posted a couple of patches towards the ultimate goal of > implementing NBD_CMD_BLOCK_STATUS / base:allocation in nbdkit. Before > I can do the final patch I think we need to discuss how this would be > exposed to plugins since at the end of the day they need to implement > the feature.Quick thoughts for now; I'll probably have to spend more time on this over the weekend and reply with more details or ideas later.> > Background reading: > - preparatory patches: > https://www.redhat.com/archives/libguestfs/2019-March/msg00013.html > https://www.redhat.com/archives/libguestfs/2019-March/msg00016.html > - NBD protocol, see in particular NBD_CMD_BLOCK_STATUS and > NBD_REPLY_TYPE_BLOCK_STATUS: > https://github.com/NetworkBlockDevice/nbd/blob/master/doc/proto.md > > I think we shouldn't modify the pread() callback. If we decide to > implement Structured Replies properly at some point in the future we > might need to do that, but it's not necessary now.Agreed; pread should stay for simple, and a plugin/filter that supports structured reads should have a new set of callbacks. A while ago, we had a thread about split callbacks for better parallelism - where nbdkit calls the plugin to start a request (which returns control to nbdkit immediately), then the plugin calls a callback to complete the request. That may be worth adding now.> > We could introduce a new call ‘extents’ to return the list of extents. > I believe it would look like this: > > struct nbdkit_extent { > uint64_t offset; > uint32_t length; // XXX is 32 bit right here?For now, the NBD spec only supports 32-bit lengths on an extent; although there has been talk of how we would extend the protocol to allow the client to make 64-bit length requests (not for read or write, but definitely useful for flush, write zeros, block status); at which point, a 64-bit block status returning a 64-bit extent length makes sense. Perhaps we should require clients to use 64-bit lengths from the outset, in anticipation of NBD spec improvements? (but then nbdkit has to post-process those values down to a 32-bit result when talking to a client that did not negotiate 64-bit commands)> uint32_t flag; // hole, zero, data ... more in future? > }; > > int extents (void *handle, uint32_t count, uint64_t offset, > uint32_t flags /* always 0? */,Unless we support the REQ_ONE flag (where the client specifically wants only one extent returned).> size_t *nr_extents, struct nbdkit_extent *extents); > > The function is meant to scan [offset, offset+count-1] and return a > list of all extents overlapping this range, and their status (hole, > zero, data). > > To make writing plugins easier we could say that extents don't need to > be returned in order, and may include extents which don't actually > overlap the requested range. Also missing regions would mean "hole" > (makes writing the VDDK plugin easier), and adjacent extents of the > same type would be coalesced automatically. But it's an error if > returned extents overlap each other.I'd go one step further - missing regions prior to a present region mean "hole"; missing regions after the last present region mean "no data available, don't advertise status to the client". This is because the NBD spec intentionally tries to allow both: - a server has more information available cheaply, allow it to return beyond the end so that a smart client can reduce the number of subsequent queries needed (a dumb client will ignore the extra data and probably query on the tail, but that doesn't hurt the server); but to avoid too much client confusion, any extra info reported has to be limited to a single extent, and coalesced with an extent that overlaps the guest's actual request - a server cannot cheaply obtain information over the entire length requested by the client, so it returns just an initial answer over the head, and the client has to ask again to find out about the tail The NBD spec also took care to require a successful return to make progress, so a client that wants to iterate over the entire export can write a sane loop (even if the client has to coalesce identical status in adjacent returns, it at least doesn't have to worry about getting stuck on an inf-loop for a 0-length return).> > nbdkit would need to do some massaging on this to get it into the > right format for NBD_CMD_BLOCK_STATUS. (I'm very confused about what > NBD_CMD_FLAG_REQ_ONE is supposed to do.)REQ_ONE says that a successful answer has to be exactly one extent, covering only the head of the region. (That is, force the server to behave as if obtaining information beyond the first extent is expensive). It also tells the server that it must not report extra information beyond the initial request (because at least qemu 3.0 would assert that the server supplied too much information). You're also right that nbdkit can do REQ_ONE massaging, so whether we pass REQ_ONE on to the client or not makes it sound like the client needs a tri-state opt-in (no block status support, block status but let nbdkit handle REQ_ONE, block status and the plugin can handle REQ_ONE efficiently itself)> > We will also need a corresponding ‘can_extents’, which is analogous to > ‘can_write’ etc and is what would control the output of > NBD_OPT_{SET,LIST}_META_CONTEXT.Yep, and I think it needs to be tri-state, like can_fua, as argued above.> > For nbdkit-file-filter:file-plugin ?> > - Fairly simple implementation using SEEK_HOLE/SEEK_DATA. > > - Not sure how we detect zeroes without reading the file.Not detecting allocated zeroes is okay; reporting the holes as zeroes is safe. Or take a leaf from the existing nozero-filter, where we default to maximum speed (no detection of allocated zeroes), but where a filter can be added that does read-based zero detection (slower, but potentially reports more zero blocks).> > For nbdkit-memory-plugin: > > - Pretty simple implementation, which can even detect non-hole zeroes. > > For VDDK: > > - VixDiskLib_QueryAllocatedBlocks can return allocated blocks, but > doesn't return holes separately (they are assumed from what is > omitted from the list). No support for detecting zeroes that I can > see. > > Some existing filters would have to be modified to correctly adjust > ‘extents’ offsets:Yeah, we'll have to think about every single filter, and whether it can even allow .can_extents, or must override the plugin.> > - nbdkit-offset-filter > > - nbdkit-partition-filter > > - nbdkit-truncate-filter (? maybe not)There's also an obvious use of the truncate-filter that treats the underlying plugin as fully allocated (if the plugin does not .can_extents), but where the truncated tail reports as sparse/zero.> > - nbdkit-xz-filter is complicated: XZ files support sparseness so in > theory we should try to return this datacache and cow filters need some thought. error needs a plan for whether we can inject errors on block_status requests, as well as consideration for whether it munges plugin extents based on current error policy. log needs a patch to log the new callback.> > Your thoughts on this appreciated, >Looking forward to seeing what you come up with, but it sounds like you are on track for supporting NBD_CMD_BLOCK_STATUS. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
Richard W.M. Jones
2019-Mar-11 19:44 UTC
Re: [Libguestfs] Supporting sparse disks in nbdkit
On Fri, Mar 08, 2019 at 10:50:35AM -0600, Eric Blake wrote:> > int extents (void *handle, uint32_t count, uint64_t offset, > > uint32_t flags /* always 0? */, > > Unless we support the REQ_ONE flag (where the client specifically wants > only one extent returned)....> > nbdkit would need to do some massaging on this to get it into the > > right format for NBD_CMD_BLOCK_STATUS. (I'm very confused about what > > NBD_CMD_FLAG_REQ_ONE is supposed to do.) > > REQ_ONE says that a successful answer has to be exactly one extent, > covering only the head of the region. (That is, force the server to > behave as if obtaining information beyond the first extent is > expensive). It also tells the server that it must not report extra > information beyond the initial request (because at least qemu 3.0 would > assert that the server supplied too much information). > > You're also right that nbdkit can do REQ_ONE massaging, so whether we > pass REQ_ONE on to the client or not makes it sound like the client > needs a tri-state opt-in (no block status support, block status but let > nbdkit handle REQ_ONE, block status and the plugin can handle REQ_ONE > efficiently itself)I had a think about the REQ_ONE flag a little. What do you think about this plan? If NBD_CMD_FLAG_REQ_ONE is passed by the nbd client, nbdkit does not need to pass it to the plugin. Instead nbdkit would turn the request into: extents (count = 1, offset = <offset>) The plugin could still respond with multiple extents, but nbdkit would pick only the one covering the <offset> to send back. In this case can_extents would return just a boolean since there is no special supported needed by the plugin AFAICT. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/