thr3ads.net - Libguestfs - [Libguestfs] RFC: nbdkit block size advertisement [Jul 2020]

If this information is useful, please help other people find it:
Share via:

Eric Blake

2020-Jul-23 15:54 UTC

[Libguestfs] RFC: nbdkit block size advertisement

I'm thinking of adding one or more callbacks to nbdkit to let 
plugins/filters enforce various block size alignments (for example, the 
swab filter requires 2/4/8 alignment, or VDDK requires 512 alignment, 
etc).  The NBD protocol currently has NBD_INFO_BLOCK_SIZE which can be 
sent in reply to NBD_OPT_GO to tell the client about sizing constraints; 
qemu already implements it as both client and server, so we have some 
reasonable testing setups (although libnbd will also need some additions 
to make it easier to expose constraints to the user, and/or add new 
convenience APIs to do blocksize-style read-modify-write at the libnbd 
client side rather than needing the blocksize filter in the NBD server 
side).

But NBD_INFO_BLOCK_SIZE is not the full picture: it only covers minimum 
block size, preferred block size, and maximum data size; there has been 
discussion on the NBD list about also advertising maximum action size 
(discussion has mentioned trim and/or zero, but even cache could benefit 
from larger buffer size than pread), which means we should be thinking 
about supporting future protocol extensions in whatever we expose to 
plugins.

So, I'm thinking something like the following:

New enum:
NBDKIT_BLOCK_SIZE_GET_MINIMUM
NBDKIT_BLOCK_SIZE_GET_PREFERRED
NBDKIT_BLOCK_SIZE_GET_MAX_DATA
NBDKIT_BLOCK_SIZE_GET_MAX_TRIM
NBDKIT_BLOCK_SIZE_GET_MAX_ZERO
NBDKIT_BLOCK_SIZE_GET_MAX_CACHE

along with a new callback for plugins:

int64_t block_size (void *handle, int which);

where 'which' is one of the enum values.  A future nbdkit might request 
an enum value not recognized at the time the plugin was compiled, so the 
recommended behavior is that a plugin returns -1 on error, 0 to let 
nbdkit pick a sane default (including when 'which' was unknown), or a 
positive value for an actual limit.  For now, nbdkit would advertise 
MIN, PREFERRED, and MAX_DATA limits (to clients that use NBD_OPT_GO), 
while the others are enforced only internally.  The idea is that if the 
plugin installs a limit, a client request that violates that limit will 
fail with EINVAL for being unaligned, without even asking the plugin to 
try the response.  nbdkit calls the plugin callback once per connection 
per reasonable 'which' (caching the results like it does for .get_size, 
and skipping limits where .can_FOO fails). Returning int64_t allows us 
to reuse this callback without change when we add v3 callbacks, although 
values larger than 4G make no difference at present.  I thought the use 
of an enum was nicer than filling in a struct whose size might change, 
or adding one callback per limit.

Filters can relax limits (such as blocksize turning a plugin MIN 512 
into an advertised MIN 1, by doing read-modify-write as needed) or 
tighten limits (the file plugin has MIN 1, but the swab filter imposes a 
tighter MIN 2).  Constraints between limits are as follows:

MIN: must be power of 2, between 1 and 64k; .get_size and .extents must 
return results aligned to MIN (as any unaligned tail or extent 
transition is inaccessible using only aligned operations).  Defaults to 1.

PREFERRED: must be power of 2, between max(512, MIN) and 2M (this upper 
limit is not specified by NBD spec, but matches qemu's implementation of 
what it uses as the max qcow2 cluster size). Defaults to max(4k, MIN).

MAX_DATA: must be multiple of MIN and at least as large as PREFERRED; 
values larger than 64M are okay but clamped by nbdkit's own internal 
limits. Defaults to 64M.

MAX_TRIM, MAX_ZERO, MAX_CACHE: must be multiple of MIN, should be at 
least as big as MAX_DATA. Values 4G and larger are clamped by nbdkit's 
own internal limits. Defaults to 4G-MIN for TRIM/ZERO, and MAX_DATA for 
CACHE.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Richard W.M. Jones

2020-Jul-23 22:00 UTC

head link

Re: [Libguestfs] RFC: nbdkit block size advertisement

On Thu, Jul 23, 2020 at 10:54:31AM -0500, Eric Blake
wrote:> I'm thinking of adding one or more callbacks to nbdkit to let
> plugins/filters enforce various block size alignments (for example,
> the swab filter requires 2/4/8 alignment, or VDDK requires 512
> alignment, etc).  The NBD protocol currently has NBD_INFO_BLOCK_SIZE
> which can be sent in reply to NBD_OPT_GO to tell the client about
> sizing constraints; qemu already implements it as both client and
> server, so we have some reasonable testing setups (although libnbd
> will also need some additions to make it easier to expose
> constraints to the user, and/or add new convenience APIs to do
> blocksize-style read-modify-write at the libnbd client side rather
> than needing the blocksize filter in the NBD server side).
> 
> But NBD_INFO_BLOCK_SIZE is not the full picture: it only covers
> minimum block size, preferred block size, and maximum data size;
> there has been discussion on the NBD list about also advertising
> maximum action size (discussion has mentioned trim and/or zero, but
> even cache could benefit from larger buffer size than pread), which
> means we should be thinking about supporting future protocol
> extensions in whatever we expose to plugins.
> 
> So, I'm thinking something like the following:
> 
> New enum:
> NBDKIT_BLOCK_SIZE_GET_MINIMUM
> NBDKIT_BLOCK_SIZE_GET_PREFERRED
> NBDKIT_BLOCK_SIZE_GET_MAX_DATA
> NBDKIT_BLOCK_SIZE_GET_MAX_TRIM
> NBDKIT_BLOCK_SIZE_GET_MAX_ZERO
> NBDKIT_BLOCK_SIZE_GET_MAX_CACHE
enum or int?  I think there are ABI problems with enums, although
probably not unless we have more than 256 cases?
> along with a new callback for plugins:
> 
> int64_t block_size (void *handle, int which);
> 
> where 'which' is one of the enum values.  A future nbdkit might
> request an enum value not recognized at the time the plugin was
> compiled, so the recommended behavior is that a plugin returns -1 on
> error, 0 to let nbdkit pick a sane default (including when 'which'
> was unknown), or a positive value for an actual limit.  For now,
> nbdkit would advertise MIN, PREFERRED, and MAX_DATA limits (to
> clients that use NBD_OPT_GO), while the others are enforced only
> internally.  The idea is that if the plugin installs a limit, a
> client request that violates that limit will fail with EINVAL for
> being unaligned, without even asking the plugin to try the response.
Isn't the plan that the server would try to resolve the problem -
eg. by making a RMW request or splitting a request?  This would be
especially the case for internal requests, but I could also see it
having value for clients which either ignore the block size
information, or don't implement it correctly/completely.
> nbdkit calls the plugin callback once per connection per reasonable
> 'which' (caching the results like it does for .get_size, and
> skipping limits where .can_FOO fails). Returning int64_t allows us
> to reuse this callback without change when we add v3 callbacks,
> although values larger than 4G make no difference at present.  I
> thought the use of an enum was nicer than filling in a struct whose
> size might change, or adding one callback per limit.
Yes the enum/int seems like a better idea than dealing with structs,
and is also easier to map into other programming languages.  The
overhead of a few extra indirect function calls is negligible because
it's only once per connection.

The only other alternative I can see would be to put these into struct
nbdkit_plugin (as int64_t fields), but we have always regretted using
plain fields instead of functions in this struct, eg. thread_model,
config_help, etc.
> Filters can relax limits (such as blocksize turning a plugin MIN 512
> into an advertised MIN 1, by doing read-modify-write as needed) or
> tighten limits (the file plugin has MIN 1, but the swab filter
> imposes a tighter MIN 2).  Constraints between limits are as
> follows:
Can the server do this transparently between layers?  Might be a lot
easier to implement it once.
> MIN: must be power of 2, between 1 and 64k; .get_size and .extents
> must return results aligned to MIN (as any unaligned tail or extent
> transition is inaccessible using only aligned operations).  Defaults
> to 1.
> 
> PREFERRED: must be power of 2, between max(512, MIN) and 2M (this
> upper limit is not specified by NBD spec, but matches qemu's
> implementation of what it uses as the max qcow2 cluster size).
> Defaults to max(4k, MIN).
> 
> MAX_DATA: must be multiple of MIN and at least as large as
> PREFERRED; values larger than 64M are okay but clamped by nbdkit's
> own internal limits. Defaults to 64M.
> 
> MAX_TRIM, MAX_ZERO, MAX_CACHE: must be multiple of MIN, should be at
> least as big as MAX_DATA. Values 4G and larger are clamped by
> nbdkit's own internal limits. Defaults to 4G-MIN for TRIM/ZERO, and
> MAX_DATA for CACHE.
Sounds reasonable otherwise.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Eric Blake

2020-Jul-23 22:17 UTC

head link

Re: [Libguestfs] RFC: nbdkit block size advertisement

On 7/23/20 5:00 PM, Richard W.M. Jones wrote:> On Thu, Jul 23, 2020 at 10:54:31AM -0500, Eric Blake wrote:
>> I'm thinking of adding one or more callbacks to nbdkit to let
>> plugins/filters enforce various block size alignments (for example,
>> the swab filter requires 2/4/8 alignment, or VDDK requires 512
>> alignment, etc).  The NBD protocol currently has NBD_INFO_BLOCK_SIZE
>> which can be sent in reply to NBD_OPT_GO to tell the client about
>> sizing constraints; qemu already implements it as both client and
>> server, so we have some reasonable testing setups (although libnbd
>> will also need some additions to make it easier to expose
>> constraints to the user, and/or add new convenience APIs to do
>> blocksize-style read-modify-write at the libnbd client side rather
>> than needing the blocksize filter in the NBD server side).
>>
>> But NBD_INFO_BLOCK_SIZE is not the full picture: it only covers
>> minimum block size, preferred block size, and maximum data size;
>> there has been discussion on the NBD list about also advertising
>> maximum action size (discussion has mentioned trim and/or zero, but
>> even cache could benefit from larger buffer size than pread), which
>> means we should be thinking about supporting future protocol
>> extensions in whatever we expose to plugins.
>>
>> So, I'm thinking something like the following:
>>
>> New enum:
>> NBDKIT_BLOCK_SIZE_GET_MINIMUM
>> NBDKIT_BLOCK_SIZE_GET_PREFERRED
>> NBDKIT_BLOCK_SIZE_GET_MAX_DATA
>> NBDKIT_BLOCK_SIZE_GET_MAX_TRIM
>> NBDKIT_BLOCK_SIZE_GET_MAX_ZERO
>> NBDKIT_BLOCK_SIZE_GET_MAX_CACHE
> 
> enum or int?  I think there are ABI problems with enums, although
> probably not unless we have more than 256 cases?
Definitely typed as 'int' in the .block_size prototype (you're right
that enums and ABI is not a game I want to play with); but whether we 
use enum or a series of #define to give values to those constants is a 
matter of taste.  Porting to other language bindings, where that 
language has saner enum support than C, can of course use an enum.
> 
>> along with a new callback for plugins:
>>
>> int64_t block_size (void *handle, int which);
>>
>> where 'which' is one of the enum values.  A future nbdkit might
>> request an enum value not recognized at the time the plugin was
>> compiled, so the recommended behavior is that a plugin returns -1 on
>> error, 0 to let nbdkit pick a sane default (including when
'which'
>> was unknown), or a positive value for an actual limit.  For now,
>> nbdkit would advertise MIN, PREFERRED, and MAX_DATA limits (to
>> clients that use NBD_OPT_GO), while the others are enforced only
>> internally.  The idea is that if the plugin installs a limit, a
>> client request that violates that limit will fail with EINVAL for
>> being unaligned, without even asking the plugin to try the response.
> 
> Isn't the plan that the server would try to resolve the problem -
> eg. by making a RMW request or splitting a request?  This would be
> especially the case for internal requests, but I could also see it
> having value for clients which either ignore the block size
> information, or don't implement it correctly/completely.
The blocksize filter can do that.  I'm leaning more towards having the 
core server ignore the issue with -EINVAL, then use the blocksize filter 
when you want the server to fix things on your behalf, rather than 
bloating the core server when most clients are already fairly 
well-behaved (qemu, for example, generally sticks to 512-byte 
boundaries).  Also, the blocksize filter currently has the limitation 
that it serializes all requests (I was lazy and wrote it with a single 
bounce buffer shared among all requests, rather than allocating a bounce 
buffer per unaligned I/O).  There's also locking questions to worry 
about: suppose you have a plugin that wants 4k alignment, but a client 
sends two non-overlapping 512-byte writes in parallel to the same 4k 
portion of the disk.  We _really_ want our rmw sequences to be atomic. 
That is,

      thread 1                thread 2
start handling write(len=512, offset=0)
                           start handling write(len=512, offset=512)
widen to .pread(len=4k, offset=0)
                           widen to .pread(len=4k, offset=0)
modify buffer
                           modify buffer
commit with .pwrite(len=4k, offset=0)
                           commit with .pwrite(len=4k, offset=0)

is a BAD example of not locking, because thread 2 undoes the work by 
thread 1, even though the client had no overlap in its parallel requests 
and thus no reason to expect that corruption.


do not want to be causing data corruption during our rmw cycle, so we 
have to stall any secondary I/O request that overlaps with a request 
where we are already using a bounce buffer.
> 
>> nbdkit calls the plugin callback once per connection per reasonable
>> 'which' (caching the results like it does for .get_size, and
>> skipping limits where .can_FOO fails). Returning int64_t allows us
>> to reuse this callback without change when we add v3 callbacks,
>> although values larger than 4G make no difference at present.  I
>> thought the use of an enum was nicer than filling in a struct whose
>> size might change, or adding one callback per limit.
> 
> Yes the enum/int seems like a better idea than dealing with structs,
> and is also easier to map into other programming languages.  The
> overhead of a few extra indirect function calls is negligible because
> it's only once per connection.
> 
> The only other alternative I can see would be to put these into struct
> nbdkit_plugin (as int64_t fields), but we have always regretted using
> plain fields instead of functions in this struct, eg. thread_model,
> config_help, etc.
Yeah, I can _totally_ see these values being data-dependent (for 
example, in the file plugin, based on whether you used an option for 
turning on O_DIRECT, and whether the underlying filesystem advertises a 
preferred size).  So a function is the only way to keep it 
runtime-adjustable.
> 
>> Filters can relax limits (such as blocksize turning a plugin MIN 512
>> into an advertised MIN 1, by doing read-modify-write as needed) or
>> tighten limits (the file plugin has MIN 1, but the swab filter
>> imposes a tighter MIN 2).  Constraints between limits are as
>> follows:
> 
> Can the server do this transparently between layers?  Might be a lot
> easier to implement it once.
Well, it's only filters that would ever tweak things, and even then, 
there aren't that many filters that want to alter alignments.  I guess I 
just have to code up a first round to see how it looks.
> 
>> MIN: must be power of 2, between 1 and 64k; .get_size and .extents
>> must return results aligned to MIN (as any unaligned tail or extent
>> transition is inaccessible using only aligned operations).  Defaults
>> to 1.
>>
>> PREFERRED: must be power of 2, between max(512, MIN) and 2M (this
>> upper limit is not specified by NBD spec, but matches qemu's
>> implementation of what it uses as the max qcow2 cluster size).
>> Defaults to max(4k, MIN).
>>
>> MAX_DATA: must be multiple of MIN and at least as large as
>> PREFERRED; values larger than 64M are okay but clamped by nbdkit's
>> own internal limits. Defaults to 64M.
>>
>> MAX_TRIM, MAX_ZERO, MAX_CACHE: must be multiple of MIN, should be at
>> least as big as MAX_DATA. Values 4G and larger are clamped by
>> nbdkit's own internal limits. Defaults to 4G-MIN for TRIM/ZERO, and
>> MAX_DATA for CACHE.
> 
> Sounds reasonable otherwise.
> 
> Rich.
> 
-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Possibly Parallel Threads

Search for more apparently analagous threads

Libguestfs - Jul 2020 - RFC: nbdkit block size advertisement

[Libguestfs] RFC: nbdkit block size advertisement

Re: [Libguestfs] RFC: nbdkit block size advertisement

Re: [Libguestfs] RFC: nbdkit block size advertisement

Possibly Parallel Threads