thr3ads.net - Libguestfs - Re: [Libguestfs] New extents structure proposal [Mar 2019]

If this information is useful, please help other people find it:
Share via:

Richard W.M. Jones

2019-Mar-20 16:57 UTC

Re: [Libguestfs] New extents structure proposal

On Wed, Mar 20, 2019 at 11:20:15AM -0500, Eric Blake
wrote:> On 3/20/19 10:31 AM, Richard W.M. Jones wrote:
> > I think the extents map is just too complicated and is unnecessarily
> > so.  How about instead we define the plugin interface to be:
> > 
> >   int can_extents (void *handle);  // as before
> >   int extents (void *handle, uint32_t count, uint64_t offset, uint32_t
flags,
> >                struct nbdkit_extents_list *list);
> > 
> > and have the extents_list be a simple list.  The first extent you add
> > must start at offset.  And you're only allowed to append
monotonically
> > increasing adjacent extents to it.  Plugins must return at least one
> > extent, but are not required to return more than one extent
> > (regardless of flags).
> 
> Well, required to return at least one for the command to succeed (if the
> return 0, then we synthesize an error, perhaps EIO, to keep the
> connection to the client still alive).
> 
> I take it you'd still have to use an nbdkit_extent_add(list, count,
> type), so that additions to the list are still calling malloc() from
> nbdkit's perspective (as you don't want the plugin to be
realloc()ing
> the list)?
Yes I think we should preserve an abstract data type, not least
because it would allow us to implement a different structure in
future.
> > This would be simple enough to implement for both VDDK and file.  (For
> > VDDK the plugin needs to synthesize the holes but that's only a
slight
> > complication).
> > 
> > NBDKIT_FLAG_REQ_ONE is now simple to implement (if generating the
> > extents, stop after generating the first one; if it's the server
> > processing the extents, take the first extent in the list).
> > 
> > The ‘end’ that we discussed before is now implicit (it's the byte
of
> > the last extent in the list).
> 
> So a client may give us more extents than we need, but can also easily
> stop giving us extents as early as they want (and as long as they gave
> us at least one, the overall call is successful).  The requirement on
> the plugin giving us extents in monotonically increasing order is not
> too onerous; all your earlier implementations were naturally already
> doing that (in other words, the full flexibility of random-access extent
> probing rather than linear is over-engineered).
> 
> I like the idea.
> 
> I'm trying to figure out if filters would still need a foreach visitor.
> But my initial guess is no.  Consider - the earlier proposal required
> the offset filter to do:
> 
> +  struct nbdkit_extents_map *map2;
> +
> +  map2 = nbdkit_extents_new ();
> +  if (map2 == NULL)
> +    return -1;
> +  if (next_ops->extents (nxdata, count, offs + offset,
> +                         flags, map2, err) == -1) {
> +    nbdkit_extents_free (map2);
> +    return -1;
> +  }
> +
> +  /* Transform offsets in map2, return result in extents_map. */
> +  if (nbdkit_extents_foreach (map2, subtract_offset, extents_map,
> +                              NBDKIT_EXTENTS_FOREACH_FLAG_RANGE,
> +                              offset, range) == -1) {
> +    nbdkit_extents_free (map2);
> +    return -1;
> +  }
> +  nbdkit_extents_free (map2);
> 
> because the plugin was calling:
> 
> +    if (nbdkit_extent_add (extents_map, offset, n, type) == -1)
> 
> But if we drop offset from the interface, the offset filter simplifies to:
> 
> return next_ops->extents (nxdata, count, offs + offset, flags, list,
err);
> 
> because the plugin will be calling:
> 
> if (nbdkit_extent_add(list, n type) == -1)
For a simple case like this we might even allow filters to simply
update the offsets in the list.  It requires exposing the inner
working of the data type to filters though.
> > Also an observation: qemu's nbd client only ever issues block
status
> > requests with the req-one flag set, so perhaps we should optimize for
> > that case.
> 
> I hope to get to the point where future qemu doesn't send the req-one
> flag. There's several threads on the qemu list about how qemu-img is
> slower than it needs to be because it is throwing away useful
> information, and where it is aggravated by the kernel's abyssmal
lseek()
> performance on tmpfs.  But until qemu learns useful caching, you're
> right that most existing NBD clients that request block status do so one
> extent at a time (because I don't know of any other existing NBD
clients
> that use BLOCK_STATUS yet).
Is it ever possible to cache block status results?  What happens in
the (admittedly unusual) case where two writers are hitting the same
NBD server?  For example if the server is implementing a cluster
filesystem.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Eric Blake

2019-Mar-20 17:11 UTC

head link

Re: [Libguestfs] New extents structure proposal

On 3/20/19 11:57 AM, Richard W.M. Jones wrote:
> 
>>> Also an observation: qemu's nbd client only ever issues block
status
>>> requests with the req-one flag set, so perhaps we should optimize
for
>>> that case.
>>
>> I hope to get to the point where future qemu doesn't send the
req-one
>> flag. There's several threads on the qemu list about how qemu-img
is
>> slower than it needs to be because it is throwing away useful
>> information, and where it is aggravated by the kernel's abyssmal
lseek()
>> performance on tmpfs.  But until qemu learns useful caching, you're
>> right that most existing NBD clients that request block status do so
one
>> extent at a time (because I don't know of any other existing NBD
clients
>> that use BLOCK_STATUS yet).
> 
> Is it ever possible to cache block status results?  What happens in
> the (admittedly unusual) case where two writers are hitting the same
> NBD server?  For example if the server is implementing a cluster
> filesystem.
For a read-only client: caching 'data' regions is okay, caching
'zero'
or 'hole' regions is bad (because even though you are not modifying the
image, another writer might be; demoting 'hole' to 'data' is
safe - it
merely pessimizes into a read() instead of skipping work; but caching
'hole' that is later promoted to 'data' is wrong - it can cause
data
lass if the client doesn't read the actual data).

For a writing client: either you are an exclusive writer (and should
know what you wrote, so the cache is the fact that you changed the state
yourself) or you are on a cluster filesystem (at which point, your
cluster system better have its own rules for how to resolve races
inherent in multiple writers, where you shouldn't be relying on block
status but on the cluster protocol in the first place).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

Martin Kletzander

2019-Mar-21 12:18 UTC

head link

Re: [Libguestfs] New extents structure proposal

On Wed, Mar 20, 2019 at 12:11:57PM -0500, Eric Blake
wrote:>On 3/20/19 11:57 AM, Richard W.M. Jones wrote:
>
>>
>>>> Also an observation: qemu's nbd client only ever issues
block status
>>>> requests with the req-one flag set, so perhaps we should
optimize for
>>>> that case.
>>>
>>> I hope to get to the point where future qemu doesn't send the
req-one
>>> flag. There's several threads on the qemu list about how
qemu-img is
>>> slower than it needs to be because it is throwing away useful
>>> information, and where it is aggravated by the kernel's
abyssmal lseek()
>>> performance on tmpfs.  But until qemu learns useful caching,
you're
>>> right that most existing NBD clients that request block status do
so one
>>> extent at a time (because I don't know of any other existing
NBD clients
>>> that use BLOCK_STATUS yet).
>>
>> Is it ever possible to cache block status results?  What happens in
>> the (admittedly unusual) case where two writers are hitting the same
>> NBD server?  For example if the server is implementing a cluster
>> filesystem.
>
>For a read-only client: caching 'data' regions is okay, caching
'zero'
>or 'hole' regions is bad (because even though you are not modifying
the
>image, another writer might be; demoting 'hole' to 'data' is
safe - it
>merely pessimizes into a read() instead of skipping work; but caching
>'hole' that is later promoted to 'data' is wrong - it can
cause data
>lass if the client doesn't read the actual data).
>
>For a writing client: either you are an exclusive writer (and should
>know what you wrote, so the cache is the fact that you changed the state
>yourself) or you are on a cluster filesystem (at which point, your
>cluster system better have its own rules for how to resolve races
>inherent in multiple writers, where you shouldn't be relying on block
>status but on the cluster protocol in the first place).
>
Even for the reading client, you don't need to need to access a place on the
disk twice, even one access is racy because there can be a change between
BLOCK_STATUS and READ.  And that same thing happens in the plugins for files and
everything that someone else can access.  I don't think it is designed for
concurrent access.  Or is it?
>--
>Eric Blake, Principal Software Engineer
>Red Hat, Inc.           +1-919-301-3226
>Virtualization:  qemu.org | libvirt.org
>

Reasonably Related Threads

Search for more apparently analagous threads

Libguestfs - Mar 2019 - Re: New extents structure proposal

Re: [Libguestfs] New extents structure proposal

Re: [Libguestfs] New extents structure proposal

Re: [Libguestfs] New extents structure proposal

Reasonably Related Threads