thr3ads.net - Libguestfs - [Libguestfs] [PATCH libnbd] copy: Implement destination preferred block size [Feb 2022]

If this information is useful, please help other people find it:
Share via:

Nir Soffer

2022-Jan-31 11:01 UTC

[Libguestfs] [PATCH libnbd] copy: Implement destination preferred block size

On Mon, Jan 31, 2022 at 11:49 AM Laszlo Ersek <lersek at redhat.com>
wrote:>
> On 01/28/22 21:36, Richard W.M. Jones wrote:
> > [NB: I think this is a failed attempt, so shoudn't go upstream,
and
> > doesn't need to be reviewed.]
> >
> > When nbdcopy writes to an NBD server it ignores the server's
> > minimum/preferred block size.  This hasn't caused a problem til
now,
> > but it turns out that in at least one case we care about (writing to
> > qemu-nbd backed by a compressed qcow2 file) we must obey the minimum
&
> > preferred block size of 64K.
> >
> > For the longer story on that see this thread:
> >
https://lists.nongnu.org/archive/html/qemu-devel/2022-01/threads.html#06108
> >
> > This patch attempts to fix this.  The uncontroversial part of this
> > patch adds a global "blocksize" variable which is the
destination
> > preferred block size.
> >
> > The tricky part of this patch tries to ensure that writes to the
> > destination obey this block size constraint.
> >
> > Since nbdcopy is driven by the extent map read from the source, the
> > theory behind this implementation is that we read the extent map and
> > then try to "adjust" it so that extents which are not
aligned to the
> > block size grow, shrink or are deleted.  It proved to be very
> > difficult to get that right, but you can see the implementation in the
> > new function "adjust_extents_for_blocksize".
> >
> > Unfortunately not only is this difficult to implement, but the theory
> > is wrong.  Read requests are actually split up into smaller
> > subrequests on write (look for "copy_subcommand" in
> > multi-thread-copying.c).  So this doesn't really solve the
problem.
> >
> > So I think in my second version I will look at adjusting the NBD
> > destination driver (nbd-ops.c) directly so that it turns unaligned
> > writes into buffered read/modify/write operations (which can be
> > optimized quite a lot because we understand the write pattern and know
> > that the main program doesn't go backwards within blocks).
>
> This seems very similar to a problem I had faced a decade ago, in the
> parallel decompressor of lbzip2. The core of the idea is this: use a
> priority queue data structure (there are multiple data structures good
> for that; a red-black tree is one). The key (for sorting; aka the
> "priority") is the offset at which the block exists.
Additionally,
> maintain the "next offset to write" in a variable (basically the
file
> size that has been written out thus far, contiguously).
>
> Whenever a new block arrives, do the following:
>
> - place it in the priority queue, using its offset as key
>
> - if the offset of the "front block" of the priority queue equals
the
> "next offset to write": loop through the initial (= front)
contiguous
> sequence of blocks in the queue, and every time the offset range reaches
> or exceeds the output block size, write out (and pop) the affected full
> blocks, and update "next offset to write" as well. Any
incompletely
> written out input block (= "tail block") will remain having the
lowest
> key (=offset) in the queue, so its key (= where to start writing out the
> tail the next time) can be adjusted in-place (it will remain at the front).
This was needed because you could not control the size of the block
read from storage, right?

With NBD, we always get a full read - there is no way to get a short read,
and we control the size of the read.

If you skip a small hole, the nbd server will send:

    data chunk
    hole chunk (no payload)
    data chunk

The hole chunk will be converted to zeroed area on the client side
and then you would write a complete block (64k) to the server. So the
write to the server will be less efficient than the read. It would be nicer
if all this was handled on qemu-nbd side by qemu block layer.
> If we're really sure that input blocks will never arrive out-of-order,
> then a simple FIFO should work (with the same logic); priorities are not
> needed, just a check (abort) in case an input block arrives either
> out-of-order (going backwards, or going forward, creating a gap).
>
> With this in mind I'm unsure if RMW is needed; just buffered writes
> should suffice I think.
Out of order reads are possible, but we alway read multiple of 64k,
it does not matter, qcow2 does not care about the order of the clusters,
it compresses one cluster at a time, if I understand it correctly.

Nir

Richard W.M. Jones

2022-Jan-31 12:43 UTC

head link

[Libguestfs] [PATCH libnbd] copy: Implement destination preferred block size

The way nbdcopy works now is it reads the extent map from the source
(in pieces).  This extent map is a flat, contiguous array describing
data + holes.

The size of those regions is whatever the source gives us, and we
don't do any further processing on it[1].  Importantly it is unrelated
to the block size of the destination.

This map drives the work loop (multi-thread-copying.c).  Holes begin
an asynchronous zero command immediately.  Data causes an asynch read
of the source, and when that command completes (other commands are
running at the same time in the same thread) we do sparseness
detection and based on that we issue one of more asynch write and
asynch zero commands to the destination.

As a further complication there are several threads working in
parallel on large blocks (128M) of the source.

The result of all this is that the destination driver (eg. nbd-ops.c)
sees a mix of zero and write requests, mostly out of order, and with
no particular block size.  [This is different from what I said before -
I wrongly said that the driver always saw requests in order.]

It should never write to the same byte in the output twice, and each
byte should be written (or zeroed) exactly once.

My proposal is to forward all aligned (to the preferred block size)
requests to the destination.  Also any non-aligned requests that
contain whole blocks.  These have to be split into fragment + aligned
part + fragment subrequests at block boundaries.

For the fragments it's tricky.  My first idea was to calloc a block
and copy each fragment's data into this block (that's what I meant
when I was talking about read-modify-write), but it's hard to know
when a block is complete.  (Bitmap?)

I think what we can do instead is to save the fragments, indexed by
destination block number and offset.  It should be relatively easy to
test when a new fragment arrives if have all of the fragments for a
particular block, so we can then write out the whole block to the
destination.

Particular attention also has to be paid to the final block which
might be a short block (or can it not be?).  Also if we get to the end
and discover we have fragments left over then it would seem to
indicate a bug in the program.

I've not yet implemented any of this.  Given how much other stuff I've
got to do it seems like RHEL 9.1 material.

Rich.

[1] We really ought to coalesce adjacent regions of the same type, and
unconditionally throwing away very small holes (eg. < 4K) might be
worth doing in case the source is not well-behaved.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Laszlo Ersek

2022-Feb-01 07:30 UTC

head link

[Libguestfs] [PATCH libnbd] copy: Implement destination preferred block size

On 01/31/22 13:43, Richard W.M. Jones wrote:
> I think what we can do instead is to save the fragments, indexed by
> destination block number and offset.  It should be relatively easy to
> test when a new fragment arrives if have all of the fragments for a
> particular block, so we can then write out the whole block to the
> destination.
Yes, this is the idea that the priority queue serves so well. It's
basically "insert in the right place, pop the front".

If you know the "subrequest" size (such that all requests that need to
be queued and then coalesced for the actual output block size are a
whole multiple of the subrequest size), then a bitmap is likely the
simplest data structure.
> Particular attention also has to be paid to the final block which
> might be a short block (or can it not be?).  Also if we get to the end
> and discover we have fragments left over then it would seem to
> indicate a bug in the program.
We're not supposed to reach the final block in the queue as long as
there are gaps before it.
> I've not yet implemented any of this.  Given how much other stuff
I've
> got to do it seems like RHEL 9.1 material.
I agree.

Thanks
Laszlo

Libguestfs - Feb 2022 - [PATCH libnbd] copy: Implement destination preferred block size

[Libguestfs] [PATCH libnbd] copy: Implement destination preferred block size

[Libguestfs] [PATCH libnbd] copy: Implement destination preferred block size

[Libguestfs] [PATCH libnbd] copy: Implement destination preferred block size