thr3ads.net - Libguestfs - [Libguestfs] Performance regression in -o rhv-upload questions [Feb 2022]

If this information is useful, please help other people find it:
Share via:

Nir Soffer

2022-Feb-08 14:44 UTC

[Libguestfs] Performance regression in -o rhv-upload questions

On Tue, Feb 8, 2022 at 3:27 PM Richard W.M. Jones <rjones at redhat.com>
wrote:>
> Hi Nir,
>
> https://bugzilla.redhat.com/show_bug.cgi?id=2039255#c4
>
> I'm looking for some advice/help with a performance regression in
> virt-v2v between 1.44.2 and the latest version.  It's very obvious and
> reproducible when I do a conversion from a local disk to local RHV
> using -o rhv-upload, specifically:
>
> $ time ./run virt-v2v -i disk /var/tmp/fedora-35.qcow2 -o rhv-upload -oc
https://ovirt4410.home.annexia.org/ovirt-engine/api -op /tmp/ovirt-passwd -oo
rhv-direct -os ovirt-data -on test5 -of raw
>
> The guest is an ordinary Fedora guest containing some junk data.
>
> Now that I've been able to reproduce the problem locally, it turns out
> to be not at all what I thought it was going to be.  The timings show
> that:
>
>  - new virt-v2v is much faster doing the copy, but
>
>  - new virt-v2v takes ages finalizing the transfer (about 10x longer
>    than the old version)
>
> virt-v2v 1.44.2:
> Complete log: http://oirase.annexia.org/tmp/virt-v2v-1.44.2-rhv-upload.log
>
>   [  63.1] Copying disk 1/1
>   ...
>   transfer 7aecd359-3706-49f5-8a0c-c8799f7b100a is finalizing_success
>   transfer 7aecd359-3706-49f5-8a0c-c8799f7b100a is finished_success
>   transfer 7aecd359-3706-49f5-8a0c-c8799f7b100a finalized in 6.105 seconds
>   [  89.2] Creating output metadata
>   [  89.8] Finishing off
>
> virt-v2v 1.45.97:
> Complete log: http://oirase.annexia.org/tmp/virt-v2v-1.45.97-rhv-upload.log
>
>   [  71.0] Copying disk 1/1
>   [  82.7] Creating output metadata
>   ...
>   transfer 6ea3d724-16f9-4bda-a33e-69a783480abc is finalizing_success
>   transfer 6ea3d724-16f9-4bda-a33e-69a783480abc is finished_success
>   transfer 6ea3d724-16f9-4bda-a33e-69a783480abc finalized in 61.552 seconds
>   [ 144.9] Finishing off
This happens because virt-v2v try to finalize the transfer *before* closing the
connections to imageio server.

Current imageio release mark the a ticket as canceled, but will not remove it
if the ticket has open connections from clients. If the clients are
idle, the connection
closes after 60 seconds, so when engine tries again to remove the ticket, the
operation succeeds.

Upstream version improved this flow to wait only for ongoing requests. If there
are no ongoing requests the ticket is removed immediately, ignoring the open
connections. The connections will be closed when the client closes the
connection
on when reading from the socket times out.

You can test upstream version from here:
https://github.com/oVirt/ovirt-imageio/actions/runs/1811058094

But if you want to be compatible with released imageio version, you
must close the connections to imageio *before* finalizing the transfers.

We already discussed this issue last year, when we handled the bug
when vddk blocks for many minutes during block status and imageio
connection is droped, and you opened a bug for this.
> It's not a completely like-for-like comparison because the rhv-upload
> backend changed a lot between these versions.  In particular if I was
> going to pick some suspicious change it would be my refactoring here
> which was supposed to be neutral but maybe isn't:
>
>  
https://github.com/libguestfs/virt-v2v/commit/143a22860216b94d3a81706193088d50c03fc35c
>
> Unfortunately this commit is almost impossible to revert because of
> deep restructuring of surrounding code.
The refactoring is not the issue, the issue is that we do not terminate
the output nbdkit before finalizing.

For some output we cannot do this since we want to query nbdkit for
more info (e.g block status) but for rhv upload we don't have anything
to do with nbkdit instance and we must close the connection before
we finalize, so we should close the output right after we finish the copy.
> Another idea:
>
> Old virt-v2v uses qemu-img convert which does not flush by default.
> New virt-v2v uses nbdcopy with the --flush option, so it will call
> imageio PATCH /path ... "op":"flush" at the end. 
However removing
> nbdcopy --flush didn't help very much (only a couple of seconds off).
Calling flush is right, we cannot remove it. Although there is bug in nbdkit,
and if flushes during close() flow even if you did not send a flush.
> Do you have any other ideas?
>
> What exactly does imageio do during the finalizing_success state?
The flow is this:
- engine send request to vdsm to delete the ticket
- vdsm connects to imageio control socket and send DELETE /tickets/{ticket-id}
- imageio mark the ticket as canceled, so no new request can succeed
  and no new connection to attach to the ticket
- imageio mark all ongoing requests as canceled, so if the request is in
  read/write loop, it will abort on the the next read/write complete
- upstream: imageio waits until all ongoing operations complete
- release: imageio waits until all connections are closed
- if the ticket could not be removed in 60 seconds, imageio returns 409 Conflict
- imageio returns 200 OK
- engine retries the delete if it failed
- when engine succeeds and finish other finalization tasks, the image
transfer will
  be reported as finished.

I would expect the flow to time out on virt-v2v side, since we use 60 seconds
timeout, and when removing a ticket times out, it takes 60 seconds, so you
should not succeed to finalize in 61 seconds.  There is a bug in ovirt
engine that cause this flow to succeed.

Nir

Richard W.M. Jones

2022-Feb-08 14:48 UTC

head link

[Libguestfs] Performance regression in -o rhv-upload questions

On Tue, Feb 08, 2022 at 04:44:33PM +0200, Nir Soffer
wrote:> On Tue, Feb 8, 2022 at 3:27 PM Richard W.M. Jones <rjones at
redhat.com> wrote:
> > virt-v2v 1.44.2:
> > Complete log:
http://oirase.annexia.org/tmp/virt-v2v-1.44.2-rhv-upload.log
> >
> >   [  63.1] Copying disk 1/1
> >   ...
> >   transfer 7aecd359-3706-49f5-8a0c-c8799f7b100a is finalizing_success
> >   transfer 7aecd359-3706-49f5-8a0c-c8799f7b100a is finished_success
> >   transfer 7aecd359-3706-49f5-8a0c-c8799f7b100a finalized in 6.105
seconds
> >   [  89.2] Creating output metadata
> >   [  89.8] Finishing off
> >
> > virt-v2v 1.45.97:
> > Complete log:
http://oirase.annexia.org/tmp/virt-v2v-1.45.97-rhv-upload.log
> >
> >   [  71.0] Copying disk 1/1
> >   [  82.7] Creating output metadata
> >   ...
> >   transfer 6ea3d724-16f9-4bda-a33e-69a783480abc is finalizing_success
> >   transfer 6ea3d724-16f9-4bda-a33e-69a783480abc is finished_success
> >   transfer 6ea3d724-16f9-4bda-a33e-69a783480abc finalized in 61.552
seconds
> >   [ 144.9] Finishing off
>
> This happens because virt-v2v try to finalize the transfer *before*
> closing the connections to imageio server.
>
> Current imageio release mark the a ticket as canceled, but will not
> remove it if the ticket has open connections from clients. If the
> clients are idle, the connection closes after 60 seconds, so when
> engine tries again to remove the ticket, the operation succeeds.
Ah ha!  I should be able to fix this easily enough ...  Let me try a
few things and then I'll come back to the rest of your message.

Rich.
> Upstream version improved this flow to wait only for ongoing requests. If
there
> are no ongoing requests the ticket is removed immediately, ignoring the
open
> connections. The connections will be closed when the client closes the
> connection
> on when reading from the socket times out.
> 
> You can test upstream version from here:
> https://github.com/oVirt/ovirt-imageio/actions/runs/1811058094
> 
> But if you want to be compatible with released imageio version, you
> must close the connections to imageio *before* finalizing the transfers.
> 
> We already discussed this issue last year, when we handled the bug
> when vddk blocks for many minutes during block status and imageio
> connection is droped, and you opened a bug for this.
> 
> > It's not a completely like-for-like comparison because the
rhv-upload
> > backend changed a lot between these versions.  In particular if I was
> > going to pick some suspicious change it would be my refactoring here
> > which was supposed to be neutral but maybe isn't:
> >
> >  
https://github.com/libguestfs/virt-v2v/commit/143a22860216b94d3a81706193088d50c03fc35c
> >
> > Unfortunately this commit is almost impossible to revert because of
> > deep restructuring of surrounding code.
> 
> The refactoring is not the issue, the issue is that we do not terminate
> the output nbdkit before finalizing.
> 
> For some output we cannot do this since we want to query nbdkit for
> more info (e.g block status) but for rhv upload we don't have anything
> to do with nbkdit instance and we must close the connection before
> we finalize, so we should close the output right after we finish the copy.
> 
> > Another idea:
> >
> > Old virt-v2v uses qemu-img convert which does not flush by default.
> > New virt-v2v uses nbdcopy with the --flush option, so it will call
> > imageio PATCH /path ... "op":"flush" at the end. 
However removing
> > nbdcopy --flush didn't help very much (only a couple of seconds
off).
> 
> Calling flush is right, we cannot remove it. Although there is bug in
nbdkit,
> and if flushes during close() flow even if you did not send a flush.
> 
> > Do you have any other ideas?
> >
> > What exactly does imageio do during the finalizing_success state?
> 
> The flow is this:
> - engine send request to vdsm to delete the ticket
> - vdsm connects to imageio control socket and send DELETE
/tickets/{ticket-id}
> - imageio mark the ticket as canceled, so no new request can succeed
>   and no new connection to attach to the ticket
> - imageio mark all ongoing requests as canceled, so if the request is in
>   read/write loop, it will abort on the the next read/write complete
> - upstream: imageio waits until all ongoing operations complete
> - release: imageio waits until all connections are closed
> - if the ticket could not be removed in 60 seconds, imageio returns 409
Conflict
> - imageio returns 200 OK
> - engine retries the delete if it failed
> - when engine succeeds and finish other finalization tasks, the image
> transfer will
>   be reported as finished.
> 
> I would expect the flow to time out on virt-v2v side, since we use 60
seconds
> timeout, and when removing a ticket times out, it takes 60 seconds, so you
> should not succeed to finalize in 61 seconds.  There is a bug in ovirt
> engine that cause this flow to succeed.
> 
> Nir
-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Richard W.M. Jones

2022-Feb-08 15:14 UTC

head link

[Libguestfs] Performance regression in -o rhv-upload questions

On Tue, Feb 08, 2022 at 04:44:33PM +0200, Nir Soffer
wrote:> This happens because virt-v2v try to finalize the transfer *before* closing
the
> connections to imageio server.
> 
> Current imageio release mark the a ticket as canceled, but will not remove
it
> if the ticket has open connections from clients. If the clients are
> idle, the connection
> closes after 60 seconds, so when engine tries again to remove the ticket,
the
> operation succeeds.
I posted this patch:
https://listman.redhat.com/archives/libguestfs/2022-February/msg00113.html

It's possibly not quite right.  We now just kill nbdkit and
immediately go on to finalization, so there might still be a race
(since killing nbdkit only starts the process of nbdkit shutting down,
closing the Python plugin etc).  I don't know if this is a concern or
not, but the patch does avoid the 60s timeout in my limited testing.
> Upstream version improved this flow to wait only for ongoing
> requests. If there are no ongoing requests the ticket is removed
> immediately, ignoring the open connections. The connections will be
> closed when the client closes the connection on when reading from
> the socket times out.
"Ongoing" here means that an HTTP request is being processed?
> You can test upstream version from here:
> https://github.com/oVirt/ovirt-imageio/actions/runs/1811058094
> 
> But if you want to be compatible with released imageio version, you
> must close the connections to imageio *before* finalizing the transfers.
> 
> We already discussed this issue last year, when we handled the bug
> when vddk blocks for many minutes during block status and imageio
> connection is droped, and you opened a bug for this.
> 
> > It's not a completely like-for-like comparison because the
rhv-upload
> > backend changed a lot between these versions.  In particular if I was
> > going to pick some suspicious change it would be my refactoring here
> > which was supposed to be neutral but maybe isn't:
> >
> >  
https://github.com/libguestfs/virt-v2v/commit/143a22860216b94d3a81706193088d50c03fc35c
> >
> > Unfortunately this commit is almost impossible to revert because of
> > deep restructuring of surrounding code.
> 
> The refactoring is not the issue, the issue is that we do not terminate
> the output nbdkit before finalizing.
>
> For some output we cannot do this since we want to query nbdkit for
> more info (e.g block status) but for rhv upload we don't have anything
> to do with nbkdit instance and we must close the connection before
> we finalize, so we should close the output right after we finish the copy.
It's not really desirable to kill nbdkit like I did in my patch.  We
do really want to open up virt-v2v to allow other processes to query
those sockets during the transfer.

An alternative to the patch I posted might be to have a way to request
that nbdkit disconnects its HTTP pool.

I will also note that HTTP is supposed to be stateless.  An open TCP
connection should not have meaning.
> > Another idea:
> >
> > Old virt-v2v uses qemu-img convert which does not flush by default.
> > New virt-v2v uses nbdcopy with the --flush option, so it will call
> > imageio PATCH /path ... "op":"flush" at the end. 
However removing
> > nbdcopy --flush didn't help very much (only a couple of seconds
off).
> 
> Calling flush is right, we cannot remove it. Although there is bug in
nbdkit,
> and if flushes during close() flow even if you did not send a flush.
> 
> > Do you have any other ideas?
> >
> > What exactly does imageio do during the finalizing_success state?
> 
> The flow is this:
> - engine send request to vdsm to delete the ticket
> - vdsm connects to imageio control socket and send DELETE
/tickets/{ticket-id}
> - imageio mark the ticket as canceled, so no new request can succeed
>   and no new connection to attach to the ticket
> - imageio mark all ongoing requests as canceled, so if the request is in
>   read/write loop, it will abort on the the next read/write complete
> - upstream: imageio waits until all ongoing operations complete
> - release: imageio waits until all connections are closed
This may answer my HTTP stateless point above.
> - if the ticket could not be removed in 60 seconds, imageio returns 409
Conflict
> - imageio returns 200 OK
> - engine retries the delete if it failed
> - when engine succeeds and finish other finalization tasks, the image
> transfer will
>   be reported as finished.
> 
> I would expect the flow to time out on virt-v2v side, since we use 60
seconds
> timeout, and when removing a ticket times out, it takes 60 seconds, so you
> should not succeed to finalize in 61 seconds.  There is a bug in ovirt
> engine that cause this flow to succeed.
Thanks,

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Libguestfs - Feb 2022 - Performance regression in -o rhv-upload questions

[Libguestfs] Performance regression in -o rhv-upload questions

[Libguestfs] Performance regression in -o rhv-upload questions

[Libguestfs] Performance regression in -o rhv-upload questions