Nir Soffer
2021-Jan-21 11:56 UTC
[Libguestfs] [PATCH v2] v2v: rhv-upload-plugin: Defer imageio connection
When using vddk input with certain vmware version, qemu-img may spend lot of time getting source image extents. If getting image extents takes more than 60 seconds, imageio server closes the idle connection, and the transfer will fail on the first write with: nbdkit: python[1]: error: /var/tmp/rhvupload.0OKqWA/rhv-upload-plugin.py: pwrite: error: Traceback (most recent call last): File "/var/tmp/rhvupload.0OKqWA/rhv-upload-plugin.py", line 94, in wrapper return func(h, *args) File "/var/tmp/rhvupload.0OKqWA/rhv-upload-plugin.py", line 230, in pwrite r = http.getresponse() File "/usr/lib64/python3.6/http/client.py", line 1361, in getresponse response.begin() File "/usr/lib64/python3.6/http/client.py", line 311, in begin version, status, reason = self._read_status() File "/usr/lib64/python3.6/http/client.py", line 280, in _read_status raise RemoteDisconnected("Remote end closed connection without" http.client.RemoteDisconnected: Remote end closed connection without response This happens only when not using unix socket, for example when running on non-ovirt host, or ovirt host from another data center, or when using -oo rhv_direct=false When using unix socket, we close the initial HTTPSConnection, and created a new UnixHTTPConnection. This connection is not connected to the server yet. When qemu-img tries to access the server, the connection is connected automatically. Fix the issue by closing the initial connection used to get server options and initialize the handle, and storing a closed connection in the handle. Tested by adding a 300 milliseconds delay in nbdkit file plugin. Due to the way qemu-img works, this cause more than 2 minutes delay after open() but before the first pwrite(). Without this change, the import fails consistently when using rhv_direct=false. Signed-off-by: Nir Soffer <nsoffer at redhat.com> --- v2v/rhv-upload-plugin.py | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/v2v/rhv-upload-plugin.py b/v2v/rhv-upload-plugin.py index 8c11012b..90b3af82 100644 --- a/v2v/rhv-upload-plugin.py +++ b/v2v/rhv-upload-plugin.py @@ -117,6 +117,15 @@ def open(readonly): destination_url = parse_transfer_url(transfer) http = create_http(destination_url) options = get_options(http, destination_url) + + # Close the initial connection to imageio server. When qemu-img will + # try to access the server, HTTPConnection will reconnect + # automatically. If we keep this connection idle and qemu-img is too + # slow getting image extents, imageio server may close the connection, + # and the import will fail on the first write. + # See https://bugzilla.redhat.com/1916176. + http.close() + http = optimize_http(http, host, options) except: cancel_transfer(connection, transfer) -- 2.26.2
Richard W.M. Jones
2021-Jan-21 12:07 UTC
[Libguestfs] [PATCH v2] v2v: rhv-upload-plugin: Defer imageio connection
On Thu, Jan 21, 2021 at 01:56:55PM +0200, Nir Soffer wrote:> This happens only when not using unix socket, for example when running > on non-ovirt host, or ovirt host from another data center, or when using > -oo rhv_direct=false > > When using unix socket, we close the initial HTTPSConnection, and > created a new UnixHTTPConnection. This connection is not connected to > the server yet. When qemu-img tries to access the server, the connection > is connected automatically. > > Fix the issue by closing the initial connection used to get server > options and initialize the handle, and storing a closed connection in > the handle.I think this description could be a bit clearer. If I'm understanding it correctly, this description could be better: " This happens only when not using unix socket, for example when running on non-ovirt host, or ovirt host from another data center, or when using -oo rhv_direct=false After this change, in the unix domain socket case: - We close the socket. - We call optimize_http which immediately reopens the unix socket. In the non-unix domain socket case: - We close the socket. - We call optimize_http which does nothing. - Later when qemu-img starts to read or write to nbdkit, HTTPConnection transparently reconnects to the oVirt server. Doing it this way avoids keeping the TCP connection idle after the initial open(), avoiding the bug described above. "> --- a/v2v/rhv-upload-plugin.py > +++ b/v2v/rhv-upload-plugin.py > @@ -117,6 +117,15 @@ def open(readonly): > destination_url = parse_transfer_url(transfer) > http = create_http(destination_url) > options = get_options(http, destination_url) > + > + # Close the initial connection to imageio server. When qemu-img will > + # try to access the server, HTTPConnection will reconnect > + # automatically. If we keep this connection idle and qemu-img is too > + # slow getting image extents, imageio server may close the connection, > + # and the import will fail on the first write. > + # See https://bugzilla.redhat.com/1916176. > + http.close() > + > http = optimize_http(http, host, options) > except: > cancel_transfer(connection, transfer)Anyway, please feel free to push it now if you want. Thanks, Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW
Nir Soffer
2021-Jan-21 12:16 UTC
[Libguestfs] [PATCH v2] v2v: rhv-upload-plugin: Defer imageio connection
On Thu, Jan 21, 2021 at 1:57 PM Nir Soffer <nirsof at gmail.com> wrote: ... How I tested: I could not use the example python plugin with the added delay, since os.preadv() is not available in python 3.6 (added in 3.7). I could not use the delay filter since it supports only integer delay. With fedora 32 image with 231 extents, qemu-img calls block_status 960 times during import and this is too slow to test. So I used this hack: $ git diff diff --git a/plugins/file/file.c b/plugins/file/file.c index 1651079a..99a95af0 100644 --- a/plugins/file/file.c +++ b/plugins/file/file.c @@ -785,6 +785,8 @@ do_extents (void *handle, uint32_t count, uint64_t offset, const bool req_one = flags & NBDKIT_FLAG_REQ_ONE; uint64_t end = offset + count; + usleep(300000); + do { off_t pos; Started nbdkit with fedora 32 image: $ ./nbdkit -f -v ./plugins/file/.libs/nbdkit-file-plugin.so file=/var/tmp/fedora-32.raw Add overlay to make it work with virt-v2v -i disk: $ qemu-img create -f qcow2 -b nbd://localhost -F raw /var/tmp/fedora-32.qcow2 With this, getting all extents takes more than 60 seconds, enough to get the connection closed: $ time qemu-img map --output json nbd://localhost >/dev/null real 1m9.416s user 0m0.007s sys 0m0.017s I imported the qcow2 image on ovirt host, with --oo rhv_direct=true/false. Looking at imageio log, we see that the initial connection is closed, and the second connection is made once qemu finish to get extetents for the entire image. I also tried to import using imageio client, using this pipeline: imageio nbd client -> qemu-nbd -> overlay.qcow2 -> nbdkit On imageio side, we always try to get extents for 2g, and we don't use NBD_CMD_FLAG_REQ_ONE. But on nbdkit we see that all extents calls use req_one=1. Looks like the issue is this code in qemu nbd driver: 1676 static int coroutine_fn nbd_client_co_block_status( 1677 BlockDriverState *bs, bool want_zero, int64_t offset, int64_t bytes, 1678 int64_t *pnum, int64_t *map, BlockDriverState **file) 1679 { ... 1685 NBDRequest request = { 1686 .type = NBD_CMD_BLOCK_STATUS, 1687 .from = offset, 1688 .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment), 1689 MIN(bytes, s->info.size - offset)), 1690 .flags = NBD_CMD_FLAG_REQ_ONE, 1691 }; Also qemu-img convert is getting extents twice: 1. One call per extents before starting the copy: nbdkit: file.0: debug: file: extents count=196608 offset=851968 req_one=1 nbdkit: file.1: debug: file: extents count=196608 offset=1376256 req_one=1 nbdkit: file.3: debug: file: extents count=393216 offset=1638400 req_one=1 nbdkit: file.4: debug: file: extents count=458752 offset=2686976 req_one=1 nbdkit: file.5: debug: file: extents count=983040 offset=3211264 req_one=1 ... nbdkit: file.14: debug: file: extents count=524288 offset=6441402368 req_one=1 2. One call per extent before each pread(), during copy: nbdkit: file.0: debug: file: extents count=196608 offset=851968 req_one=1 nbdkit: file.1: debug: file: extents count=196608 offset=1376256 req_one=1 nbdkit: file.3: debug: file: extents count=393216 offset=1638400 req_one=1 nbdkit: file.4: debug: file: extents count=458752 offset=2686976 req_one=1 nbdkit: file.5: debug: file: pread count=458752 offset=2686976 nbdkit: file.5: debug: file: extents count=983040 offset=3211264 req_one=1 nbdkit: file.6: debug: file: pread count=983040 offset=3211264 nbdkit: file.7: debug: file: extents count=2031616 offset=4259840 req_one=1 nbdkit: file.8: debug: file: pread count=2031616 offset=4259840 nbdkit: file.9: debug: file: extents count=11599872 offset=6356992 req_one=1 nbdkit: file.10: debug: file: pread count=2097152 offset=6356992 nbdkit: file.11: debug: file: pread count=2097152 offset=8454144 ... nbdkit: file.14: debug: file: extents count=393216 offset=5591269376 req_one=1 nbdkit: file.6: debug: file: pread count=393216 offset=5591269376 nbdkit: file.6: debug: file: extents count=313851904 offset=5591728128 req_one=1 nbdkit: file.4: debug: file: pread count=1048576 offset=5591728128 nbdkit: file.4: debug: file: extents count=312803328 offset=5592776704 req_one=1 nbdkit: file.8: debug: file: extents count=535232512 offset=5905580032 req_one=1 nbdkit: file.3: debug: file: extents count=524288 offset=6441402368 req_one=1 nbdkit: file.13: debug: client sent NBD_CMD_DISC, closing connection Total 960 calls per one import. A tiny delay in getting extents can slow down the entire import. Nir