Nir Soffer
2021-Jan-21 11:56 UTC
[Libguestfs] [PATCH v2] v2v: rhv-upload-plugin: Defer imageio connection
When using vddk input with certain vmware version, qemu-img may spend
lot of time getting source image extents. If getting image extents takes
more than 60 seconds, imageio server closes the idle connection, and the
transfer will fail on the first write with:
nbdkit: python[1]: error: /var/tmp/rhvupload.0OKqWA/rhv-upload-plugin.py:
pwrite: error:
Traceback (most recent call last):
File "/var/tmp/rhvupload.0OKqWA/rhv-upload-plugin.py", line 94, in
wrapper
return func(h, *args)
File "/var/tmp/rhvupload.0OKqWA/rhv-upload-plugin.py", line 230, in
pwrite
r = http.getresponse()
File "/usr/lib64/python3.6/http/client.py", line 1361, in
getresponse
response.begin()
File "/usr/lib64/python3.6/http/client.py", line 311, in begin
version, status, reason = self._read_status()
File "/usr/lib64/python3.6/http/client.py", line 280, in
_read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
This happens only when not using unix socket, for example when running
on non-ovirt host, or ovirt host from another data center, or when using
-oo rhv_direct=false
When using unix socket, we close the initial HTTPSConnection, and
created a new UnixHTTPConnection. This connection is not connected to
the server yet. When qemu-img tries to access the server, the connection
is connected automatically.
Fix the issue by closing the initial connection used to get server
options and initialize the handle, and storing a closed connection in
the handle.
Tested by adding a 300 milliseconds delay in nbdkit file plugin. Due to
the way qemu-img works, this cause more than 2 minutes delay after
open() but before the first pwrite(). Without this change, the import
fails consistently when using rhv_direct=false.
Signed-off-by: Nir Soffer <nsoffer at redhat.com>
---
v2v/rhv-upload-plugin.py | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/v2v/rhv-upload-plugin.py b/v2v/rhv-upload-plugin.py
index 8c11012b..90b3af82 100644
--- a/v2v/rhv-upload-plugin.py
+++ b/v2v/rhv-upload-plugin.py
@@ -117,6 +117,15 @@ def open(readonly):
destination_url = parse_transfer_url(transfer)
http = create_http(destination_url)
options = get_options(http, destination_url)
+
+ # Close the initial connection to imageio server. When qemu-img will
+ # try to access the server, HTTPConnection will reconnect
+ # automatically. If we keep this connection idle and qemu-img is too
+ # slow getting image extents, imageio server may close the connection,
+ # and the import will fail on the first write.
+ # See https://bugzilla.redhat.com/1916176.
+ http.close()
+
http = optimize_http(http, host, options)
except:
cancel_transfer(connection, transfer)
--
2.26.2
Richard W.M. Jones
2021-Jan-21 12:07 UTC
[Libguestfs] [PATCH v2] v2v: rhv-upload-plugin: Defer imageio connection
On Thu, Jan 21, 2021 at 01:56:55PM +0200, Nir Soffer wrote:> This happens only when not using unix socket, for example when running > on non-ovirt host, or ovirt host from another data center, or when using > -oo rhv_direct=false > > When using unix socket, we close the initial HTTPSConnection, and > created a new UnixHTTPConnection. This connection is not connected to > the server yet. When qemu-img tries to access the server, the connection > is connected automatically. > > Fix the issue by closing the initial connection used to get server > options and initialize the handle, and storing a closed connection in > the handle.I think this description could be a bit clearer. If I'm understanding it correctly, this description could be better: " This happens only when not using unix socket, for example when running on non-ovirt host, or ovirt host from another data center, or when using -oo rhv_direct=false After this change, in the unix domain socket case: - We close the socket. - We call optimize_http which immediately reopens the unix socket. In the non-unix domain socket case: - We close the socket. - We call optimize_http which does nothing. - Later when qemu-img starts to read or write to nbdkit, HTTPConnection transparently reconnects to the oVirt server. Doing it this way avoids keeping the TCP connection idle after the initial open(), avoiding the bug described above. "> --- a/v2v/rhv-upload-plugin.py > +++ b/v2v/rhv-upload-plugin.py > @@ -117,6 +117,15 @@ def open(readonly): > destination_url = parse_transfer_url(transfer) > http = create_http(destination_url) > options = get_options(http, destination_url) > + > + # Close the initial connection to imageio server. When qemu-img will > + # try to access the server, HTTPConnection will reconnect > + # automatically. If we keep this connection idle and qemu-img is too > + # slow getting image extents, imageio server may close the connection, > + # and the import will fail on the first write. > + # See https://bugzilla.redhat.com/1916176. > + http.close() > + > http = optimize_http(http, host, options) > except: > cancel_transfer(connection, transfer)Anyway, please feel free to push it now if you want. Thanks, Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com Fedora Windows cross-compiler. Compile Windows programs, test, and build Windows installers. Over 100 libraries supported. http://fedoraproject.org/wiki/MinGW
Nir Soffer
2021-Jan-21 12:16 UTC
[Libguestfs] [PATCH v2] v2v: rhv-upload-plugin: Defer imageio connection
On Thu, Jan 21, 2021 at 1:57 PM Nir Soffer <nirsof at gmail.com> wrote:
...
How I tested:
I could not use the example python plugin with the added delay, since
os.preadv() is not available in python 3.6 (added in 3.7).
I could not use the delay filter since it supports only integer delay. With
fedora 32 image with 231 extents, qemu-img calls block_status 960 times
during import and this is too slow to test.
So I used this hack:
$ git diff
diff --git a/plugins/file/file.c b/plugins/file/file.c
index 1651079a..99a95af0 100644
--- a/plugins/file/file.c
+++ b/plugins/file/file.c
@@ -785,6 +785,8 @@ do_extents (void *handle, uint32_t count, uint64_t offset,
const bool req_one = flags & NBDKIT_FLAG_REQ_ONE;
uint64_t end = offset + count;
+ usleep(300000);
+
do {
off_t pos;
Started nbdkit with fedora 32 image:
$ ./nbdkit -f -v ./plugins/file/.libs/nbdkit-file-plugin.so
file=/var/tmp/fedora-32.raw
Add overlay to make it work with virt-v2v -i disk:
$ qemu-img create -f qcow2 -b nbd://localhost -F raw /var/tmp/fedora-32.qcow2
With this, getting all extents takes more than 60 seconds, enough to get the
connection closed:
$ time qemu-img map --output json nbd://localhost >/dev/null
real 1m9.416s
user 0m0.007s
sys 0m0.017s
I imported the qcow2 image on ovirt host, with --oo rhv_direct=true/false.
Looking at imageio log, we see that the initial connection is closed,
and the second
connection is made once qemu finish to get extetents for the entire image.
I also tried to import using imageio client, using this pipeline:
imageio nbd client -> qemu-nbd -> overlay.qcow2 -> nbdkit
On imageio side, we always try to get extents for 2g, and we don't use
NBD_CMD_FLAG_REQ_ONE. But on nbdkit we see that all extents
calls use req_one=1.
Looks like the issue is this code in qemu nbd driver:
1676 static int coroutine_fn nbd_client_co_block_status(
1677 BlockDriverState *bs, bool want_zero, int64_t offset,
int64_t bytes,
1678 int64_t *pnum, int64_t *map, BlockDriverState **file)
1679 {
...
1685 NBDRequest request = {
1686 .type = NBD_CMD_BLOCK_STATUS,
1687 .from = offset,
1688 .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment),
1689 MIN(bytes, s->info.size - offset)),
1690 .flags = NBD_CMD_FLAG_REQ_ONE,
1691 };
Also qemu-img convert is getting extents twice:
1. One call per extents before starting the copy:
nbdkit: file.0: debug: file: extents count=196608 offset=851968 req_one=1
nbdkit: file.1: debug: file: extents count=196608 offset=1376256 req_one=1
nbdkit: file.3: debug: file: extents count=393216 offset=1638400 req_one=1
nbdkit: file.4: debug: file: extents count=458752 offset=2686976 req_one=1
nbdkit: file.5: debug: file: extents count=983040 offset=3211264 req_one=1
...
nbdkit: file.14: debug: file: extents count=524288 offset=6441402368 req_one=1
2. One call per extent before each pread(), during copy:
nbdkit: file.0: debug: file: extents count=196608 offset=851968 req_one=1
nbdkit: file.1: debug: file: extents count=196608 offset=1376256 req_one=1
nbdkit: file.3: debug: file: extents count=393216 offset=1638400 req_one=1
nbdkit: file.4: debug: file: extents count=458752 offset=2686976 req_one=1
nbdkit: file.5: debug: file: pread count=458752 offset=2686976
nbdkit: file.5: debug: file: extents count=983040 offset=3211264 req_one=1
nbdkit: file.6: debug: file: pread count=983040 offset=3211264
nbdkit: file.7: debug: file: extents count=2031616 offset=4259840 req_one=1
nbdkit: file.8: debug: file: pread count=2031616 offset=4259840
nbdkit: file.9: debug: file: extents count=11599872 offset=6356992 req_one=1
nbdkit: file.10: debug: file: pread count=2097152 offset=6356992
nbdkit: file.11: debug: file: pread count=2097152 offset=8454144
...
nbdkit: file.14: debug: file: extents count=393216 offset=5591269376 req_one=1
nbdkit: file.6: debug: file: pread count=393216 offset=5591269376
nbdkit: file.6: debug: file: extents count=313851904 offset=5591728128 req_one=1
nbdkit: file.4: debug: file: pread count=1048576 offset=5591728128
nbdkit: file.4: debug: file: extents count=312803328 offset=5592776704 req_one=1
nbdkit: file.8: debug: file: extents count=535232512 offset=5905580032 req_one=1
nbdkit: file.3: debug: file: extents count=524288 offset=6441402368 req_one=1
nbdkit: file.13: debug: client sent NBD_CMD_DISC, closing connection
Total 960 calls per one import.
A tiny delay in getting extents can slow down the entire import.
Nir