thr3ads.net - Libguestfs - Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk [Apr 2018]

If this information is useful, please help other people find it:
Share via:

Kevin Wolf

2018-Apr-10 13:48 UTC

Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

Am 10.04.2018 um 15:03 hat Nir Soffer geschrieben:> On Tue, Apr 10, 2018 at 1:44 PM Richard W.M. Jones
<rjones@redhat.com>
> wrote:
> 
> > We now have true zeroing support in oVirt imageio, thanks for that.
> >
> > However a problem is that ‘qemu-img convert’ issues zero requests for
> > the whole disk before starting the transfer.  It does this using 32 MB
> > requests which take approx. 1 second each to execute on the oVirt
side.
> 
> 
> > Two problems therefore:
> >
> > (1) Zeroing the disk can take a long time (eg. 40 GB is approx.
> >     20 minutes).  Furthermore there is no progress indication while
this
> >     is happening.
> >
> 
> >     Nothing bad happens: because it is making frequent requests there
> >     is no timeout.
> >
> > (2) I suspect that because we don't have trim support that this is
> >     actually causing the disk to get fully allocated on the target.
> >
> >     The NBD requests are sent with may_trim=1 so we could turn these
> >     into trim requests, but obviously cannot do that while there is no
> >     trim support.
> >
> 
> It sounds like nbdkit is emulating trim with zero instead of noop.
> 
> I'm not sure why qemu-img is trying to do, I hope the nbd maintainer on
> qemu side can explain this.
qemu-img tries to efficiently zero out the whole device at once so that
it doesn't have to use individual small write requests for unallocated
parts of the image later on.

The problem is that the NBD block driver has max_pwrite_zeroes = 32 MB,
so it's not that efficient after all. I'm not sure if there is a real
reason for this, but Eric should know.
> However, since you suggest that we could use "trim" request for
these
> requests, it means that these requests are advisory (since trim is), and
> we can just ignore them if the server does not support trim.
What qemu-img sends shouldn't be a NBD_CMD_TRIM request (which is indeed
advisory), but a NBD_CMD_WRITE_ZEROES request. qemu-img relies on the
image actually being zeroed after this.

Kevin

Nir Soffer

2018-Apr-10 14:07 UTC

head link

Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

On Tue, Apr 10, 2018 at 4:48 PM Kevin Wolf <kwolf@redhat.com> wrote:
> Am 10.04.2018 um 15:03 hat Nir Soffer geschrieben:
> > On Tue, Apr 10, 2018 at 1:44 PM Richard W.M. Jones
<rjones@redhat.com>
> > wrote:
> >
> > > We now have true zeroing support in oVirt imageio, thanks for
that.
> > >
> > > However a problem is that ‘qemu-img convert’ issues zero requests
for
> > > the whole disk before starting the transfer.  It does this using
32 MB
> > > requests which take approx. 1 second each to execute on the oVirt
side.
> >
> >
> > > Two problems therefore:
> > >
> > > (1) Zeroing the disk can take a long time (eg. 40 GB is approx.
> > >     20 minutes).  Furthermore there is no progress indication
while
> this
> > >     is happening.
> > >
> >
> > >     Nothing bad happens: because it is making frequent requests
there
> > >     is no timeout.
> > >
> > > (2) I suspect that because we don't have trim support that
this is
> > >     actually causing the disk to get fully allocated on the
target.
> > >
> > >     The NBD requests are sent with may_trim=1 so we could turn
these
> > >     into trim requests, but obviously cannot do that while there
is no
> > >     trim support.
> > >
> >
> > It sounds like nbdkit is emulating trim with zero instead of noop.
> >
> > I'm not sure why qemu-img is trying to do, I hope the nbd
maintainer on
> > qemu side can explain this.
>
> qemu-img tries to efficiently zero out the whole device at once so that
> it doesn't have to use individual small write requests for unallocated
> parts of the image later on.
>
This makes sense if the device is backed by a block device on oVirt side,
and the NBD support efficient zeroing. But in this case the device is backed
by an empty sparse file on NFS, and oVirt does not support yet efficient
zeroing, we just write zeros manually.

I think should be handled on virt-v2v plugin side. When zeroing a file raw
image,
you can ignore zero requests after the highest write offset, since the
plugin
created a new image, and we know that the image is empty.

When the destination is a block device we cannot avoid zeroing since a block
device may contain junk data (we usually get dirty empty images from our
local
xtremio server).

> The problem is that the NBD block driver has max_pwrite_zeroes = 32 MB,
> so it's not that efficient after all. I'm not sure if there is a
real
> reason for this, but Eric should know.
>
We support zero with unlimited size without sending any payload to oVirt,
so
there is no reason to limit zero request by max_pwrite_zeros. This limit may
make sense when zero is emulated using pwrite.

>
> > However, since you suggest that we could use "trim" request
for these
> > requests, it means that these requests are advisory (since trim is),
and
> > we can just ignore them if the server does not support trim.
>
> What qemu-img sends shouldn't be a NBD_CMD_TRIM request (which is
indeed
> advisory), but a NBD_CMD_WRITE_ZEROES request. qemu-img relies on the
> image actually being zeroed after this.
>
So it seems that may_trim=1 is wrong, since trim cannot replace zero.

Nir

Richard W.M. Jones

2018-Apr-10 14:30 UTC

head link

Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

On Tue, Apr 10, 2018 at 03:48:11PM +0200, Kevin Wolf
wrote:> Am 10.04.2018 um 15:03 hat Nir Soffer geschrieben:
> > On Tue, Apr 10, 2018 at 1:44 PM Richard W.M. Jones
<rjones@redhat.com>
> > wrote:
> > 
> > > We now have true zeroing support in oVirt imageio, thanks for
that.
> > >
> > > However a problem is that ‘qemu-img convert’ issues zero requests
for
> > > the whole disk before starting the transfer.  It does this using
32 MB
> > > requests which take approx. 1 second each to execute on the oVirt
side.
> > 
> > 
> > > Two problems therefore:
> > >
> > > (1) Zeroing the disk can take a long time (eg. 40 GB is approx.
> > >     20 minutes).  Furthermore there is no progress indication
while this
> > >     is happening.
> > >
> > 
> > >     Nothing bad happens: because it is making frequent requests
there
> > >     is no timeout.
> > >
> > > (2) I suspect that because we don't have trim support that
this is
> > >     actually causing the disk to get fully allocated on the
target.
> > >
> > >     The NBD requests are sent with may_trim=1 so we could turn
these
> > >     into trim requests, but obviously cannot do that while there
is no
> > >     trim support.
> > >
> > 
> > It sounds like nbdkit is emulating trim with zero instead of noop.
> > 
> > I'm not sure why qemu-img is trying to do, I hope the nbd
maintainer on
> > qemu side can explain this.
> 
> qemu-img tries to efficiently zero out the whole device at once so that
> it doesn't have to use individual small write requests for unallocated
> parts of the image later on.
> 
> The problem is that the NBD block driver has max_pwrite_zeroes = 32 MB,
> so it's not that efficient after all. I'm not sure if there is a
real
> reason for this, but Eric should know.
> 
> > However, since you suggest that we could use "trim" request
for these
> > requests, it means that these requests are advisory (since trim is),
and
> > we can just ignore them if the server does not support trim.
> 
> What qemu-img sends shouldn't be a NBD_CMD_TRIM request (which is
indeed
> advisory), but a NBD_CMD_WRITE_ZEROES request. qemu-img relies on the
> image actually being zeroed after this.
Yup it's actually sending NBD_CMD_WRITE_ZEROES with the flag
NBD_CMD_FLAG_NO_HOLE clear (not set).

I think Eric needs to comment here ..

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Richard W.M. Jones

2018-Apr-10 14:40 UTC

head link

Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

On Tue, Apr 10, 2018 at 02:07:33PM +0000, Nir Soffer
wrote:> This makes sense if the device is backed by a block device on oVirt side,
> and the NBD support efficient zeroing. But in this case the device is
backed
> by an empty sparse file on NFS, and oVirt does not support yet efficient
> zeroing, we just write zeros manually.
> 
> I think should be handled on virt-v2v plugin side. When zeroing a file raw
> image,
> you can ignore zero requests after the highest write offset, since the
> plugin
> created a new image, and we know that the image is empty.
> 
> When the destination is a block device we cannot avoid zeroing since a
block
> device may contain junk data (we usually get dirty empty images from our
> local
> xtremio server).
(Off topic for qemu-block but ...)  We don't have enough information
at our end to know about any of this.
> > The problem is that the NBD block driver has max_pwrite_zeroes = 32
MB,
> > so it's not that efficient after all. I'm not sure if there is
a real
> > reason for this, but Eric should know.
> >
> 
> We support zero with unlimited size without sending any payload to oVirt,
> so
> there is no reason to limit zero request by max_pwrite_zeros. This limit
may
> make sense when zero is emulated using pwrite.
Yes, this seems wrong, but I'd want Eric to comment.
> > > However, since you suggest that we could use "trim"
request for these
> > > requests, it means that these requests are advisory (since trim
is), and
> > > we can just ignore them if the server does not support trim.
> >
> > What qemu-img sends shouldn't be a NBD_CMD_TRIM request (which is
indeed
> > advisory), but a NBD_CMD_WRITE_ZEROES request. qemu-img relies on the
> > image actually being zeroed after this.
> >
> 
> So it seems that may_trim=1 is wrong, since trim cannot replace zero.
Note that the current plugin ignores may_trim.  It is not used at all,
so it's not relevant to this problem.

However this flag actually corresponds to the inverse of
NBD_CMD_FLAG_NO_HOLE which is defined by the NBD spec as:

    bit 1, NBD_CMD_FLAG_NO_HOLE; valid during
    NBD_CMD_WRITE_ZEROES. SHOULD be set to 1 if the client wants to
    ensure that the server does not create a hole. The client MAY send
    NBD_CMD_FLAG_NO_HOLE even if NBD_FLAG_SEND_TRIM was not set in the
    transmission flags field. The server MUST support the use of this
    flag if it advertises NBD_FLAG_SEND_WRITE_ZEROES. *

qemu-img convert uses NBD_CMD_WRITE_ZEROES and does NOT set this flag
(hence in the plugin we see may_trim=1), and I believe that qemu-img
is correct because it doesn't want to force preallocation.

Rich.

* https://github.com/NetworkBlockDevice/nbd/blob/master/doc/proto.md

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Eric Blake

2018-Apr-10 14:52 UTC

head link

Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

On 04/10/2018 09:07 AM, Nir Soffer wrote:> On Tue, Apr 10, 2018 at 4:48 PM Kevin Wolf <kwolf@redhat.com> wrote:
> 
>> Am 10.04.2018 um 15:03 hat Nir Soffer geschrieben:
>>> On Tue, Apr 10, 2018 at 1:44 PM Richard W.M. Jones
<rjones@redhat.com>
>>> wrote:
>>>
>>>> We now have true zeroing support in oVirt imageio, thanks for
that.
>>>>
>>>> However a problem is that ‘qemu-img convert’ issues zero
requests for
>>>> the whole disk before starting the transfer.  It does this
using 32 MB
>>>> requests which take approx. 1 second each to execute on the
oVirt side.
>>>
>>>
>>>> Two problems therefore:
>>>>
>>>> (1) Zeroing the disk can take a long time (eg. 40 GB is approx.
>>>>     20 minutes).  Furthermore there is no progress indication
while
>> this
>>>>     is happening.
This is going to be true whether or not you write zeroes in 32M chunks
or in 2G chunks - it takes a long time to write actual zeroes to a block
device if you are unsure of whether the device already contains zeroes.
There is more overhead for sending 64 requests of 32M each than 1
request for 2G, there's a question for whether that's in the noise
(slightly more data sent over the wire) or impactful (because you have
to wait for more round trips, where the time spent waiting for traffic
is on par with the time spent writing zeroes for a single request).

The only way that a write zeroes request is not going to be slower than
a normal write is if the block device itself supports an efficient way
to guarantee that the sectors of the disk will read as zero (for
example, using things like WRITE_SAME on iscsi devices).
>>>>
>>>
>>>>     Nothing bad happens: because it is making frequent requests
there
>>>>     is no timeout.
>>>>
>>>> (2) I suspect that because we don't have trim support that
this is
>>>>     actually causing the disk to get fully allocated on the
target.
>>>>
>>>>     The NBD requests are sent with may_trim=1 so we could turn
these
>>>>     into trim requests, but obviously cannot do that while
there is no
>>>>     trim support.
In fact, if a trim request guarantees that you can read back zeroes
regardless of what was previously on the block device, then that is
precisely what you SHOULD be doing to make write zeroes more efficient
(but only when may_trim=1).
>>>>
>>>
>>> It sounds like nbdkit is emulating trim with zero instead of noop.
No, qemu-img is NOT requesting trim, it is requesting write zeroes.  You
can implement write zeroes with a trim if the trim will read back as
zeroes.  But while trim is advisory, write zeroes has mandatory
semantics on what you read back (where may_trim=1 is a determining
factor on whether the write MUST allocate, or MAY trim. Ignoring
may_trim and always allocating is semantically correct but may be
slower, while trimming is correct only when may_trim=1).
>>>
>>> I'm not sure why qemu-img is trying to do, I hope the nbd
maintainer on
>>> qemu side can explain this.
>>
>> qemu-img tries to efficiently zero out the whole device at once so that
>> it doesn't have to use individual small write requests for
unallocated
>> parts of the image later on.
At one point, there was a proposal to have the NBD protocol add
something where the server could advertise to the client if it is known
at initial connection time that the export is starting life with ALL
sectors zeroed.  (Easy to prove for a just-created sparse file, a bit
harder to prove for a block device although at least some iscsi devices
do have queries to learn if the entire device is unallocated).

This has not yet been implemented in the NBD protocol, but may be worth
doing; it is something that is slightly redundant with the
NBD_CMD_BLOCK_STATUS that qemu 2.12 is introducing (in that the client
can perform that sort of query itself rather than the server advertising
it at initial connection), but may be easy enough to implement even
where NBD_CMD_BLOCK_STATUS is difficult that it would still allow
qemu-img to operate more efficiently in some situations.  But qemu-img
DOES know how to skip zeroing a block device if it knows up front that
the device already reads as all zeroes, so the missing piece of
information is getting NBD to tell that to qemu-img.

Meanwhile, NBD_CMD_BLOCK_STATUS is still quite a ways from being
supported in nbdkit, so that's not anything that rhv-upload can exploit
any time soon.
>>
> 
> This makes sense if the device is backed by a block device on oVirt side,
> and the NBD support efficient zeroing. But in this case the device is
backed
> by an empty sparse file on NFS, and oVirt does not support yet efficient
> zeroing, we just write zeros manually.
> 
> I think should be handled on virt-v2v plugin side. When zeroing a file raw
> image,
> you can ignore zero requests after the highest write offset, since the
> plugin
> created a new image, and we know that the image is empty.
Didn't Rich already try to do that?

+def emulate_zero(h, count, offset):
+    # qemu-img convert starts by trying to zero/trim the whole device.
+    # Since we've just created a new disk it's safe to ignore these
+    # requests as long as they are smaller than the highest write seen.
+    # After that we must emulate them with writes.
+    if offset+count < h['highestwrite']:

Or is the problem that emulate_zero() is only being called if:

+    # Unlike the trim and flush calls, there is no 'can_zero' method
+    # so nbdkit could call this even if the server doesn't support
+    # zeroing.  If this is the case we must emulate.
+    if not h['can_zero']:
+        emulate_zero(h, count, offset)
+        return

rather than doing the 'highestwrite' check unconditionally even when
oVirt supports zero requests?
> 
> When the destination is a block device we cannot avoid zeroing since a
block
> device may contain junk data (we usually get dirty empty images from our
> local
> xtremio server).
And that's why qemu-img is starting life with write zeroes requests -
because it needs to guarantee that the image either already started as
all zeroes, or that zeroes are written to overwrite junk data.
>> The problem is that the NBD block driver has max_pwrite_zeroes = 32 MB,
>> so it's not that efficient after all. I'm not sure if there is
a real
>> reason for this, but Eric should know.
>>
Yes, I do know.  But it missed qemu 2.12; it's another NBD spec proposal
where I'm also going to submit a qemu patch:
https://lists.debian.org/nbd/2018/03/msg00017.html

Right now, the NBD protocol has no clean distinction between maximum
data request (hard limit of 32M for NBD_CMD_WRITE in qemu-img) and for
maximum length on a request with no accompanying data
(NBD_CMD_WRITE_ZEROES).  Once we add NBD_INFO_ZERO_SIZE, then it becomes
obvious that sending a 2G NBD_CMD_WRITE_ZEROES request makes sense, even
when 32M is the maximum for a normal write; but until that point, qemu
is being conservative and capping EVERYTHING to the 32M limit.  There's
also talk about enhancing NBD to support larger than 4G by adding an
extension that permits 64-bit lengths, but that's further off in the
"nice idea, but not yet documented or implemented" category.
> 
> We support zero with unlimited size without sending any payload to oVirt,
> so
> there is no reason to limit zero request by max_pwrite_zeros. This limit
may
> make sense when zero is emulated using pwrite.
Even when write zeroes is emulated by falling back to pwrite, the pwrite
can be done in a loop (however, then you get into the game of whether
writing 2G of zeroes takes long enough that you really DO want to
enforce a write zero maximum smaller than 4G, if only to guarantee more
frequent traffic to avoid timing out).
> 
> 
>>
>>> However, since you suggest that we could use "trim"
request for these
>>> requests, it means that these requests are advisory (since trim
is), and
>>> we can just ignore them if the server does not support trim.
>>
>> What qemu-img sends shouldn't be a NBD_CMD_TRIM request (which is
indeed
>> advisory), but a NBD_CMD_WRITE_ZEROES request. qemu-img relies on the
>> image actually being zeroed after this.
>>
> 
> So it seems that may_trim=1 is wrong, since trim cannot replace zero.
No, 'may_trim=1' means you may trim, IF you can guarantee that you can
read back as zero.  If trim can't guarantee a read back as zero, then
may_trim=1 must be ignored and the server do a write instead.  The
client should always be able to request may_trim=1, whether or not the
server can actually do a trim as an optimization.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Apparently Analagous Threads

Search for more reasonably related threads

Libguestfs - Apr 2018 - Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

Apparently Analagous Threads