thr3ads.net - Libguestfs - Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk [Apr 2018]

If this information is useful, please help other people find it:
Share via:

Richard W.M. Jones

2018-Apr-10 14:40 UTC

Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

On Tue, Apr 10, 2018 at 02:07:33PM +0000, Nir Soffer
wrote:> This makes sense if the device is backed by a block device on oVirt side,
> and the NBD support efficient zeroing. But in this case the device is
backed
> by an empty sparse file on NFS, and oVirt does not support yet efficient
> zeroing, we just write zeros manually.
> 
> I think should be handled on virt-v2v plugin side. When zeroing a file raw
> image,
> you can ignore zero requests after the highest write offset, since the
> plugin
> created a new image, and we know that the image is empty.
> 
> When the destination is a block device we cannot avoid zeroing since a
block
> device may contain junk data (we usually get dirty empty images from our
> local
> xtremio server).
(Off topic for qemu-block but ...)  We don't have enough information
at our end to know about any of this.
> > The problem is that the NBD block driver has max_pwrite_zeroes = 32
MB,
> > so it's not that efficient after all. I'm not sure if there is
a real
> > reason for this, but Eric should know.
> >
> 
> We support zero with unlimited size without sending any payload to oVirt,
> so
> there is no reason to limit zero request by max_pwrite_zeros. This limit
may
> make sense when zero is emulated using pwrite.
Yes, this seems wrong, but I'd want Eric to comment.
> > > However, since you suggest that we could use "trim"
request for these
> > > requests, it means that these requests are advisory (since trim
is), and
> > > we can just ignore them if the server does not support trim.
> >
> > What qemu-img sends shouldn't be a NBD_CMD_TRIM request (which is
indeed
> > advisory), but a NBD_CMD_WRITE_ZEROES request. qemu-img relies on the
> > image actually being zeroed after this.
> >
> 
> So it seems that may_trim=1 is wrong, since trim cannot replace zero.
Note that the current plugin ignores may_trim.  It is not used at all,
so it's not relevant to this problem.

However this flag actually corresponds to the inverse of
NBD_CMD_FLAG_NO_HOLE which is defined by the NBD spec as:

    bit 1, NBD_CMD_FLAG_NO_HOLE; valid during
    NBD_CMD_WRITE_ZEROES. SHOULD be set to 1 if the client wants to
    ensure that the server does not create a hole. The client MAY send
    NBD_CMD_FLAG_NO_HOLE even if NBD_FLAG_SEND_TRIM was not set in the
    transmission flags field. The server MUST support the use of this
    flag if it advertises NBD_FLAG_SEND_WRITE_ZEROES. *

qemu-img convert uses NBD_CMD_WRITE_ZEROES and does NOT set this flag
(hence in the plugin we see may_trim=1), and I believe that qemu-img
is correct because it doesn't want to force preallocation.

Rich.

* https://github.com/NetworkBlockDevice/nbd/blob/master/doc/proto.md

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

Eric Blake

2018-Apr-10 15:00 UTC

head link

Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

On 04/10/2018 09:40 AM, Richard W.M. Jones wrote:>> When the destination is a block device we cannot avoid zeroing since a
block
>> device may contain junk data (we usually get dirty empty images from
our
>> local
>> xtremio server).
> 
> (Off topic for qemu-block but ...)  We don't have enough information
> at our end to know about any of this.
Yep, see my other email about a possible NBD protocol extension to
actually let the client learn up-front if the exported device is known
to start in an all-zero state.
> 
>>> The problem is that the NBD block driver has max_pwrite_zeroes = 32
MB,
>>> so it's not that efficient after all. I'm not sure if there
is a real
>>> reason for this, but Eric should know.
>>>
>>
>> We support zero with unlimited size without sending any payload to
oVirt,
>> so
>> there is no reason to limit zero request by max_pwrite_zeros. This
limit may
>> make sense when zero is emulated using pwrite.
> 
> Yes, this seems wrong, but I'd want Eric to comment.
The 32M cap is currently the fault of qemu-img, not nbdkit (nbdkit is
not further reducing the size of the zero requests it passes on to
oVirt); and I explained in the other email about how qemu 2.13 will fix
things to send larger zero requests (hmm, that means nbdkit really needs
to start supporting NBD_OPT_GO, as that is what qemu will be relying on
to learn the larger limits).
> 
>>>> However, since you suggest that we could use "trim"
request for these
>>>> requests, it means that these requests are advisory (since trim
is), and
>>>> we can just ignore them if the server does not support trim.
>>>
>>> What qemu-img sends shouldn't be a NBD_CMD_TRIM request (which
is indeed
>>> advisory), but a NBD_CMD_WRITE_ZEROES request. qemu-img relies on
the
>>> image actually being zeroed after this.
>>>
>>
>> So it seems that may_trim=1 is wrong, since trim cannot replace zero.
> 
> Note that the current plugin ignores may_trim.  It is not used at all,
> so it's not relevant to this problem.
> 
> However this flag actually corresponds to the inverse of
> NBD_CMD_FLAG_NO_HOLE which is defined by the NBD spec as:
> 
>     bit 1, NBD_CMD_FLAG_NO_HOLE; valid during
>     NBD_CMD_WRITE_ZEROES. SHOULD be set to 1 if the client wants to
>     ensure that the server does not create a hole. The client MAY send
>     NBD_CMD_FLAG_NO_HOLE even if NBD_FLAG_SEND_TRIM was not set in the
>     transmission flags field. The server MUST support the use of this
>     flag if it advertises NBD_FLAG_SEND_WRITE_ZEROES. *
> 
> qemu-img convert uses NBD_CMD_WRITE_ZEROES and does NOT set this flag
> (hence in the plugin we see may_trim=1), and I believe that qemu-img
> is correct because it doesn't want to force preallocation.
Yes, the flag usage is correct, and you are also correct that the
'may_trim' flag of nbdkit is the inverse bit sense of the
NBD_CMD_FLAG_NO_HOLE of the NBD protocol; it's all a documentation game
in deciding whether having a bit be 0 or 1 in the default state made
more sense.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Nir Soffer

2018-Apr-10 15:25 UTC

head link

Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

On Tue, Apr 10, 2018 at 5:50 PM Richard W.M. Jones <rjones@redhat.com>
wrote:
> On Tue, Apr 10, 2018 at 02:07:33PM +0000, Nir Soffer wrote:
> > This makes sense if the device is backed by a block device on oVirt
side,
> > and the NBD support efficient zeroing. But in this case the device is
> backed
> > by an empty sparse file on NFS, and oVirt does not support yet
efficient
> > zeroing, we just write zeros manually.
> >
> > I think should be handled on virt-v2v plugin side. When zeroing a file
> raw
> > image,
> > you can ignore zero requests after the highest write offset, since the
> > plugin
> > created a new image, and we know that the image is empty.
> >
> > When the destination is a block device we cannot avoid zeroing since a
> block
> > device may contain junk data (we usually get dirty empty images from
our
> > local
> > xtremio server).
>
> (Off topic for qemu-block but ...)  We don't have enough information
> at our end to know about any of this.
>
Can't use use this logic in the oVirt plugin?

file based storage -> skip initial zeroing
block based storage -> use initial zeroing

Do you think that publishing disk capabilities in the sdk will solve this?

> > > The problem is that the NBD block driver has max_pwrite_zeroes =
32 MB,
> > > so it's not that efficient after all. I'm not sure if
there is a real
> > > reason for this, but Eric should know.
> > >
> >
> > We support zero with unlimited size without sending any payload to
oVirt,
> > so
> > there is no reason to limit zero request by max_pwrite_zeros. This
limit
> may
> > make sense when zero is emulated using pwrite.
>
> Yes, this seems wrong, but I'd want Eric to comment.
>
> > > > However, since you suggest that we could use
"trim" request for these
> > > > requests, it means that these requests are advisory (since
trim is),
> and
> > > > we can just ignore them if the server does not support trim.
> > >
> > > What qemu-img sends shouldn't be a NBD_CMD_TRIM request
(which is
> indeed
> > > advisory), but a NBD_CMD_WRITE_ZEROES request. qemu-img relies on
the
> > > image actually being zeroed after this.
> > >
> >
> > So it seems that may_trim=1 is wrong, since trim cannot replace zero.
>
> Note that the current plugin ignores may_trim.  It is not used at all,
> so it's not relevant to this problem.
>
> However this flag actually corresponds to the inverse of
> NBD_CMD_FLAG_NO_HOLE which is defined by the NBD spec as:
>
>     bit 1, NBD_CMD_FLAG_NO_HOLE; valid during
>     NBD_CMD_WRITE_ZEROES. SHOULD be set to 1 if the client wants to
>     ensure that the server does not create a hole. The client MAY send
>     NBD_CMD_FLAG_NO_HOLE even if NBD_FLAG_SEND_TRIM was not set in the
>     transmission flags field. The server MUST support the use of this
>     flag if it advertises NBD_FLAG_SEND_WRITE_ZEROES. *
>
> qemu-img convert uses NBD_CMD_WRITE_ZEROES and does NOT set this flag
> (hence in the plugin we see may_trim=1), and I believe that qemu-img
> is correct because it doesn't want to force preallocation.
>
So once oVirt will support efficient zeroing, this flag may be translated to
(for file based storage):

may_trim=1 -> fallocate(FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE)
may_trim=0 -> fallocate(FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE)

We planed to choose this by default on oVirt side, based on disk type. For
preallocated
disk we never want to use FALLOC_FL_PUNCH_HOLE, and for sparse disk we
always
want to use FALLOC_FL_PUNCH_HOLE unless it is not supported.

Seems that we need to add a "trim" or "punch_hole" flag to
the PATCH/zero
request,
so you can hint oVirt how do you want to zero. oVirt will choose what to do
based
on storage type (file/block), user request(trim/notrim), and disk type
(thin/preallocated).

I think we can start the use this flag when we publish the "trim"
feature.

Nir

Nir Soffer

2018-Apr-10 15:31 UTC

head link

Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

On Tue, Apr 10, 2018 at 6:00 PM Eric Blake <eblake@redhat.com> wrote:
> On 04/10/2018 09:40 AM, Richard W.M. Jones wrote:
> >> When the destination is a block device we cannot avoid zeroing
since a
> block
> >> device may contain junk data (we usually get dirty empty images
from our
> >> local
> >> xtremio server).
> >
> > (Off topic for qemu-block but ...)  We don't have enough
information
> > at our end to know about any of this.
>
> Yep, see my other email about a possible NBD protocol extension to
> actually let the client learn up-front if the exported device is known
> to start in an all-zero state.
>
In the future we can report all-zero state in OPTIONS, but since this
info is already known on engine side when creating a disk, I think reporting
it in engine is better.

Nir

Richard W.M. Jones

2018-Apr-10 15:51 UTC

head link

Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

On Tue, Apr 10, 2018 at 03:25:47PM +0000, Nir Soffer
wrote:> On Tue, Apr 10, 2018 at 5:50 PM Richard W.M. Jones
<rjones@redhat.com>
> wrote:
> 
> > On Tue, Apr 10, 2018 at 02:07:33PM +0000, Nir Soffer wrote:
> > > This makes sense if the device is backed by a block device on
oVirt side,
> > > and the NBD support efficient zeroing. But in this case the
device is
> > backed
> > > by an empty sparse file on NFS, and oVirt does not support yet
efficient
> > > zeroing, we just write zeros manually.
> > >
> > > I think should be handled on virt-v2v plugin side. When zeroing a
file
> > raw
> > > image,
> > > you can ignore zero requests after the highest write offset,
since the
> > > plugin
> > > created a new image, and we know that the image is empty.
> > >
> > > When the destination is a block device we cannot avoid zeroing
since a
> > block
> > > device may contain junk data (we usually get dirty empty images
from our
> > > local
> > > xtremio server).
> >
> > (Off topic for qemu-block but ...)  We don't have enough
information
> > at our end to know about any of this.
> >
> 
> Can't use use this logic in the oVirt plugin?
> 
> file based storage -> skip initial zeroing
> block based storage -> use initial zeroing
>
> Do you think that publishing disk capabilities in the sdk will solve this?
The plugin would have to do some complicated gymnastics.  It would
have to keep track of which areas of the disk have been written and
ignore NBD_CMD_WRITE_ZEROES for other areas, except if block-based
storage is being used.  And so yes we'd also need the imageio API to
publish that information to the plugin.

So it's possible but not trivial.

By the way I think we're slowly reimplementing NBD in the imageio API.
Dan Berrange pointed out earlier on that it might be easier if imageio
just exposed NBD, or if we found a way to tunnel NBD requests over web
sockets (in the format case nbdkit would not be needed, in the latter
case nbdkit could act as a bridge).

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org

Apparently Analagous Threads

Search for more possibly parallel threads

Libguestfs - Apr 2018 - Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

Apparently Analagous Threads