Richard W.M. Jones
2018-Apr-10 14:40 UTC
Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk
On Tue, Apr 10, 2018 at 02:07:33PM +0000, Nir Soffer wrote:> This makes sense if the device is backed by a block device on oVirt side, > and the NBD support efficient zeroing. But in this case the device is backed > by an empty sparse file on NFS, and oVirt does not support yet efficient > zeroing, we just write zeros manually. > > I think should be handled on virt-v2v plugin side. When zeroing a file raw > image, > you can ignore zero requests after the highest write offset, since the > plugin > created a new image, and we know that the image is empty. > > When the destination is a block device we cannot avoid zeroing since a block > device may contain junk data (we usually get dirty empty images from our > local > xtremio server).(Off topic for qemu-block but ...) We don't have enough information at our end to know about any of this.> > The problem is that the NBD block driver has max_pwrite_zeroes = 32 MB, > > so it's not that efficient after all. I'm not sure if there is a real > > reason for this, but Eric should know. > > > > We support zero with unlimited size without sending any payload to oVirt, > so > there is no reason to limit zero request by max_pwrite_zeros. This limit may > make sense when zero is emulated using pwrite.Yes, this seems wrong, but I'd want Eric to comment.> > > However, since you suggest that we could use "trim" request for these > > > requests, it means that these requests are advisory (since trim is), and > > > we can just ignore them if the server does not support trim. > > > > What qemu-img sends shouldn't be a NBD_CMD_TRIM request (which is indeed > > advisory), but a NBD_CMD_WRITE_ZEROES request. qemu-img relies on the > > image actually being zeroed after this. > > > > So it seems that may_trim=1 is wrong, since trim cannot replace zero.Note that the current plugin ignores may_trim. It is not used at all, so it's not relevant to this problem. However this flag actually corresponds to the inverse of NBD_CMD_FLAG_NO_HOLE which is defined by the NBD spec as: bit 1, NBD_CMD_FLAG_NO_HOLE; valid during NBD_CMD_WRITE_ZEROES. SHOULD be set to 1 if the client wants to ensure that the server does not create a hole. The client MAY send NBD_CMD_FLAG_NO_HOLE even if NBD_FLAG_SEND_TRIM was not set in the transmission flags field. The server MUST support the use of this flag if it advertises NBD_FLAG_SEND_WRITE_ZEROES. * qemu-img convert uses NBD_CMD_WRITE_ZEROES and does NOT set this flag (hence in the plugin we see may_trim=1), and I believe that qemu-img is correct because it doesn't want to force preallocation. Rich. * https://github.com/NetworkBlockDevice/nbd/blob/master/doc/proto.md -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top
Eric Blake
2018-Apr-10 15:00 UTC
Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk
On 04/10/2018 09:40 AM, Richard W.M. Jones wrote:>> When the destination is a block device we cannot avoid zeroing since a block >> device may contain junk data (we usually get dirty empty images from our >> local >> xtremio server). > > (Off topic for qemu-block but ...) We don't have enough information > at our end to know about any of this.Yep, see my other email about a possible NBD protocol extension to actually let the client learn up-front if the exported device is known to start in an all-zero state.> >>> The problem is that the NBD block driver has max_pwrite_zeroes = 32 MB, >>> so it's not that efficient after all. I'm not sure if there is a real >>> reason for this, but Eric should know. >>> >> >> We support zero with unlimited size without sending any payload to oVirt, >> so >> there is no reason to limit zero request by max_pwrite_zeros. This limit may >> make sense when zero is emulated using pwrite. > > Yes, this seems wrong, but I'd want Eric to comment.The 32M cap is currently the fault of qemu-img, not nbdkit (nbdkit is not further reducing the size of the zero requests it passes on to oVirt); and I explained in the other email about how qemu 2.13 will fix things to send larger zero requests (hmm, that means nbdkit really needs to start supporting NBD_OPT_GO, as that is what qemu will be relying on to learn the larger limits).> >>>> However, since you suggest that we could use "trim" request for these >>>> requests, it means that these requests are advisory (since trim is), and >>>> we can just ignore them if the server does not support trim. >>> >>> What qemu-img sends shouldn't be a NBD_CMD_TRIM request (which is indeed >>> advisory), but a NBD_CMD_WRITE_ZEROES request. qemu-img relies on the >>> image actually being zeroed after this. >>> >> >> So it seems that may_trim=1 is wrong, since trim cannot replace zero. > > Note that the current plugin ignores may_trim. It is not used at all, > so it's not relevant to this problem. > > However this flag actually corresponds to the inverse of > NBD_CMD_FLAG_NO_HOLE which is defined by the NBD spec as: > > bit 1, NBD_CMD_FLAG_NO_HOLE; valid during > NBD_CMD_WRITE_ZEROES. SHOULD be set to 1 if the client wants to > ensure that the server does not create a hole. The client MAY send > NBD_CMD_FLAG_NO_HOLE even if NBD_FLAG_SEND_TRIM was not set in the > transmission flags field. The server MUST support the use of this > flag if it advertises NBD_FLAG_SEND_WRITE_ZEROES. * > > qemu-img convert uses NBD_CMD_WRITE_ZEROES and does NOT set this flag > (hence in the plugin we see may_trim=1), and I believe that qemu-img > is correct because it doesn't want to force preallocation.Yes, the flag usage is correct, and you are also correct that the 'may_trim' flag of nbdkit is the inverse bit sense of the NBD_CMD_FLAG_NO_HOLE of the NBD protocol; it's all a documentation game in deciding whether having a bit be 0 or 1 in the default state made more sense. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Nir Soffer
2018-Apr-10 15:25 UTC
Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk
On Tue, Apr 10, 2018 at 5:50 PM Richard W.M. Jones <rjones@redhat.com> wrote:> On Tue, Apr 10, 2018 at 02:07:33PM +0000, Nir Soffer wrote: > > This makes sense if the device is backed by a block device on oVirt side, > > and the NBD support efficient zeroing. But in this case the device is > backed > > by an empty sparse file on NFS, and oVirt does not support yet efficient > > zeroing, we just write zeros manually. > > > > I think should be handled on virt-v2v plugin side. When zeroing a file > raw > > image, > > you can ignore zero requests after the highest write offset, since the > > plugin > > created a new image, and we know that the image is empty. > > > > When the destination is a block device we cannot avoid zeroing since a > block > > device may contain junk data (we usually get dirty empty images from our > > local > > xtremio server). > > (Off topic for qemu-block but ...) We don't have enough information > at our end to know about any of this. >Can't use use this logic in the oVirt plugin? file based storage -> skip initial zeroing block based storage -> use initial zeroing Do you think that publishing disk capabilities in the sdk will solve this?> > > The problem is that the NBD block driver has max_pwrite_zeroes = 32 MB, > > > so it's not that efficient after all. I'm not sure if there is a real > > > reason for this, but Eric should know. > > > > > > > We support zero with unlimited size without sending any payload to oVirt, > > so > > there is no reason to limit zero request by max_pwrite_zeros. This limit > may > > make sense when zero is emulated using pwrite. > > Yes, this seems wrong, but I'd want Eric to comment. > > > > > However, since you suggest that we could use "trim" request for these > > > > requests, it means that these requests are advisory (since trim is), > and > > > > we can just ignore them if the server does not support trim. > > > > > > What qemu-img sends shouldn't be a NBD_CMD_TRIM request (which is > indeed > > > advisory), but a NBD_CMD_WRITE_ZEROES request. qemu-img relies on the > > > image actually being zeroed after this. > > > > > > > So it seems that may_trim=1 is wrong, since trim cannot replace zero. > > Note that the current plugin ignores may_trim. It is not used at all, > so it's not relevant to this problem. > > However this flag actually corresponds to the inverse of > NBD_CMD_FLAG_NO_HOLE which is defined by the NBD spec as: > > bit 1, NBD_CMD_FLAG_NO_HOLE; valid during > NBD_CMD_WRITE_ZEROES. SHOULD be set to 1 if the client wants to > ensure that the server does not create a hole. The client MAY send > NBD_CMD_FLAG_NO_HOLE even if NBD_FLAG_SEND_TRIM was not set in the > transmission flags field. The server MUST support the use of this > flag if it advertises NBD_FLAG_SEND_WRITE_ZEROES. * > > qemu-img convert uses NBD_CMD_WRITE_ZEROES and does NOT set this flag > (hence in the plugin we see may_trim=1), and I believe that qemu-img > is correct because it doesn't want to force preallocation. >So once oVirt will support efficient zeroing, this flag may be translated to (for file based storage): may_trim=1 -> fallocate(FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE) may_trim=0 -> fallocate(FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE) We planed to choose this by default on oVirt side, based on disk type. For preallocated disk we never want to use FALLOC_FL_PUNCH_HOLE, and for sparse disk we always want to use FALLOC_FL_PUNCH_HOLE unless it is not supported. Seems that we need to add a "trim" or "punch_hole" flag to the PATCH/zero request, so you can hint oVirt how do you want to zero. oVirt will choose what to do based on storage type (file/block), user request(trim/notrim), and disk type (thin/preallocated). I think we can start the use this flag when we publish the "trim" feature. Nir
Nir Soffer
2018-Apr-10 15:31 UTC
Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk
On Tue, Apr 10, 2018 at 6:00 PM Eric Blake <eblake@redhat.com> wrote:> On 04/10/2018 09:40 AM, Richard W.M. Jones wrote: > >> When the destination is a block device we cannot avoid zeroing since a > block > >> device may contain junk data (we usually get dirty empty images from our > >> local > >> xtremio server). > > > > (Off topic for qemu-block but ...) We don't have enough information > > at our end to know about any of this. > > Yep, see my other email about a possible NBD protocol extension to > actually let the client learn up-front if the exported device is known > to start in an all-zero state. >In the future we can report all-zero state in OPTIONS, but since this info is already known on engine side when creating a disk, I think reporting it in engine is better. Nir
Richard W.M. Jones
2018-Apr-10 15:51 UTC
Re: [Libguestfs] [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk
On Tue, Apr 10, 2018 at 03:25:47PM +0000, Nir Soffer wrote:> On Tue, Apr 10, 2018 at 5:50 PM Richard W.M. Jones <rjones@redhat.com> > wrote: > > > On Tue, Apr 10, 2018 at 02:07:33PM +0000, Nir Soffer wrote: > > > This makes sense if the device is backed by a block device on oVirt side, > > > and the NBD support efficient zeroing. But in this case the device is > > backed > > > by an empty sparse file on NFS, and oVirt does not support yet efficient > > > zeroing, we just write zeros manually. > > > > > > I think should be handled on virt-v2v plugin side. When zeroing a file > > raw > > > image, > > > you can ignore zero requests after the highest write offset, since the > > > plugin > > > created a new image, and we know that the image is empty. > > > > > > When the destination is a block device we cannot avoid zeroing since a > > block > > > device may contain junk data (we usually get dirty empty images from our > > > local > > > xtremio server). > > > > (Off topic for qemu-block but ...) We don't have enough information > > at our end to know about any of this. > > > > Can't use use this logic in the oVirt plugin? > > file based storage -> skip initial zeroing > block based storage -> use initial zeroing > > Do you think that publishing disk capabilities in the sdk will solve this?The plugin would have to do some complicated gymnastics. It would have to keep track of which areas of the disk have been written and ignore NBD_CMD_WRITE_ZEROES for other areas, except if block-based storage is being used. And so yes we'd also need the imageio API to publish that information to the plugin. So it's possible but not trivial. By the way I think we're slowly reimplementing NBD in the imageio API. Dan Berrange pointed out earlier on that it might be easier if imageio just exposed NBD, or if we found a way to tunnel NBD requests over web sockets (in the format case nbdkit would not be needed, in the latter case nbdkit could act as a bridge). Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org
Possibly Parallel Threads
- Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk
- Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk
- Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk
- Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk
- Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk