thr3ads.net - Libguestfs - [Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy [May 2021]

If this information is useful, please help other people find it:
Share via:

Richard W.M. Jones

2021-May-26 14:15 UTC

[Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy

On Wed, May 26, 2021 at 04:49:50PM +0300, Nir Soffer
wrote:> On Wed, May 26, 2021 at 4:03 PM Richard W.M. Jones <rjones at
redhat.com> wrote:
> > In my testing, nbdcopy is a clear 4x faster than qemu-img convert,
with
> > 4 also happening to be the default number of connections/threads.
> > Why use nbdcopy --connections=1?  That completely disables threads in
> > nbdcopy.
> 
> Because qemu-nbd does not report multicon when writing, so practically
> you get one nbd handle for writing.
Let's see if we can fix that.  Crippling nbdcopy because of a missing
feature in qemu-nbd isn't right.  I wonder what Eric's reasoning for
multi-conn not being safe is.
> > Also I'm not sure if --flush is fair (it depends on what
> > qemu-img does, which I don't know).
> 
> qemu is flushing at the end of the operation. Not flushing is cheating :-)
That's fair enough.  I will add that flag to my future tests.

I also pushed these commits to disable malloc checking outside tests:

 
https://gitlab.com/nbdkit/libnbd/-/commit/88e72dcb1631b315957f5f98e3cdfcdd1fd0fe29
 
https://gitlab.com/nbdkit/nbdkit/-/commit/6039780f3bb0617650fa1fa4c1399b0d3f1dcb26
> > The other interesting things are the qemu-img convert flags you're
using:
> >
> >  -m 16  number of coroutines, default is 8
> 
> We use 8 in RHV since the difference is very small, and when running
> concurrent copies it does not matter. Since we use up to 64 concurrent
> requests in nbdcopy, it is useful to compare similar setup in qemu.
I'm not really clear on the relationship (in qemu-img) between number
of coroutines, number of pthreads and number of requests in flight.
At this rate I'm going to have to look at the source :-)
> >  -W     out of order writes, but the manual says "This is only
recommended
> >         for preallocated devices like host devices or other raw block
> >         devices" which is a very unclear recommendation to me.
> >         What's special about host devices versus (eg) files or
> >         qcow2 files which means -W wouldn't always be recommended?
> 
> This is how RHV use qemu-img convert when copying to raw preallocated
> volumes. Using -W  can be up to 6x times faster. We use the same for
imageio
> for any type of disk. This is the reason I tested this way.
> 
> -W is equivalent to the nbdocpy multithreaded copy using a single
connection.
>
> qemu-img does N concurrent reads. If you don't specify -W, it writes
> the data in the right order (based on offset). If a read did not
> finish, the copy loops waits until the read complets before
> writing. This ensure exactly one concurrent write, and it is much
> slower.
Thanks - interesting.  Still not sure why you wouldn't want to use
this flag all the time.

See also:
https://lists.nongnu.org/archive/html/qemu-discuss/2021-05/msg00070.html

...> This shows that nbdcopy works better when the latency is
> (practically) zero, copying data from memory to memory. This is
> useful for minimizing overhead in nbdcopy, but when copying real
> images with real storage with direct I/O the time to write the data
> to storage hides everything else.
>
> Would it be useful to add latency in the sparse-random plugin, so it
> behaves more like real storage? (or maybe it is already possible
> with a filter?)
We could use one of these filters:
https://libguestfs.org/nbdkit-delay-filter.1.html
https://libguestfs.org/nbdkit-rate-filter.1.html

Something like "--filter=delay wdelay=1ms" might be more realistic.
To simulate NVMe we might need to be able to specify microseconds there.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Nir Soffer

2021-May-26 15:15 UTC

head link

[Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy

On Wed, May 26, 2021 at 5:15 PM Richard W.M. Jones <rjones at redhat.com>
wrote:
...> > >  -W     out of order writes, but the manual says "This is
only recommended
> > >         for preallocated devices like host devices or other raw
block
> > >         devices" which is a very unclear recommendation to
me.
> > >         What's special about host devices versus (eg) files
or
> > >         qcow2 files which means -W wouldn't always be
recommended?
> >
> > This is how RHV use qemu-img convert when copying to raw preallocated
> > volumes. Using -W  can be up to 6x times faster. We use the same for
imageio
> > for any type of disk. This is the reason I tested this way.
> >
> > -W is equivalent to the nbdocpy multithreaded copy using a single
connection.
> >
> > qemu-img does N concurrent reads. If you don't specify -W, it
writes
> > the data in the right order (based on offset). If a read did not
> > finish, the copy loops waits until the read complets before
> > writing. This ensure exactly one concurrent write, and it is much
> > slower.
>
> Thanks - interesting.  Still not sure why you wouldn't want to use
> this flag all the time.
We started to use -W for this bug:
https://bugzilla.redhat.com/1511891#c57

The comment shows that this can be up to 6 times faster. When copying to NFS
the improvement was smaller so we felt it is safer to avoid this optimization.

But since then we use unordered writes in imageio and in virt-v2v rhv
upload (master)
so I think we can simplify and always use unordered writes.
> See also:
> https://lists.nongnu.org/archive/html/qemu-discuss/2021-05/msg00070.html...

Eric Blake

2021-May-26 21:36 UTC

head link

[Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy

On Wed, May 26, 2021 at 03:15:13PM +0100, Richard W.M. Jones
wrote:> On Wed, May 26, 2021 at 04:49:50PM +0300, Nir Soffer wrote:
> > On Wed, May 26, 2021 at 4:03 PM Richard W.M. Jones <rjones at
redhat.com> wrote:
> > > In my testing, nbdcopy is a clear 4x faster than qemu-img
convert, with
> > > 4 also happening to be the default number of connections/threads.
> > > Why use nbdcopy --connections=1?  That completely disables
threads in
> > > nbdcopy.
> > 
> > Because qemu-nbd does not report multicon when writing, so practically
> > you get one nbd handle for writing.
> 
> Let's see if we can fix that.  Crippling nbdcopy because of a missing
> feature in qemu-nbd isn't right.  I wonder what Eric's reasoning
for
> multi-conn not being safe is.
multi-conn implies that connection A writes, connection B flushes, and
connection C is then guaranteed to read what connection A wrote.
Furthermore, if client A and B plan on doing overlapping writes, the
presence of multi-conn means that whoever flushes last is guaranteed
to have that last write stick.  Without multiconn, even if A writes, B
writes, B flushes, then A flushes, you can end up with A's data
(rather than B's) as the final contents on disk, because the separate
connections are allowed to have separate caching regions where the
order of flushes determines which cache (with potentially stale data)
gets flushed when.  And note that the effect of overlapping writes may
happen even when your client requests are not overlapping: if client A
and B both write distinct 512 byte regions within a larger 4k page,
the server performing RMW caching of that page will behave as though
there are overlapping writes.

During nbdcopy or qemu-img convert, we aren't reading what we just
wrote and can easily arrange to avoid overlapping writes, so we don't
care about the bulk of the semantics of multi-conn (other than it is a
nice hint of a server that accepts multiple clients).  So at the end
of the day, it boils down to:

If the server advertised multi-conn: connect multiple clients, then
when all of them are done writing, only ONE client has to flush, and
the flush will be valid for what all of the clients wrote.

If the server did not advertise multi-conn, but still allows multiple
clients: connect those clients, write to distinct areas (avoid
overlapping writes, and hopefully your writes are sized large enough
that you are also avoiding overlapping cache granularities); then when
all clients are finished writing, ALL of them must call flush
(ideally, only the first flush takes any real time, and the server can
optimize the later flushes as having nothing further to flush - but we
can't guarantee that).

New enough nbdkit also has the multi-conn filter where you can play
around with different multi-conn policies :)

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Libguestfs - May 2021 - FYI: perf commands I'm using to benchmark nbdcopy

[Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy

[Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy

[Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy