thr3ads.net - Libguestfs - [Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy [May 2021]

If this information is useful, please help other people find it:
Share via:

Richard W.M. Jones

2021-May-26 13:03 UTC

[Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy

On Wed, May 26, 2021 at 02:50:32PM +0300, Nir Soffer
wrote:> Basically all give very similar results.
> 
> # hyperfine "./copy-libev $SRC $DST" "qemu-img convert -n -W
-m 16 -S
> 1048576 $SRC $DST" "../copy/nbdcopy --sparse=1048576
> --request-size=1048576 --flush --requests=16 --connections=1 $SRC
> $DST"
> Benchmark #1: ./copy-libev nbd+unix:///?socket=/tmp/src.sock
> nbd+unix:///?socket=/tmp/dst.sock
>   Time (mean ? ?):     103.514 s ?  0.836 s    [User: 7.153 s, System:
19.422 s]
>   Range (min ? max):   102.265 s ? 104.824 s    10 runs
> 
> Benchmark #2: qemu-img convert -n -W -m 16 -S 1048576
> nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock
>   Time (mean ? ?):     103.104 s ?  0.899 s    [User: 2.897 s, System:
25.204 s]
>   Range (min ? max):   101.958 s ? 104.499 s    10 runs
> 
> Benchmark #3: ../copy/nbdcopy --sparse=1048576 --request-size=1048576
> --flush --requests=16 --connections=1
> nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock
>   Time (mean ? ?):     104.085 s ?  0.977 s    [User: 7.188 s, System:
19.965 s]
>   Range (min ? max):   102.314 s ? 105.153 s    10 runs
In my testing, nbdcopy is a clear 4x faster than qemu-img convert, with
4 also happening to be the default number of connections/threads.
Why use nbdcopy --connections=1?  That completely disables threads in
nbdcopy.  Also I'm not sure if --flush is fair (it depends on what
qemu-img does, which I don't know).

The other interesting things are the qemu-img convert flags you're using:

 -m 16  number of coroutines, default is 8

 -W     out of order writes, but the manual says "This is only recommended
        for preallocated devices like host devices or other raw block
	devices" which is a very unclear recommendation to me.
	What's special about host devices versus (eg) files or
	qcow2 files which means -W wouldn't always be recommended?

Anyway I tried various settings to see if I could improve the
performance of qemu-img convert vs nbdcopy using the sparse-random
test harness.  The results seem to confirm what has been said in this
thread so far.

libnbd-1.7.11-1.fc35.x86_64
nbdkit-1.25.8-2.fc35.x86_64
qemu-img-6.0.0-1.fc35.x86_64

$ hyperfine 'nbdkit -U - sparse-random size=100G --run "qemu-img
convert \$uri \$uri"' 'nbdkit -U - sparse-random size=100G --run
"qemu-img convert -m 16 -W \$uri \$uri"' 'nbdkit -U -
sparse-random size=100G --run "nbdcopy \$uri \$uri"' 'nbdkit
-U - sparse-random size=100G --run "nbdcopy --request-size=1048576
--requests=16 \$uri \$uri"'
Benchmark #1: nbdkit -U - sparse-random size=100G --run "qemu-img convert
\$uri \$uri"
  Time (mean ? ?):     17.245 s ?  1.004 s    [User: 28.611 s, System: 7.219 s]
  Range (min ? max):   15.711 s ? 18.930 s    10 runs

Benchmark #2: nbdkit -U - sparse-random size=100G --run "qemu-img convert
-m 16 -W \$uri \$uri"
  Time (mean ? ?):      8.618 s ?  0.266 s    [User: 33.091 s, System: 7.331 s]
  Range (min ? max):    8.130 s ?  8.943 s    10 runs

Benchmark #3: nbdkit -U - sparse-random size=100G --run "nbdcopy \$uri
\$uri"
  Time (mean ? ?):      5.227 s ?  0.153 s    [User: 34.299 s, System: 30.136 s]
  Range (min ? max):    5.049 s ?  5.439 s    10 runs

Benchmark #4: nbdkit -U - sparse-random size=100G --run "nbdcopy
--request-size=1048576 --requests=16 \$uri \$uri"
  Time (mean ? ?):      4.198 s ?  0.197 s    [User: 32.105 s, System: 24.562 s]
  Range (min ? max):    3.868 s ?  4.474 s    10 runs

Summary
  'nbdkit -U - sparse-random size=100G --run "nbdcopy
--request-size=1048576 --requests=16 \$uri \$uri"' ran
    1.25 ? 0.07 times faster than 'nbdkit -U - sparse-random size=100G --run
"nbdcopy \$uri \$uri"'
    2.05 ? 0.12 times faster than 'nbdkit -U - sparse-random size=100G --run
"qemu-img convert -m 16 -W \$uri \$uri"'
    4.11 ? 0.31 times faster than 'nbdkit -U - sparse-random size=100G --run
"qemu-img convert \$uri \$uri"'
> ## Compare nbdcopy request size with 16 requests and one connection
This is testing 4 connections I think?  Or is the destination not
advertising multi-conn?
> ## Compare number of requests with multiple connections
>
> To enable multiple connections to the destination, I hacked nbdcopy
> to ignore the the destination can_multicon always use multiple
> connections. This is how we use qemu-nbd with imageio in RHV.
So qemu-nbd doesn't advertise multi-conn?  I'd prefer if we fixed
qemu-nbd.

Anyway, interesting stuff, thanks.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Nir Soffer

2021-May-26 13:49 UTC

head link

[Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy

On Wed, May 26, 2021 at 4:03 PM Richard W.M. Jones <rjones at redhat.com>
wrote:>
> On Wed, May 26, 2021 at 02:50:32PM +0300, Nir Soffer wrote:
> > Basically all give very similar results.
> >
> > # hyperfine "./copy-libev $SRC $DST" "qemu-img convert
-n -W -m 16 -S
> > 1048576 $SRC $DST" "../copy/nbdcopy --sparse=1048576
> > --request-size=1048576 --flush --requests=16 --connections=1 $SRC
> > $DST"
> > Benchmark #1: ./copy-libev nbd+unix:///?socket=/tmp/src.sock
> > nbd+unix:///?socket=/tmp/dst.sock
> >   Time (mean ? ?):     103.514 s ?  0.836 s    [User: 7.153 s, System:
19.422 s]
> >   Range (min ? max):   102.265 s ? 104.824 s    10 runs
> >
> > Benchmark #2: qemu-img convert -n -W -m 16 -S 1048576
> > nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock
> >   Time (mean ? ?):     103.104 s ?  0.899 s    [User: 2.897 s, System:
25.204 s]
> >   Range (min ? max):   101.958 s ? 104.499 s    10 runs
> >
> > Benchmark #3: ../copy/nbdcopy --sparse=1048576 --request-size=1048576
> > --flush --requests=16 --connections=1
> > nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock
> >   Time (mean ? ?):     104.085 s ?  0.977 s    [User: 7.188 s, System:
19.965 s]
> >   Range (min ? max):   102.314 s ? 105.153 s    10 runs
>
> In my testing, nbdcopy is a clear 4x faster than qemu-img convert, with
> 4 also happening to be the default number of connections/threads.
> Why use nbdcopy --connections=1?  That completely disables threads in
> nbdcopy.
Because qemu-nbd does not report multicon when writing, so practically
you get one nbd handle for writing.
> Also I'm not sure if --flush is fair (it depends on what
> qemu-img does, which I don't know).
qemu is flushing at the end of the operation. Not flushing is cheating :-)
> The other interesting things are the qemu-img convert flags you're
using:
>
>  -m 16  number of coroutines, default is 8
We use 8 in RHV since the difference is very small, and when running
concurrent copies it does not matter. Since we use up to 64 concurrent
requests in nbdcopy, it is useful to compare similar setup in qemu.
>  -W     out of order writes, but the manual says "This is only
recommended
>         for preallocated devices like host devices or other raw block
>         devices" which is a very unclear recommendation to me.
>         What's special about host devices versus (eg) files or
>         qcow2 files which means -W wouldn't always be recommended?
This is how RHV use qemu-img convert when copying to raw preallocated
volumes. Using -W  can be up to 6x times faster. We use the same for imageio
for any type of disk. This is the reason I tested this way.

-W is equivalent to the nbdocpy multithreaded copy using a single connection.

qemu-img does N concurrent reads. If you don't specify -W, it writes the
data
in the right order (based on offset). If a read did not finish, the copy loops
waits until the read complets before writing. This ensure exactly one concurrent
write, and it is much slower.
> Anyway I tried various settings to see if I could improve the
> performance of qemu-img convert vs nbdcopy using the sparse-random
> test harness.  The results seem to confirm what has been said in this
> thread so far.
>
> libnbd-1.7.11-1.fc35.x86_64
> nbdkit-1.25.8-2.fc35.x86_64
> qemu-img-6.0.0-1.fc35.x86_64
>
> $ hyperfine 'nbdkit -U - sparse-random size=100G --run "qemu-img
convert \$uri \$uri"' 'nbdkit -U - sparse-random size=100G --run
"qemu-img convert -m 16 -W \$uri \$uri"' 'nbdkit -U -
sparse-random size=100G --run "nbdcopy \$uri \$uri"' 'nbdkit
-U - sparse-random size=100G --run "nbdcopy --request-size=1048576
--requests=16 \$uri \$uri"'
> Benchmark #1: nbdkit -U - sparse-random size=100G --run "qemu-img
convert \$uri \$uri"
>   Time (mean ? ?):     17.245 s ?  1.004 s    [User: 28.611 s, System:
7.219 s]
>   Range (min ? max):   15.711 s ? 18.930 s    10 runs
>
> Benchmark #2: nbdkit -U - sparse-random size=100G --run "qemu-img
convert -m 16 -W \$uri \$uri"
>   Time (mean ? ?):      8.618 s ?  0.266 s    [User: 33.091 s, System:
7.331 s]
>   Range (min ? max):    8.130 s ?  8.943 s    10 runs
>
> Benchmark #3: nbdkit -U - sparse-random size=100G --run "nbdcopy \$uri
\$uri"
>   Time (mean ? ?):      5.227 s ?  0.153 s    [User: 34.299 s, System:
30.136 s]
>   Range (min ? max):    5.049 s ?  5.439 s    10 runs
>
> Benchmark #4: nbdkit -U - sparse-random size=100G --run "nbdcopy
--request-size=1048576 --requests=16 \$uri \$uri"
>   Time (mean ? ?):      4.198 s ?  0.197 s    [User: 32.105 s, System:
24.562 s]
>   Range (min ? max):    3.868 s ?  4.474 s    10 runs
>
> Summary
>   'nbdkit -U - sparse-random size=100G --run "nbdcopy
--request-size=1048576 --requests=16 \$uri \$uri"' ran
>     1.25 ? 0.07 times faster than 'nbdkit -U - sparse-random size=100G
--run "nbdcopy \$uri \$uri"'
>     2.05 ? 0.12 times faster than 'nbdkit -U - sparse-random size=100G
--run "qemu-img convert -m 16 -W \$uri \$uri"'
>     4.11 ? 0.31 times faster than 'nbdkit -U - sparse-random size=100G
--run "qemu-img convert \$uri \$uri"'
This shows that nbdcopy works better when the latency is (practically)
zero, copying
data from memory to memory. This is useful for minimizing overhead in nbdcopy,
but when copying real images with real storage with direct I/O the
time to write the
data to storage hides everything else.

Would it be useful to add latency in the sparse-random plugin, so it behaves
more like real storage? (or maybe it is already possible with a filter?)
> > ## Compare nbdcopy request size with 16 requests and one connection
>
> This is testing 4 connections I think?  Or is the destination not
> advertising multi-conn?
>
> > ## Compare number of requests with multiple connections
> >
> > To enable multiple connections to the destination, I hacked nbdcopy
> > to ignore the the destination can_multicon always use multiple
> > connections. This is how we use qemu-nbd with imageio in RHV.
>
> So qemu-nbd doesn't advertise multi-conn?  I'd prefer if we fixed
qemu-nbd.
Right, but according to Eric, advertising multi-conn is not safe with current
code.

But he confirmed that using multiple writers writing to distinct
ranges is safe, so
nbdcopy can use this now for extra 10% performance in real copies.

Nir

Libguestfs - May 2021 - FYI: perf commands I'm using to benchmark nbdcopy

[Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy

[Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy