Richard W.M. Jones
2021-May-26 10:25 UTC
[Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy
On Wed, May 26, 2021 at 10:32:08AM +0100, Richard W.M. Jones wrote:> On Wed, May 26, 2021 at 11:40:11AM +0300, Nir Soffer wrote: > > On Tue, May 25, 2021 at 9:06 PM Richard W.M. Jones <rjones at redhat.com> wrote: > > > I ran perf as below. Although nbdcopy and nbdkit themselves do not > > > require root (and usually should _not_ be run as root), in this case > > > perf must be run as root, so everything has to be run as root. > > > > > > # perf record -a -g --call-graph=dwarf ./nbdkit -U - sparse-random size=1T --run "MALLOC_CHECK_= ../libnbd/run nbdcopy \$uri \$uri" > > > > This uses 64 requests with a request size of 32m. In my tests using > > --requests 16 --request-size 1048576 is faster. Did you try to profile > > this? > > Interesting! No I didn't. In fact I just assumed that larger request > sizes / number of parallel requests would be better.This is the topology of the machine I ran the tests on: https://rwmj.files.wordpress.com/2019/09/screenshot_2019-09-04_11-08-41.png Even a single 32MB buffer isn't going to fit in any cache, so reducing buffer size should be a win, and once they are within the size of the L3 cache, reusing buffers should also be a win. That's the theory anyway ... Using --request-size=1048576 changes the flamegraph quite dramatically (see new attachment). [What is the meaning of the swapper stack traces? They are coming from idle cores?] Test runs slightly faster: $ hyperfine 'nbdkit -U - sparse-random size=1T --run "nbdcopy \$uri \$uri"' Benchmark #1: nbdkit -U - sparse-random size=1T --run "nbdcopy \$uri \$uri" Time (mean ? ?): 47.407 s ? 0.953 s [User: 347.982 s, System: 276.220 s] Range (min ? max): 46.474 s ? 49.373 s 10 runs $ hyperfine 'nbdkit -U - sparse-random size=1T --run "nbdcopy --request-size=1048576 \$uri \$uri"' Benchmark #1: nbdkit -U - sparse-random size=1T --run "nbdcopy --request-size=1048576 \$uri \$uri" Time (mean ? ?): 43.796 s ? 0.799 s [User: 328.134 s, System: 252.775 s] Range (min ? max): 42.289 s ? 44.917 s 10 runs (Note the buffers are still not being reused.) Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-p2v converts physical machines to virtual machines. Boot with a live CD or over the network (PXE) and turn machines into KVM guests. http://libguestfs.org/virt-v2v -------------- next part -------------- A non-text attachment was scrubbed... Name: nbdcopy3.svg.xz Type: application/x-xz Size: 28744 bytes Desc: not available URL: <http://listman.redhat.com/archives/libguestfs/attachments/20210526/84fd7c76/attachment.xz>
Nir Soffer
2021-May-26 11:50 UTC
[Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy
On Wed, May 26, 2021 at 1:25 PM Richard W.M. Jones <rjones at redhat.com> wrote:> > On Wed, May 26, 2021 at 10:32:08AM +0100, Richard W.M. Jones wrote: > > On Wed, May 26, 2021 at 11:40:11AM +0300, Nir Soffer wrote: > > > On Tue, May 25, 2021 at 9:06 PM Richard W.M. Jones <rjones at redhat.com> wrote: > > > > I ran perf as below. Although nbdcopy and nbdkit themselves do not > > > > require root (and usually should _not_ be run as root), in this case > > > > perf must be run as root, so everything has to be run as root. > > > > > > > > # perf record -a -g --call-graph=dwarf ./nbdkit -U - sparse-random size=1T --run "MALLOC_CHECK_= ../libnbd/run nbdcopy \$uri \$uri" > > > > > > This uses 64 requests with a request size of 32m. In my tests using > > > --requests 16 --request-size 1048576 is faster. Did you try to profile > > > this? > > > > Interesting! No I didn't. In fact I just assumed that larger request > > sizes / number of parallel requests would be better. > > This is the topology of the machine I ran the tests on: > > https://rwmj.files.wordpress.com/2019/09/screenshot_2019-09-04_11-08-41.png > > Even a single 32MB buffer isn't going to fit in any cache, so reducing > buffer size should be a win, and once they are within the size of the > L3 cache, reusing buffers should also be a win. > > That's the theory anyway ... Using --request-size=1048576 changes the > flamegraph quite dramatically (see new attachment).Interestingly, now malloc is about 35% of the time (6.6/18.4) of the worker thread.> [What is the meaning of the swapper stack traces? They are coming > from idle cores?] > > Test runs slightly faster: > > $ hyperfine 'nbdkit -U - sparse-random size=1T --run "nbdcopy \$uri \$uri"' > Benchmark #1: nbdkit -U - sparse-random size=1T --run "nbdcopy \$uri \$uri" > Time (mean ? ?): 47.407 s ? 0.953 s [User: 347.982 s, System: 276.220 s] > Range (min ? max): 46.474 s ? 49.373 s 10 runs > > $ hyperfine 'nbdkit -U - sparse-random size=1T --run "nbdcopy --request-size=1048576 \$uri \$uri"' > Benchmark #1: nbdkit -U - sparse-random size=1T --run "nbdcopy --request-size=1048576 \$uri \$uri" > Time (mean ? ?): 43.796 s ? 0.799 s [User: 328.134 s, System: 252.775 s] > Range (min ? max): 42.289 s ? 44.917 s 10 runsAdding --requests 16 is faster with real server, copying real images and shared storage. These flamegraphs are awesome! Here are results from tests I did a few month ago in the RHV scale lab. ## Server model name : Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz cores: 40 memory: 503g ## Source Dell Express Flash PM1725b 3.2TB SFF According to Dell site, this is: http://image-us.samsung.com/SamsungUS/PIM/Samsung_1725b_Product.pdf # qemu-img info /scratch/nsoffer-v2v.qcow2 image: /scratch/nsoffer-v2v.qcow2 file format: qcow2 virtual size: 100 GiB (107374182400 bytes) disk size: 66.5 GiB cluster_size: 65536 Format specific information: compat: 1.1 compression type: zlib lazy refcounts: false refcount bits: 16 corrupt: false Exported with qemu-nbd: qemu-nbd --persistent --shared=8 --format=qcow2 --cache=none --aio=native --read-only /scratch/nsoffer-v2v.qcow2 --socket /tmp/src.sock (Using configuration used by oVirt when exporting disks for backup) ## Destination NetApp LUN connected via FC via 4 paths: # multipath -ll 3600a098038304437415d4b6a59682f76 dm-4 NETAPP,LUN C-Mode size=1.0T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 alua' wp=rw |-+- policy='service-time 0' prio=50 status=active | |- 8:0:1:0 sdf 8:80 active ready running | `- 8:0:0:0 sdd 8:48 active ready running `-+- policy='service-time 0' prio=10 status=enabled |- 1:0:0:0 sde 8:64 active ready running `- 1:0:1:0 sdg 8:96 active ready running Disk is a logical volume on this lun: # qemu-img info -U /dev/f7b5c299-df2a-42bc-85d7-b60027f14e00/8825cff6-a9ef-4f8a-b159-97d77e21cf03 image: /dev/f7b5c299-df2a-42bc-85d7-b60027f14e00/8825cff6-a9ef-4f8a-b159-97d77e21cf03 file format: qcow2 virtual size: 100 GiB (107374182400 bytes) disk size: 0 B cluster_size: 65536 Format specific information: compat: 1.1 compression type: zlib lazy refcounts: false refcount bits: 16 corrupt: false Exported with qemu-nbd: qemu-nbd --persistent --shared=8 --format=qcow2 --cache=none --aio=native /root/nsoffer/target-disk --socket /tmp/dst.sock ## Compare qemu-img convert, nbdcopy and libev-copy with similar sparse settings. Basically all give very similar results. # hyperfine "./copy-libev $SRC $DST" "qemu-img convert -n -W -m 16 -S 1048576 $SRC $DST" "../copy/nbdcopy --sparse=1048576 --request-size=1048576 --flush --requests=16 --connections=1 $SRC $DST" Benchmark #1: ./copy-libev nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock Time (mean ? ?): 103.514 s ? 0.836 s [User: 7.153 s, System: 19.422 s] Range (min ? max): 102.265 s ? 104.824 s 10 runs Benchmark #2: qemu-img convert -n -W -m 16 -S 1048576 nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock Time (mean ? ?): 103.104 s ? 0.899 s [User: 2.897 s, System: 25.204 s] Range (min ? max): 101.958 s ? 104.499 s 10 runs Benchmark #3: ../copy/nbdcopy --sparse=1048576 --request-size=1048576 --flush --requests=16 --connections=1 nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock Time (mean ? ?): 104.085 s ? 0.977 s [User: 7.188 s, System: 19.965 s] Range (min ? max): 102.314 s ? 105.153 s 10 runs Summary 'qemu-img convert -n -W -m 16 -S 1048576 nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock' ran 1.00 ? 0.01 times faster than './copy-libev nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock' 1.01 ? 0.01 times faster than '../copy/nbdcopy --sparse=1048576 --request-size=1048576 --flush --requests=16 --connections=1 nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock' ## Compare nbdcopy request size with 16 requests and one connection # hyperfine "./copy-libev nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock" Benchmark #1: ./copy-libev nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock Time (mean ? ?): 104.195 s ? 1.911 s [User: 8.652 s, System: 18.887 s] Range (min ? max): 102.474 s ? 108.660 s 10 runs # hyperfine -L r 524288,1048576,2097152 --export-json nbdcopy-nbd-to-nbd-request-size.json "./nbdcopy --requests=16 --request-si ze={r} nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock" Benchmark #1: ./nbdcopy --requests=16 --request-size=524288 nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock Time (mean ? ?): 108.251 s ? 0.942 s [User: 5.538 s, System: 21.327 s] Range (min ? max): 107.098 s ? 110.019 s 10 runs Benchmark #2: ./nbdcopy --requests=16 --request-size=1048576 nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock Time (mean ? ?): 105.973 s ? 0.732 s [User: 7.901 s, System: 22.064 s] Range (min ? max): 104.915 s ? 107.003 s 10 runs Benchmark #3: ./nbdcopy --requests=16 --request-size=2097152 nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock Time (mean ? ?): 109.151 s ? 1.355 s [User: 9.898 s, System: 26.591 s] Range (min ? max): 107.168 s ? 111.176 s 10 runs Summary './nbdcopy --requests=16 --request-size=1048576 nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock' ran 1.02 ? 0.01 times faster than './nbdcopy --requests=16 --request-size=524288 nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock' 1.03 ? 0.01 times faster than './nbdcopy --requests=16 --request-size=2097152 nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock' ## Compare number of requests with multiple connections To enable multiple connections to the destination, I hacked nbdcopy to ignore the the destination can_multicon always use multiple connections. This is how we use qemu-nbd with imageio in RHV. This shows 10% better performance, best with 4 requests per connection, but the difference between 4,8, and 16 is not significant. # hyperfine -r3 -L r 1,2,4,8,16 "./nbdcopy --flush --request-size=1048576 --requests={r} --connections=4 nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock" Benchmark #1: ./nbdcopy --flush --request-size=1048576 --requests=1 --connections=4 nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock Time (mean ? ?): 117.876 s ? 1.612 s [User: 6.968 s, System: 23.676 s] Range (min ? max): 116.163 s ? 119.363 s 3 runs Benchmark #2: ./nbdcopy --flush --request-size=1048576 --requests=2 --connections=4 nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock Time (mean ? ?): 96.447 s ? 0.319 s [User: 8.216 s, System: 23.213 s] Range (min ? max): 96.192 s ? 96.805 s 3 runs Benchmark #3: ./nbdcopy --flush --request-size=1048576 --requests=4 --connections=4 nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock Time (mean ? ?): 91.356 s ? 0.339 s [User: 10.269 s, System: 23.029 s] Range (min ? max): 91.013 s ? 91.691 s 3 runs Benchmark #4: ./nbdcopy --flush --request-size=1048576 --requests=8 --connections=4 nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock Time (mean ? ?): 91.387 s ? 0.965 s [User: 12.699 s, System: 26.156 s] Range (min ? max): 90.786 s ? 92.500 s 3 runs Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options. Benchmark #5: ./nbdcopy --flush --request-size=1048576 --requests=16 --connections=4 nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock Time (mean ? ?): 91.637 s ? 0.861 s [User: 13.816 s, System: 31.043 s] Range (min ? max): 91.077 s ? 92.629 s 3 runs Summary './nbdcopy --flush --request-size=1048576 --requests=4 --connections=4 nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock' ran 1.00 ? 0.01 times faster than './nbdcopy --flush --request-size=1048576 --requests=8 --connections=4 nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock' 1.00 ? 0.01 times faster than './nbdcopy --flush --request-size=1048576 --requests=16 --connections=4 nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock' 1.06 ? 0.01 times faster than './nbdcopy --flush --request-size=1048576 --requests=2 --connections=4 nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock' 1.29 ? 0.02 times faster than './nbdcopy --flush --request-size=1048576 --requests=1 --connections=4 nbd+unix:///?socket=/tmp/src.sock nbd+unix:///?socket=/tmp/dst.sock' Nir