Richard W.M. Jones
2021-May-25 18:04 UTC
[Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy
Hi Abhay, FYI I thought I would document the successful commands I am using to benchmark nbdcopy and produce the flame graphs that you saw this morning. Attached is a very recent flame graph produced using this method. Firstly I'm running everything on Fedora 34, with selected packages upgraded to Fedora Rawhide. However any reasonably recent Linux distro should work fine. You will need to install the perf tool. Compile libnbd & nbdkit from git source, following the instructions in the respective README files. https://gitlab.com/nbdkit/libnbd https://gitlab.com/nbdkit/nbdkit I have nbdkit and libnbd checked out in adjacent directories. This is important so that commands like "./nbdkit" and "../libnbd/run nbdcopy" work. There's more information about this in the READMEs. I ran perf as below. Although nbdcopy and nbdkit themselves do not require root (and usually should _not_ be run as root), in this case perf must be run as root, so everything has to be run as root. # perf record -a -g --call-graph=dwarf ./nbdkit -U - sparse-random size=1T --run "MALLOC_CHECK_= ../libnbd/run nbdcopy \$uri \$uri" Some things to explain: * The output is perf.data in the local directory. This file may be huge (22GB for me!) * I am running this from the nbdkit directory, so ./nbdkit runs the locally compiled copy of nbdkit. This allows me to make quick changes to nbdkit and see the effects immediately. * I am running nbdcopy using "../libnbd/run nbdcopy", so that's from the adjacent locally compiled libnbd directory. Again the reason for this is so I can make changes, recompile libnbd, and see the effect quickly. * "MALLOC_CHECK_=" is needed because of complicated reasons to do with how the nbdkit wrapper enables malloc-checking. We should probably provide a way to disable malloc-checking when benchmarking because it adds overhead for no benefit, but I've not done that yet (patches welcome!) * The test harness is nbdkit-sparse-random-plugin, documented here: https://libguestfs.org/nbdkit-sparse-random-plugin.1.html * I'm using DWARF debugging info to generate call stacks, which is more reliable than the default (frame pointers). * The -a option means I'm measuring events on the whole machine. You can read the perf manual to find out how to measure only a single process (eg. just nbdkit or just nbdcopy). But actually measuring the whole machine gives a truer picture, I believe. * If the test takes too long to run or runs out of space, try adjusting the size (1T = 1 terabyte) downwards, eg. 512G, 256G, ... until it fits. Although nbdkit doesn't store the virtual disk or use very much memory at all, the test does appear to stress the Linux VMM, and the amount of perf.data generated can be huge. Then I run this long command to generate the flame graph. Again it must be run as root: # perf script | ../FlameGraph/stackcollapse-perf.pl | ../FlameGraph/flamegraph.pl > nbdcopy.svg * This reads perf.data as input. * Brendan Gregg's FlameGraph code is checked out in another adjacent directory. You can open the SVG file in a web browser. Try clicking around - it's interactive. If you get stuck, ask questions, we're here to help. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top -------------- next part -------------- A non-text attachment was scrubbed... Name: nbdcopy2.svg.xz Type: application/x-xz Size: 28112 bytes Desc: not available URL: <http://listman.redhat.com/archives/libguestfs/attachments/20210525/3b36a3cd/attachment.xz>
Abhay Raj Singh
2021-May-25 18:18 UTC
[Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy
Thanks I was able to get perf.data for and can analyse it using perf report which showed expected output but flamegraph errors for quite some time Turns out I had to use stackcollapse-perf.pl instead of stackcollapse.pl thanks!! -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://listman.redhat.com/archives/libguestfs/attachments/20210525/8d2f05b9/attachment.htm>
Nir Soffer
2021-May-26 08:40 UTC
[Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy
On Tue, May 25, 2021 at 9:06 PM Richard W.M. Jones <rjones at redhat.com> wrote:> I ran perf as below. Although nbdcopy and nbdkit themselves do not > require root (and usually should _not_ be run as root), in this case > perf must be run as root, so everything has to be run as root. > > # perf record -a -g --call-graph=dwarf ./nbdkit -U - sparse-random size=1T --run "MALLOC_CHECK_= ../libnbd/run nbdcopy \$uri \$uri"This uses 64 requests with a request size of 32m. In my tests using --requests 16 --request-size 1048576 is faster. Did you try to profile this?> Some things to explain: > > * The output is perf.data in the local directory. This file may be > huge (22GB for me!) > > * I am running this from the nbdkit directory, so ./nbdkit runs the > locally compiled copy of nbdkit. This allows me to make quick > changes to nbdkit and see the effects immediately. > > * I am running nbdcopy using "../libnbd/run nbdcopy", so that's from > the adjacent locally compiled libnbd directory. Again the reason > for this is so I can make changes, recompile libnbd, and see the > effect quickly. > > * "MALLOC_CHECK_=" is needed because of complicated reasons to do > with how the nbdkit wrapper enables malloc-checking. We should > probably provide a way to disable malloc-checking when benchmarking > because it adds overhead for no benefit, but I've not done that yet > (patches welcome!)Why enable malloc checking in nbdkit when profiling nbdcopy?> * The test harness is nbdkit-sparse-random-plugin, documented here: > https://libguestfs.org/nbdkit-sparse-random-plugin.1.htmlDoes it create a similar pattern to real world images, or more like the worst case? In my tests using nbdkit memory and pattern plugins was way more stable compared with real images via qemu-nbd/nbdkit, but real image give more real results :-) Maybe we can extract the extents from a real image, and add a plugin accepting json extents and inventing data for the data extents?> * I'm using DWARF debugging info to generate call stacks, which is > more reliable than the default (frame pointers).When I tried to use perf, I did not get proper call stacks, maybe this was the reason.> * The -a option means I'm measuring events on the whole machine. You > can read the perf manual to find out how to measure only a single > process (eg. just nbdkit or just nbdcopy). But actually measuring > the whole machine gives a truer picture, I believe.Why profile the whole machine? I would profile only nbdcopy or nbdkit, depending on what we are trying to focus on. Looking in the attached flame graph, if we focus on the nbdcopy worker_thread, and sort by time: poll_both_ends: 14.53% (58%) malloc: 5.55% (22%) nbd_ops_async_read: 4.34% (17%) nbd_ops_get_extents: 0.52% (2%) If we focus into into poll_both_ends: send 10.17% (69%) free 4.53% (31%) So we have a lot of opportunities to optimize by allocating all buffers up front as done in examples/libev-copy. But I'm not sure we will would see the same picture when using smaller buffers (--request-size 1m). nbd_ops_async_read is surprising - this is an async operation that should consume no time. Why does it take 17% of the time? Thanks for the info! Nir