thr3ads.net - Libguestfs - [Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy [May 2021]

If this information is useful, please help other people find it:
Share via:

Richard W.M. Jones

2021-May-26 09:32 UTC

[Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy

On Wed, May 26, 2021 at 11:40:11AM +0300, Nir Soffer
wrote:> On Tue, May 25, 2021 at 9:06 PM Richard W.M. Jones <rjones at
redhat.com> wrote:
> > I ran perf as below.  Although nbdcopy and nbdkit themselves do not
> > require root (and usually should _not_ be run as root), in this case
> > perf must be run as root, so everything has to be run as root.
> >
> >   # perf record -a -g --call-graph=dwarf ./nbdkit -U - sparse-random
size=1T --run "MALLOC_CHECK_= ../libnbd/run nbdcopy \$uri \$uri"
> 
> This uses 64 requests with a request size of 32m. In my tests using
> --requests 16 --request-size 1048576 is faster. Did you try to profile
> this?
Interesting!  No I didn't.  In fact I just assumed that larger request
sizes / number of parallel requests would be better.

This would be an interesting and easy avenue for Abhay to explore too.
> > Some things to explain:
> >
> >  * The output is perf.data in the local directory.  This file may be
> >    huge (22GB for me!)
> >
> >  * I am running this from the nbdkit directory, so ./nbdkit runs the
> >    locally compiled copy of nbdkit.  This allows me to make quick
> >    changes to nbdkit and see the effects immediately.
> >
> >  * I am running nbdcopy using "../libnbd/run nbdcopy", so
that's from
> >    the adjacent locally compiled libnbd directory.  Again the reason
> >    for this is so I can make changes, recompile libnbd, and see the
> >    effect quickly.
> >
> >  * "MALLOC_CHECK_=" is needed because of complicated reasons
to do
> >    with how the nbdkit wrapper enables malloc-checking.  We should
> >    probably provide a way to disable malloc-checking when benchmarking
> >    because it adds overhead for no benefit, but I've not done that
yet
> >    (patches welcome!)
> 
> Why enable malloc checking in nbdkit when profiling nbdcopy?
There's no good reason.  It's because the ./nbdkit wrapper is designed
for two purposes: To run nbdkit without installing from the local
directory, and as a way to run nbdkit during tests.  In tests we want
to add malloc checking because it's an easy way to discover some
memory problems, but it's not useful for the other reason.  I need to
work out some way to turn on malloc checking only when running the
tests.
> >  * The test harness is nbdkit-sparse-random-plugin, documented here:
> >    https://libguestfs.org/nbdkit-sparse-random-plugin.1.html
> 
> Does it create a similar pattern to real world images, or more like
> the worst case?
>
> In my tests using nbdkit memory and pattern plugins was way more
> stable compared with real images via qemu-nbd/nbdkit, but real image
> give more real results :-)
It's meant to make virtual images which consist of runs of blocks and
holes, and therefore look at least somewhat like disk images:

https://gitlab.com/nbdkit/nbdkit/-/blob/65d0d80da15ed01ed1f9d82fef316596948b7536/plugins/sparse-random/sparse-random.c#L135

The problem with using real disk images is they're huge, the test
tends to be dominated by other effects like disk I/O and memory, and
the results are not easily reproducible by other people.  (With
sparse-random, setting the seed and other parameters should produce
bit-for-bit identical images).  The problem with the memory and
pattern plugins is that the sparseness is unrealistic.
> Maybe we can extract the extents from a real image, and add a plugin
> accepting json extents and inventing data for the data extents?
Something like this would also work.
> >  * I'm using DWARF debugging info to generate call stacks, which
is
> >    more reliable than the default (frame pointers).
> 
> When I tried to use perf, I did not get proper call stacks, maybe this
> was the reason.
Yes, perf is infuriatingly difficult to get reliable stacks.  The
latest bug I'm wrestling with is:
https://bugzilla.redhat.com/show_bug.cgi?id=1963865
> >  * The -a option means I'm measuring events on the whole machine. 
You
> >    can read the perf manual to find out how to measure only a single
> >    process (eg. just nbdkit or just nbdcopy).  But actually measuring
> >    the whole machine gives a truer picture, I believe.
> 
> Why profile the whole machine? I would profile only nbdcopy or nbdkit,
> depending on what we are trying to focus on.
Actually I think there are use cases for both.  However if I'd only
profiled nbdcopy then I wouldn't have noticed that the sparse-plugin
had a horrible hotspot in one function! (now fixed)
> Looking in the attached flame graph, if we focus on the nbdcopy
> worker_thread, and sort by time:
>
> poll_both_ends: 14.53% (58%)
> malloc: 5.55% (22%)
> nbd_ops_async_read: 4.34% (17%)
> nbd_ops_get_extents: 0.52% (2%)
> 
> If we focus into into poll_both_ends:
> send 10.17% (69%)
> free 4.53% (31%)
>
> So we have a lot of opportunities to optimize by allocating all
> buffers up front as done in examples/libev-copy. But I'm not sure we
> will would see the same picture when using smaller buffers
> (--request-size 1m).
Also about 1/3rd of all the time is spent copying data to and from
userspace.

And even though as you point out, nbdcopy does too much malloc, it's
surprising that the overhead of malloc is so large.

So for context Abhay is investigating io_uring.  I think this allows
us to register buffers with the kernel and have the kernel copy data
directly into and out of those buffers.  If it's possible, this should
solve both problems.

One thing I'm not clear about is how exactly Unix domain sockets work.
Do they use packets and framing (and therefore skbs)?  I kind of
assume not since it's just two processes on the same host producing
and consuming data.  If skbs are involved in the network layer then
presumably fewer opportunities to do zero copy.

(Note that in the real world of virt-v2v and hypervisor migration we
expect to use a mix of Unix domain sockets and TCP, depending on the
exact source and destination of the copy).
> nbd_ops_async_read is surprising - this is an async operation that should
> consume no time. Why does it take 17% of the time?
Yeah there are a few mysteries like this.  More things for Abhay to
investigate ...

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

Richard W.M. Jones

2021-May-26 10:25 UTC

head link

[Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy

On Wed, May 26, 2021 at 10:32:08AM +0100, Richard W.M. Jones
wrote:> On Wed, May 26, 2021 at 11:40:11AM +0300, Nir Soffer wrote:
> > On Tue, May 25, 2021 at 9:06 PM Richard W.M. Jones <rjones at
redhat.com> wrote:
> > > I ran perf as below.  Although nbdcopy and nbdkit themselves do
not
> > > require root (and usually should _not_ be run as root), in this
case
> > > perf must be run as root, so everything has to be run as root.
> > >
> > >   # perf record -a -g --call-graph=dwarf ./nbdkit -U -
sparse-random size=1T --run "MALLOC_CHECK_= ../libnbd/run nbdcopy \$uri
\$uri"
> > 
> > This uses 64 requests with a request size of 32m. In my tests using
> > --requests 16 --request-size 1048576 is faster. Did you try to profile
> > this?
> 
> Interesting!  No I didn't.  In fact I just assumed that larger request
> sizes / number of parallel requests would be better.
This is the topology of the machine I ran the tests on:

  https://rwmj.files.wordpress.com/2019/09/screenshot_2019-09-04_11-08-41.png

Even a single 32MB buffer isn't going to fit in any cache, so reducing
buffer size should be a win, and once they are within the size of the
L3 cache, reusing buffers should also be a win.

That's the theory anyway ...  Using --request-size=1048576 changes the
flamegraph quite dramatically (see new attachment).

[What is the meaning of the swapper stack traces?  They are coming
from idle cores?]

Test runs slightly faster:

  $ hyperfine 'nbdkit -U - sparse-random size=1T --run "nbdcopy \$uri
\$uri"'
  Benchmark #1: nbdkit -U - sparse-random size=1T --run "nbdcopy \$uri
\$uri"
    Time (mean ? ?):     47.407 s ?  0.953 s    [User: 347.982 s, System:
276.220 s]
    Range (min ? max):   46.474 s ? 49.373 s    10 runs
 
  $ hyperfine 'nbdkit -U - sparse-random size=1T --run "nbdcopy
--request-size=1048576 \$uri \$uri"'
  Benchmark #1: nbdkit -U - sparse-random size=1T --run "nbdcopy
--request-size=1048576 \$uri \$uri"
    Time (mean ? ?):     43.796 s ?  0.799 s    [User: 328.134 s, System:
252.775 s]
    Range (min ? max):   42.289 s ? 44.917 s    10 runs

(Note the buffers are still not being reused.)

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nbdcopy3.svg.xz
Type: application/x-xz
Size: 28744 bytes
Desc: not available
URL:
<http://listman.redhat.com/archives/libguestfs/attachments/20210526/84fd7c76/attachment.xz>

Libguestfs - May 2021 - FYI: perf commands I'm using to benchmark nbdcopy

[Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy

[Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy