thr3ads.net - Libguestfs - [Libguestfs] [PATCH libnbd 2/2] copy: Set default request-size to 2**18 (262144 bytes) [Jun 2021]

If this information is useful, please help other people find it:
Share via:

Nir Soffer

2021-Jun-20 19:21 UTC

[Libguestfs] [PATCH libnbd 2/2] copy: Set default request-size to 2**18 (262144 bytes)

On Sun, Jun 20, 2021 at 7:46 PM Richard W.M. Jones <rjones at redhat.com>
wrote:>
> As Nir has often pointed out, our current default request buffer size
> (32MB) is too large, resulting in nbdcopy being as much as 2? times
> slower than it could be.
>
> The optimum buffer size most likely depends on the hardware, and may
> even vary over time as machines get generally larger caches.  To
> explore the problem I used this command:
>
> $ hyperfine -P rs 15 25 'nbdkit -U - sparse-random size=100G seed=1
--run "nbdcopy --request-size=\$((2**{rs})) \$uri \$uri"'
This uses the same process for serving both reads and writes, which may
be different from real world usage when one process is used for reading
and one for writing.
> On my 2019-era AMD server with 32GB of RAM and 64MB * 4 of L3 cache,
> 2**18 (262144) was the optimum when I tested all sizes between
> 2**15 (32K) and 2**25 (32M, the current default).
>
> Summary
>   'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy
--request-size=\$((2**18)) \$uri \$uri"' ran
>     1.03 ? 0.04 times faster than 'nbdkit -U - sparse-random size=100G
seed=1 --run "nbdcopy --request-size=\$((2**19)) \$uri \$uri"'
>     1.06 ? 0.04 times faster than 'nbdkit -U - sparse-random size=100G
seed=1 --run "nbdcopy --request-size=\$((2**17)) \$uri \$uri"'
>     1.09 ? 0.03 times faster than 'nbdkit -U - sparse-random size=100G
seed=1 --run "nbdcopy --request-size=\$((2**20)) \$uri \$uri"'
The difference is very small up to this point
>     1.23 ? 0.04 times faster than 'nbdkit -U - sparse-random size=100G
seed=1 --run "nbdcopy --request-size=\$((2**21)) \$uri \$uri"'
>     1.26 ? 0.04 times faster than 'nbdkit -U - sparse-random size=100G
seed=1 --run "nbdcopy --request-size=\$((2**16)) \$uri \$uri"'
>     1.39 ? 0.04 times faster than 'nbdkit -U - sparse-random size=100G
seed=1 --run "nbdcopy --request-size=\$((2**22)) \$uri \$uri"'
>     1.45 ? 0.05 times faster than 'nbdkit -U - sparse-random size=100G
seed=1 --run "nbdcopy --request-size=\$((2**15)) \$uri \$uri"'
>     1.61 ? 0.05 times faster than 'nbdkit -U - sparse-random size=100G
seed=1 --run "nbdcopy --request-size=\$((2**23)) \$uri \$uri"'
>     1.94 ? 0.05 times faster than 'nbdkit -U - sparse-random size=100G
seed=1 --run "nbdcopy --request-size=\$((2**24)) \$uri \$uri"'
>     2.47 ? 0.08 times faster than 'nbdkit -U - sparse-random size=100G
seed=1 --run "nbdcopy --request-size=\$((2**25)) \$uri \$uri"'
>
> My 2018-era Intel laptop with a measly 8 MB of L3 cache the optimum
> size is one power-of-2 smaller (but 2**18 is still an improvement):
>
> Summary
>   'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy
--request-size=\$((2**17)) \$uri \$uri"' ran
This matches results I got when testing the libev example on Lenovo T480s
(~2018) and Dell Optiplex 9080 (~2012).
>     1.05 ? 0.19 times faster than 'nbdkit -U - sparse-random size=100G
seed=1 --run "nbdcopy --request-size=\$((2**15)) \$uri \$uri"'
>     1.06 ? 0.01 times faster than 'nbdkit -U - sparse-random size=100G
seed=1 --run "nbdcopy --request-size=\$((2**16)) \$uri \$uri"'
>     1.10 ? 0.01 times faster than 'nbdkit -U - sparse-random size=100G
seed=1 --run "nbdcopy --request-size=\$((2**18)) \$uri \$uri"'
>     1.22 ? 0.01 times faster than 'nbdkit -U - sparse-random size=100G
seed=1 --run "nbdcopy --request-size=\$((2**19)) \$uri \$uri"'
>     1.29 ? 0.01 times faster than 'nbdkit -U - sparse-random size=100G
seed=1 --run "nbdcopy --request-size=\$((2**20)) \$uri \$uri"'
>     1.33 ? 0.02 times faster than 'nbdkit -U - sparse-random size=100G
seed=1 --run "nbdcopy --request-size=\$((2**21)) \$uri \$uri"'
>     1.35 ? 0.01 times faster than 'nbdkit -U - sparse-random size=100G
seed=1 --run "nbdcopy --request-size=\$((2**22)) \$uri \$uri"'
>     1.38 ? 0.01 times faster than 'nbdkit -U - sparse-random size=100G
seed=1 --run "nbdcopy --request-size=\$((2**23)) \$uri \$uri"'
>     1.45 ? 0.02 times faster than 'nbdkit -U - sparse-random size=100G
seed=1 --run "nbdcopy --request-size=\$((2**24)) \$uri \$uri"'
>     1.63 ? 0.03 times faster than 'nbdkit -U - sparse-random size=100G
seed=1 --run "nbdcopy --request-size=\$((2**25)) \$uri \$uri"'
>
> To get an idea of the best request size on something rather different,
> this is a Raspberry Pi 4B.  I had to reduce the copy size down by a
> factor of 10 (to 10G) to make it run in a reasonable time.  2**18 is
> about 8% slower than the optimum choice (2**15).  It's still
> significantly better than our current default.
>
> Summary
>   'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy
--request-size=\$((2**15)) \$uri \$uri"' ran
>     1.00 ? 0.04 times faster than 'nbdkit -U - sparse-random size=10G
seed=1 --run "nbdcopy --request-size=\$((2**21)) \$uri \$uri"'
>     1.03 ? 0.05 times faster than 'nbdkit -U - sparse-random size=10G
seed=1 --run "nbdcopy --request-size=\$((2**20)) \$uri \$uri"'
>     1.04 ? 0.05 times faster than 'nbdkit -U - sparse-random size=10G
seed=1 --run "nbdcopy --request-size=\$((2**22)) \$uri \$uri"'
>     1.05 ? 0.08 times faster than 'nbdkit -U - sparse-random size=10G
seed=1 --run "nbdcopy --request-size=\$((2**16)) \$uri \$uri"'
>     1.05 ? 0.05 times faster than 'nbdkit -U - sparse-random size=10G
seed=1 --run "nbdcopy --request-size=\$((2**19)) \$uri \$uri"'
>     1.07 ? 0.05 times faster than 'nbdkit -U - sparse-random size=10G
seed=1 --run "nbdcopy --request-size=\$((2**17)) \$uri \$uri"'
>     1.08 ? 0.05 times faster than 'nbdkit -U - sparse-random size=10G
seed=1 --run "nbdcopy --request-size=\$((2**18)) \$uri \$uri"'
>     1.15 ? 0.05 times faster than 'nbdkit -U - sparse-random size=10G
seed=1 --run "nbdcopy --request-size=\$((2**23)) \$uri \$uri"'
>     1.28 ? 0.06 times faster than 'nbdkit -U - sparse-random size=10G
seed=1 --run "nbdcopy --request-size=\$((2**24)) \$uri \$uri"'
>     1.35 ? 0.06 times faster than 'nbdkit -U - sparse-random size=10G
seed=1 --run "nbdcopy --request-size=\$((2**25)) \$uri \$uri"'
But all these results do not test real work copy. They test copying
from memory to memory with zero (practical) latency.

When I tested using real storage on a real server, I got best results
using 16 requests and one connection and a request size of 1m.

4 connections with 4 requests per connection with the same request
size seem to be ~10% faster in these conditions.

I posted more info on these tests here:
https://listman.redhat.com/archives/libguestfs/2021-May/msg00124.html

Of course testing with other servers or storage can show different results,
and it is impossible to find a value that will work best in all cases.

I think we need to test both the number of requests and connections to
improve the defaults.
> ---
>  copy/main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/copy/main.c b/copy/main.c
> index 0fddfc3..70534b5 100644
> --- a/copy/main.c
> +++ b/copy/main.c
> @@ -50,7 +50,7 @@ bool flush;                     /* --flush flag */
>  unsigned max_requests = 64;     /* --requests */
>  bool progress;                  /* -p flag */
>  int progress_fd = -1;           /* --progress=FD */
> -unsigned request_size = MAX_REQUEST_SIZE;  /* --request-size */
> +unsigned request_size = 1<<18;  /* --request-size */
But this is clearly a better default.
>  unsigned sparse_size = 4096;    /* --sparse */
>  bool synchronous;               /* --synchronous flag */
>  unsigned threads;               /* --threads */
> --
> 2.32.0
>
Nir

Richard W.M. Jones

2021-Jun-20 20:58 UTC

head link

[Libguestfs] [PATCH libnbd 2/2] copy: Set default request-size to 2**18 (262144 bytes)

On Sun, Jun 20, 2021 at 10:21:52PM +0300, Nir Soffer
wrote:> On Sun, Jun 20, 2021 at 7:46 PM Richard W.M. Jones <rjones at
redhat.com> wrote:
> >
> > As Nir has often pointed out, our current default request buffer size
> > (32MB) is too large, resulting in nbdcopy being as much as 2? times
> > slower than it could be.
> >
> > The optimum buffer size most likely depends on the hardware, and may
> > even vary over time as machines get generally larger caches.  To
> > explore the problem I used this command:
> >
> > $ hyperfine -P rs 15 25 'nbdkit -U - sparse-random size=100G
seed=1 --run "nbdcopy --request-size=\$((2**{rs})) \$uri \$uri"'
> 
> This uses the same process for serving both reads and writes, which may
> be different from real world usage when one process is used for reading
> and one for writing.
Yes, we do need to check whether nbdkit-sparse-random-plugin is a
realistic test of copying disk images, which I've not really done.

However as far as I know it's not a problem that only having a single
nbdkit process would be a problem?  (More below ...)
> > On my 2019-era AMD server with 32GB of RAM and 64MB * 4 of L3 cache,
> > 2**18 (262144) was the optimum when I tested all sizes between
> > 2**15 (32K) and 2**25 (32M, the current default).
> >
> > Summary
> >   'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy
--request-size=\$((2**18)) \$uri \$uri"' ran
> >     1.03 ? 0.04 times faster than 'nbdkit -U - sparse-random
size=100G seed=1 --run "nbdcopy --request-size=\$((2**19)) \$uri
\$uri"'
> >     1.06 ? 0.04 times faster than 'nbdkit -U - sparse-random
size=100G seed=1 --run "nbdcopy --request-size=\$((2**17)) \$uri
\$uri"'
> >     1.09 ? 0.03 times faster than 'nbdkit -U - sparse-random
size=100G seed=1 --run "nbdcopy --request-size=\$((2**20)) \$uri
\$uri"'
> 
> The difference is very small up to this point
I'm going to guess that we bump into some working set vs cache size wall.
We need to instrument nbdcopy to print the actual working set, since
it's hard to calculate it from first principles.  (Actually I'm hoping
Abhay will look at this.)

...
> But all these results do not test real work copy. They test copying
> from memory to memory with zero (practical) latency.
(Continuing from above ...)  The problem with doing a practical test
is that we'd need to set up, eg., 3 machines, and things like network
bandwidth would likely mask the performance of nbdcopy since I'm
pretty sure nbdcopy can copy much faster than any network we would
hope to set up.

Latency could be more of an issue, since it's plausible that on an NBD
connection with non-zero latency we'd get better performance by using
more parallel requests.  At least, that's my intuition based on the
idea that if we have only 1 request at a time, then we suffer a full
round trip per request, and if that RTT is non-zero then obviously
more parallelism that one at a time would be better.

The upshot would be that we could end up optimizing for the Unix
domain socket (near-zero latency) case and find that performance over
a real network needs different parameters.

But I do like that nbdkit-sparse-random-plugin makes the tests easy to
do and highly reproducible.

This is surprising:
> When I tested using real storage on a real server, I got best results
> using 16 requests and one connection and a request size of 1m.
When I've tested multi-conn, more has almost always been better.
> 4 connections with 4 requests per connection with the same request
> size seem to be ~10% faster in these conditions.
So that's 10% faster than 16 requests x 1 connection?

Anyhow I agree it's a 3D (or higher dimension) space that needs to be
explored.
> I posted more info on these tests here:
> https://listman.redhat.com/archives/libguestfs/2021-May/msg00124.html
IIUC there is one high end server.  Is both the source and destination
located on the NetApp SAN?  It wasn't clear.  Also since it's been
about 15 years since I touched FC, what's the bandwidth/latency like
on modern systems?  I'm concerned that using a SAN could just be a
bottleneck to testing, and we'd be better off testing everything with
local NVMe.
> Of course testing with other servers or storage can show different results,
> and it is impossible to find a value that will work best in all cases.
> 
> I think we need to test both the number of requests and connections to
> improve the defaults.
Agreed.
> > ---
> >  copy/main.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/copy/main.c b/copy/main.c
> > index 0fddfc3..70534b5 100644
> > --- a/copy/main.c
> > +++ b/copy/main.c
> > @@ -50,7 +50,7 @@ bool flush;                     /* --flush flag */
> >  unsigned max_requests = 64;     /* --requests */
> >  bool progress;                  /* -p flag */
> >  int progress_fd = -1;           /* --progress=FD */
> > -unsigned request_size = MAX_REQUEST_SIZE;  /* --request-size */
> > +unsigned request_size = 1<<18;  /* --request-size */
> 
> But this is clearly a better default.
> 
> >  unsigned sparse_size = 4096;    /* --sparse */
> >  bool synchronous;               /* --synchronous flag */
> >  unsigned threads;               /* --threads */
> > --
> > 2.32.0
I think I'm going to push this, and then I'll discuss with Abhay if
it's possible to explore more questions:

 * Comparing and characterising nbdkit-sparse-random-plugin vs
   "real" copies.

 * Exploring #connections vs #requests vs #request size and seeing if
   there's a better minimum or set of recommendations.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine.  Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/

Libguestfs - Jun 2021 - [PATCH libnbd 2/2] copy: Set default request-size to 2**18 (262144 bytes)

[Libguestfs] [PATCH libnbd 2/2] copy: Set default request-size to 2**18 (262144 bytes)

[Libguestfs] [PATCH libnbd 2/2] copy: Set default request-size to 2**18 (262144 bytes)