Richard W.M. Jones
2021-Jun-20 16:46 UTC
[Libguestfs] [PATCH libnbd 2/2] copy: Set default request-size to 2**18 (262144 bytes)
As Nir has often pointed out, our current default request buffer size (32MB) is too large, resulting in nbdcopy being as much as 2? times slower than it could be. The optimum buffer size most likely depends on the hardware, and may even vary over time as machines get generally larger caches. To explore the problem I used this command: $ hyperfine -P rs 15 25 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**{rs})) \$uri \$uri"' On my 2019-era AMD server with 32GB of RAM and 64MB * 4 of L3 cache, 2**18 (262144) was the optimum when I tested all sizes between 2**15 (32K) and 2**25 (32M, the current default). Summary 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**18)) \$uri \$uri"' ran 1.03 ? 0.04 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**19)) \$uri \$uri"' 1.06 ? 0.04 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**17)) \$uri \$uri"' 1.09 ? 0.03 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**20)) \$uri \$uri"' 1.23 ? 0.04 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**21)) \$uri \$uri"' 1.26 ? 0.04 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**16)) \$uri \$uri"' 1.39 ? 0.04 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**22)) \$uri \$uri"' 1.45 ? 0.05 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**15)) \$uri \$uri"' 1.61 ? 0.05 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**23)) \$uri \$uri"' 1.94 ? 0.05 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**24)) \$uri \$uri"' 2.47 ? 0.08 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**25)) \$uri \$uri"' My 2018-era Intel laptop with a measly 8 MB of L3 cache the optimum size is one power-of-2 smaller (but 2**18 is still an improvement): Summary 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**17)) \$uri \$uri"' ran 1.05 ? 0.19 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**15)) \$uri \$uri"' 1.06 ? 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**16)) \$uri \$uri"' 1.10 ? 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**18)) \$uri \$uri"' 1.22 ? 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**19)) \$uri \$uri"' 1.29 ? 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**20)) \$uri \$uri"' 1.33 ? 0.02 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**21)) \$uri \$uri"' 1.35 ? 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**22)) \$uri \$uri"' 1.38 ? 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**23)) \$uri \$uri"' 1.45 ? 0.02 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**24)) \$uri \$uri"' 1.63 ? 0.03 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**25)) \$uri \$uri"' To get an idea of the best request size on something rather different, this is a Raspberry Pi 4B. I had to reduce the copy size down by a factor of 10 (to 10G) to make it run in a reasonable time. 2**18 is about 8% slower than the optimum choice (2**15). It's still significantly better than our current default. Summary 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**15)) \$uri \$uri"' ran 1.00 ? 0.04 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**21)) \$uri \$uri"' 1.03 ? 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**20)) \$uri \$uri"' 1.04 ? 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**22)) \$uri \$uri"' 1.05 ? 0.08 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**16)) \$uri \$uri"' 1.05 ? 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**19)) \$uri \$uri"' 1.07 ? 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**17)) \$uri \$uri"' 1.08 ? 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**18)) \$uri \$uri"' 1.15 ? 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**23)) \$uri \$uri"' 1.28 ? 0.06 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**24)) \$uri \$uri"' 1.35 ? 0.06 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**25)) \$uri \$uri"' --- copy/main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/copy/main.c b/copy/main.c index 0fddfc3..70534b5 100644 --- a/copy/main.c +++ b/copy/main.c @@ -50,7 +50,7 @@ bool flush; /* --flush flag */ unsigned max_requests = 64; /* --requests */ bool progress; /* -p flag */ int progress_fd = -1; /* --progress=FD */ -unsigned request_size = MAX_REQUEST_SIZE; /* --request-size */ +unsigned request_size = 1<<18; /* --request-size */ unsigned sparse_size = 4096; /* --sparse */ bool synchronous; /* --synchronous flag */ unsigned threads; /* --threads */ -- 2.32.0
Nir Soffer
2021-Jun-20 19:21 UTC
[Libguestfs] [PATCH libnbd 2/2] copy: Set default request-size to 2**18 (262144 bytes)
On Sun, Jun 20, 2021 at 7:46 PM Richard W.M. Jones <rjones at redhat.com> wrote:> > As Nir has often pointed out, our current default request buffer size > (32MB) is too large, resulting in nbdcopy being as much as 2? times > slower than it could be. > > The optimum buffer size most likely depends on the hardware, and may > even vary over time as machines get generally larger caches. To > explore the problem I used this command: > > $ hyperfine -P rs 15 25 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**{rs})) \$uri \$uri"'This uses the same process for serving both reads and writes, which may be different from real world usage when one process is used for reading and one for writing.> On my 2019-era AMD server with 32GB of RAM and 64MB * 4 of L3 cache, > 2**18 (262144) was the optimum when I tested all sizes between > 2**15 (32K) and 2**25 (32M, the current default). > > Summary > 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**18)) \$uri \$uri"' ran > 1.03 ? 0.04 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**19)) \$uri \$uri"' > 1.06 ? 0.04 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**17)) \$uri \$uri"' > 1.09 ? 0.03 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**20)) \$uri \$uri"'The difference is very small up to this point> 1.23 ? 0.04 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**21)) \$uri \$uri"' > 1.26 ? 0.04 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**16)) \$uri \$uri"' > 1.39 ? 0.04 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**22)) \$uri \$uri"' > 1.45 ? 0.05 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**15)) \$uri \$uri"' > 1.61 ? 0.05 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**23)) \$uri \$uri"' > 1.94 ? 0.05 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**24)) \$uri \$uri"' > 2.47 ? 0.08 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**25)) \$uri \$uri"' > > My 2018-era Intel laptop with a measly 8 MB of L3 cache the optimum > size is one power-of-2 smaller (but 2**18 is still an improvement): > > Summary > 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**17)) \$uri \$uri"' ranThis matches results I got when testing the libev example on Lenovo T480s (~2018) and Dell Optiplex 9080 (~2012).> 1.05 ? 0.19 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**15)) \$uri \$uri"' > 1.06 ? 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**16)) \$uri \$uri"' > 1.10 ? 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**18)) \$uri \$uri"' > 1.22 ? 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**19)) \$uri \$uri"' > 1.29 ? 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**20)) \$uri \$uri"' > 1.33 ? 0.02 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**21)) \$uri \$uri"' > 1.35 ? 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**22)) \$uri \$uri"' > 1.38 ? 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**23)) \$uri \$uri"' > 1.45 ? 0.02 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**24)) \$uri \$uri"' > 1.63 ? 0.03 times faster than 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy --request-size=\$((2**25)) \$uri \$uri"' > > To get an idea of the best request size on something rather different, > this is a Raspberry Pi 4B. I had to reduce the copy size down by a > factor of 10 (to 10G) to make it run in a reasonable time. 2**18 is > about 8% slower than the optimum choice (2**15). It's still > significantly better than our current default. > > Summary > 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**15)) \$uri \$uri"' ran > 1.00 ? 0.04 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**21)) \$uri \$uri"' > 1.03 ? 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**20)) \$uri \$uri"' > 1.04 ? 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**22)) \$uri \$uri"' > 1.05 ? 0.08 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**16)) \$uri \$uri"' > 1.05 ? 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**19)) \$uri \$uri"' > 1.07 ? 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**17)) \$uri \$uri"' > 1.08 ? 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**18)) \$uri \$uri"' > 1.15 ? 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**23)) \$uri \$uri"' > 1.28 ? 0.06 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**24)) \$uri \$uri"' > 1.35 ? 0.06 times faster than 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy --request-size=\$((2**25)) \$uri \$uri"'But all these results do not test real work copy. They test copying from memory to memory with zero (practical) latency. When I tested using real storage on a real server, I got best results using 16 requests and one connection and a request size of 1m. 4 connections with 4 requests per connection with the same request size seem to be ~10% faster in these conditions. I posted more info on these tests here: https://listman.redhat.com/archives/libguestfs/2021-May/msg00124.html Of course testing with other servers or storage can show different results, and it is impossible to find a value that will work best in all cases. I think we need to test both the number of requests and connections to improve the defaults.> --- > copy/main.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/copy/main.c b/copy/main.c > index 0fddfc3..70534b5 100644 > --- a/copy/main.c > +++ b/copy/main.c > @@ -50,7 +50,7 @@ bool flush; /* --flush flag */ > unsigned max_requests = 64; /* --requests */ > bool progress; /* -p flag */ > int progress_fd = -1; /* --progress=FD */ > -unsigned request_size = MAX_REQUEST_SIZE; /* --request-size */ > +unsigned request_size = 1<<18; /* --request-size */But this is clearly a better default.> unsigned sparse_size = 4096; /* --sparse */ > bool synchronous; /* --synchronous flag */ > unsigned threads; /* --threads */ > -- > 2.32.0 >Nir