On 01/10/22 16:52, Richard W.M. Jones wrote:> > For the raw format local disk to local disk conversion, it's possible > to regain most of the performance by adding > --request-size=$(( 16 * 1024 * 1024 )) to the nbdcopy command. The > patch below is not suitable for going upstream but it can be used for > testing: > > diff --git a/v2v/v2v.ml b/v2v/v2v.ml > index 47e6e937..ece3b7d9 100644 > --- a/v2v/v2v.ml > +++ b/v2v/v2v.ml > @@ -613,6 +613,7 @@ and nbdcopy output_alloc input_uri output_uri > let cmd = ref [] in > List.push_back_list cmd [ "nbdcopy"; input_uri; output_uri ]; > List.push_back cmd "--flush"; > + List.push_back cmd "--request-size=16777216"; > (*List.push_back cmd "--verbose";*) > if not (quiet ()) then List.push_back cmd "--progress"; > if output_alloc = Types.Preallocated then List.push_back cmd "--allocated"; > > The problem is of course this is a pessimisation for other > conversions. It's known to make at least qcow2 to qcow2, and all VDDK > conversions worse. So we'd have to make it conditional on doing a raw > format local conversion, which is a pretty ugly hack. Even worse, the > exact size (16M) varies for me when I test this on different machines > and HDDs vs SSDs. On my very fast AMD machine with an SSD, the > nbdcopy default request size (256K) is fastest and larger sizes are a > very slightly slower. > > I can imagine an "adaptive nbdcopy" which adjusts these parameters > while copying in order to find the best performance. A little bit > hard to implement ... > > I'm also still wondering exactly why a larger request size is better > in this case. You can easily reproduce the effect using the attached > test script and adjusting --request-size. You'll need to build the > standard test guest, see part 1.(The following thought occurred to me last evening.) In modular v2v, we use multi-threaded nbdkit instances, and multi-threaded nbdcopy instances. (IIUC.) I think: that should result in quite a bit of thrashing, on both source and destination disks, no? That should be especially visible on HDDs, but perhaps also on SSDs (dependent on request size as you mention above). The worst is likely when both nbdcopy processes operate on the same physical HDD (i.e., spinning rust). qemu-img is single-threaded, so even if reads from and writes to the same physical hard disk, it kind of generates two "parallel" request streams, which both the disk and the kernel's IO scheduler could cope with more easily. According to the nbdcopy manual, the default thread count is "number of processor cores available", the "sliding window of requests" with a high thread count is likely undistinguishable from real random access. Also I (vaguely?) gather that nbdcopy bypasses the page cache (or does it only sync automatically at the end? I don't remember). If the page cache is avoided, then the page cache has no chance to mitigate the thrashing, especially on HDDs -- but even on SSDs, if the drive's internal cache is not large enough (considering the individual request size and the number of random requests flying in parallel), the degradation should be visible. Can you tweak (i.e., lower) the thread count of both nbdcopy processes; let's say to "1", for starters? Thanks! Laszlo
On 01/11/22 08:00, Laszlo Ersek wrote:> On 01/10/22 16:52, Richard W.M. Jones wrote: >> >> For the raw format local disk to local disk conversion, it's possible >> to regain most of the performance by adding >> --request-size=$(( 16 * 1024 * 1024 )) to the nbdcopy command. The >> patch below is not suitable for going upstream but it can be used for >> testing: >> >> diff --git a/v2v/v2v.ml b/v2v/v2v.ml >> index 47e6e937..ece3b7d9 100644 >> --- a/v2v/v2v.ml >> +++ b/v2v/v2v.ml >> @@ -613,6 +613,7 @@ and nbdcopy output_alloc input_uri output_uri >> let cmd = ref [] in >> List.push_back_list cmd [ "nbdcopy"; input_uri; output_uri ]; >> List.push_back cmd "--flush"; >> + List.push_back cmd "--request-size=16777216"; >> (*List.push_back cmd "--verbose";*) >> if not (quiet ()) then List.push_back cmd "--progress"; >> if output_alloc = Types.Preallocated then List.push_back cmd "--allocated"; >> >> The problem is of course this is a pessimisation for other >> conversions. It's known to make at least qcow2 to qcow2, and all VDDK >> conversions worse. So we'd have to make it conditional on doing a raw >> format local conversion, which is a pretty ugly hack. Even worse, the >> exact size (16M) varies for me when I test this on different machines >> and HDDs vs SSDs. On my very fast AMD machine with an SSD, the >> nbdcopy default request size (256K) is fastest and larger sizes are a >> very slightly slower. >> >> I can imagine an "adaptive nbdcopy" which adjusts these parameters >> while copying in order to find the best performance. A little bit >> hard to implement ... >> >> I'm also still wondering exactly why a larger request size is better >> in this case. You can easily reproduce the effect using the attached >> test script and adjusting --request-size. You'll need to build the >> standard test guest, see part 1. > > (The following thought occurred to me last evening.) > > In modular v2v, we use multi-threaded nbdkit instances, and > multi-threaded nbdcopy instances. (IIUC.) I think: that should result in > quite a bit of thrashing, on both source and destination disks, no? That > should be especially visible on HDDs, but perhaps also on SSDs > (dependent on request size as you mention above). > > The worst is likely when both nbdcopy processes operate on the same > physical HDD (i.e., spinning rust). > > qemu-img is single-threaded, so even if reads from and writes to the > same physical hard disk, it kind of generates two "parallel" request > streams, which both the disk and the kernel's IO scheduler could cope > with more easily. According to the nbdcopy manual, the default thread > count is "number of processor cores available", the "sliding window of > requests" with a high thread count is likely undistinguishable from real > random access. > > Also I (vaguely?) gather that nbdcopy bypasses the page cache (or does > it only sync automatically at the end? I don't remember). If the page > cache is avoided, then the page cache has no chance to mitigate the > thrashing, especially on HDDs -- but even on SSDs, if the drive's > internal cache is not large enough (considering the individual request > size and the number of random requests flying in parallel), the > degradation should be visible. > > Can you tweak (i.e., lower) the thread count of both nbdcopy processes; > let's say to "1", for starters?I meant to add: the "--request-size=16777216" option, *where it helps*, effectively serializes the request stream. A single request is not broken up into smaller *parallel* requests, thus, if you have huge (like 16MiB) requests, then that effectively de-randomizes the accesses in the sliding window. Thanks Laszlo
On 01/11/22 08:00, Laszlo Ersek wrote:> On 01/10/22 16:52, Richard W.M. Jones wrote: >> >> For the raw format local disk to local disk conversion, it's possible >> to regain most of the performance by adding >> --request-size=$(( 16 * 1024 * 1024 )) to the nbdcopy command. The >> patch below is not suitable for going upstream but it can be used for >> testing: >> >> diff --git a/v2v/v2v.ml b/v2v/v2v.ml >> index 47e6e937..ece3b7d9 100644 >> --- a/v2v/v2v.ml >> +++ b/v2v/v2v.ml >> @@ -613,6 +613,7 @@ and nbdcopy output_alloc input_uri output_uri >> let cmd = ref [] in >> List.push_back_list cmd [ "nbdcopy"; input_uri; output_uri ]; >> List.push_back cmd "--flush"; >> + List.push_back cmd "--request-size=16777216"; >> (*List.push_back cmd "--verbose";*) >> if not (quiet ()) then List.push_back cmd "--progress"; >> if output_alloc = Types.Preallocated then List.push_back cmd "--allocated"; >> >> The problem is of course this is a pessimisation for other >> conversions. It's known to make at least qcow2 to qcow2, and all VDDK >> conversions worse. So we'd have to make it conditional on doing a raw >> format local conversion, which is a pretty ugly hack. Even worse, the >> exact size (16M) varies for me when I test this on different machines >> and HDDs vs SSDs. On my very fast AMD machine with an SSD, the >> nbdcopy default request size (256K) is fastest and larger sizes are a >> very slightly slower. >> >> I can imagine an "adaptive nbdcopy" which adjusts these parameters >> while copying in order to find the best performance. A little bit >> hard to implement ... >> >> I'm also still wondering exactly why a larger request size is better >> in this case. You can easily reproduce the effect using the attached >> test script and adjusting --request-size. You'll need to build the >> standard test guest, see part 1. > > (The following thought occurred to me last evening.) > > In modular v2v, we use multi-threaded nbdkit instances, and > multi-threaded nbdcopy instances. (IIUC.) I think: that should result in > quite a bit of thrashing, on both source and destination disks, no? That > should be especially visible on HDDs, but perhaps also on SSDs > (dependent on request size as you mention above). > > The worst is likely when both nbdcopy processes operate on the same > physical HDD (i.e., spinning rust). > > qemu-img is single-threaded,hmmmm, not necessarily; according to the manual, "qemu-img convert" uses (by default) 8 co-routines. There's also the -W flag ("out of order writes"), which I don't know if the original virt-v2v used. Laszlo> so even if reads from and writes to the > same physical hard disk, it kind of generates two "parallel" request > streams, which both the disk and the kernel's IO scheduler could cope > with more easily. According to the nbdcopy manual, the default thread > count is "number of processor cores available", the "sliding window of > requests" with a high thread count is likely undistinguishable from real > random access. > > Also I (vaguely?) gather that nbdcopy bypasses the page cache (or does > it only sync automatically at the end? I don't remember). If the page > cache is avoided, then the page cache has no chance to mitigate the > thrashing, especially on HDDs -- but even on SSDs, if the drive's > internal cache is not large enough (considering the individual request > size and the number of random requests flying in parallel), the > degradation should be visible. > > Can you tweak (i.e., lower) the thread count of both nbdcopy processes; > let's say to "1", for starters? > > Thanks! > Laszlo >
Richard W.M. Jones
2022-Jan-11 10:53 UTC
[Libguestfs] Virt-v2v performance benchmarking part 3
On Tue, Jan 11, 2022 at 08:00:57AM +0100, Laszlo Ersek wrote:> (The following thought occurred to me last evening.) > > In modular v2v, we use multi-threaded nbdkit instances, and > multi-threaded nbdcopy instances. (IIUC.) I think: that should result in > quite a bit of thrashing, on both source and destination disks, no? That > should be especially visible on HDDs, but perhaps also on SSDs > (dependent on request size as you mention above).This is very possible. Also the HDD machine has only 2 cores / 4 threads (the SSD machine has 12 cores / 24 threads) so we are heavily overcommitting software threads to hardware threads. In contrast, in the VDDK to local disk case (which if you recall performs fine) nbdcopy won't be using multiple threads. This is because multi-conn is not enabled on the input (VDDK) side [see: https://listman.redhat.com/archives/libguestfs/2021-December/msg00172.html] so nbdcopy will use a single thread and a single NBD connection to both input and output sides. Within nbdkit-vddk-plugin we're using some threads to handle parallel requests on the same NBD connection (using VDDK's Async method). Within nbdkit-file-plugin on the output side it will also use multiple threads to handle parallel requests, again on a single NBD connection. But the copy is generally sequential.> The worst is likely when both nbdcopy processes operate on the same > physical HDD (i.e., spinning rust). > > qemu-img is single-threaded, so even if reads from and writes to the > same physical hard disk, it kind of generates two "parallel" request > streams, which both the disk and the kernel's IO scheduler could cope > with more easily. According to the nbdcopy manual, the default thread > count is "number of processor cores available", the "sliding window of > requests" with a high thread count is likely undistinguishable from real > random access. > > Also I (vaguely?) gather that nbdcopy bypasses the page cache (or does > it only sync automatically at the end? I don't remember).Yes, both nbdcopy and nbdkit-file-cache + cache=none (which we are using) will attempt to minimize use of the page cache.> If the page cache is avoided, then the page cache has no chance to > mitigate the thrashing, especially on HDDs -- but even on SSDs, if > the drive's internal cache is not large enough (considering the > individual request size and the number of random requests flying in > parallel), the degradation should be visible.I don't understand the mechanism by which this could happen. We're reading and writing blocks which are much larger than the filesystem block size (fs block size is probably 1K or 4K, minimum nbdcopy block size in any test is 256K). And we read and write each block exactly once, and we never revisit that data after reading/writing. Can the page cache help?> Can you tweak (i.e., lower) the thread count of both nbdcopy > processes; let's say to "1", for starters?Using nbdcopy -T 1 and leaving --request-size at the default does actually improve performance quite a bit, although not as much as increasing the request size. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-p2v converts physical machines to virtual machines. Boot with a live CD or over the network (PXE) and turn machines into KVM guests. http://libguestfs.org/virt-v2v