Petr Beneš
2022-May-09 18:52 UTC
Slow VM start/revert, when trying to start/revert dozens of VMs in parallel
Hi, my problem can be described simply: libvirt can't handle starting dozens of VMs at the same time. (technically, it can, but it's really slow.) We have an AMD machine with 256 logical cores and 1.5T ram. On that machine there is roughly 200 VMs. Each VM is the same: 8GB of RAM, 4 VCPUs. Half of them is Win7 x86, the other half is Win7 x64. VMs are using qcow2 as the disk image. These images reside in the ramdisk (tmpfs). We use these machines for automatic malware analysis, so our scenario consists of this cycle: - reverting VM to a running state - execute sample inside of the VM for ~1-2 minutes - shutdown the VM Of course, this results in multiple VMs trying to start at the same time. At first, reverts/starts are really fast - second or two. After about a minute, the "revertToSnapshot" suddenly takes 10-15 seconds, which is really unacceptable. For comparison, we're running the same scenarion on Proxmox, where the revertToSnapshot usually takes 2 seconds. Few notes: - Because of this fast cycle (~2-3 minutes) and because of VMs taking 10-15 seconds to start, there is barely more than 25-30 VMs running at once. We would really love to utilise the whole potential of such beast machine of ours, and have at least ~100 VMs running at any given time. - During the time running, the avg. CPU load isn't higher than 25%. Also, there's only about 280 GB of RAM used. Therefore, it's not limitation of our resources. - When the framwork is running and libvirt is making its best to start our VMs, I noticed that every libvirt operation is suddenly very slow. Even simple "virsh list [--all]" takes few seconds to complete, even though it finishes instantly when no VM is running/starting. I was trying to search for this issue, but didn't really find anything besides this presentation: https://events19.linuxfoundation.org/wp-content/uploads/2017/12/Scalability-and-Stability-of-libvirt-Experiences-with-Very-Large-Hosts-Marc-Hartmayer-IBM-1.pdf However, I couldn't find those commits in your upstream. Is this a known issue? Or is there some setting I don't know of which would magically make the VMs start faster? As for steps to reproduce - I don't think there is anything special needed. Just try to start/destroy several VMs in a loop. There is even provided one-liner for that in the presentation above. ``` # For multiple domains: # while virsh start $vm && virsh destroy $vm; do : ; done # ? ~30s hang ups of the libvirtd main loop ``` Best Regards, Petr -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://listman.redhat.com/archives/libvirt-users/attachments/20220509/6bb803db/attachment.htm>
Petr Beneš
2022-May-09 18:58 UTC
Slow VM start/revert, when trying to start/revert dozens of VMs in parallel
I forgot to mention: ``` $ lsb_release -a Distributor ID: Ubuntu Description: Ubuntu 22.04 LTS Release: 22.04 Codename: jammy $ virsh version Compiled against library: libvirt 8.0.0 Using library: libvirt 8.0.0 Using API: QEMU 8.0.0 Running hypervisor: QEMU 6.2.0 ``` ________________________________ Od: Petr Bene? <w.benny at outlook.com> Odesl?no: pond?l? 9. kv?tna 2022 20:52 Komu: libvirt-users at redhat.com <libvirt-users at redhat.com> P?edm?t: Slow VM start/revert, when trying to start/revert dozens of VMs in parallel Hi, my problem can be described simply: libvirt can't handle starting dozens of VMs at the same time. (technically, it can, but it's really slow.) We have an AMD machine with 256 logical cores and 1.5T ram. On that machine there is roughly 200 VMs. Each VM is the same: 8GB of RAM, 4 VCPUs. Half of them is Win7 x86, the other half is Win7 x64. VMs are using qcow2 as the disk image. These images reside in the ramdisk (tmpfs). We use these machines for automatic malware analysis, so our scenario consists of this cycle: - reverting VM to a running state - execute sample inside of the VM for ~1-2 minutes - shutdown the VM Of course, this results in multiple VMs trying to start at the same time. At first, reverts/starts are really fast - second or two. After about a minute, the "revertToSnapshot" suddenly takes 10-15 seconds, which is really unacceptable. For comparison, we're running the same scenarion on Proxmox, where the revertToSnapshot usually takes 2 seconds. Few notes: - Because of this fast cycle (~2-3 minutes) and because of VMs taking 10-15 seconds to start, there is barely more than 25-30 VMs running at once. We would really love to utilise the whole potential of such beast machine of ours, and have at least ~100 VMs running at any given time. - During the time running, the avg. CPU load isn't higher than 25%. Also, there's only about 280 GB of RAM used. Therefore, it's not limitation of our resources. - When the framwork is running and libvirt is making its best to start our VMs, I noticed that every libvirt operation is suddenly very slow. Even simple "virsh list [--all]" takes few seconds to complete, even though it finishes instantly when no VM is running/starting. I was trying to search for this issue, but didn't really find anything besides this presentation: https://events19.linuxfoundation.org/wp-content/uploads/2017/12/Scalability-and-Stability-of-libvirt-Experiences-with-Very-Large-Hosts-Marc-Hartmayer-IBM-1.pdf However, I couldn't find those commits in your upstream. Is this a known issue? Or is there some setting I don't know of which would magically make the VMs start faster? As for steps to reproduce - I don't think there is anything special needed. Just try to start/destroy several VMs in a loop. There is even provided one-liner for that in the presentation above. ``` # For multiple domains: # while virsh start $vm && virsh destroy $vm; do : ; done # ? ~30s hang ups of the libvirtd main loop ``` Best Regards, Petr -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://listman.redhat.com/archives/libvirt-users/attachments/20220509/825fc912/attachment-0001.htm>
Daniel P. Berrangé
2022-May-10 08:08 UTC
Slow VM start/revert, when trying to start/revert dozens of VMs in parallel
On Mon, May 09, 2022 at 06:52:32PM +0000, Petr Bene? wrote:> Hi, > > my problem can be described simply: libvirt can't handle starting dozens of VMs at the same time. > > (technically, it can, but it's really slow.) > > We have an AMD machine with 256 logical cores and 1.5T ram. > On that machine there is roughly 200 VMs. > Each VM is the same: 8GB of RAM, 4 VCPUs. Half of them is Win7 x86, the other half is Win7 x64. > VMs are using qcow2 as the disk image. These images reside in the ramdisk (tmpfs). > > We use these machines for automatic malware analysis, so our scenario consists of this cycle: > - reverting VM to a running state > - execute sample inside of the VM for ~1-2 minutes > - shutdown the VM > > Of course, this results in multiple VMs trying to start at the same time. > At first, reverts/starts are really fast - second or two. > After about a minute, the "revertToSnapshot" suddenly takes 10-15 seconds, which is really unacceptable. > For comparison, we're running the same scenarion on Proxmox, where the revertToSnapshot usually takes 2 seconds.Can you share the XML configuration of one of your guests - assuming they all have the same basic configuration. As a gut feeling it sounds to me like it could be initially fast due to utilization of host I/O cache, but then slows down due to having to flush data to disk / read fresh from disk. This could be the case if the disk configuration cache mode is set to certain values, so the XML config will show us this info. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|