Michal Privoznik
2018-Sep-17 13:08 UTC
Re: [libvirt-users] NUMA issues on virtualized hosts
On 09/14/2018 03:36 PM, Lukas Hejtmanek wrote:> Hello, > > ok, I found that cpu pinning was wrong, so I corrected it to be 1:1. The issue > with iozone remains the same. > > The spec is running, however, it runs slower than 1-NUMA case. > > The corrected XML looks like follows:[Reformated XML for better reading] <cpu mode="host-passthrough"> <topology sockets="8" cores="4" threads="1"/> <numa> <cell cpus="0-3" memory="62000000"/> <cell cpus="4-7" memory="62000000"/> <cell cpus="8-11" memory="62000000"/> <cell cpus="12-15" memory="62000000"/> <cell cpus="16-19" memory="62000000"/> <cell cpus="20-23" memory="62000000"/> <cell cpus="24-27" memory="62000000"/> <cell cpus="28-31" memory="62000000"/> </numa> </cpu> <cputune> <vcpupin vcpu="0" cpuset="0"/> <vcpupin vcpu="1" cpuset="1"/> <vcpupin vcpu="2" cpuset="2"/> <vcpupin vcpu="3" cpuset="3"/> <vcpupin vcpu="4" cpuset="4"/> <vcpupin vcpu="5" cpuset="5"/> <vcpupin vcpu="6" cpuset="6"/> <vcpupin vcpu="7" cpuset="7"/> <vcpupin vcpu="8" cpuset="8"/> <vcpupin vcpu="9" cpuset="9"/> <vcpupin vcpu="10" cpuset="10"/> <vcpupin vcpu="11" cpuset="11"/> <vcpupin vcpu="12" cpuset="12"/> <vcpupin vcpu="13" cpuset="13"/> <vcpupin vcpu="14" cpuset="14"/> <vcpupin vcpu="15" cpuset="15"/> <vcpupin vcpu="16" cpuset="16"/> <vcpupin vcpu="17" cpuset="17"/> <vcpupin vcpu="18" cpuset="18"/> <vcpupin vcpu="19" cpuset="19"/> <vcpupin vcpu="20" cpuset="20"/> <vcpupin vcpu="21" cpuset="21"/> <vcpupin vcpu="22" cpuset="22"/> <vcpupin vcpu="23" cpuset="23"/> <vcpupin vcpu="24" cpuset="24"/> <vcpupin vcpu="25" cpuset="25"/> <vcpupin vcpu="26" cpuset="26"/> <vcpupin vcpu="27" cpuset="27"/> <vcpupin vcpu="28" cpuset="28"/> <vcpupin vcpu="29" cpuset="29"/> <vcpupin vcpu="30" cpuset="30"/> <vcpupin vcpu="31" cpuset="31"/> </cputune> <numatune> <memory mode="strict" nodeset="0-7"/> </numatune> However, this is not enough. This XML pins only vCPUs and not guest memory. So while say vCPU #0 is pinned onto physical CPU #0, the memory for guest NUMA #0 might be allocated at host NUMA #7 (for instance). You need to add: <numatune> <memnode cellid="0" mode="strict" nodeset="0"/> <memnode cellid="1" mode="strict" nodeset="1"/> ... </numatune> This will ensure also the guest memory pinning. But wait, there is more. In your later e-mails you mention slow disk I/O. This might be caused by various variables but the most obvious one in this case is qemu I/O loop, I'd say. Without iothreads qemu has only one I/O loop and thus if your guest issues writes from all 32 cores at once this loop is unable to handle it (performance wise) and therefore the performance drop. You can try enabling iothreads: https://libvirt.org/formatdomain.html#elementsIOThreadsAllocation This is a qemu feature that allows you to create more I/O threads and also pin them. This is an example how to use them: https://libvirt.org/git/?p=libvirt.git;a=blob;f=tests/qemuxml2argvdata/iothreads-disk.xml;h=0aa32c392300c0a86ad26185292ebc7a0d85d588;hb=HEAD And this is an example how to pin them: https://libvirt.org/git/?p=libvirt.git;a=blob;f=tests/qemuxml2argvdata/cputune-iothreads.xml;h=311a1d3604177d9699edf7132a75f387aa57ad6f;hb=HEAD Also, since iothreads are capable of handling just any I/O they can be used for other devices too, not only disks. For instance interfaces. Hopefully, this will boost your performance. Regards, Michal (who is a bit envious about your machine :-P)
Lukas Hejtmanek
2018-Sep-17 14:59 UTC
Re: [libvirt-users] NUMA issues on virtualized hosts
Hello, so the current domain configuration: <cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu> <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='1' /><vcpupin vcpu='2' cpuset='2' /><vcpupin vcpu='3' cpuset='3' /><vcpupin vcpu='4' cpuset='4' /><vcpupin vcpu='5' cpuset='5' /><vcpupin vcpu='6' cpuset='6' /><vcpupin vcpu='7' cpuset='7' /><vcpupin vcpu='8' cpuset='8' /><vcpupin vcpu='9' cpuset='9' /><vcpupin vcpu='10' cpuset='10' /><vcpupin vcpu='11' cpuset='11' /><vcpupin vcpu='12' cpuset='12' /><vcpupin vcpu='13' cpuset='13' /><vcpupin vcpu='14' cpuset='14' /><vcpupin vcpu='15' cpuset='15' /><vcpupin vcpu='16' cpuset='16' /><vcpupin vcpu='17' cpuset='17' /><vcpupin vcpu='18' cpuset='18' /><vcpupin vcpu='19' cpuset='19' /><vcpupin vcpu='20' cpuset='20' /><vcpupin vcpu='21' cpuset='21' /><vcpupin vcpu='22' cpuset='22' /><vcpupin vcpu='23' cpuset='23' /><vcpupin vcpu='24' cpuset='24' /><vcpupin vcpu='25' cpuset='25' /><vcpupin vcpu='26' cpuset='26' /><vcpupin vcpu='27' cpuset='27' /><vcpupin vcpu='28' cpuset='28' /><vcpupin vcpu='29' cpuset='29' /><vcpupin vcpu='30' cpuset='30' /><vcpupin vcpu='31' cpuset='31' /></cputune> <numatune> <memnode cellid="0" mode="strict" nodeset="0"/> <memnode cellid="1" mode="strict" nodeset="1"/> <memnode cellid="2" mode="strict" nodeset="2"/> <memnode cellid="3" mode="strict" nodeset="3"/> <memnode cellid="4" mode="strict" nodeset="4"/> <memnode cellid="5" mode="strict" nodeset="5"/> <memnode cellid="6" mode="strict" nodeset="6"/> <memnode cellid="7" mode="strict" nodeset="7"/> </numatune> hopefully, I got it right. Good news is, that spec benchmark looks promising. The first test bwaves finished in 1003 seconds compared to 1700 seconds in the previous wrong case. So far so good. Bad news is, that iozone is still the same. There might be some misunderstanding. I have to cases: 1) cache=unsafe. In this case, I can see that hypervizor is prone to swap. Swap a lot. It usually eats whole swap partition and kswapd is running on 100% CPU. swappines, dirty_ration and company do not improve things at all. However, I believe, this is just wrong option for scratch disks where one can expect huge I/O load. Moreover, the hypevizor is poor machine with only low memory left (ok, in my case about 10GB available), so it does not make sense to use that memory for additional cache/disk buffers. 2) cache=none. In this case, performance is better (only few percent behind baremetal). However, as soon as the size of stored data is about the size of memory of the virtual, writes stops and iozone is eating whole CPU, it looks like it is searching more free pages and it is harder and harder. But not sure, I am not skilled in this area. here, you can clearly see, that it starts writes, doing the writes, then it takes a pause, writes again, and so on, but the pauses are longer and longer.. https://pastebin.com/2gfPFgb9 The output is until the very end of iozone (I cancelled it by ctrl-c). It seems that this is not happening on 2-NUMA node with rotational disks only. It is partly happening on 2-NUMA node with 2 NVME SSDs. The partly means, that there are also pauses in writes but it finishes, speed is reduced though. On 1-NUMA node, with the same test, I can see steady writes from the very beginning to the very end at roughly the same speed. Maybe it could be related to the fact, that NVME is PCI device that is linked to one NUMA node only? As of iothreads, I have only 1 disk (the vde) that is exposed to high i/o load, so I believe more I/O threads is not applicable here. If I understand correctly, I cannot set more iothreads to a single device.. And it does not seem to be iothreads linked as the same scenario in 1-NUMA configuration works OK (I mean that memory penalties can be huge as it does not reflect real NUMA topology, but disk speed it ok anyway.) And as of that machine, what about this one? :) [root@urga1 ~]$ free -g total used free shared buff/cache available Mem: 5857 75 5746 0 35 5768 ... NUMA node47 CPU(s): 376-383 this it not virtualized though :) On Mon, Sep 17, 2018 at 03:08:34PM +0200, Michal Privoznik wrote:> On 09/14/2018 03:36 PM, Lukas Hejtmanek wrote: > > Hello, > > > > ok, I found that cpu pinning was wrong, so I corrected it to be 1:1. The issue > > with iozone remains the same. > > > > The spec is running, however, it runs slower than 1-NUMA case. > > > > The corrected XML looks like follows: > [Reformated XML for better reading] > > <cpu mode="host-passthrough"> > <topology sockets="8" cores="4" threads="1"/> > <numa> > <cell cpus="0-3" memory="62000000"/> > <cell cpus="4-7" memory="62000000"/> > <cell cpus="8-11" memory="62000000"/> > <cell cpus="12-15" memory="62000000"/> > <cell cpus="16-19" memory="62000000"/> > <cell cpus="20-23" memory="62000000"/> > <cell cpus="24-27" memory="62000000"/> > <cell cpus="28-31" memory="62000000"/> > </numa> > </cpu> > <cputune> > <vcpupin vcpu="0" cpuset="0"/> > <vcpupin vcpu="1" cpuset="1"/> > <vcpupin vcpu="2" cpuset="2"/> > <vcpupin vcpu="3" cpuset="3"/> > <vcpupin vcpu="4" cpuset="4"/> > <vcpupin vcpu="5" cpuset="5"/> > <vcpupin vcpu="6" cpuset="6"/> > <vcpupin vcpu="7" cpuset="7"/> > <vcpupin vcpu="8" cpuset="8"/> > <vcpupin vcpu="9" cpuset="9"/> > <vcpupin vcpu="10" cpuset="10"/> > <vcpupin vcpu="11" cpuset="11"/> > <vcpupin vcpu="12" cpuset="12"/> > <vcpupin vcpu="13" cpuset="13"/> > <vcpupin vcpu="14" cpuset="14"/> > <vcpupin vcpu="15" cpuset="15"/> > <vcpupin vcpu="16" cpuset="16"/> > <vcpupin vcpu="17" cpuset="17"/> > <vcpupin vcpu="18" cpuset="18"/> > <vcpupin vcpu="19" cpuset="19"/> > <vcpupin vcpu="20" cpuset="20"/> > <vcpupin vcpu="21" cpuset="21"/> > <vcpupin vcpu="22" cpuset="22"/> > <vcpupin vcpu="23" cpuset="23"/> > <vcpupin vcpu="24" cpuset="24"/> > <vcpupin vcpu="25" cpuset="25"/> > <vcpupin vcpu="26" cpuset="26"/> > <vcpupin vcpu="27" cpuset="27"/> > <vcpupin vcpu="28" cpuset="28"/> > <vcpupin vcpu="29" cpuset="29"/> > <vcpupin vcpu="30" cpuset="30"/> > <vcpupin vcpu="31" cpuset="31"/> > </cputune> > <numatune> > <memory mode="strict" nodeset="0-7"/> > </numatune> > > > However, this is not enough. This XML pins only vCPUs and not guest > memory. So while say vCPU #0 is pinned onto physical CPU #0, the memory > for guest NUMA #0 might be allocated at host NUMA #7 (for instance). You > need to add: > > <numatune> > <memnode cellid="0" mode="strict" nodeset="0"/> > <memnode cellid="1" mode="strict" nodeset="1"/> > ... > </numatune> > > This will ensure also the guest memory pinning. But wait, there is more. > In your later e-mails you mention slow disk I/O. This might be caused by > various variables but the most obvious one in this case is qemu I/O > loop, I'd say. Without iothreads qemu has only one I/O loop and thus if > your guest issues writes from all 32 cores at once this loop is unable > to handle it (performance wise) and therefore the performance drop. You > can try enabling iothreads: > > https://libvirt.org/formatdomain.html#elementsIOThreadsAllocation > > This is a qemu feature that allows you to create more I/O threads and > also pin them. This is an example how to use them: > > https://libvirt.org/git/?p=libvirt.git;a=blob;f=tests/qemuxml2argvdata/iothreads-disk.xml;h=0aa32c392300c0a86ad26185292ebc7a0d85d588;hb=HEAD > > And this is an example how to pin them: > > https://libvirt.org/git/?p=libvirt.git;a=blob;f=tests/qemuxml2argvdata/cputune-iothreads.xml;h=311a1d3604177d9699edf7132a75f387aa57ad6f;hb=HEAD > > Also, since iothreads are capable of handling just any I/O they can be > used for other devices too, not only disks. For instance interfaces. > > Hopefully, this will boost your performance. > > Regards, > Michal (who is a bit envious about your machine :-P)-- Lukáš Hejtmánek Linux Administrator only because Full Time Multitasking Ninja is not an official job title
Michal Privoznik
2018-Sep-18 07:50 UTC
Re: [libvirt-users] NUMA issues on virtualized hosts
On 09/17/2018 04:59 PM, Lukas Hejtmanek wrote:> Hello, > > so the current domain configuration: > <cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu> > <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='1' /><vcpupin vcpu='2' cpuset='2' /><vcpupin vcpu='3' cpuset='3' /><vcpupin vcpu='4' cpuset='4' /><vcpupin vcpu='5' cpuset='5' /><vcpupin vcpu='6' cpuset='6' /><vcpupin vcpu='7' cpuset='7' /><vcpupin vcpu='8' cpuset='8' /><vcpupin vcpu='9' cpuset='9' /><vcpupin vcpu='10' cpuset='10' /><vcpupin vcpu='11' cpuset='11' /><vcpupin vcpu='12' cpuset='12' /><vcpupin vcpu='13' cpuset='13' /><vcpupin vcpu='14' cpuset='14' /><vcpupin vcpu='15' cpuset='15' /><vcpupin vcpu='16' cpuset='16' /><vcpupin vcpu='17' cpuset='17' /><vcpupin vcpu='18' cpuset='18' /><vcpupin vcpu='19' cpuset='19' /><vcpupin vcpu='20' cpuset='20' /><vcpupin vcpu='21' cpuset='21' /><vcpupin vcpu='22' cpuset='22' /><vcpupin vcpu='23' cpuset='23' /><vcpupin vcpu='24' cpuset='24' /><vcpupin vcpu='25' cpuset='25' /><vcpupin vcpu='26' cpuset='26' /><vcpupin vcpu='27' cpuset='27' /><vcpupin vcpu='28' cpuset='28' /><vcpupin vcpu='29' cpuset='29' /><vcpupin vcpu='30' cpuset='30' /><vcpupin vcpu='31' cpuset='31' /></cputune> > <numatune> > <memnode cellid="0" mode="strict" nodeset="0"/> > <memnode cellid="1" mode="strict" nodeset="1"/> > <memnode cellid="2" mode="strict" nodeset="2"/> > <memnode cellid="3" mode="strict" nodeset="3"/> > <memnode cellid="4" mode="strict" nodeset="4"/> > <memnode cellid="5" mode="strict" nodeset="5"/> > <memnode cellid="6" mode="strict" nodeset="6"/> > <memnode cellid="7" mode="strict" nodeset="7"/> > </numatune> > > hopefully, I got it right.Yes, looking good.> > Good news is, that spec benchmark looks promising. The first test bwaves > finished in 1003 seconds compared to 1700 seconds in the previous wrong case. > So far so good.Very well, this means that the config above is correct.> > Bad news is, that iozone is still the same. There might be some > misunderstanding. > > I have to cases: > > 1) cache=unsafe. In this case, I can see that hypervizor is prone to swap. > Swap a lot. It usually eats whole swap partition and kswapd is running on 100% > CPU. swappines, dirty_ration and company do not improve things at all. > However, I believe, this is just wrong option for scratch disks where one can > expect huge I/O load. Moreover, the hypevizor is poor machine with only low > memory left (ok, in my case about 10GB available), so it does not make sense > to use that memory for additional cache/disk buffers.One thing that just occurred to me - is the qcow2 file fully allocated? # qemu-img info /var/lib/libvirt/images/fedora.qcow2 .. virtual size: 20G (21474836480 bytes) disk size: 7.0G .. This is NOT a fully allocated qcow2.> > 2) cache=none. In this case, performance is better (only few percent behind > baremetal). However, as soon as the size of stored data is about the size of > memory of the virtual, writes stops and iozone is eating whole CPU, it looks like > it is searching more free pages and it is harder and harder. But not sure, > I am not skilled in this area.Hmm. Could it be that SSD doesn't have enough free blocks and thus writes are throttled? Can you fstrim it and see if that helps?> > here, you can clearly see, that it starts writes, doing the writes, then it > takes a pause, writes again, and so on, but the pauses are longer and longer.. > https://pastebin.com/2gfPFgb9 > The output is until the very end of iozone (I cancelled it by ctrl-c). > > It seems that this is not happening on 2-NUMA node with rotational disks only. > It is partly happening on 2-NUMA node with 2 NVME SSDs. The partly means, that > there are also pauses in writes but it finishes, speed is reduced though. On > 1-NUMA node, with the same test, I can see steady writes from the very > beginning to the very end at roughly the same speed. > > Maybe it could be related to the fact, that NVME is PCI device that is linked > to one NUMA node only?Can be. I don't know qemu internals that much to know if its capable of doing zero copy disk writes.> > > As of iothreads, I have only 1 disk (the vde) that is exposed to high i/o > load, so I believe more I/O threads is not applicable here. If I understand > correctly, I cannot set more iothreads to a single device.. And it does not > seem to be iothreads linked as the same scenario in 1-NUMA configuration works > OK (I mean that memory penalties can be huge as it does not reflect real NUMA > topology, but disk speed it ok anyway.)Ah, since it's only one disk then iothreads will not help much here. Still worth giving it a shot ;-) Remember, iothreads are for all I/O, not disk I/O only. Anyway, this is the point where I have to say "I don't know". Sorry. Try contacting qemu guys: qemu-discuss@nongnu.org qemu-devel@nongnu.org Michal