Hello, I have cluster with AMD EPYC 7351 cpu. Two CPUs per node. I have performance 8-NUMA configuration: This is from hypervizor: [root@hde10 ~]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 8 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC 7351 16-Core Processor Stepping: 2 CPU MHz: 1800.000 CPU max MHz: 2400.0000 CPU min MHz: 1200.0000 BogoMIPS: 4800.05 Virtualization: AMD-V L1d cache: 32K L1i cache: 64K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-3,32-35 NUMA node1 CPU(s): 4-7,36-39 NUMA node2 CPU(s): 8-11,40-43 NUMA node3 CPU(s): 12-15,44-47 NUMA node4 CPU(s): 16-19,48-51 NUMA node5 CPU(s): 20-23,52-55 NUMA node6 CPU(s): 24-27,56-59 NUMA node7 CPU(s): 28-31,60-63 I'm running one big virtual on this hypervizor - almost whole memory + all physical CPUs. This is what I'm seeing inside: root@zenon10:~# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 8 NUMA node(s): 8 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC 7351 16-Core Processor Stepping: 2 CPU MHz: 2400.000 BogoMIPS: 4800.00 Virtualization: AMD-V Hypervisor vendor: KVM Virtualization type: full L1d cache: 64K L1i cache: 64K L2 cache: 512K NUMA node0 CPU(s): 0-3 NUMA node1 CPU(s): 4-7 NUMA node2 CPU(s): 8-11 NUMA node3 CPU(s): 12-15 NUMA node4 CPU(s): 16-19 NUMA node5 CPU(s): 20-23 NUMA node6 CPU(s): 24-27 NUMA node7 CPU(s): 28-31 This is virtual node configuration: (i tried different numatune settings but it was still the same) <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> <name>one-55782</name> <vcpu><![CDATA[32]]></vcpu> <cputune> <shares>32768</shares> </cputune> <memory>507904000</memory> <os> <type arch='x86_64'>hvm</type> </os> <devices> <emulator><![CDATA[/usr/bin/kvm]]></emulator> <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.0'/> <target dev='vda'/> <driver name='qemu' type='qcow2' cache='unsafe'/> </disk> <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.1'/> <target dev='vdc'/> <driver name='qemu' type='raw' cache='unsafe'/> </disk> <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.2'/> <target dev='vdd'/> <driver name='qemu' type='raw' cache='unsafe'/> </disk> <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.3'/> <target dev='vde'/> <driver name='qemu' type='raw' cache='unsafe'/> </disk> <disk type='file' device='cdrom'> <source file='/opt/opennebula/var/datastores/108/55782/disk.4'/> <target dev='vdb'/> <readonly/> <driver name='qemu' type='raw'/> </disk> <interface type='bridge'> <source bridge='br0'/> <mac address='02:00:93:fb:3b:78'/> <target dev='one-55782-0'/> <model type='virtio'/> <filterref filter='no-arp-mac-spoofing'> <parameter name='IP' value='147.251.59.120'/> </filterref> </interface> </devices> <features> <pae/> <acpi/> </features> <!-- RAW data follows: --> <cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu> <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='2' /><vcpupin vcpu='2' cpuset='4' /><vcpupin vcpu='3' cpuset='6' /><vcpupin vcpu='4' cpuset='8' /><vcpupin vcpu='5' cpuset='10' /><vcpupin vcpu='6' cpuset='12' /><vcpupin vcpu='7' cpuset='14' /><vcpupin vcpu='8' cpuset='16' /><vcpupin vcpu='9' cpuset='18' /><vcpupin vcpu='10' cpuset='20' /><vcpupin vcpu='11' cpuset='22' /><vcpupin vcpu='12' cpuset='24' /><vcpupin vcpu='13' cpuset='26' /><vcpupin vcpu='14' cpuset='28' /><vcpupin vcpu='15' cpuset='30' /><vcpupin vcpu='16' cpuset='1' /><vcpupin vcpu='17' cpuset='3' /><vcpupin vcpu='18' cpuset='5' /><vcpupin vcpu='19' cpuset='7' /><vcpupin vcpu='20' cpuset='9' /><vcpupin vcpu='21' cpuset='11' /><vcpupin vcpu='22' cpuset='13' /><vcpupin vcpu='23' cpuset='15' /><vcpupin vcpu='24' cpuset='17' /><vcpupin vcpu='25' cpuset='19' /><vcpupin vcpu='26' cpuset='21' /><vcpupin vcpu='27' cpuset='23' /><vcpupin vcpu='28' cpuset='25' /><vcpupin vcpu='29' cpuset='27' /><vcpupin vcpu='30' cpuset='29' /><vcpupin vcpu='31' cpuset='31' /></cputune> <numatune><memory mode='preferred' nodeset='0'/></numatune>) <devices><serial type='pty'><target port='0'/></serial><console type='pty'><target type='serial' port='0'/></console><channel type='pty'><target type='virtio' name='org.qemu.guest_agent.0'/></channel></devices> <devices><hostdev mode='subsystem' type='pci' managed='yes'><source><address domain='0x0' bus='0x11' slot='0x0' function='0x1'/></source></hostdev></devices> <devices><controller type='pci' index='1' model='pci-bridge'/><controller type='pci' index='2' model='pci-bridge'/><controller type='pci' index='3' model='pci-bridge'/><controller type='pci' index='4' model='pci-bridge'/><controller type='pci' index='5' model='pci-bridge'/></devices> <metadata> <system_datastore><![CDATA[/opt/opennebula/var/datastores/108/55782]]> </system_datastore> </metadata> </domain> If I run e.g., spec2017 on the virtual, I can see: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1350 root 20 0 843136 830068 2524 R 78.1 0.2 513:16.16 bwaves_r_base.m 2456 root 20 0 804608 791264 2524 R 76.6 0.2 491:39.92 bwaves_r_base.m 4631 root 20 0 843136 829892 2344 R 75.8 0.2 450:16.04 bwaves_r_base.m 6441 root 20 0 802580 790212 2532 R 75.0 0.2 120:37.54 bwaves_r_base.m 7991 root 20 0 784676 772092 2576 R 75.0 0.2 387:15.39 bwaves_r_base.m 8142 root 20 0 843136 830044 2496 R 75.0 0.2 384:39.02 bwaves_r_base.m 8234 root 20 0 843136 830064 2524 R 75.0 0.2 99:04.48 bwaves_r_base.m 8578 root 20 0 749240 736604 2468 R 73.4 0.2 375:45.66 bwaves_r_base.m 9974 root 20 0 784676 771984 2468 R 73.4 0.2 348:01.36 bwaves_r_base.m 10396 root 20 0 802580 790264 2576 R 73.4 0.2 340:08.40 bwaves_r_base.m 12932 root 20 0 843136 830024 2480 R 73.4 0.2 288:39.76 bwaves_r_base.m 13113 root 20 0 784676 771864 2348 R 71.9 0.2 284:47.34 bwaves_r_base.m 13518 root 20 0 784676 762816 2540 R 71.9 0.2 276:31.58 bwaves_r_base.m 14443 root 20 0 784676 771984 2468 R 71.9 0.2 260:01.82 bwaves_r_base.m 12791 root 20 0 784676 772060 2544 R 70.3 0.2 291:43.96 bwaves_r_base.m 10544 root 20 0 843136 830068 2520 R 68.8 0.2 336:47.43 bwaves_r_base.m 15464 root 20 0 784676 762880 2608 R 60.9 0.2 239:19.14 bwaves_r_base.m 15487 root 20 0 784676 772048 2532 R 60.2 0.2 238:37.07 bwaves_r_base.m 16824 root 20 0 784676 772120 2604 R 55.5 0.2 212:10.92 bwaves_r_base.m 17255 root 20 0 843136 830012 2468 R 54.7 0.2 203:22.89 bwaves_r_base.m 17962 root 20 0 784676 772004 2488 R 54.7 0.2 188:26.07 bwaves_r_base.m 17505 root 20 0 843136 830068 2520 R 53.1 0.2 198:04.25 bwaves_r_base.m 27767 root 20 0 784676 771860 2344 R 52.3 0.2 592:25.95 bwaves_r_base.m 24458 root 20 0 843136 829888 2344 R 50.8 0.2 658:23.70 bwaves_r_base.m 30746 root 20 0 747376 735160 2604 R 43.0 0.2 556:47.67 bwaves_r_base.m The CPU TIME should be roughly the same but huge differences are obvious. This is what I see on the hypervizor: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18201 oneadmin 20 0 474.0g 473.3g 1732 S 2459 94.0 33332:54 kvm 369 root 20 0 0 0 0 R 100.0 0.0 768:12.85 kswapd1 368 root 20 0 0 0 0 R 94.1 0.0 869:05.61 kswapd0 i.e., kswapd is eating whole CPU. Swap is turned off. [root@hde10 ~]# free total used free shared buff/cache available Mem: 528151432 503432580 1214048 34740 23504804 21907800 Swap: 0 0 0 Hypervisor is [root@hde10 ~]# cat /etc/redhat-release CentOS Linux release 7.5.1804 (Core) qemu-kvm-1.5.3-156.el7_5.5.x86_64 Virtual is Debian 9. Moreover, I'm using this type of disks for virtuals: <disk type='file' device='disk'> <source file='/opt/opennebula/var/datastores/108/55782/disk.3'/> <target dev='vde'/> <driver name='qemu' type='raw' cache='unsafe'/> </disk> If I keep cache='unsafe' and if I run iozone test on really big files (e.g., 8x 100GB), I can see huge cache pressure on the hypervizor - all 8 kswapd are running on 100 % percent and slowing things down. The disk under datastore is NVME SSD Intel 4500. If I set cache='none', kswaps are on idle, disk writes are pretty fast, however, with 8-NUMA configuration, writes slow down to less than 10MB/s as soon as the size of written data is roughly the same as memory size in the virtual node. iozone has 100 % CPU usage thereafter and it seems that it is traversing page lists. If I do the same with 1-NUMA configuration, everything is ok except performance penalty about 25 %. -- Lukáš Hejtmánek Linux Administrator only because Full Time Multitasking Ninja is not an official job title
Lukas Hejtmanek
2018-Sep-14 13:36 UTC
Re: [libvirt-users] NUMA issues on virtualized hosts
Hello, ok, I found that cpu pinning was wrong, so I corrected it to be 1:1. The issue with iozone remains the same. The spec is running, however, it runs slower than 1-NUMA case. The corrected XML looks like follows: <cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu> <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='1' /><vcpupin vcpu='2' cpuset='2' /><vcpupin vcpu='3' cpuset='3' /><vcpupin vcpu='4' cpuset='4' /><vcpupin vcpu='5' cpuset='5' /><vcpupin vcpu='6' cpuset='6' /><vcpupin vcpu='7' cpuset='7' /><vcpupin vcpu='8' cpuset='8' /><vcpupin vcpu='9' cpuset='9' /><vcpupin vcpu='10' cpuset='10' /><vcpupin vcpu='11' cpuset='11' /><vcpupin vcpu='12' cpuset='12' /><vcpupin vcpu='13' cpuset='13' /><vcpupin vcpu='14' cpuset='14' /><vcpupin vcpu='15' cpuset='15' /><vcpupin vcpu='16' cpuset='16' /><vcpupin vcpu='17' cpuset='17' /><vcpupin vcpu='18' cpuset='18' /><vcpupin vcpu='19' cpuset='19' /><vcpupin vcpu='20' cpuset='20' /><vcpupin vcpu='21' cpuset='21' /><vcpupin vcpu='22' cpuset='22' /><vcpupin vcpu='23' cpuset='23' /><vcpupin vcpu='24' cpuset='24' /><vcpupin vcpu='25' cpuset='25' /><vcpupin vcpu='26' cpuset='26' /><vcpupin vcpu='27' cpuset='27' /><vcpupin vcpu='28' cpuset='28' /><vcpupin vcpu='29' cpuset='29' /><vcpupin vcpu='30' cpuset='30' /><vcpupin vcpu='31' cpuset='31' /></cputune> <numatune><memory mode='strict' nodeset='0-7'/></numatune> In this case, the first part took more than 1700 seconds. 1-NUMA config finishes in 1646 seconds. Hypervisor with 1-NUMA config finishes in 1470 seconds, the hypervisor with 8-NUMA config finishes in 900 seconds. On Fri, Sep 14, 2018 at 02:06:26PM +0200, Lukas Hejtmanek wrote:> Hello, > > I have cluster with AMD EPYC 7351 cpu. Two CPUs per node. I have performance > 8-NUMA configuration: > > This is from hypervizor: > [root@hde10 ~]# lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 64 > On-line CPU(s) list: 0-63 > Thread(s) per core: 2 > Core(s) per socket: 16 > Socket(s): 2 > NUMA node(s): 8 > Vendor ID: AuthenticAMD > CPU family: 23 > Model: 1 > Model name: AMD EPYC 7351 16-Core Processor > Stepping: 2 > CPU MHz: 1800.000 > CPU max MHz: 2400.0000 > CPU min MHz: 1200.0000 > BogoMIPS: 4800.05 > Virtualization: AMD-V > L1d cache: 32K > L1i cache: 64K > L2 cache: 512K > L3 cache: 8192K > NUMA node0 CPU(s): 0-3,32-35 > NUMA node1 CPU(s): 4-7,36-39 > NUMA node2 CPU(s): 8-11,40-43 > NUMA node3 CPU(s): 12-15,44-47 > NUMA node4 CPU(s): 16-19,48-51 > NUMA node5 CPU(s): 20-23,52-55 > NUMA node6 CPU(s): 24-27,56-59 > NUMA node7 CPU(s): 28-31,60-63 > > I'm running one big virtual on this hypervizor - almost whole memory + all > physical CPUs. > > This is what I'm seeing inside: > > root@zenon10:~# lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 32 > On-line CPU(s) list: 0-31 > Thread(s) per core: 1 > Core(s) per socket: 4 > Socket(s): 8 > NUMA node(s): 8 > Vendor ID: AuthenticAMD > CPU family: 23 > Model: 1 > Model name: AMD EPYC 7351 16-Core Processor > Stepping: 2 > CPU MHz: 2400.000 > BogoMIPS: 4800.00 > Virtualization: AMD-V > Hypervisor vendor: KVM > Virtualization type: full > L1d cache: 64K > L1i cache: 64K > L2 cache: 512K > NUMA node0 CPU(s): 0-3 > NUMA node1 CPU(s): 4-7 > NUMA node2 CPU(s): 8-11 > NUMA node3 CPU(s): 12-15 > NUMA node4 CPU(s): 16-19 > NUMA node5 CPU(s): 20-23 > NUMA node6 CPU(s): 24-27 > NUMA node7 CPU(s): 28-31 > > This is virtual node configuration: (i tried different numatune settings but > it was still the same) > > <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> > <name>one-55782</name> > <vcpu><![CDATA[32]]></vcpu> > <cputune> > <shares>32768</shares> > </cputune> > <memory>507904000</memory> > <os> > <type arch='x86_64'>hvm</type> > </os> > <devices> > <emulator><![CDATA[/usr/bin/kvm]]></emulator> > <disk type='file' device='disk'> > <source file='/opt/opennebula/var/datastores/108/55782/disk.0'/> > <target dev='vda'/> > <driver name='qemu' type='qcow2' cache='unsafe'/> > </disk> > <disk type='file' device='disk'> > <source file='/opt/opennebula/var/datastores/108/55782/disk.1'/> > <target dev='vdc'/> > <driver name='qemu' type='raw' cache='unsafe'/> > </disk> > <disk type='file' device='disk'> > <source file='/opt/opennebula/var/datastores/108/55782/disk.2'/> > <target dev='vdd'/> > <driver name='qemu' type='raw' cache='unsafe'/> > </disk> > <disk type='file' device='disk'> > <source file='/opt/opennebula/var/datastores/108/55782/disk.3'/> > <target dev='vde'/> > <driver name='qemu' type='raw' cache='unsafe'/> > </disk> > <disk type='file' device='cdrom'> > <source file='/opt/opennebula/var/datastores/108/55782/disk.4'/> > <target dev='vdb'/> > <readonly/> > <driver name='qemu' type='raw'/> > </disk> > <interface type='bridge'> > <source bridge='br0'/> > <mac address='02:00:93:fb:3b:78'/> > <target dev='one-55782-0'/> > <model type='virtio'/> > <filterref filter='no-arp-mac-spoofing'> > <parameter name='IP' value='147.251.59.120'/> > </filterref> > </interface> > </devices> > <features> > <pae/> > <acpi/> > </features> > <!-- RAW data follows: --> > <cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu> > <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='2' /><vcpupin vcpu='2' cpuset='4' /><vcpupin vcpu='3' cpuset='6' /><vcpupin vcpu='4' cpuset='8' /><vcpupin vcpu='5' cpuset='10' /><vcpupin vcpu='6' cpuset='12' /><vcpupin vcpu='7' cpuset='14' /><vcpupin vcpu='8' cpuset='16' /><vcpupin vcpu='9' cpuset='18' /><vcpupin vcpu='10' cpuset='20' /><vcpupin vcpu='11' cpuset='22' /><vcpupin vcpu='12' cpuset='24' /><vcpupin vcpu='13' cpuset='26' /><vcpupin vcpu='14' cpuset='28' /><vcpupin vcpu='15' cpuset='30' /><vcpupin vcpu='16' cpuset='1' /><vcpupin vcpu='17' cpuset='3' /><vcpupin vcpu='18' cpuset='5' /><vcpupin vcpu='19' cpuset='7' /><vcpupin vcpu='20' cpuset='9' /><vcpupin vcpu='21' cpuset='11' /><vcpupin vcpu='22' cpuset='13' /><vcpupin vcpu='23' cpuset='15' /><vcpupin vcpu='24' cpuset='17' /><vcpupin vcpu='25' cpuset='19' /><vcpupin vcpu='26' cpuset='21' /><vcpupin vcpu='27' cpuset='23' /><vcpupin vcpu='28' cpuset='25' /><vcpupin vcpu='29' cpuset='27' /><vcpupin vcpu='30' cpuset='29' /><vcpupin vcpu='31' cpuset='31' /></cputune> > <numatune><memory mode='preferred' nodeset='0'/></numatune>) > <devices><serial type='pty'><target port='0'/></serial><console type='pty'><target type='serial' port='0'/></console><channel type='pty'><target type='virtio' name='org.qemu.guest_agent.0'/></channel></devices> > <devices><hostdev mode='subsystem' type='pci' managed='yes'><source><address domain='0x0' bus='0x11' slot='0x0' function='0x1'/></source></hostdev></devices> > > <devices><controller type='pci' index='1' model='pci-bridge'/><controller type='pci' index='2' model='pci-bridge'/><controller type='pci' index='3' model='pci-bridge'/><controller type='pci' index='4' model='pci-bridge'/><controller type='pci' index='5' model='pci-bridge'/></devices> > <metadata> > <system_datastore><![CDATA[/opt/opennebula/var/datastores/108/55782]]> </system_datastore> > </metadata> > </domain> > > If I run e.g., spec2017 on the virtual, I can see: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 1350 root 20 0 843136 830068 2524 R 78.1 0.2 513:16.16 bwaves_r_base.m > 2456 root 20 0 804608 791264 2524 R 76.6 0.2 491:39.92 bwaves_r_base.m > 4631 root 20 0 843136 829892 2344 R 75.8 0.2 450:16.04 bwaves_r_base.m > 6441 root 20 0 802580 790212 2532 R 75.0 0.2 120:37.54 bwaves_r_base.m > 7991 root 20 0 784676 772092 2576 R 75.0 0.2 387:15.39 bwaves_r_base.m > 8142 root 20 0 843136 830044 2496 R 75.0 0.2 384:39.02 bwaves_r_base.m > 8234 root 20 0 843136 830064 2524 R 75.0 0.2 99:04.48 bwaves_r_base.m > 8578 root 20 0 749240 736604 2468 R 73.4 0.2 375:45.66 bwaves_r_base.m > 9974 root 20 0 784676 771984 2468 R 73.4 0.2 348:01.36 bwaves_r_base.m > 10396 root 20 0 802580 790264 2576 R 73.4 0.2 340:08.40 bwaves_r_base.m > 12932 root 20 0 843136 830024 2480 R 73.4 0.2 288:39.76 bwaves_r_base.m > 13113 root 20 0 784676 771864 2348 R 71.9 0.2 284:47.34 bwaves_r_base.m > 13518 root 20 0 784676 762816 2540 R 71.9 0.2 276:31.58 bwaves_r_base.m > 14443 root 20 0 784676 771984 2468 R 71.9 0.2 260:01.82 bwaves_r_base.m > 12791 root 20 0 784676 772060 2544 R 70.3 0.2 291:43.96 bwaves_r_base.m > 10544 root 20 0 843136 830068 2520 R 68.8 0.2 336:47.43 bwaves_r_base.m > 15464 root 20 0 784676 762880 2608 R 60.9 0.2 239:19.14 bwaves_r_base.m > 15487 root 20 0 784676 772048 2532 R 60.2 0.2 238:37.07 bwaves_r_base.m > 16824 root 20 0 784676 772120 2604 R 55.5 0.2 212:10.92 bwaves_r_base.m > 17255 root 20 0 843136 830012 2468 R 54.7 0.2 203:22.89 bwaves_r_base.m > 17962 root 20 0 784676 772004 2488 R 54.7 0.2 188:26.07 bwaves_r_base.m > 17505 root 20 0 843136 830068 2520 R 53.1 0.2 198:04.25 bwaves_r_base.m > 27767 root 20 0 784676 771860 2344 R 52.3 0.2 592:25.95 bwaves_r_base.m > 24458 root 20 0 843136 829888 2344 R 50.8 0.2 658:23.70 bwaves_r_base.m > 30746 root 20 0 747376 735160 2604 R 43.0 0.2 556:47.67 bwaves_r_base.m > > The CPU TIME should be roughly the same but huge differences are obvious. > > This is what I see on the hypervizor: > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 18201 oneadmin 20 0 474.0g 473.3g 1732 S 2459 94.0 33332:54 kvm > 369 root 20 0 0 0 0 R 100.0 0.0 768:12.85 kswapd1 > 368 root 20 0 0 0 0 R 94.1 0.0 869:05.61 kswapd0 > > i.e., kswapd is eating whole CPU. Swap is turned off. > > [root@hde10 ~]# free > total used free shared buff/cache available > Mem: 528151432 503432580 1214048 34740 23504804 21907800 > Swap: 0 0 0 > > Hypervisor is > [root@hde10 ~]# cat /etc/redhat-release > CentOS Linux release 7.5.1804 (Core) > > qemu-kvm-1.5.3-156.el7_5.5.x86_64 > > Virtual is Debian 9. > > > Moreover, I'm using this type of disks for virtuals: > <disk type='file' device='disk'> > <source file='/opt/opennebula/var/datastores/108/55782/disk.3'/> > <target dev='vde'/> > <driver name='qemu' type='raw' cache='unsafe'/> > </disk> > > If I keep cache='unsafe' and if I run iozone test on really big files (e.g., > 8x 100GB), I can see huge cache pressure on the hypervizor - all 8 kswapd are > running on 100 % percent and slowing things down. The disk under datastore is > NVME SSD Intel 4500. > > If I set cache='none', kswaps are on idle, disk writes are pretty fast, > however, with 8-NUMA configuration, writes slow down to less than 10MB/s as > soon as the size of written data is roughly the same as memory size in the virtual > node. iozone has 100 % CPU usage thereafter and it seems that it is traversing page > lists. If I do the same with 1-NUMA configuration, everything is ok except > performance penalty about 25 %. > > -- > Lukáš Hejtmánek > > Linux Administrator only because > Full Time Multitasking Ninja > is not an official job title-- Lukáš Hejtmánek Linux Administrator only because Full Time Multitasking Ninja is not an official job title
Lukas Hejtmanek
2018-Sep-14 13:40 UTC
Re: [libvirt-users] NUMA issues on virtualized hosts
Hello again, when the iozone writes slow. This is how slabtop looks like: 62476752 62476728 0% 0.10K 1601968 39 6407872K buffer_head 1000678 999168 0% 0.56K 142954 7 571816K radix_tree_node 132184 125911 0% 0.03K 1066 124 4264K kmalloc-32 118496 118224 0% 0.12K 3703 32 14812K kmalloc-node 73206 56467 0% 0.19K 3486 21 13944K dentry 34816 33247 0% 0.12K 1024 34 4096K kernfs_node_cache 34496 29031 0% 0.06K 539 64 2156K kmalloc-64 23283 22707 0% 1.05K 7761 3 31044K ext4_inode_cache 16940 16052 0% 0.57K 2420 7 9680K inode_cache 14464 4124 0% 0.06K 226 64 904K anon_vma_chain 11900 11841 0% 0.14K 425 28 1700K ext4_groupinfo_4k 11312 9861 0% 0.50K 1414 8 5656K kmalloc-512 10692 10066 0% 0.04K 108 99 432K ext4_extent_status 10688 4238 0% 0.25K 668 16 2672K kmalloc-256 8120 2420 0% 0.07K 145 56 580K anon_vma 8040 4563 0% 0.20K 402 20 1608K vm_area_struct 7488 3845 0% 0.12K 234 32 936K kmalloc-96 7456 7061 0% 1.00K 1864 4 7456K kmalloc-1024 7234 7227 0% 4.00K 7234 1 28936K kmalloc-4096 and this is /proc/$PID/stack of iozone eating CPU but not writing data. [<ffffffffba78151b>] find_get_entry+0x1b/0x100 [<ffffffffba781de0>] pagecache_get_page+0x30/0x2a0 [<ffffffffc06ec12b>] ext4_da_get_block_prep+0x27b/0x440 [ext4] [<ffffffffba840d8b>] __find_get_block_slow+0x3b/0x150 [<ffffffffba840ebd>] unmap_underlying_metadata+0x1d/0x70 [<ffffffffc06ec960>] ext4_block_write_begin+0x2e0/0x520 [ext4] [<ffffffffc06ebeb0>] ext4_inode_attach_jinode.part.72+0xa0/0xa0 [ext4] [<ffffffffc041f9f9>] jbd2__journal_start+0xd9/0x1e0 [jbd2] [<ffffffffba80511a>] __check_object_size+0xfa/0x1d8 [<ffffffffba946b85>] iov_iter_copy_from_user_atomic+0xa5/0x330 [<ffffffffba780dcb>] generic_perform_write+0xfb/0x1d0 [<ffffffffba7831ca>] __generic_file_write_iter+0x16a/0x1b0 [<ffffffffc06e7220>] ext4_file_write_iter+0x90/0x370 [ext4] [<ffffffffc06e7190>] ext4_dax_fault+0x140/0x140 [ext4] [<ffffffffba6aef01>] update_curr+0xe1/0x160 [<ffffffffba808890>] new_sync_write+0xe0/0x130 [<ffffffffba809010>] vfs_write+0xb0/0x190 [<ffffffffba80a452>] SyS_write+0x52/0xc0 [<ffffffffba603b7d>] do_syscall_64+0x8d/0xf0 [<ffffffffbac15c4e>] entry_SYSCALL_64_after_swapgs+0x58/0xc6 [<ffffffffffffffff>] 0xffffffffffffffff On Fri, Sep 14, 2018 at 03:36:59PM +0200, Lukas Hejtmanek wrote:> Hello, > > ok, I found that cpu pinning was wrong, so I corrected it to be 1:1. The issue > with iozone remains the same. > > The spec is running, however, it runs slower than 1-NUMA case. > > The corrected XML looks like follows: > <cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu> > <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='1' /><vcpupin vcpu='2' cpuset='2' /><vcpupin vcpu='3' cpuset='3' /><vcpupin vcpu='4' cpuset='4' /><vcpupin vcpu='5' cpuset='5' /><vcpupin vcpu='6' cpuset='6' /><vcpupin vcpu='7' cpuset='7' /><vcpupin vcpu='8' cpuset='8' /><vcpupin vcpu='9' cpuset='9' /><vcpupin vcpu='10' cpuset='10' /><vcpupin vcpu='11' cpuset='11' /><vcpupin vcpu='12' cpuset='12' /><vcpupin vcpu='13' cpuset='13' /><vcpupin vcpu='14' cpuset='14' /><vcpupin vcpu='15' cpuset='15' /><vcpupin vcpu='16' cpuset='16' /><vcpupin vcpu='17' cpuset='17' /><vcpupin vcpu='18' cpuset='18' /><vcpupin vcpu='19' cpuset='19' /><vcpupin vcpu='20' cpuset='20' /><vcpupin vcpu='21' cpuset='21' /><vcpupin vcpu='22' cpuset='22' /><vcpupin vcpu='23' cpuset='23' /><vcpupin vcpu='24' cpuset='24' /><vcpupin vcpu='25' cpuset='25' /><vcpupin vcpu='26' cpuset='26' /><vcpupin vcpu='27' cpuset='27' /><vcpupin vcpu='28' cpuset='28' /><vcpupin vcpu='29' cpuset='29' /><vcpupin vcpu='30' cpuset='30' /><vcpupin vcpu='31' cpuset='31' /></cputune> > <numatune><memory mode='strict' nodeset='0-7'/></numatune> > > In this case, the first part took more than 1700 seconds. 1-NUMA config > finishes in 1646 seconds. > > Hypervisor with 1-NUMA config finishes in 1470 seconds, the hypervisor with > 8-NUMA config finishes in 900 seconds. > > On Fri, Sep 14, 2018 at 02:06:26PM +0200, Lukas Hejtmanek wrote: > > Hello, > > > > I have cluster with AMD EPYC 7351 cpu. Two CPUs per node. I have performance > > 8-NUMA configuration: > > > > This is from hypervizor: > > [root@hde10 ~]# lscpu > > Architecture: x86_64 > > CPU op-mode(s): 32-bit, 64-bit > > Byte Order: Little Endian > > CPU(s): 64 > > On-line CPU(s) list: 0-63 > > Thread(s) per core: 2 > > Core(s) per socket: 16 > > Socket(s): 2 > > NUMA node(s): 8 > > Vendor ID: AuthenticAMD > > CPU family: 23 > > Model: 1 > > Model name: AMD EPYC 7351 16-Core Processor > > Stepping: 2 > > CPU MHz: 1800.000 > > CPU max MHz: 2400.0000 > > CPU min MHz: 1200.0000 > > BogoMIPS: 4800.05 > > Virtualization: AMD-V > > L1d cache: 32K > > L1i cache: 64K > > L2 cache: 512K > > L3 cache: 8192K > > NUMA node0 CPU(s): 0-3,32-35 > > NUMA node1 CPU(s): 4-7,36-39 > > NUMA node2 CPU(s): 8-11,40-43 > > NUMA node3 CPU(s): 12-15,44-47 > > NUMA node4 CPU(s): 16-19,48-51 > > NUMA node5 CPU(s): 20-23,52-55 > > NUMA node6 CPU(s): 24-27,56-59 > > NUMA node7 CPU(s): 28-31,60-63 > > > > I'm running one big virtual on this hypervizor - almost whole memory + all > > physical CPUs. > > > > This is what I'm seeing inside: > > > > root@zenon10:~# lscpu > > Architecture: x86_64 > > CPU op-mode(s): 32-bit, 64-bit > > Byte Order: Little Endian > > CPU(s): 32 > > On-line CPU(s) list: 0-31 > > Thread(s) per core: 1 > > Core(s) per socket: 4 > > Socket(s): 8 > > NUMA node(s): 8 > > Vendor ID: AuthenticAMD > > CPU family: 23 > > Model: 1 > > Model name: AMD EPYC 7351 16-Core Processor > > Stepping: 2 > > CPU MHz: 2400.000 > > BogoMIPS: 4800.00 > > Virtualization: AMD-V > > Hypervisor vendor: KVM > > Virtualization type: full > > L1d cache: 64K > > L1i cache: 64K > > L2 cache: 512K > > NUMA node0 CPU(s): 0-3 > > NUMA node1 CPU(s): 4-7 > > NUMA node2 CPU(s): 8-11 > > NUMA node3 CPU(s): 12-15 > > NUMA node4 CPU(s): 16-19 > > NUMA node5 CPU(s): 20-23 > > NUMA node6 CPU(s): 24-27 > > NUMA node7 CPU(s): 28-31 > > > > This is virtual node configuration: (i tried different numatune settings but > > it was still the same) > > > > <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> > > <name>one-55782</name> > > <vcpu><![CDATA[32]]></vcpu> > > <cputune> > > <shares>32768</shares> > > </cputune> > > <memory>507904000</memory> > > <os> > > <type arch='x86_64'>hvm</type> > > </os> > > <devices> > > <emulator><![CDATA[/usr/bin/kvm]]></emulator> > > <disk type='file' device='disk'> > > <source file='/opt/opennebula/var/datastores/108/55782/disk.0'/> > > <target dev='vda'/> > > <driver name='qemu' type='qcow2' cache='unsafe'/> > > </disk> > > <disk type='file' device='disk'> > > <source file='/opt/opennebula/var/datastores/108/55782/disk.1'/> > > <target dev='vdc'/> > > <driver name='qemu' type='raw' cache='unsafe'/> > > </disk> > > <disk type='file' device='disk'> > > <source file='/opt/opennebula/var/datastores/108/55782/disk.2'/> > > <target dev='vdd'/> > > <driver name='qemu' type='raw' cache='unsafe'/> > > </disk> > > <disk type='file' device='disk'> > > <source file='/opt/opennebula/var/datastores/108/55782/disk.3'/> > > <target dev='vde'/> > > <driver name='qemu' type='raw' cache='unsafe'/> > > </disk> > > <disk type='file' device='cdrom'> > > <source file='/opt/opennebula/var/datastores/108/55782/disk.4'/> > > <target dev='vdb'/> > > <readonly/> > > <driver name='qemu' type='raw'/> > > </disk> > > <interface type='bridge'> > > <source bridge='br0'/> > > <mac address='02:00:93:fb:3b:78'/> > > <target dev='one-55782-0'/> > > <model type='virtio'/> > > <filterref filter='no-arp-mac-spoofing'> > > <parameter name='IP' value='147.251.59.120'/> > > </filterref> > > </interface> > > </devices> > > <features> > > <pae/> > > <acpi/> > > </features> > > <!-- RAW data follows: --> > > <cpu mode='host-passthrough'><topology sockets='8' cores='4' threads='1'/><numa><cell cpus='0-3' memory='62000000' /><cell cpus='4-7' memory='62000000' /><cell cpus='8-11' memory='62000000' /><cell cpus='12-15' memory='62000000' /><cell cpus='16-19' memory='62000000' /><cell cpus='20-23' memory='62000000' /><cell cpus='24-27' memory='62000000' /><cell cpus='28-31' memory='62000000' /></numa></cpu> > > <cputune><vcpupin vcpu='0' cpuset='0' /><vcpupin vcpu='1' cpuset='2' /><vcpupin vcpu='2' cpuset='4' /><vcpupin vcpu='3' cpuset='6' /><vcpupin vcpu='4' cpuset='8' /><vcpupin vcpu='5' cpuset='10' /><vcpupin vcpu='6' cpuset='12' /><vcpupin vcpu='7' cpuset='14' /><vcpupin vcpu='8' cpuset='16' /><vcpupin vcpu='9' cpuset='18' /><vcpupin vcpu='10' cpuset='20' /><vcpupin vcpu='11' cpuset='22' /><vcpupin vcpu='12' cpuset='24' /><vcpupin vcpu='13' cpuset='26' /><vcpupin vcpu='14' cpuset='28' /><vcpupin vcpu='15' cpuset='30' /><vcpupin vcpu='16' cpuset='1' /><vcpupin vcpu='17' cpuset='3' /><vcpupin vcpu='18' cpuset='5' /><vcpupin vcpu='19' cpuset='7' /><vcpupin vcpu='20' cpuset='9' /><vcpupin vcpu='21' cpuset='11' /><vcpupin vcpu='22' cpuset='13' /><vcpupin vcpu='23' cpuset='15' /><vcpupin vcpu='24' cpuset='17' /><vcpupin vcpu='25' cpuset='19' /><vcpupin vcpu='26' cpuset='21' /><vcpupin vcpu='27' cpuset='23' /><vcpupin vcpu='28' cpuset='25' /><vcpupin vcpu='29' cpuset='27' /><vcpupin vcpu='30' cpuset='29' /><vcpupin vcpu='31' cpuset='31' /></cputune> > > <numatune><memory mode='preferred' nodeset='0'/></numatune>) > > <devices><serial type='pty'><target port='0'/></serial><console type='pty'><target type='serial' port='0'/></console><channel type='pty'><target type='virtio' name='org.qemu.guest_agent.0'/></channel></devices> > > <devices><hostdev mode='subsystem' type='pci' managed='yes'><source><address domain='0x0' bus='0x11' slot='0x0' function='0x1'/></source></hostdev></devices> > > > > <devices><controller type='pci' index='1' model='pci-bridge'/><controller type='pci' index='2' model='pci-bridge'/><controller type='pci' index='3' model='pci-bridge'/><controller type='pci' index='4' model='pci-bridge'/><controller type='pci' index='5' model='pci-bridge'/></devices> > > <metadata> > > <system_datastore><![CDATA[/opt/opennebula/var/datastores/108/55782]]> </system_datastore> > > </metadata> > > </domain> > > > > If I run e.g., spec2017 on the virtual, I can see: > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 1350 root 20 0 843136 830068 2524 R 78.1 0.2 513:16.16 bwaves_r_base.m > > 2456 root 20 0 804608 791264 2524 R 76.6 0.2 491:39.92 bwaves_r_base.m > > 4631 root 20 0 843136 829892 2344 R 75.8 0.2 450:16.04 bwaves_r_base.m > > 6441 root 20 0 802580 790212 2532 R 75.0 0.2 120:37.54 bwaves_r_base.m > > 7991 root 20 0 784676 772092 2576 R 75.0 0.2 387:15.39 bwaves_r_base.m > > 8142 root 20 0 843136 830044 2496 R 75.0 0.2 384:39.02 bwaves_r_base.m > > 8234 root 20 0 843136 830064 2524 R 75.0 0.2 99:04.48 bwaves_r_base.m > > 8578 root 20 0 749240 736604 2468 R 73.4 0.2 375:45.66 bwaves_r_base.m > > 9974 root 20 0 784676 771984 2468 R 73.4 0.2 348:01.36 bwaves_r_base.m > > 10396 root 20 0 802580 790264 2576 R 73.4 0.2 340:08.40 bwaves_r_base.m > > 12932 root 20 0 843136 830024 2480 R 73.4 0.2 288:39.76 bwaves_r_base.m > > 13113 root 20 0 784676 771864 2348 R 71.9 0.2 284:47.34 bwaves_r_base.m > > 13518 root 20 0 784676 762816 2540 R 71.9 0.2 276:31.58 bwaves_r_base.m > > 14443 root 20 0 784676 771984 2468 R 71.9 0.2 260:01.82 bwaves_r_base.m > > 12791 root 20 0 784676 772060 2544 R 70.3 0.2 291:43.96 bwaves_r_base.m > > 10544 root 20 0 843136 830068 2520 R 68.8 0.2 336:47.43 bwaves_r_base.m > > 15464 root 20 0 784676 762880 2608 R 60.9 0.2 239:19.14 bwaves_r_base.m > > 15487 root 20 0 784676 772048 2532 R 60.2 0.2 238:37.07 bwaves_r_base.m > > 16824 root 20 0 784676 772120 2604 R 55.5 0.2 212:10.92 bwaves_r_base.m > > 17255 root 20 0 843136 830012 2468 R 54.7 0.2 203:22.89 bwaves_r_base.m > > 17962 root 20 0 784676 772004 2488 R 54.7 0.2 188:26.07 bwaves_r_base.m > > 17505 root 20 0 843136 830068 2520 R 53.1 0.2 198:04.25 bwaves_r_base.m > > 27767 root 20 0 784676 771860 2344 R 52.3 0.2 592:25.95 bwaves_r_base.m > > 24458 root 20 0 843136 829888 2344 R 50.8 0.2 658:23.70 bwaves_r_base.m > > 30746 root 20 0 747376 735160 2604 R 43.0 0.2 556:47.67 bwaves_r_base.m > > > > The CPU TIME should be roughly the same but huge differences are obvious. > > > > This is what I see on the hypervizor: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 18201 oneadmin 20 0 474.0g 473.3g 1732 S 2459 94.0 33332:54 kvm > > 369 root 20 0 0 0 0 R 100.0 0.0 768:12.85 kswapd1 > > 368 root 20 0 0 0 0 R 94.1 0.0 869:05.61 kswapd0 > > > > i.e., kswapd is eating whole CPU. Swap is turned off. > > > > [root@hde10 ~]# free > > total used free shared buff/cache available > > Mem: 528151432 503432580 1214048 34740 23504804 21907800 > > Swap: 0 0 0 > > > > Hypervisor is > > [root@hde10 ~]# cat /etc/redhat-release > > CentOS Linux release 7.5.1804 (Core) > > > > qemu-kvm-1.5.3-156.el7_5.5.x86_64 > > > > Virtual is Debian 9. > > > > > > Moreover, I'm using this type of disks for virtuals: > > <disk type='file' device='disk'> > > <source file='/opt/opennebula/var/datastores/108/55782/disk.3'/> > > <target dev='vde'/> > > <driver name='qemu' type='raw' cache='unsafe'/> > > </disk> > > > > If I keep cache='unsafe' and if I run iozone test on really big files (e.g., > > 8x 100GB), I can see huge cache pressure on the hypervizor - all 8 kswapd are > > running on 100 % percent and slowing things down. The disk under datastore is > > NVME SSD Intel 4500. > > > > If I set cache='none', kswaps are on idle, disk writes are pretty fast, > > however, with 8-NUMA configuration, writes slow down to less than 10MB/s as > > soon as the size of written data is roughly the same as memory size in the virtual > > node. iozone has 100 % CPU usage thereafter and it seems that it is traversing page > > lists. If I do the same with 1-NUMA configuration, everything is ok except > > performance penalty about 25 %. > > > > -- > > Lukáš Hejtmánek > > > > Linux Administrator only because > > Full Time Multitasking Ninja > > is not an official job title > > -- > Lukáš Hejtmánek > > Linux Administrator only because > Full Time Multitasking Ninja > is not an official job title-- Lukáš Hejtmánek Linux Administrator only because Full Time Multitasking Ninja is not an official job title
Michal Privoznik
2018-Sep-17 13:08 UTC
Re: [libvirt-users] NUMA issues on virtualized hosts
On 09/14/2018 03:36 PM, Lukas Hejtmanek wrote:> Hello, > > ok, I found that cpu pinning was wrong, so I corrected it to be 1:1. The issue > with iozone remains the same. > > The spec is running, however, it runs slower than 1-NUMA case. > > The corrected XML looks like follows:[Reformated XML for better reading] <cpu mode="host-passthrough"> <topology sockets="8" cores="4" threads="1"/> <numa> <cell cpus="0-3" memory="62000000"/> <cell cpus="4-7" memory="62000000"/> <cell cpus="8-11" memory="62000000"/> <cell cpus="12-15" memory="62000000"/> <cell cpus="16-19" memory="62000000"/> <cell cpus="20-23" memory="62000000"/> <cell cpus="24-27" memory="62000000"/> <cell cpus="28-31" memory="62000000"/> </numa> </cpu> <cputune> <vcpupin vcpu="0" cpuset="0"/> <vcpupin vcpu="1" cpuset="1"/> <vcpupin vcpu="2" cpuset="2"/> <vcpupin vcpu="3" cpuset="3"/> <vcpupin vcpu="4" cpuset="4"/> <vcpupin vcpu="5" cpuset="5"/> <vcpupin vcpu="6" cpuset="6"/> <vcpupin vcpu="7" cpuset="7"/> <vcpupin vcpu="8" cpuset="8"/> <vcpupin vcpu="9" cpuset="9"/> <vcpupin vcpu="10" cpuset="10"/> <vcpupin vcpu="11" cpuset="11"/> <vcpupin vcpu="12" cpuset="12"/> <vcpupin vcpu="13" cpuset="13"/> <vcpupin vcpu="14" cpuset="14"/> <vcpupin vcpu="15" cpuset="15"/> <vcpupin vcpu="16" cpuset="16"/> <vcpupin vcpu="17" cpuset="17"/> <vcpupin vcpu="18" cpuset="18"/> <vcpupin vcpu="19" cpuset="19"/> <vcpupin vcpu="20" cpuset="20"/> <vcpupin vcpu="21" cpuset="21"/> <vcpupin vcpu="22" cpuset="22"/> <vcpupin vcpu="23" cpuset="23"/> <vcpupin vcpu="24" cpuset="24"/> <vcpupin vcpu="25" cpuset="25"/> <vcpupin vcpu="26" cpuset="26"/> <vcpupin vcpu="27" cpuset="27"/> <vcpupin vcpu="28" cpuset="28"/> <vcpupin vcpu="29" cpuset="29"/> <vcpupin vcpu="30" cpuset="30"/> <vcpupin vcpu="31" cpuset="31"/> </cputune> <numatune> <memory mode="strict" nodeset="0-7"/> </numatune> However, this is not enough. This XML pins only vCPUs and not guest memory. So while say vCPU #0 is pinned onto physical CPU #0, the memory for guest NUMA #0 might be allocated at host NUMA #7 (for instance). You need to add: <numatune> <memnode cellid="0" mode="strict" nodeset="0"/> <memnode cellid="1" mode="strict" nodeset="1"/> ... </numatune> This will ensure also the guest memory pinning. But wait, there is more. In your later e-mails you mention slow disk I/O. This might be caused by various variables but the most obvious one in this case is qemu I/O loop, I'd say. Without iothreads qemu has only one I/O loop and thus if your guest issues writes from all 32 cores at once this loop is unable to handle it (performance wise) and therefore the performance drop. You can try enabling iothreads: https://libvirt.org/formatdomain.html#elementsIOThreadsAllocation This is a qemu feature that allows you to create more I/O threads and also pin them. This is an example how to use them: https://libvirt.org/git/?p=libvirt.git;a=blob;f=tests/qemuxml2argvdata/iothreads-disk.xml;h=0aa32c392300c0a86ad26185292ebc7a0d85d588;hb=HEAD And this is an example how to pin them: https://libvirt.org/git/?p=libvirt.git;a=blob;f=tests/qemuxml2argvdata/cputune-iothreads.xml;h=311a1d3604177d9699edf7132a75f387aa57ad6f;hb=HEAD Also, since iothreads are capable of handling just any I/O they can be used for other devices too, not only disks. For instance interfaces. Hopefully, this will boost your performance. Regards, Michal (who is a bit envious about your machine :-P)