Xen vNUMA introduction This series of patches introduces vNUMA topology awareness and provides interfaces and data structures to enable vNUMA for PV guests. There is a plan to extend this support for dom0 and HVM domains. vNUMA topology support should be supported by PV guest kernel. Corresponging patches should be applied. Xen patches available: https://git.gitorious.org/xenvnuma/xenvnuma_rfcdrop.git v1 Introduction ------------- vNUMA topology is exposed to the PV guest to improve performance when running workloads on NUMA machines. XEN vNUMA implementation provides a way to create vNUMA-enabled guests on NUMA/UMA and map vNUMA topology to physical NUMA in a optimal way. XEN vNUMA support Current set of patches introduces subop hypercall that is available for enlightened PV guests with vNUMA patches applied. Domain structure was modified to reflect per-domain vNUMA topology for use in other vNUMA-aware subsystems (e.g. ballooning). libxc libxc provides interfaces to build PV guests with vNUMA support and in case of NUMA machines provides initial memory allocation on physical NUMA nodes. This implemented by utilizing nodemap formed by automatic NUMA placement. Details are in patch #3. libxl libxl provides a way to predefine in VM config vNUMA topology - number of vnodes, memory arrangement, vcpus to vnodes assignment, distance map. PV guest As of now, only PV guest can take advantage of vNUMA functionality. vNUMA Linux patches should be applied and NUMA support should be compiled in kernel. Examples of booting vNUMA enabled PV Linux guest on real NUMA machine: 1. Automatic vNUMA placement on real NUMA machine: VM config: memory = 16384 vcpus = 4 name = "rcbig" vnodes = 4 vnumamem = [10,10] vnuma_distance = [10, 30, 10, 30] vcpu_to_vnode = "0, 2" vnode_to_pnode = [0, 1] #cpus="0-3" Xen: (XEN) Memory location of each domain: (XEN) Domain 0 (total: 2569511): (XEN) Node 0: 1416166 (XEN) Node 1: 1153345 (XEN) Domain 5 (total: 4194304): (XEN) Node 0: 2097152 (XEN) Node 1: 2097152 (XEN) Domain has 4 vnodes (XEN) vnode 0 - pnode 0 (4096) MB, (XEN) vnode 1 - pnode 0 (4096) MB, (XEN) vnode 2 - pnode 1 (4096) MB, (XEN) vnode 3 - pnode 1 (4096) MB, (XEN) Domain vcpu to vnode: 0 1 2 3 dmesg on pv guest: [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x0009ffff] [ 0.000000] node 0: [mem 0x00100000-0xffffffff] [ 0.000000] node 1: [mem 0x100000000-0x1ffffffff] [ 0.000000] node 2: [mem 0x200000000-0x2ffffffff] [ 0.000000] node 3: [mem 0x300000000-0x3ffffffff] [ 0.000000] On node 0 totalpages: 1048479 [ 0.000000] DMA zone: 56 pages used for memmap [ 0.000000] DMA zone: 21 pages reserved [ 0.000000] DMA zone: 3999 pages, LIFO batch:0 [ 0.000000] DMA32 zone: 14280 pages used for memmap [ 0.000000] DMA32 zone: 1044480 pages, LIFO batch:31 [ 0.000000] On node 1 totalpages: 1048576 [ 0.000000] Normal zone: 14336 pages used for memmap [ 0.000000] Normal zone: 1048576 pages, LIFO batch:31 [ 0.000000] On node 2 totalpages: 1048576 [ 0.000000] Normal zone: 14336 pages used for memmap [ 0.000000] Normal zone: 1048576 pages, LIFO batch:31 [ 0.000000] On node 3 totalpages: 1048576 [ 0.000000] Normal zone: 14336 pages used for memmap [ 0.000000] Normal zone: 1048576 pages, LIFO batch:31 [ 0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org [ 0.000000] smpboot: Allowing 4 CPUs, 0 hotplug CPUs [ 0.000000] No local APIC present [ 0.000000] APIC: disable apic facility [ 0.000000] APIC: switched to apic NOOP [ 0.000000] nr_irqs_gsi: 16 [ 0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff] [ 0.000000] e820: cannot find a gap in the 32bit address range [ 0.000000] e820: PCI devices with unassigned 32bit BARs may break! [ 0.000000] e820: [mem 0x400100000-0x4004fffff] available for PCI devices [ 0.000000] Booting paravirtualized kernel on Xen [ 0.000000] Xen version: 4.4-unstable (preserve-AD) [ 0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:4 nr_node_ids:4 [ 0.000000] PERCPU: Embedded 28 pages/cpu @ffff8800ffc00000 s85376 r8192 d21120 u2097152 [ 0.000000] pcpu-alloc: s85376 r8192 d21120 u2097152 alloc=1*2097152 [ 0.000000] pcpu-alloc: [0] 0 [1] 1 [2] 2 [3] 3 pv guest: numactl --hardware: root@heatpipe:~# numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 node 0 size: 4031 MB node 0 free: 3997 MB node 1 cpus: 1 node 1 size: 4039 MB node 1 free: 4022 MB node 2 cpus: 2 node 2 size: 4039 MB node 2 free: 4023 MB node 3 cpus: 3 node 3 size: 3975 MB node 3 free: 3963 MB node distances: node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 20 2: 20 20 10 20 3: 20 20 20 10 Comments: None of the configuration options are correct so default values were used. Since machine is NUMA machine and there is no vcpu pinning defines, NUMA automatic node selecttion mechanism is used and you can see how vnodes were split across physical nodes. 2. vNUMA enabled guest, no default values, real NUMA machine Config: memory = 4096 vcpus = 4 name = "rc9" vnodes = 2 vnumamem = [2048, 2048] vdistance = [10, 40, 40, 10] vnuma_vcpumap ="1, 0, 1, 0" vnuma_vnodemap = [1, 0] Xen: (XEN) ''u'' pressed -> dumping numa info (now-0xA86:BD6C8829) (XEN) idx0 -> NODE0 start->0 size->4521984 free->131471 (XEN) phys_to_nid(0000000000001000) -> 0 should be 0 (XEN) idx1 -> NODE1 start->4521984 size->4194304 free->341610 (XEN) phys_to_nid(0000000450001000) -> 1 should be 1 (XEN) CPU0 -> NODE0 (XEN) CPU1 -> NODE0 (XEN) CPU2 -> NODE0 (XEN) CPU3 -> NODE0 (XEN) CPU4 -> NODE1 (XEN) CPU5 -> NODE1 (XEN) CPU6 -> NODE1 (XEN) CPU7 -> NODE1 (XEN) Memory location of each domain: (XEN) Domain 0 (total: 2569511): (XEN) Node 0: 1416166 (XEN) Node 1: 1153345 (XEN) Domain 6 (total: 1048576): (XEN) Node 0: 524288 (XEN) Node 1: 524288 (XEN) Domain has 2 vnodes (XEN) vnode 0 - pnode 1 (2048) MB, (XEN) vnode 1 - pnode 0 (2048) MB, (XEN) Domain vcpu to vnode: 1 0 1 0 pv guest dmesg: [ 0.000000] NUMA: Initialized distance table, cnt=2 [ 0.000000] Initmem setup node 0 [mem 0x00000000-0x7fffffff] [ 0.000000] NODE_DATA [mem 0x7ffd9000-0x7fffffff] [ 0.000000] Initmem setup node 1 [mem 0x80000000-0xffffffff] [ 0.000000] NODE_DATA [mem 0xff7f8000-0xff81efff] [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x00001000-0x00ffffff] [ 0.000000] DMA32 [mem 0x01000000-0xffffffff] [ 0.000000] Normal empty [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x0009ffff] [ 0.000000] node 0: [mem 0x00100000-0x7fffffff] [ 0.000000] node 1: [mem 0x80000000-0xffffffff] [ 0.000000] On node 0 totalpages: 524191 [ 0.000000] DMA zone: 56 pages used for memmap [ 0.000000] DMA zone: 21 pages reserved [ 0.000000] DMA zone: 3999 pages, LIFO batch:0 [ 0.000000] DMA32 zone: 7112 pages used for memmap [ 0.000000] DMA32 zone: 520192 pages, LIFO batch:31 [ 0.000000] On node 1 totalpages: 524288 [ 0.000000] DMA32 zone: 7168 pages used for memmap [ 0.000000] DMA32 zone: 524288 pages, LIFO batch:31 [ 0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org [ 0.000000] smpboot: Allowing 4 CPUs, 0 hotplug CPUs [ 0.000000] No local APIC present [ 0.000000] APIC: disable apic facility [ 0.000000] APIC: switched to apic NOOP [ 0.000000] nr_irqs_gsi: 16 [ 0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff] [ 0.000000] e820: cannot find a gap in the 32bit address range [ 0.000000] e820: PCI devices with unassigned 32bit BARs may break! [ 0.000000] e820: [mem 0x100100000-0x1004fffff] available for PCI devices [ 0.000000] Booting paravirtualized kernel on Xen [ 0.000000] Xen version: 4.4-unstable (preserve-AD) [ 0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:4 nr_node_ids:2 [ 0.000000] PERCPU: Embedded 28 pages/cpu @ffff88007fc00000 s85376 r8192 d21120 u1048576 [ 0.000000] pcpu-alloc: s85376 r8192 d21120 u1048576 alloc=1*2097152 [ 0.000000] pcpu-alloc: [0] 0 2 [1] 1 3 pv guest: root@heatpipe:~# numactl --ha available: 2 nodes (0-1) node 0 cpus: 1 3 node 0 size: 2011 MB node 0 free: 1975 MB node 1 cpus: 0 2 node 1 size: 2003 MB node 1 free: 1983 MB node distances: node 0 1 0: 10 40 1: 40 10 In this case every config option is correct and we have exact vNUMA topology as it in VN config file. Elena Ufimtseva (7): xen: vNUMA support for PV guests libxc: Plumb Xen with vNUMA topology for domain. libxc: vnodes allocation on NUMA nodes. libxl: vNUMA supporting interface. libxl: vNUMA configuration parser xen: adds vNUMA info debug-key u xl: docs for xl config vnuma options docs/man/xl.cfg.pod.5 | 47 +++++++ tools/libxc/xc_dom.h | 10 ++ tools/libxc/xc_dom_x86.c | 85 ++++++++++-- tools/libxc/xc_domain.c | 59 +++++++++ tools/libxc/xenctrl.h | 9 ++ tools/libxc/xg_private.h | 1 + tools/libxl/libxl.c | 19 +++ tools/libxl/libxl.h | 18 +++ tools/libxl/libxl_arch.h | 8 ++ tools/libxl/libxl_dom.c | 186 +++++++++++++++++++++++++- tools/libxl/libxl_internal.h | 3 + tools/libxl/libxl_types.idl | 5 +- tools/libxl/libxl_x86.c | 53 ++++++++ tools/libxl/xl_cmdimpl.c | 294 +++++++++++++++++++++++++++++++++++++++++- xen/arch/x86/numa.c | 20 ++- xen/common/domain.c | 10 ++ xen/common/domctl.c | 72 +++++++++++ xen/common/memory.c | 41 ++++++ xen/include/public/domctl.h | 14 ++ xen/include/public/memory.h | 8 ++ xen/include/xen/domain.h | 11 ++ xen/include/xen/sched.h | 1 + xen/include/xen/vnuma.h | 18 +++ 23 files changed, 979 insertions(+), 13 deletions(-) create mode 100644 xen/include/xen/vnuma.h -- 1.7.10.4