vNUMA introduction This series of patches introduces vNUMA topology awareness and provides interfaces and data structures to enable vNUMA for PV guests. There is a plan to extend this support for dom0 and HVM domains. vNUMA topology support should be supported by PV guest kernel. Corresponging patches should be applied. Introduction ------------- vNUMA topology is exposed to the PV guest to improve performance when running workloads on NUMA machines. XEN vNUMA implementation provides a way to create vNUMA-enabled guests on NUMA/UMA and map vNUMA topology to physical NUMA in a optimal way. XEN vNUMA support Current set of patches introduces subop hypercall that is available for enlightened PV guests with vNUMA patches applied. Domain structure was modified to reflect per-domain vNUMA topology for use in other vNUMA-aware subsystems (e.g. ballooning). libxc libxc provides interfaces to build PV guests with vNUMA support and in case of NUMA machines provides initial memory allocation on physical NUMA nodes. This implemented by utilizing nodemap formed by automatic NUMA placement. Details are in patch #3. libxl libxl provides a way to predefine in VM config vNUMA topology - number of vnodes, memory arrangement, vcpus to vnodes assignment, distance map. PV guest As of now, only PV guest can take advantage of vNUMA functionality. vNUMA Linux patches should be applied and NUMA support should be compiled in kernel. This patchset can be pulled from https://git.gitorious.org/xenvnuma/xenvnuma.git:v6 Linux patchset https://git.gitorious.org/xenvnuma/linuxvnuma.git:v6 Examples of booting vNUMA enabled PV Linux guest on real NUMA machine: 1. Automatic vNUMA placement on h/w NUMA machine: VM config: memory = 16384 vcpus = 4 name = "rcbig" vnodes = 4 vnumamem = [10,10] vnuma_distance = [10, 30, 10, 30] vcpu_to_vnode = [0, 0, 1, 1] Xen: (XEN) Memory location of each domain: (XEN) Domain 0 (total: 2569511): (XEN) Node 0: 1416166 (XEN) Node 1: 1153345 (XEN) Domain 5 (total: 4194304): (XEN) Node 0: 2097152 (XEN) Node 1: 2097152 (XEN) Domain has 4 vnodes (XEN) vnode 0 - pnode 0 (4096) MB (XEN) vnode 1 - pnode 0 (4096) MB (XEN) vnode 2 - pnode 1 (4096) MB (XEN) vnode 3 - pnode 1 (4096) MB (XEN) Domain vcpu to vnode: (XEN) 0 1 2 3 dmesg on pv guest: [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x0009ffff] [ 0.000000] node 0: [mem 0x00100000-0xffffffff] [ 0.000000] node 1: [mem 0x100000000-0x1ffffffff] [ 0.000000] node 2: [mem 0x200000000-0x2ffffffff] [ 0.000000] node 3: [mem 0x300000000-0x3ffffffff] [ 0.000000] On node 0 totalpages: 1048479 [ 0.000000] DMA zone: 56 pages used for memmap [ 0.000000] DMA zone: 21 pages reserved [ 0.000000] DMA zone: 3999 pages, LIFO batch:0 [ 0.000000] DMA32 zone: 14280 pages used for memmap [ 0.000000] DMA32 zone: 1044480 pages, LIFO batch:31 [ 0.000000] On node 1 totalpages: 1048576 [ 0.000000] Normal zone: 14336 pages used for memmap [ 0.000000] Normal zone: 1048576 pages, LIFO batch:31 [ 0.000000] On node 2 totalpages: 1048576 [ 0.000000] Normal zone: 14336 pages used for memmap [ 0.000000] Normal zone: 1048576 pages, LIFO batch:31 [ 0.000000] On node 3 totalpages: 1048576 [ 0.000000] Normal zone: 14336 pages used for memmap [ 0.000000] Normal zone: 1048576 pages, LIFO batch:31 [ 0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org [ 0.000000] smpboot: Allowing 4 CPUs, 0 hotplug CPUs [ 0.000000] No local APIC present [ 0.000000] APIC: disable apic facility [ 0.000000] APIC: switched to apic NOOP [ 0.000000] nr_irqs_gsi: 16 [ 0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff] [ 0.000000] e820: cannot find a gap in the 32bit address range [ 0.000000] e820: PCI devices with unassigned 32bit BARs may break! [ 0.000000] e820: [mem 0x400100000-0x4004fffff] available for PCI devices [ 0.000000] Booting paravirtualized kernel on Xen [ 0.000000] Xen version: 4.4-unstable (preserve-AD) [ 0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:4 nr_node_ids:4 [ 0.000000] PERCPU: Embedded 28 pages/cpu @ffff8800ffc00000 s85376 r8192 d21120 u2097152 [ 0.000000] pcpu-alloc: s85376 r8192 d21120 u2097152 alloc=1*2097152 [ 0.000000] pcpu-alloc: [0] 0 [1] 1 [2] 2 [3] 3 pv guest: numactl --hardware: root@heatpipe:~# numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 node 0 size: 4031 MB node 0 free: 3997 MB node 1 cpus: 1 node 1 size: 4039 MB node 1 free: 4022 MB node 2 cpus: 2 node 2 size: 4039 MB node 2 free: 4023 MB node 3 cpus: 3 node 3 size: 3975 MB node 3 free: 3963 MB node distances: node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 20 2: 20 20 10 20 3: 20 20 20 10 Comments: None of the configuration options are correct so default values were used. Since machine is NUMA machine and there is no vcpu pinning defines, NUMA automatic node selection mechanism is used and you can see how vnodes were split across physical nodes. 2. Example with e820_host = 1 (32GB real NUMA machines, two nodes). pv config: memory = 4000 vcpus = 8 # The name of the domain, change this if you want more than 1 VM. name = "null" vnodes = 4 #vnumamem = [3000, 1000] vdistance = [10, 40] #vnuma_vcpumap = [1, 0, 3, 2] vnuma_vnodemap = [1, 0, 1, 0] #vnuma_autoplacement = 1 e820_host = 1 guest boot: [ 0.000000] Initializing cgroup subsys cpuset [ 0.000000] Initializing cgroup subsys cpu [ 0.000000] Initializing cgroup subsys cpuacct [ 0.000000] Linux version 3.12.0+ (assert@superpipe) (gcc version 4.7.2 (Debi an 4.7.2-5) ) #111 SMP Tue Dec 3 14:54:36 EST 2013 [ 0.000000] Command line: root=/dev/xvda1 ro earlyprintk=xen debug loglevel=8 debug print_fatal_signals=1 loglvl=all guest_loglvl=all LOGLEVEL=8 earlyprintkxen sched_debug [ 0.000000] ACPI in unprivileged domain disabled [ 0.000000] Freeing ac228-fa000 pfn range: 318936 pages freed [ 0.000000] 1-1 mapping on ac228->100000 [ 0.000000] Released 318936 pages of unused memory [ 0.000000] Set 343512 page(s) to 1-1 mapping [ 0.000000] Populating 100000-14ddd8 pfn range: 318936 pages added [ 0.000000] e820: BIOS-provided physical RAM map: [ 0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable [ 0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved [ 0.000000] Xen: [mem 0x0000000000100000-0x00000000ac227fff] usable [ 0.000000] Xen: [mem 0x00000000ac228000-0x00000000ac26bfff] reserved [ 0.000000] Xen: [mem 0x00000000ac26c000-0x00000000ac57ffff] unusable [ 0.000000] Xen: [mem 0x00000000ac580000-0x00000000ac5a0fff] reserved [ 0.000000] Xen: [mem 0x00000000ac5a1000-0x00000000ac5bbfff] unusable [ 0.000000] Xen: [mem 0x00000000ac5bc000-0x00000000ac5bdfff] reserved [ 0.000000] Xen: [mem 0x00000000ac5be000-0x00000000ac5befff] unusable [ 0.000000] Xen: [mem 0x00000000ac5bf000-0x00000000ac5cafff] reserved [ 0.000000] Xen: [mem 0x00000000ac5cb000-0x00000000ac5d9fff] unusable [ 0.000000] Xen: [mem 0x00000000ac5da000-0x00000000ac5fafff] reserved [ 0.000000] Xen: [mem 0x00000000ac5fb000-0x00000000ac6b6fff] unusable [ 0.000000] Xen: [mem 0x00000000ac6b7000-0x00000000ac7fafff] ACPI NVS [ 0.000000] Xen: [mem 0x00000000ac7fb000-0x00000000ac80efff] unusable [ 0.000000] Xen: [mem 0x00000000ac80f000-0x00000000ac80ffff] ACPI data [ 0.000000] Xen: [mem 0x00000000ac810000-0x00000000ac810fff] unusable [ 0.000000] Xen: [mem 0x00000000ac811000-0x00000000ac812fff] ACPI data [ 0.000000] Xen: [mem 0x00000000ac813000-0x00000000ad7fffff] unusable [ 0.000000] Xen: [mem 0x00000000b0000000-0x00000000b3ffffff] reserved [ 0.000000] Xen: [mem 0x00000000fed20000-0x00000000fed3ffff] reserved [ 0.000000] Xen: [mem 0x00000000fed50000-0x00000000fed8ffff] reserved [ 0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff] reserved [ 0.000000] Xen: [mem 0x00000000ffa00000-0x00000000ffa3ffff] reserved [ 0.000000] Xen: [mem 0x0000000100000000-0x000000014ddd7fff] usable [ 0.000000] bootconsole [xenboot0] enabled [ 0.000000] NX (Execute Disable) protection: active [ 0.000000] DMI not present or invalid. [ 0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved [ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable [ 0.000000] No AGP bridge found [ 0.000000] e820: last_pfn = 0x14ddd8 max_arch_pfn = 0x400000000 [ 0.000000] e820: last_pfn = 0xac228 max_arch_pfn = 0x400000000 [ 0.000000] Base memory trampoline at [ffff88000009a000] 9a000 size 24576 [ 0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff] [ 0.000000] [mem 0x00000000-0x000fffff] page 4k [ 0.000000] init_memory_mapping: [mem 0x14da00000-0x14dbfffff] [ 0.000000] [mem 0x14da00000-0x14dbfffff] page 4k [ 0.000000] BRK [0x019bd000, 0x019bdfff] PGTABLE [ 0.000000] BRK [0x019be000, 0x019befff] PGTABLE [ 0.000000] init_memory_mapping: [mem 0x14c000000-0x14d9fffff] [ 0.000000] [mem 0x14c000000-0x14d9fffff] page 4k [ 0.000000] BRK [0x019bf000, 0x019bffff] PGTABLE [ 0.000000] BRK [0x019c0000, 0x019c0fff] PGTABLE [ 0.000000] BRK [0x019c1000, 0x019c1fff] PGTABLE [ 0.000000] BRK [0x019c2000, 0x019c2fff] PGTABLE [ 0.000000] init_memory_mapping: [mem 0x100000000-0x14bffffff] [ 0.000000] [mem 0x100000000-0x14bffffff] page 4k [ 0.000000] init_memory_mapping: [mem 0x00100000-0xac227fff] [ 0.000000] [mem 0x00100000-0xac227fff] page 4k [ 0.000000] init_memory_mapping: [mem 0x14dc00000-0x14ddd7fff] [ 0.000000] [mem 0x14dc00000-0x14ddd7fff] page 4k [ 0.000000] RAMDISK: [mem 0x01dc8000-0x0346ffff] [ 0.000000] NUMA: Initialized distance table, cnt=4 [ 0.000000] Initmem setup node 0 [mem 0x00000000-0x3e7fffff] [ 0.000000] NODE_DATA [mem 0x3e7d9000-0x3e7fffff] [ 0.000000] Initmem setup node 1 [mem 0x3e800000-0x7cffffff] [ 0.000000] NODE_DATA [mem 0x7cfd9000-0x7cffffff] [ 0.000000] Initmem setup node 2 [mem 0x7d000000-0x10f5dffff] [ 0.000000] NODE_DATA [mem 0x10f5b9000-0x10f5dffff] [ 0.000000] Initmem setup node 3 [mem 0x10f800000-0x14ddd7fff] [ 0.000000] NODE_DATA [mem 0x14ddad000-0x14ddd3fff] [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x00001000-0x00ffffff] [ 0.000000] DMA32 [mem 0x01000000-0xffffffff] [ 0.000000] Normal [mem 0x100000000-0x14ddd7fff] [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x0009ffff] [ 0.000000] node 0: [mem 0x00100000-0x3e7fffff] [ 0.000000] node 1: [mem 0x3e800000-0x7cffffff] [ 0.000000] node 2: [mem 0x7d000000-0xac227fff] [ 0.000000] node 2: [mem 0x100000000-0x10f5dffff] [ 0.000000] node 3: [mem 0x10f5e0000-0x14ddd7fff] [ 0.000000] On node 0 totalpages: 255903 [ 0.000000] DMA zone: 56 pages used for memmap [ 0.000000] DMA zone: 21 pages reserved [ 0.000000] DMA zone: 3999 pages, LIFO batch:0 [ 0.000000] DMA32 zone: 3444 pages used for memmap [ 0.000000] DMA32 zone: 251904 pages, LIFO batch:31 [ 0.000000] On node 1 totalpages: 256000 [ 0.000000] DMA32 zone: 3500 pages used for memmap [ 0.000000] DMA32 zone: 256000 pages, LIFO batch:31 [ 0.000000] On node 2 totalpages: 256008 [ 0.000000] DMA32 zone: 2640 pages used for memmap [ 0.000000] DMA32 zone: 193064 pages, LIFO batch:31 [ 0.000000] Normal zone: 861 pages used for memmap [ 0.000000] Normal zone: 62944 pages, LIFO batch:15 [ 0.000000] On node 3 totalpages: 255992 [ 0.000000] Normal zone: 3500 pages used for memmap [ 0.000000] Normal zone: 255992 pages, LIFO batch:31 [ 0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org [ 0.000000] smpboot: Allowing 8 CPUs, 0 hotplug CPUs root@heatpipe:~# numactl --ha available: 4 nodes (0-3) node 0 cpus: 0 4 node 0 size: 977 MB node 0 free: 947 MB node 1 cpus: 1 5 node 1 size: 985 MB node 1 free: 974 MB node 2 cpus: 2 6 node 2 size: 985 MB node 2 free: 973 MB node 3 cpus: 3 7 node 3 size: 969 MB node 3 free: 958 MB node distances: node 0 1 2 3 0: 10 40 40 40 1: 40 10 40 40 2: 40 40 10 40 3: 40 40 40 10 root@heatpipe:~# numastat -m Per-node system memory usage (in MBs): Node 0 Node 1 Node 2 Node 3 Total --------------- --------------- --------------- --------------- --------------- MemTotal 977.14 985.50 985.44 969.91 3917.99 hypervisor: xl debug-keys u (XEN) ''u'' pressed -> dumping numa info (now-0x2A3:F7B8CB0F) (XEN) Domain 2 (total: 1024000): (XEN) Node 0: 415468 (XEN) Node 1: 608532 (XEN) Domain has 4 vnodes (XEN) vnode 0 - pnode 1 1000 MB, vcpus: 0 4 (XEN) vnode 1 - pnode 0 1000 MB, vcpus: 1 5 (XEN) vnode 2 - pnode 1 2341 MB, vcpus: 2 6 (XEN) vnode 3 - pnode 0 999 MB, vcpus: 3 7 This size descrepancy caused by the way how size if calculated from guest pfns: end - start. Thus the hole size in this case of ~1,3Gb is included in the size. 3. zero vNUMA configuration for every pv domain. Will be at least one vnuma node if vnuma topology was not specified. pv config: memory = 4000 vcpus = 8 # The name of the domain, change this if you want more than 1 VM. name = "null" #vnodes = 4 vnumamem = [3000, 1000] vdistance = [10, 40] vnuma_vcpumap = [1, 0, 3, 2] vnuma_vnodemap = [1, 0, 1, 0] vnuma_autoplacement = 1 e820_host = 1 boot: [ 0.000000] init_memory_mapping: [mem 0x14dc00000-0x14ddd7fff] [ 0.000000] [mem 0x14dc00000-0x14ddd7fff] page 4k [ 0.000000] RAMDISK: [mem 0x01dc8000-0x0346ffff] [ 0.000000] NUMA: Initialized distance table, cnt=1 [ 0.000000] Initmem setup node 0 [mem 0x00000000-0x14ddd7fff] [ 0.000000] NODE_DATA [mem 0x14ddad000-0x14ddd3fff] [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x00001000-0x00ffffff] [ 0.000000] DMA32 [mem 0x01000000-0xffffffff] [ 0.000000] Normal [mem 0x100000000-0x14ddd7fff] [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x0009ffff] [ 0.000000] node 0: [mem 0x00100000-0xac227fff] [ 0.000000] node 0: [mem 0x100000000-0x14ddd7fff] root@heatpipe:~# numactl --ha maxn: 0 available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 3918 MB node 0 free: 3853 MB node distances: node 0 0: 10 root@heatpipe:~# numastat -m Per-node system memory usage (in MBs): Node 0 Total --------------- --------------- MemTotal 3918.74 3918.74 hypervisor: xl debug-keys u (XEN) Memory location of each domain: (XEN) Domain 0 (total: 6787432): (XEN) Node 0: 3485706 (XEN) Node 1: 3301726 (XEN) Domain 3 (total: 1024000): (XEN) Node 0: 512000 (XEN) Node 1: 512000 (XEN) Domain has 1 vnodes (XEN) vnode 0 - pnode any 5341 MB, vcpus: 0 1 2 3 4 5 6 7 Notes: to enable vNUMA in pv guest the corresponding patch set should be applied - https://git.gitorious.org/xenvnuma/linuxvnuma.git:v5 or https://www.gitorious.org/xenvnuma/linuxvnuma/commit/deaa014257b99f57c76fbba12a28907786cbe17d. Issues: The most important right now is the automatic numa balancing for linux pv kernel as its corrupting user space memory. Since the v3 of this patch series linux kernel 3.13 seem to perform correctly, but with the recent changes the issue is back. See https://lkml.org/lkml/2013/10/31/133 for urgent patch what presumably had numa balancing working. Sine 3.12 there were multiple changes to numa automatic balancing. I am currently back to investigating if anything should be done from hypervisor side and will work with kernel maintainers. Elena Ufimtseva (7): xen: vNUMA support for PV guests libxc: Plumb Xen with vNUMA topology for domain xl: vnuma memory parsing and supplement functions xl: vnuma distance, vcpu and pnode masks parser libxc: vnuma memory domain allocation libxl: vNUMA supporting interface xen: adds vNUMA info debug-key u docs/man/xl.cfg.pod.5 | 60 +++++++ tools/libxc/xc_dom.h | 10 ++ tools/libxc/xc_dom_x86.c | 63 +++++-- tools/libxc/xc_domain.c | 64 +++++++ tools/libxc/xenctrl.h | 9 + tools/libxc/xg_private.h | 1 + tools/libxl/libxl.c | 18 ++ tools/libxl/libxl.h | 20 +++ tools/libxl/libxl_arch.h | 6 + tools/libxl/libxl_dom.c | 158 ++++++++++++++++-- tools/libxl/libxl_internal.h | 6 + tools/libxl/libxl_numa.c | 49 ++++++ tools/libxl/libxl_types.idl | 6 +- tools/libxl/libxl_vnuma.h | 11 ++ tools/libxl/libxl_x86.c | 123 ++++++++++++++ tools/libxl/xl_cmdimpl.c | 380 ++++++++++++++++++++++++++++++++++++++++++ xen/arch/x86/numa.c | 30 +++- xen/common/domain.c | 10 ++ xen/common/domctl.c | 79 +++++++++ xen/common/memory.c | 96 +++++++++++ xen/include/public/domctl.h | 29 ++++ xen/include/public/memory.h | 17 ++ xen/include/public/vnuma.h | 59 +++++++ xen/include/xen/domain.h | 8 + xen/include/xen/sched.h | 1 + 25 files changed, 1282 insertions(+), 31 deletions(-) create mode 100644 tools/libxl/libxl_vnuma.h create mode 100644 xen/include/public/vnuma.h -- 1.7.10.4
Defines interface, structures and hypercalls for toolstack to build vnuma topology and for guests that wish to retreive it. Two subop hypercalls introduced by patch: XEN_DOMCTL_setvnumainfo to define vNUMA domain topology per domain and XENMEM_get_vnuma_info to retreive that topology by guest. Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> --- Changes since v3: - added subop hypercall to retrive number of vnodes and vcpus for domain to make correct allocations before requesting vnuma topology. --- xen/common/domain.c | 10 +++++ xen/common/domctl.c | 79 +++++++++++++++++++++++++++++++++++ xen/common/memory.c | 96 +++++++++++++++++++++++++++++++++++++++++++ xen/include/public/domctl.h | 29 +++++++++++++ xen/include/public/memory.h | 17 ++++++++ xen/include/public/vnuma.h | 59 ++++++++++++++++++++++++++ xen/include/xen/domain.h | 8 ++++ xen/include/xen/sched.h | 1 + 8 files changed, 299 insertions(+) create mode 100644 xen/include/public/vnuma.h diff --git a/xen/common/domain.c b/xen/common/domain.c index 2cbc489..8f5c665 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -511,6 +511,15 @@ int rcu_lock_live_remote_domain_by_id(domid_t dom, struct domain **d) return 0; } +static void vnuma_destroy(struct vnuma_info *vnuma) +{ + vnuma->nr_vnodes = 0; + xfree(vnuma->vmemrange); + xfree(vnuma->vcpu_to_vnode); + xfree(vnuma->vdistance); + xfree(vnuma->vnode_to_pnode); +} + int domain_kill(struct domain *d) { int rc = 0; @@ -531,6 +540,7 @@ int domain_kill(struct domain *d) tmem_destroy(d->tmem); domain_set_outstanding_pages(d, 0); d->tmem = NULL; + vnuma_destroy(&d->vnuma); /* fallthrough */ case DOMDYING_dying: rc = domain_relinquish_resources(d); diff --git a/xen/common/domctl.c b/xen/common/domctl.c index 904d27b..4f5a17c 100644 --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -29,6 +29,7 @@ #include <asm/page.h> #include <public/domctl.h> #include <xsm/xsm.h> +#include <public/vnuma.h> static DEFINE_SPINLOCK(domctl_lock); DEFINE_SPINLOCK(vcpu_alloc_lock); @@ -889,6 +890,84 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) } break; + case XEN_DOMCTL_setvnumainfo: + { + unsigned int dist_size, nr_vnodes, i; + + ret = -EINVAL; + + /* + * If number of vnodes was set before, + * dont initilize it again. + */ + if ( d->vnuma.nr_vnodes > 0 ) + break; + + nr_vnodes = op->u.vnuma.nr_vnodes; + if ( nr_vnodes == 0 ) + break; + if ( nr_vnodes > (UINT_MAX / nr_vnodes) ) + break; + + ret = -EFAULT; + if ( guest_handle_is_null(op->u.vnuma.vdistance) || + guest_handle_is_null(op->u.vnuma.vmemrange) || + guest_handle_is_null(op->u.vnuma.vcpu_to_vnode) || + guest_handle_is_null(op->u.vnuma.vnode_to_pnode) ) + goto setvnumainfo_out; + + dist_size = nr_vnodes * nr_vnodes; + + d->vnuma.vdistance = xmalloc_array(unsigned int, dist_size); + d->vnuma.vmemrange = xmalloc_array(vmemrange_t, nr_vnodes); + d->vnuma.vcpu_to_vnode = xmalloc_array(unsigned int, d->max_vcpus); + d->vnuma.vnode_to_pnode = xmalloc_array(unsigned int, nr_vnodes); + + if ( d->vnuma.vdistance == NULL || + d->vnuma.vmemrange == NULL || + d->vnuma.vcpu_to_vnode == NULL || + d->vnuma.vnode_to_pnode == NULL ) + { + ret = -ENOMEM; + goto setvnumainfo_out; + } + + if ( unlikely(copy_from_guest(d->vnuma.vdistance, + op->u.vnuma.vdistance, + dist_size)) ) + goto setvnumainfo_out; + if ( unlikely(copy_from_guest(d->vnuma.vmemrange, + op->u.vnuma.vmemrange, + nr_vnodes)) ) + goto setvnumainfo_out; + if ( unlikely(copy_from_guest(d->vnuma.vcpu_to_vnode, + op->u.vnuma.vcpu_to_vnode, + d->max_vcpus)) ) + goto setvnumainfo_out; + if ( unlikely(copy_from_guest(d->vnuma.vnode_to_pnode, + op->u.vnuma.vnode_to_pnode, + nr_vnodes)) ) + goto setvnumainfo_out; + + /* Everything is good, lets set the number of vnodes */ + d->vnuma.nr_vnodes = nr_vnodes; + + for ( i = 0; i < nr_vnodes; i++ ) + d->vnuma.vmemrange[i]._reserved = 0; + + ret = 0; + + setvnumainfo_out: + if ( ret != 0 ) + { + xfree(d->vnuma.vdistance); + xfree(d->vnuma.vmemrange); + xfree(d->vnuma.vcpu_to_vnode); + xfree(d->vnuma.vnode_to_pnode); + } + } + break; + default: ret = arch_do_domctl(op, d, u_domctl); break; diff --git a/xen/common/memory.c b/xen/common/memory.c index 50b740f..5bfab08 100644 --- a/xen/common/memory.c +++ b/xen/common/memory.c @@ -28,6 +28,7 @@ #include <public/memory.h> #include <xsm/xsm.h> #include <xen/trace.h> +#include <public/vnuma.h> struct memop_args { /* INPUT */ @@ -733,6 +734,101 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg) break; + case XENMEM_get_vnuma_info: + { + struct vnuma_topology_info mtopology; + struct domain *d; + + if ( copy_from_guest(&mtopology, arg, 1) ) + { + printk(KERN_INFO "COpy from guest of mtopology failed.\n"); + return -EFAULT; + } + if ( (d = rcu_lock_domain_by_any_id(mtopology.domid)) == NULL ) + return -ESRCH; + + if ( (d->vnuma.nr_vnodes == 0) || + (d->vnuma.nr_vnodes > d->max_vcpus) ) + { + rc = -EOPNOTSUPP; + goto vnumainfo_out; + } + + rc = -EFAULT; + + if ( guest_handle_is_null(mtopology.vmemrange.h) || + guest_handle_is_null(mtopology.vdistance.h) || + guest_handle_is_null(mtopology.vcpu_to_vnode.h)|| + guest_handle_is_null(mtopology.nr_vnodes.h) ) + goto vnumainfo_out; + + if ( __copy_to_guest(mtopology.vmemrange.h, + d->vnuma.vmemrange, + d->vnuma.nr_vnodes) != 0 ) + goto vnumainfo_out; + if ( __copy_to_guest(mtopology.vdistance.h, + d->vnuma.vdistance, + d->vnuma.nr_vnodes * d->vnuma.nr_vnodes) != 0 ) + goto vnumainfo_out; + if ( __copy_to_guest(mtopology.vcpu_to_vnode.h, + d->vnuma.vcpu_to_vnode, + d->max_vcpus) != 0 ) + goto vnumainfo_out; + + if ( __copy_to_guest(mtopology.nr_vnodes.h, &d->vnuma.nr_vnodes, 1) != 0 ) + goto vnumainfo_out; + + rc = 0; + + vnumainfo_out: + rcu_unlock_domain(d); + if ( rc != 0 ) { + printk(KERN_INFO "Problem with some parts of vnuma hypercall\n"); + } + break; + } + + /* only two fields are used here from vnuma_topology_info: + * nr_vnodes and max_vcpus. Used by guest to allocate correct + * size of vnuma topology arrays. + */ + case XENMEM_get_vnodes_vcpus: + { + struct vnuma_topology_info mtopology; + struct domain *d; + unsigned int nr_vnodes, max_vcpus; + + if ( copy_from_guest(&mtopology, arg, 1) ) + { + printk(KERN_INFO "Null pointer vnuma_nodes.\n"); + return -EFAULT; + } + if ( (d = rcu_lock_domain_by_any_id(mtopology.domid)) == NULL ) + return -ESRCH; + + nr_vnodes = d->vnuma.nr_vnodes; + max_vcpus = d->max_vcpus; + rcu_unlock_domain(d); + + rc = -EFAULT; + + /* check if its request to get number of nodes, first one */ + if ( guest_handle_is_null(mtopology.nr_vnodes.h) || + guest_handle_is_null(mtopology.nr_vcpus.h) ) + return rc; + + rc = __copy_to_guest(mtopology.nr_vnodes.h, &nr_vnodes, 1); + if (rc) + return rc; + + rc = __copy_to_guest(mtopology.nr_vcpus.h, &max_vcpus, 1); + if (rc) + return rc; + + rc = 0; + break; + } + default: rc = arch_memory_op(op, arg); break; diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h index 01a3652..0157a16 100644 --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -35,6 +35,7 @@ #include "xen.h" #include "grant_table.h" #include "hvm/save.h" +#include "vnuma.h" #define XEN_DOMCTL_INTERFACE_VERSION 0x00000009 @@ -869,6 +870,32 @@ struct xen_domctl_set_max_evtchn { typedef struct xen_domctl_set_max_evtchn xen_domctl_set_max_evtchn_t; DEFINE_XEN_GUEST_HANDLE(xen_domctl_set_max_evtchn_t); +/* + * XEN_DOMCTL_setvnumainfo: sets the vNUMA topology + * parameters from toolstack. + */ +struct xen_domctl_vnuma { + uint32_t nr_vnodes; + uint32_t __pad; + XEN_GUEST_HANDLE_64(uint) vdistance; + XEN_GUEST_HANDLE_64(uint) vcpu_to_vnode; + /* + * vnodes to physical NUMA nodes mask. + * This will be kept on per-domain basis + * for requests by consumers as vnuma + * aware ballooning. + */ + XEN_GUEST_HANDLE_64(uint) vnode_to_pnode; + /* + * memory rages that vNUMA node can represent + * If more than one, its a linked list. + */ + XEN_GUEST_HANDLE_64(vmemrange_t) vmemrange; +}; + +typedef struct xen_domctl_vnuma xen_domctl_vnuma_t; +DEFINE_XEN_GUEST_HANDLE(xen_domctl_vnuma_t); + struct xen_domctl { uint32_t cmd; #define XEN_DOMCTL_createdomain 1 @@ -938,6 +965,7 @@ struct xen_domctl { #define XEN_DOMCTL_setnodeaffinity 68 #define XEN_DOMCTL_getnodeaffinity 69 #define XEN_DOMCTL_set_max_evtchn 70 +#define XEN_DOMCTL_setvnumainfo 71 #define XEN_DOMCTL_gdbsx_guestmemio 1000 #define XEN_DOMCTL_gdbsx_pausevcpu 1001 #define XEN_DOMCTL_gdbsx_unpausevcpu 1002 @@ -998,6 +1026,7 @@ struct xen_domctl { struct xen_domctl_set_broken_page_p2m set_broken_page_p2m; struct xen_domctl_gdbsx_pauseunp_vcpu gdbsx_pauseunp_vcpu; struct xen_domctl_gdbsx_domstatus gdbsx_domstatus; + struct xen_domctl_vnuma vnuma; uint8_t pad[128]; } u; }; diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h index 7a26dee..f5ea6af 100644 --- a/xen/include/public/memory.h +++ b/xen/include/public/memory.h @@ -339,6 +339,23 @@ struct xen_pod_target { }; typedef struct xen_pod_target xen_pod_target_t; +/* + * XENMEM_get_vnuma_info used by caller to retrieve + * vNUMA topology constructed for particular domain. + * + * The data exchanged is presented by vnuma_topology_info. + */ +#define XENMEM_get_vnuma_info 25 + +/* + * XENMEM_get_vnuma_nodes used to retreive number of nodes + * and vcpus for allocating correct amount of memory for + * vnuma topology. + */ +#define XENMEM_get_vnodes_vcpus 26 + +#define XENMEM_get_vnuma_pnode 27 + #if defined(__XEN__) || defined(__XEN_TOOLS__) #ifndef uint64_aligned_t diff --git a/xen/include/public/vnuma.h b/xen/include/public/vnuma.h new file mode 100644 index 0000000..32f860f --- /dev/null +++ b/xen/include/public/vnuma.h @@ -0,0 +1,59 @@ +#ifndef _XEN_PUBLIC_VNUMA_H +#define _XEN_PUBLIC_VNUMA_H + +#include "xen.h" + +/* + * Following structures are used to represent vNUMA + * topology to guest if requested. + */ + +/* + * Memory ranges can be used to define + * vNUMA memory node boundaries by the + * linked list. As of now, only one range + * per domain is suported. + */ +struct vmemrange { + uint64_t start, end; + uint64_t _reserved; +}; + +typedef struct vmemrange vmemrange_t; +DEFINE_XEN_GUEST_HANDLE(vmemrange_t); + +/* + * vNUMA topology specifies vNUMA node + * number, distance table, memory ranges and + * vcpu mapping provided for guests. + */ + +struct vnuma_topology_info { + /* IN */ + domid_t domid; + /* OUT */ + union { + XEN_GUEST_HANDLE(uint) h; + uint64_t _pad; + } nr_vnodes; + union { + XEN_GUEST_HANDLE(uint) h; + uint64_t _pad; + } nr_vcpus; + union { + XEN_GUEST_HANDLE(uint) h; + uint64_t _pad; + } vdistance; + union { + XEN_GUEST_HANDLE(uint) h; + uint64_t _pad; + } vcpu_to_vnode; + union { + XEN_GUEST_HANDLE(vmemrange_t) h; + uint64_t _pad; + } vmemrange; +}; +typedef struct vnuma_topology_info vnuma_topology_info_t; +DEFINE_XEN_GUEST_HANDLE(vnuma_topology_info_t); + +#endif diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h index a057069..ee0eeee 100644 --- a/xen/include/xen/domain.h +++ b/xen/include/xen/domain.h @@ -89,4 +89,12 @@ extern unsigned int xen_processor_pmbits; extern bool_t opt_dom0_vcpus_pin; +struct vnuma_info { + unsigned int nr_vnodes; + unsigned int *vdistance; + unsigned int *vcpu_to_vnode; + unsigned int *vnode_to_pnode; + struct vmemrange *vmemrange; +}; + #endif /* __XEN_DOMAIN_H__ */ diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index cbdf377..3765eae 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -417,6 +417,7 @@ struct domain nodemask_t node_affinity; unsigned int last_alloc_node; spinlock_t node_affinity_lock; + struct vnuma_info vnuma; }; struct domain_setup_info -- 1.7.10.4
Elena Ufimtseva
2013-Dec-04 05:47 UTC
[PATCH v4 2/7] libxc: Plumb Xen with vNUMA topology for domain
Per-domain vNUMA topology initialization. domctl hypercall is used to set vNUMA topology per domU during domain build time. Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> --- tools/libxc/xc_domain.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++ tools/libxc/xenctrl.h | 9 +++++++ 2 files changed, 73 insertions(+) diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c index 1ccafc5..a436a3a 100644 --- a/tools/libxc/xc_domain.c +++ b/tools/libxc/xc_domain.c @@ -1776,6 +1776,70 @@ int xc_domain_set_max_evtchn(xc_interface *xch, uint32_t domid, return do_domctl(xch, &domctl); } +/* Plumbs Xen with vNUMA topology */ +int xc_domain_setvnuma(xc_interface *xch, + uint32_t domid, + uint16_t nr_vnodes, + uint16_t nr_vcpus, + vmemrange_t *vmemrange, + unsigned int *vdistance, + unsigned int *vcpu_to_vnode, + unsigned int *vnode_to_pnode) +{ + int rc; + DECLARE_DOMCTL; + DECLARE_HYPERCALL_BOUNCE(vmemrange, sizeof(*vmemrange) * nr_vnodes, + XC_HYPERCALL_BUFFER_BOUNCE_BOTH); + DECLARE_HYPERCALL_BOUNCE(vdistance, sizeof(*vdistance) * + nr_vnodes * nr_vnodes, + XC_HYPERCALL_BUFFER_BOUNCE_BOTH); + DECLARE_HYPERCALL_BOUNCE(vcpu_to_vnode, sizeof(*vcpu_to_vnode) * nr_vcpus, + XC_HYPERCALL_BUFFER_BOUNCE_BOTH); + DECLARE_HYPERCALL_BOUNCE(vnode_to_pnode, sizeof(*vnode_to_pnode) * + nr_vnodes, + XC_HYPERCALL_BUFFER_BOUNCE_BOTH); + if ( nr_vnodes == 0 ) { + errno = EINVAL; + return -1; + } + + if ( vdistance == NULL || vcpu_to_vnode == NULL || + vmemrange == NULL || vnode_to_pnode == NULL ) { + PERROR("Incorrect parameters for XEN_DOMCTL_setvnumainfo.\n"); + errno = EINVAL; + return -1; + } + + if ( xc_hypercall_bounce_pre(xch, vmemrange) || + xc_hypercall_bounce_pre(xch, vdistance) || + xc_hypercall_bounce_pre(xch, vcpu_to_vnode) || + xc_hypercall_bounce_pre(xch, vnode_to_pnode) ) { + PERROR("Could not bounce buffer for xc_domain_setvnuma.\n"); + return -1; + } + + set_xen_guest_handle(domctl.u.vnuma.vmemrange, vmemrange); + set_xen_guest_handle(domctl.u.vnuma.vdistance, vdistance); + set_xen_guest_handle(domctl.u.vnuma.vcpu_to_vnode, vcpu_to_vnode); + set_xen_guest_handle(domctl.u.vnuma.vnode_to_pnode, vnode_to_pnode); + + domctl.cmd = XEN_DOMCTL_setvnumainfo; + domctl.domain = (domid_t)domid; + domctl.u.vnuma.nr_vnodes = nr_vnodes; + domctl.u.vnuma.__pad = 0; + + rc = do_domctl(xch, &domctl); + + xc_hypercall_bounce_post(xch, vmemrange); + xc_hypercall_bounce_post(xch, vdistance); + xc_hypercall_bounce_post(xch, vcpu_to_vnode); + xc_hypercall_bounce_post(xch, vnode_to_pnode); + + if ( rc ) + errno = EFAULT; + return rc; +} + /* * Local variables: * mode: C diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h index 4ac6b8a..f360726 100644 --- a/tools/libxc/xenctrl.h +++ b/tools/libxc/xenctrl.h @@ -1136,6 +1136,15 @@ int xc_domain_set_memmap_limit(xc_interface *xch, uint32_t domid, unsigned long map_limitkb); +int xc_domain_setvnuma(xc_interface *xch, + uint32_t domid, + uint16_t nr_vnodes, + uint16_t nr_vcpus, + vmemrange_t *vmemrange, + unsigned int *vdistance, + unsigned int *vcpu_to_vnode, + unsigned int *vnode_to_pnode); + #if defined(__i386__) || defined(__x86_64__) /* * PC BIOS standard E820 types and structure. -- 1.7.10.4
Elena Ufimtseva
2013-Dec-04 05:47 UTC
[PATCH v4 3/7] xl: vnuma memory parsing and supplement functions
Parses vnuma topoplogy number of nodes and memory ranges. If not defined, initializes vnuma with only one node and default topology. Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> --- Changes since v3: - added subop hypercall to retrive number of vnodes and vcpus for domain to make correct allocations before requesting vnuma topology. --- tools/libxl/libxl_types.idl | 6 +- tools/libxl/libxl_vnuma.h | 11 ++ tools/libxl/xl_cmdimpl.c | 234 +++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 250 insertions(+), 1 deletion(-) create mode 100644 tools/libxl/libxl_vnuma.h diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl index cba8eff..ba46f58 100644 --- a/tools/libxl/libxl_types.idl +++ b/tools/libxl/libxl_types.idl @@ -311,7 +311,11 @@ libxl_domain_build_info = Struct("domain_build_info",[ ("disable_migrate", libxl_defbool), ("cpuid", libxl_cpuid_policy_list), ("blkdev_start", string), - + ("vnuma_memszs", Array(uint64, "nr_vnodes")), + ("vcpu_to_vnode", Array(uint32, "nr_vnodemap")), + ("vdistance", Array(uint32, "nr_vdist")), + ("vnode_to_pnode", Array(uint32, "nr_vnode_to_pnode")), + ("vnuma_placement", libxl_defbool), ("device_model_version", libxl_device_model_version), ("device_model_stubdomain", libxl_defbool), # if you set device_model you must set device_model_version too diff --git a/tools/libxl/libxl_vnuma.h b/tools/libxl/libxl_vnuma.h new file mode 100644 index 0000000..f1568ae --- /dev/null +++ b/tools/libxl/libxl_vnuma.h @@ -0,0 +1,11 @@ +#include "libxl_osdeps.h" /* must come before any other headers */ + +#define VNUMA_NO_NODE ~((unsigned int)0) + +/* + * Max vNUMA node size in Mb is taken 64Mb even now Linux lets + * 32Mb, thus letting some slack. Will be modified to match Linux. + */ +#define MIN_VNODE_SIZE 64U + +#define MAX_VNUMA_NODES (unsigned int)1 << 10 diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index 341863e..c79e73e 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -40,6 +40,7 @@ #include "libxl_json.h" #include "libxlutil.h" #include "xl.h" +#include "libxl_vnuma.h" #define CHK_ERRNO( call ) ({ \ int chk_errno = (call); \ @@ -622,6 +623,134 @@ vcpp_out: return rc; } +/* Should exit after calling this */ +static void vnuma_info_release(libxl_domain_build_info *info) +{ + free(info->vnuma_memszs); + free(info->vdistance); + free(info->vcpu_to_vnode); + free(info->vnode_to_pnode); + info->nr_vnodes = 0; +} + +static void vdistance_default(unsigned int *vdistance, + unsigned int nr_vnodes, + unsigned int samenode, + unsigned int othernode) +{ + int i, j; + for (i = 0; i < nr_vnodes; i++) + for (j = 0; j < nr_vnodes; j++) + *(vdistance + j * nr_vnodes + i) = i == j ? samenode : othernode; +} + +static void vcputovnode_default(unsigned int *vcpu_to_vnode, + unsigned int nr_vnodes, + unsigned int max_vcpus) +{ + int i; + for(i = 0; i < max_vcpus; i++) + vcpu_to_vnode[i] = i % nr_vnodes; +} + +/* Split domain memory between vNUMA nodes equally */ +static int split_vnumamem(libxl_domain_build_info *b_info) +{ + unsigned long long vnodemem = 0; + unsigned long n; + unsigned int i; + + /* In MBytes */ + vnodemem = (b_info->max_memkb >> 10) / b_info->nr_vnodes; + if (vnodemem < MIN_VNODE_SIZE) + return -1; + /* reminder in MBytes */ + n = (b_info->max_memkb >> 10) % b_info->nr_vnodes; + /* get final sizes in MBytes */ + for(i = 0; i < (b_info->nr_vnodes - 1); i++) + b_info->vnuma_memszs[i] = vnodemem; + /* add the reminder to the last node */ + b_info->vnuma_memszs[i] = vnodemem + n; + return 0; +} + +static void vnode_to_pnode_default(unsigned int *vnode_to_pnode, + unsigned int nr_vnodes) +{ + unsigned int i; + for (i = 0; i < nr_vnodes; i++) + vnode_to_pnode[i] = VNUMA_NO_NODE; +} + +/* + * vNUMA zero config initialization for every pv domain that has + * no vnuma defined in config file. + */ +static int vnuma_zero_config(libxl_domain_build_info *b_info) +{ + b_info->nr_vnodes = 1; + /* dont leak memory with realloc */ + unsigned int *vdist, *vntop, *vcputov; + uint64_t *memsz; + + /* all memory goes to this one vnode */ + memsz = b_info->vnuma_memszs; + b_info->vnuma_memszs = (uint64_t *)realloc(b_info->vnuma_memszs, + b_info->nr_vnodes * + sizeof(*b_info->vnuma_memszs)); + if (b_info->vnuma_memszs == NULL) { + b_info->vnuma_memszs = memsz; + goto bad_vnumazerocfg; + } + b_info->vnuma_memszs[0] = b_info->max_memkb >> 10; + + /* all vcpus assigned to this vnode */ + vcputov = b_info->vcpu_to_vnode; + b_info->vcpu_to_vnode = (unsigned int *)realloc( + b_info->vcpu_to_vnode, + b_info->max_vcpus * + sizeof(*b_info->vcpu_to_vnode)); + if (b_info->vcpu_to_vnode == NULL) { + b_info->vcpu_to_vnode = vcputov; + goto bad_vnumazerocfg; + } + vcputovnode_default(b_info->vcpu_to_vnode, + b_info->nr_vnodes, + b_info->max_vcpus); + + /* default vdistance 10 */ + vdist = b_info->vdistance; + b_info->vdistance = (unsigned int *)realloc(b_info->vdistance, + b_info->nr_vnodes * b_info->nr_vnodes * + sizeof(*b_info->vdistance)); + if (b_info->vdistance == NULL) { + b_info->vdistance = vdist; + goto bad_vnumazerocfg; + } + vdistance_default(b_info->vdistance, b_info->nr_vnodes, 10, 10); + + /* VNUMA_NO_NODE for vnode_to_pnode */ + vntop = b_info->vnode_to_pnode; + b_info->vnode_to_pnode = (unsigned int *)realloc(b_info->vnode_to_pnode, + b_info->nr_vnodes * + sizeof(*b_info->vnode_to_pnode)); + if (b_info->vnode_to_pnode == NULL) { + b_info->vnode_to_pnode = vntop; + goto bad_vnumazerocfg; + } + vnode_to_pnode_default(b_info->vnode_to_pnode, b_info->nr_vnodes); + + /* + * will be placed to some physical nodes defined by automatic + * numa placement or VNUMA_NO_NODE will not request exact node + */ + libxl_defbool_set(&b_info->vnuma_placement, true); + return 0; + +bad_vnumazerocfg: + return -1; +} + static void parse_config_data(const char *config_source, const char *config_data, int config_len, @@ -960,6 +1089,11 @@ static void parse_config_data(const char *config_source, char *cmdline = NULL; const char *root = NULL, *extra = ""; + XLU_ConfigList *vnumamemcfg; + int nr_vnuma_regions; + unsigned long long vnuma_memparsed = 0; + unsigned long ul; + xlu_cfg_replace_string (config, "kernel", &b_info->u.pv.kernel, 0); xlu_cfg_get_string (config, "root", &root, 0); @@ -977,6 +1111,106 @@ static void parse_config_data(const char *config_source, exit(1); } + libxl_defbool_set(&b_info->vnuma_placement, false); + + if (!xlu_cfg_get_long (config, "vnodes", &l, 0)) { + if (l > MAX_VNUMA_NODES) { + fprintf(stderr, "Too many vnuma nodes, max %d is allowed.\n", MAX_VNUMA_NODES); + exit(1); + } + + b_info->nr_vnodes = l; + + xlu_cfg_get_defbool(config, "vnuma_autoplacement", &b_info->vnuma_placement, 0); + + if (b_info->nr_vnodes != 0 && b_info->max_vcpus >= b_info->nr_vnodes) { + if (!xlu_cfg_get_list(config, "vnumamem", + &vnumamemcfg, &nr_vnuma_regions, 0)) { + + if (nr_vnuma_regions != b_info->nr_vnodes) { + fprintf(stderr, "Number of numa regions is incorrect.\n"); + exit(1); + } + + b_info->vnuma_memszs = calloc(b_info->nr_vnodes, + sizeof(*b_info->vnuma_memszs)); + if (b_info->vnuma_memszs == NULL) { + fprintf(stderr, "unable to allocate memory for vnuma ranges.\n"); + exit(1); + } + + char *ep; + /* + * Will parse only nr_vnodes times, even if we have more/less regions. + * Take care of it later if less or discard if too many regions. + */ + for (i = 0; i < b_info->nr_vnodes; i++) { + buf = xlu_cfg_get_listitem(vnumamemcfg, i); + if (!buf) { + fprintf(stderr, + "xl: Unable to get element %d in vnuma memroy list.\n", i); + break; + } + ul = strtoul(buf, &ep, 10); + if (ep == buf) { + fprintf(stderr, + "xl: Invalid argument parsing vnumamem: %s.\n", buf); + break; + } + + /* 32Mb is a min size for a node, taken from Linux */ + if (ul >= UINT32_MAX || ul < MIN_VNODE_SIZE) { + fprintf(stderr, "xl: vnuma memory %lu is not withing %u - %u range.\n", + ul, MIN_VNODE_SIZE, UINT32_MAX); + break; + } + + /* memory in MBytes */ + b_info->vnuma_memszs[i] = ul; + } + + /* Total memory for vNUMA parsed to verify */ + for(i = 0; i < nr_vnuma_regions; i++) + vnuma_memparsed = vnuma_memparsed + (b_info->vnuma_memszs[i]); + + /* Amount of memory for vnodes same as total? */ + if((vnuma_memparsed << 10) != (b_info->max_memkb)) { + fprintf(stderr, "xl: vnuma memory is not the same as initial domain memory size.\n"); + vnuma_info_release(b_info); + exit(1); + } + } else { + b_info->vnuma_memszs = calloc(b_info->nr_vnodes, + sizeof(*b_info->vnuma_memszs)); + if (b_info->vnuma_memszs == NULL) { + vnuma_info_release(b_info); + fprintf(stderr, "unable to allocate memory for vnuma ranges.\n"); + exit(1); + } + + fprintf(stderr, "WARNING: vNUMA memory ranges were not specified.\n"); + fprintf(stderr, "Using default equal vnode memory size %lu Kbytes to cover %lu Kbytes.\n", + b_info->max_memkb / b_info->nr_vnodes, b_info->max_memkb); + + if (split_vnumamem(b_info) < 0) { + fprintf(stderr, "Could not split vnuma memory into equal chunks.\n"); + vnuma_info_release(b_info); + exit(1); + } + } + } + else + if (vnuma_zero_config(b_info)) { + vnuma_info_release(b_info); + exit(1); + } + } + else + if (vnuma_zero_config(b_info)) { + vnuma_info_release(b_info); + exit(1); + } + xlu_cfg_replace_string (config, "bootloader", &b_info->u.pv.bootloader, 0); switch (xlu_cfg_get_list_as_string_list(config, "bootloader_args", &b_info->u.pv.bootloader_args, 1)) -- 1.7.10.4
Elena Ufimtseva
2013-Dec-04 05:47 UTC
[PATCH v4 4/7] xl: vnuma distance, vcpu and pnode masks parser
Parses vm config options vdistance, vcpu_to_vnuma mask and vnode_to_pnode mask. If not configures, uses default settings. Includes documentation about vnuma topology config parameters in xl.cfg. Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> --- docs/man/xl.cfg.pod.5 | 60 +++++++++++++++++++ tools/libxl/xl_cmdimpl.c | 146 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 206 insertions(+) diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 index 3b227b7..ccc25de 100644 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -216,6 +216,66 @@ if the values of B<memory=> and B<maxmem=> differ. A "pre-ballooned" HVM guest needs a balloon driver, without a balloon driver it will crash. +=item B<vnuma_nodes=N> + +Number of vNUMA nodes the guest will be initialized with on boot. + +=item B<vnuma_mem=[vmem1, vmem2, ...]> + +The vnode memory sizes defined in MBytes. If the sum of all vnode memories +does not match the domain memory or not all the nodes defined here, will fail. +If not specified, memory will be equally split between vnodes. Currently +minimum vnode size is 64MB. + +Example: vnuma_mem=[1024, 1024, 2048, 2048] + +=item B<vdistance=[d1, d2]> + +Defines the distance table for vNUMA nodes. Distance for NUMA machines usually + represented by two dimensional array and all distance may be spcified in one +line here, by rows. Distance can be specified as two numbers [d1, d2], +where d1 is same node distance, d2 is a value for all other distances. +If not specified, the defaul distance will be used, e.g. [10, 20]. + +Examples: +vnodes = 3 +vdistance=[10, 20] +will expand to this distance table (this is default setting as well): +[10, 20, 20] +[20, 10, 20] +[20, 20, 10] + +=item B<vnuma_vcpumap=[vcpu1, vcpu2, ...]> + +Defines vcpu to vnode mapping as a string of integers, representing node +numbers. If not defined, the vcpus are interleaved over the virtual nodes. +Current limitation: vNUMA nodes have to have at least one vcpu, otherwise +default vcpu_to_vnode will be used. + +Example: +to map 4 vcpus to 2 nodes - 0,1 vcpu -> vnode1, 2,3 vcpu -> vnode2: +vnuma_vcpumap = [0, 0, 1, 1] + +=item B<vnuma_vnodemap=[p1, p2, ..., pn]> + +vnode to pnode mapping. Can be configured if manual vnode allocation +required. Will be only taken into effect on real NUMA machines and if +memory or other constraints do not prevent it. If the mapping is ok, +automatic NUMA placement will be disabled. If the mapping incorrect +and vnuma_autoplacement is true, automatical numa placement will be used, +otherwise fails to create domain. + +Example: +assume two node NUMA node machine: +vnuma_vndoemap=[1, 0] +first vnode will be placed on node 1, second on node0. + +=item B<vnuma_autoplacement=[0|1]> + +If enabled, automatically will find the best placement physical node candidate for +each vnode if vnuma_vnodemap is incorrect or memory requirements prevent +using it. Set to ''0'' by default. + =back =head3 Event Actions diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index c79e73e..f6a7774 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -633,6 +633,23 @@ static void vnuma_info_release(libxl_domain_build_info *info) info->nr_vnodes = 0; } +static int get_list_item_uint(XLU_ConfigList *list, unsigned int i) +{ + const char *buf; + char *ep; + unsigned long ul; + int rc = -EINVAL; + buf = xlu_cfg_get_listitem(list, i); + if (!buf) + return rc; + ul = strtoul(buf, &ep, 10); + if (ep == buf) + return rc; + if (ul >= UINT16_MAX) + return rc; + return (int)ul; +} + static void vdistance_default(unsigned int *vdistance, unsigned int nr_vnodes, unsigned int samenode, @@ -1090,7 +1107,9 @@ static void parse_config_data(const char *config_source, const char *root = NULL, *extra = ""; XLU_ConfigList *vnumamemcfg; + XLU_ConfigList *vdistancecfg, *vnodemap, *vcpumap; int nr_vnuma_regions; + int nr_vdist, nr_vnodemap; unsigned long long vnuma_memparsed = 0; unsigned long ul; @@ -1198,6 +1217,133 @@ static void parse_config_data(const char *config_source, exit(1); } } + + b_info->vdistance = calloc(b_info->nr_vnodes * b_info->nr_vnodes, + sizeof(*b_info->vdistance)); + if (b_info->vdistance == NULL) { + vnuma_info_release(b_info); + exit(1); + } + + if(!xlu_cfg_get_list(config, "vdistance", &vdistancecfg, &nr_vdist, 0) && + nr_vdist == 2) { + /* + * First value is the same node distance, the second as the + * rest of distances. The following is required right now to + * avoid non-symmetrical distance table as it may break latest kernel. + * TODO: Better way to analyze extended distance table, possibly + * OS specific. + */ + int d1, d2; + d1 = get_list_item_uint(vdistancecfg, 0); + d2 = get_list_item_uint(vdistancecfg, 1); + + if (d1 >= 0 && d2 >= 0 && d1 < d2) { + vdistance_default(b_info->vdistance, b_info->nr_vnodes, d1, d2); + } else { + fprintf(stderr, "WARNING: Distances are not correct.\n"); + vnuma_info_release(b_info); + exit(1); + } + + } else + vdistance_default(b_info->vdistance, b_info->nr_vnodes, 10, 20); + + b_info->vcpu_to_vnode = (unsigned int *)calloc(b_info->max_vcpus, + sizeof(*b_info->vcpu_to_vnode)); + if (b_info->vcpu_to_vnode == NULL) { + vnuma_info_release(b_info); + exit(1); + } + nr_vnodemap = 0; + if (!xlu_cfg_get_list(config, "vnuma_vcpumap", + &vcpumap, &nr_vnodemap, 0)) { + if (nr_vnodemap == b_info->max_vcpus) { + unsigned int vnodemask = 0, vnode = 0, smask, vmask; + smask = ~(~0 << b_info->nr_vnodes); + vmask = ~(~0 << nr_vnodemap); + for (i = 0; i < nr_vnodemap; i++) { + vnode = get_list_item_uint(vcpumap, i); + if (vnode >= 0 && vnode < b_info->nr_vnodes) { + vnodemask |= (1 << vnode); + b_info->vcpu_to_vnode[i] = vnode; + } + } + /* Did it covered all vnodes in the vcpu mask? */ + if ( !(((smask & vnodemask) + 1) == (1 << b_info->nr_vnodes)) ) { + fprintf(stderr, "WARNING: Not all vnodes were covered in vnuma_vcpumap.\n"); + vnuma_info_release(b_info); + exit(1); + } + } else { + fprintf(stderr, "WARNING: Bad vnuma_vcpumap.\n"); + vnuma_info_release(b_info); + exit(1); + } + } + else + vcputovnode_default(b_info->vcpu_to_vnode, + b_info->nr_vnodes, + b_info->max_vcpus); + + /* There is mapping to NUMA physical nodes? */ + b_info->vnode_to_pnode = (unsigned int *)calloc(b_info->nr_vnodes, + sizeof(*b_info->vnode_to_pnode)); + if (b_info->vnode_to_pnode == NULL) { + vnuma_info_release(b_info); + exit(1); + } + nr_vnodemap = 0; + if (!xlu_cfg_get_list(config, "vnuma_vnodemap", &vnodemap, + &nr_vnodemap, 0)) { + /* + * If not specified or incorred, will be defined + * later based on the machine architecture, configuration + * and memory availble when creating domain. + */ + if (nr_vnodemap == b_info->nr_vnodes) { + unsigned int vnodemask = 0, pnode = 0, smask; + smask = ~(~0 << b_info->nr_vnodes); + for (i = 0; i < b_info->nr_vnodes; i++) { + pnode = get_list_item_uint(vnodemap, i); + if (pnode >= 0) { + vnodemask |= (1 << i); + b_info->vnode_to_pnode[i] = pnode; + } + } + /* Did it covered all vnodes in the mask? */ + if ( !(((vnodemask & smask) + 1) == (1 << nr_vnodemap)) ) { + fprintf(stderr, "WARNING: Not all vnodes were covered vnuma_vnodemap.\n"); + + if (libxl_defbool_val(b_info->vnuma_placement)) { + fprintf(stderr, "Automatic placement will be used for vnodes.\n"); + vnode_to_pnode_default(b_info->vnode_to_pnode, b_info->nr_vnodes); + } else { + vnuma_info_release(b_info); + exit(1); + } + } + } else { + fprintf(stderr, "WARNING: Incorrect vnuma_vnodemap.\n"); + if (libxl_defbool_val(b_info->vnuma_placement)) { + fprintf(stderr, "Automatic placement will be used for vnodes.\n"); + vnode_to_pnode_default(b_info->vnode_to_pnode, b_info->nr_vnodes); + } else { + vnuma_info_release(b_info); + exit(1); + } + } + } else { + fprintf(stderr, "WARNING: Missing vnuma_vnodemap.\n"); + + if (libxl_defbool_val(b_info->vnuma_placement)) { + fprintf(stderr, "Automatic placement will be used for vnodes.\n"); + vnode_to_pnode_default(b_info->vnode_to_pnode, b_info->nr_vnodes); + } else { + vnuma_info_release(b_info); + exit(1); + } + } } else if (vnuma_zero_config(b_info)) { -- 1.7.10.4
Elena Ufimtseva
2013-Dec-04 05:47 UTC
[PATCH v4 5/7] libxc: vnuma memory domain allocation
domain memory allocation with vnuma enabled. Every pv domain has at least one vnuma node and the vnode to pnode will be taken into account. if not it works as default allocation without using XENMEMF_exact_node memflags. Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> --- tools/libxc/xc_dom.h | 10 ++++++++ tools/libxc/xc_dom_x86.c | 63 +++++++++++++++++++++++++++++++++++++--------- tools/libxc/xg_private.h | 1 + 3 files changed, 62 insertions(+), 12 deletions(-) diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h index a183e62..6d07071 100644 --- a/tools/libxc/xc_dom.h +++ b/tools/libxc/xc_dom.h @@ -114,6 +114,15 @@ struct xc_dom_image { struct xc_dom_phys *phys_pages; int realmodearea_log; + /* + * vNUMA topology and memory allocation structure. + * Defines the way to allocate memory on per NUMA + * physical nodes that is defined by vnode_to_pnode. + */ + uint32_t nr_vnodes; + uint64_t *vnuma_memszs; + unsigned int *vnode_to_pnode; + /* malloc memory pool */ struct xc_dom_mem *memblocks; @@ -369,6 +378,7 @@ static inline xen_pfn_t xc_dom_p2m_guest(struct xc_dom_image *dom, int arch_setup_meminit(struct xc_dom_image *dom); int arch_setup_bootearly(struct xc_dom_image *dom); int arch_setup_bootlate(struct xc_dom_image *dom); +int arch_boot_numa_alloc(struct xc_dom_image *dom); /* * Local variables: diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c index e034d62..803e460 100644 --- a/tools/libxc/xc_dom_x86.c +++ b/tools/libxc/xc_dom_x86.c @@ -759,7 +759,7 @@ static int x86_shadow(xc_interface *xch, domid_t domid) int arch_setup_meminit(struct xc_dom_image *dom) { int rc; - xen_pfn_t pfn, allocsz, i, j, mfn; + xen_pfn_t pfn, i, j, mfn; rc = x86_compat(dom->xch, dom->guest_domid, dom->guest_type); if ( rc ) @@ -802,6 +802,7 @@ int arch_setup_meminit(struct xc_dom_image *dom) else { /* try to claim pages for early warning of insufficient memory avail */ + rc = 0; if ( dom->claim_enabled ) { rc = xc_domain_claim_pages(dom->xch, dom->guest_domid, dom->total_pages); @@ -813,23 +814,61 @@ int arch_setup_meminit(struct xc_dom_image *dom) dom->p2m_host[pfn] = pfn; /* allocate guest memory */ - for ( i = rc = allocsz = 0; - (i < dom->total_pages) && !rc; - i += allocsz ) - { - allocsz = dom->total_pages - i; - if ( allocsz > 1024*1024 ) - allocsz = 1024*1024; - rc = xc_domain_populate_physmap_exact( - dom->xch, dom->guest_domid, allocsz, - 0, 0, &dom->p2m_host[i]); - } + rc = arch_boot_numa_alloc(dom); + if ( rc ) + return rc; /* Ensure no unclaimed pages are left unused. * OK to call if hadn''t done the earlier claim call. */ (void)xc_domain_claim_pages(dom->xch, dom->guest_domid, 0 /* cancels the claim */); } + return rc; +} + +/* + * Any pv guest will have at least one vnuma node + * with vnuma_memszs[0] = domain memory and the rest + * topology initialized with default values. + */ +int arch_boot_numa_alloc(struct xc_dom_image *dom) +{ + int rc; + unsigned int n; + unsigned long long vnode_pages; + unsigned long long allocsz = 0, node_pfn_base, i; + unsigned long memflags; + + rc = allocsz = node_pfn_base = 0; + + allocsz = 0; + for ( n = 0; n < dom->nr_vnodes; n++ ) + { + memflags = 0; + if ( dom->vnode_to_pnode[n] != VNUMA_NO_NODE ) + { + memflags |= XENMEMF_exact_node(dom->vnode_to_pnode[n]); + memflags |= XENMEMF_exact_node_request; + } + vnode_pages = (dom->vnuma_memszs[n] << 20) >> PAGE_SHIFT_X86; + for ( i = 0; (i < vnode_pages) && !rc; i += allocsz ) + { + allocsz = vnode_pages - i; + if ( allocsz > 1024*1024 ) + allocsz = 1024*1024; + rc = xc_domain_populate_physmap_exact( + dom->xch, dom->guest_domid, allocsz, + 0, memflags, &dom->p2m_host[node_pfn_base + i]); + } + if ( rc ) + { + xc_dom_panic(dom->xch, XC_INTERNAL_ERROR, + "%s: Failed allocation of %Lu pages for vnode %d on pnode %d out of %lu\n", + __FUNCTION__, vnode_pages, n, dom->vnode_to_pnode[n], dom->total_pages); + return rc; + } + node_pfn_base += i; + } return rc; } diff --git a/tools/libxc/xg_private.h b/tools/libxc/xg_private.h index 5ff2124..9554b71 100644 --- a/tools/libxc/xg_private.h +++ b/tools/libxc/xg_private.h @@ -127,6 +127,7 @@ typedef uint64_t l4_pgentry_64_t; #define ROUNDUP(_x,_w) (((unsigned long)(_x)+(1UL<<(_w))-1) & ~((1UL<<(_w))-1)) #define NRPAGES(x) (ROUNDUP(x, PAGE_SHIFT) >> PAGE_SHIFT) +#define VNUMA_NO_NODE ~((unsigned int)0) /* XXX SMH: following skanky macros rely on variable p2m_size being set */ /* XXX TJD: also, "guest_width" should be the guest''s sizeof(unsigned long) */ -- 1.7.10.4
* Checks and sets if vnode to physical node mask was incorrectly defined. If yes and vnuma_placement is set to 1, tries use automatic NUMA placement machanism, otherwise falls to default mask VNUMA_NO_NODE. If user define allocation map can be used based on memory requirements, disables automatic numa placement. * Verifies the correctness of memory ranges pfns for PV guest by requesting the e820 map for that domain, takes into account e820_host config option; * Provides vNUMA topology information to Xen about vNUMA topology and allocation map used for vnodes; Comment on e820 map and memory alignment: When e820_host is not set, then pv guest has fixed e820 map: [ 0.000000] e820: BIOS-provided physical RAM map: [ 0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable [ 0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved [ 0.000000] Xen: [mem 0x0000000000100000-0x00000000f9ffffff] usable [ 0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved [ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable Means, first 4KB (0x0000 - 0x0fff) and 384K gap between 0xa0000 and 0xfffff will be reserved. Since these pfns will never appear in the pages allocations and the the beginning and end of memory blocks In this case memory ranges for guest are constructed based on sizes of vnodes. In case e820_host is set to 1, the memory holes should be taken into account. Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> --- tools/libxl/libxl.c | 18 +++++ tools/libxl/libxl.h | 20 ++++++ tools/libxl/libxl_arch.h | 6 ++ tools/libxl/libxl_dom.c | 158 +++++++++++++++++++++++++++++++++++++----- tools/libxl/libxl_internal.h | 6 ++ tools/libxl/libxl_numa.c | 49 +++++++++++++ tools/libxl/libxl_x86.c | 123 ++++++++++++++++++++++++++++++++ 7 files changed, 363 insertions(+), 17 deletions(-) diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c index 9b93262..4b67640 100644 --- a/tools/libxl/libxl.c +++ b/tools/libxl/libxl.c @@ -4658,6 +4658,24 @@ static int libxl__set_vcpuonline_qmp(libxl__gc *gc, uint32_t domid, return 0; } +int libxl_domain_setvnuma(libxl_ctx *ctx, + uint32_t domid, + uint16_t nr_vnodes, + uint16_t nr_vcpus, + vmemrange_t *vmemrange, + unsigned int *vdistance, + unsigned int *vcpu_to_vnode, + unsigned int *vnode_to_pnode) +{ + int ret; + ret = xc_domain_setvnuma(ctx->xch, domid, nr_vnodes, + nr_vcpus, vmemrange, + vdistance, + vcpu_to_vnode, + vnode_to_pnode); + return ret; +} + int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *cpumap) { GC_INIT(ctx); diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h index a9663e4..6087ddc 100644 --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -281,11 +281,14 @@ #include <netinet/in.h> #include <sys/wait.h> /* for pid_t */ +#include <xen/memory.h> #include <xentoollog.h> #include <libxl_uuid.h> #include <_libxl_list.h> +#include <xen/vnuma.h> + /* API compatibility. */ #ifdef LIBXL_API_VERSION #if LIBXL_API_VERSION != 0x040200 && LIBXL_API_VERSION != 0x040300 && \ @@ -391,6 +394,14 @@ #define LIBXL_EXTERNAL_CALLERS_ONLY /* disappears for callers outside libxl */ #endif +/* + * LIBXL_HAVE_BUILDINFO_VNUMA indicates that vnuma topology will be + * build for the guest upon request and with VM configuration. + * It will try to define best allocation for vNUMA + * nodes on real NUMA nodes. + */ +#define LIBXL_HAVE_BUILDINFO_VNUMA 1 + typedef uint8_t libxl_mac[6]; #define LIBXL_MAC_FMT "%02hhx:%02hhx:%02hhx:%02hhx:%02hhx:%02hhx" #define LIBXL_MAC_FMTLEN ((2*6)+5) /* 6 hex bytes plus 5 colons */ @@ -750,6 +761,15 @@ void libxl_vcpuinfo_list_free(libxl_vcpuinfo *, int nr_vcpus); void libxl_device_vtpm_list_free(libxl_device_vtpm*, int nr_vtpms); void libxl_vtpminfo_list_free(libxl_vtpminfo *, int nr_vtpms); +int libxl_domain_setvnuma(libxl_ctx *ctx, + uint32_t domid, + uint16_t nr_vnodes, + uint16_t nr_vcpus, + vmemrange_t *vmemrange, + unsigned int *vdistance, + unsigned int *vcpu_to_vnode, + unsigned int *vnode_to_pnode); + /* * Devices * ======diff --git a/tools/libxl/libxl_arch.h b/tools/libxl/libxl_arch.h index aee0a91..9caf0ae 100644 --- a/tools/libxl/libxl_arch.h +++ b/tools/libxl/libxl_arch.h @@ -22,4 +22,10 @@ int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config, int libxl__arch_domain_configure(libxl__gc *gc, libxl_domain_build_info *info, struct xc_dom_image *dom); + +int libxl__vnuma_align_mem(libxl__gc *gc, + uint32_t domid, + struct libxl_domain_build_info *b_info, + vmemrange_t *memblks); + #endif diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c index 72489f8..5ff8218 100644 --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -23,6 +23,7 @@ #include <xc_dom.h> #include <xen/hvm/hvm_info_table.h> #include <xen/hvm/hvm_xs_strings.h> +#include <libxl_vnuma.h> libxl_domain_type libxl__domain_type(libxl__gc *gc, uint32_t domid) { @@ -201,6 +202,64 @@ static int numa_place_domain(libxl__gc *gc, uint32_t domid, return rc; } +/* prepares vnode to pnode map for domain vNUMA memory allocation */ +int libxl__init_vnode_to_pnode(libxl__gc *gc, uint32_t domid, + libxl_domain_build_info *info) +{ + int i, n, nr_nodes = 0, rc; + uint64_t *mems; + unsigned long long *claim = NULL; + libxl_numainfo *ninfo = NULL; + + rc = ERROR_FAIL; + + /* default setting */ + for (i = 0; i < info->nr_vnodes; i++) + info->vnode_to_pnode[i] = VNUMA_NO_NODE; + + /* Get NUMA info */ + ninfo = libxl_get_numainfo(CTX, &nr_nodes); + if (ninfo == NULL) { + rc = 0; + goto vnmapout; + } + + /* + * We dont try to build vnode_to_pnode map + * if info->cpumap is full what means that + * no nodemap was built. + */ + if (libxl_bitmap_is_full(&info->nodemap)) { + LOG(DETAIL, "No suitable NUMA candidates were found for vnuma.\n"); + rc = 0; + goto vnmapout; + } + mems = info->vnuma_memszs; + /* + * TODO: review the algorithm and imporove algorithm. + * If no p-node found, will be set to NUMA_NO_NODE + */ + claim = libxl__calloc(gc, info->nr_vnodes, sizeof(*claim)); + + libxl_for_each_set_bit(n, info->nodemap) + { + for (i = 0; i < info->nr_vnodes; i++) + { + if (((claim[n] + (mems[i] << 20)) <= ninfo[n].free) && + /*vnode was not set yet */ + (info->vnode_to_pnode[i] == VNUMA_NO_NODE ) ) + { + info->vnode_to_pnode[i] = n; + claim[n] += (mems[i] << 20); + } + } + } + + rc = 0; + vnmapout: + return rc; +} + int libxl__build_pre(libxl__gc *gc, uint32_t domid, libxl_domain_config *d_config, libxl__domain_build_state *state) { @@ -214,27 +273,70 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid, return ERROR_FAIL; } - /* - * Check if the domain has any CPU affinity. If not, try to build - * up one. In case numa_place_domain() find at least a suitable - * candidate, it will affect info->nodemap accordingly; if it - * does not, it just leaves it as it is. This means (unless - * some weird error manifests) the subsequent call to - * libxl_domain_set_nodeaffinity() will do the actual placement, - * whatever that turns out to be. - */ - if (libxl_defbool_val(info->numa_placement)) { + if (info->nr_vnodes > 0) { + /* The memory blocks will be formed here from sizes */ + struct vmemrange *memrange = libxl__calloc(gc, info->nr_vnodes, + sizeof(*memrange)); - if (!libxl_bitmap_is_full(&info->cpumap)) { - LOG(ERROR, "Can run NUMA placement only if no vcpu " - "affinity is specified"); - return ERROR_INVAL; + if (libxl__vnuma_align_mem(gc, domid, info, memrange) < 0) { + LOG(DETAIL, "Failed to align memory map.\n"); + return ERROR_FAIL; + } + + /* + * If vNUMA vnode_to_pnode map defined, determine if we + * can disable automatic numa placement and place vnodes + * on specified pnodes. + * For now, if vcpu affinity specified, we will use + * specified vnode to pnode map. + */ + + /* will be used default numa placement? */ + if (libxl_defbool_val(info->vnuma_placement)) { + /* + * Check if the domain has any CPU affinity. If not, try to build + * up one. In case numa_place_domain() find at least a suitable + * candidate, it will affect info->nodemap accordingly; if it + * does not, it just leaves it as it is. This means (unless + * some weird error manifests) the subsequent call to + * libxl_domain_set_nodeaffinity() will do the actual placement, + * whatever that turns out to be. + */ + if (libxl_defbool_val(info->numa_placement)) { + if (!libxl_bitmap_is_full(&info->cpumap)) { + LOG(ERROR, "Can run NUMA placement only if no vcpu " + "affinity is specified"); + return ERROR_INVAL; + } + + rc = numa_place_domain(gc, domid, info); + if (rc) + return rc; + /* init vnodemap to numa automatic placement */ + if (libxl__init_vnode_to_pnode(gc, domid, info) < 0) { + LOG(DETAIL, "Failed to init vnodemap\n"); + /* vnuma_nodemap will not be used if nr_vnodes == 0 */ + return ERROR_FAIL; + } + } + } else { + if (libxl__vnodemap_is_usable(gc, info)) + libxl_defbool_set(&info->numa_placement, false); + else { + LOG(ERROR, "The allocation mask for vnuma nodes cannot be used.\n"); + return ERROR_FAIL; + } } - rc = numa_place_domain(gc, domid, info); - if (rc) - return rc; + if (xc_domain_setvnuma(ctx->xch, domid, info->nr_vnodes, + info->max_vcpus, memrange, + info->vdistance, info->vcpu_to_vnode, + info->vnode_to_pnode) < 0) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "Failed to set vnuma topology for domain from\n."); + return ERROR_FAIL; + } } + libxl_domain_set_nodeaffinity(ctx, domid, &info->nodemap); libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap); @@ -382,6 +484,28 @@ int libxl__build_pv(libxl__gc *gc, uint32_t domid, } } + if (info->nr_vnodes != 0) { + dom->vnode_to_pnode = (unsigned int *)malloc( + info->nr_vnodes * sizeof(*info->vnode_to_pnode)); + dom->vnuma_memszs = (uint64_t *)malloc( + info->nr_vnodes * sizeof(*info->vnuma_memszs)); + + if ( dom->vnuma_memszs == NULL || dom->vnode_to_pnode == NULL ) { + info->nr_vnodes = 0; + if (dom->vnode_to_pnode) free(dom->vnode_to_pnode); + if (dom->vnuma_memszs) free(dom->vnuma_memszs); + goto out; + } + + memcpy(dom->vnuma_memszs, info->vnuma_memszs, + sizeof(*info->vnuma_memszs) * info->nr_vnodes); + memcpy(dom->vnode_to_pnode, info->vnode_to_pnode, + sizeof(*info->vnode_to_pnode) * info->nr_vnodes); + + dom->nr_vnodes = info->nr_vnodes; + } else + goto out; + dom->flags = flags; dom->console_evtchn = state->console_port; dom->console_domid = state->console_domid; diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h index a2d8247..c842763 100644 --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -2888,6 +2888,10 @@ void libxl__numa_candidate_put_nodemap(libxl__gc *gc, libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap); } +int libxl__init_vnode_to_pnode(libxl__gc *gc, uint32_t domid, + libxl_domain_build_info *info); + + /* * Inserts "elm_new" into the sorted list "head". * @@ -2937,6 +2941,8 @@ void libxl__numa_candidate_put_nodemap(libxl__gc *gc, */ #define CTYPE(isfoo,c) (isfoo((unsigned char)(c))) +unsigned int libxl__vnodemap_is_usable(libxl__gc *gc, + libxl_domain_build_info *info); #endif diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c index 20c99ac..68b53d7 100644 --- a/tools/libxl/libxl_numa.c +++ b/tools/libxl/libxl_numa.c @@ -19,6 +19,8 @@ #include "libxl_internal.h" +#include "libxl_vnuma.h" + /* * What follows are helpers for generating all the k-combinations * without repetitions of a set S with n elements in it. Formally @@ -500,6 +502,53 @@ int libxl__get_numa_candidate(libxl__gc *gc, } /* + * Checks if vnuma_nodemap defined in info can be used + * for allocation of vnodes on physical NUMA nodes by + * verifying that there is enough memory on corresponding + * NUMA nodes. + */ +unsigned int libxl__vnodemap_is_usable(libxl__gc *gc, libxl_domain_build_info *info) +{ + unsigned int i; + libxl_numainfo *ninfo = NULL; + unsigned long long *claim; + unsigned int node; + uint64_t *mems; + int rc, nr_nodes; + + rc = nr_nodes = 0; + + /* + * Cannot use specified mapping if not NUMA machine + */ + ninfo = libxl_get_numainfo(CTX, &nr_nodes); + if (ninfo == NULL) + return rc; + + mems = info->vnuma_memszs; + claim = libxl__calloc(gc, info->nr_vnodes, sizeof(*claim)); + /* Sum memory request on per pnode basis */ + for (i = 0; i < info->nr_vnodes; i++) + { + node = info->vnode_to_pnode[i]; + /* Correct pnode number? */ + if (node < nr_nodes) + claim[node] += (mems[i] << 20); + else + goto vmapu; + } + for (i = 0; i < nr_nodes; i++) { + if (claim[i] > ninfo[i].free) + /* Cannot complete user request, falling to default */ + goto vmapu; + } + rc = 1; + + vmapu: + return rc; +} + +/* * Local variables: * mode: C * c-basic-offset: 4 diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c index e1c183f..35c4d67 100644 --- a/tools/libxl/libxl_x86.c +++ b/tools/libxl/libxl_x86.c @@ -1,5 +1,6 @@ #include "libxl_internal.h" #include "libxl_arch.h" +#include "libxl_vnuma.h" static const char *e820_names(int type) { @@ -317,3 +318,125 @@ int libxl__arch_domain_configure(libxl__gc *gc, { return 0; } + +/* + * Used for PV guest with e802_host enabled and thus + * having non-contiguous e820 memory map. + */ +static unsigned long e820_memory_hole_size(unsigned long start, + unsigned long end, + struct e820entry e820[], + int nr) +{ + int i; + unsigned long absent, start_pfn, end_pfn; + + absent = end - start; + for(i = 0; i < nr; i++) { + /* if not E820_RAM region, skip it and dont substract from absent */ + if(e820[i].type == E820_RAM) { + start_pfn = e820[i].addr; + end_pfn = e820[i].addr + e820[i].size; + /* beginning pfn is in this region? */ + if (start >= start_pfn && start <= end_pfn) { + if (end > end_pfn) + absent -= end_pfn - start; + else + /* fit the region? then no absent pages */ + absent -= end - start; + continue; + } + /* found the end of range in this region? */ + if (end <= end_pfn && end >= start_pfn) { + absent -= end - start_pfn; + /* no need to look for more ranges */ + break; + } + } + } + return absent; +} + +/* + * Checks for the beginnig and end of RAM in e820 map for domain + * and aligns start of first and end of last vNUMA memory block to + * that map. vnode memory size are passed here Megabytes. + * For PV guest e820 map has fixed hole sizes. + */ +int libxl__vnuma_align_mem(libxl__gc *gc, + uint32_t domid, + libxl_domain_build_info *b_info, /* IN: mem sizes */ + vmemrange_t *memblks) /* OUT: linux numa blocks in pfn */ +{ + int i, j, rc; + uint64_t next_start_pfn, end_max = 0, size;//, mem_hole; + uint32_t nr; + struct e820entry map[E820MAX]; + + if (b_info->nr_vnodes == 0) + return -EINVAL; + libxl_ctx *ctx = libxl__gc_owner(gc); + + /* retreive e820 map for this host */ + rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX); + + if (rc < 0) { + errno = rc; + return -EINVAL; + } + nr = rc; + rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb, + (b_info->max_memkb - b_info->target_memkb) + + b_info->u.pv.slack_memkb); + if (rc) + { + errno = rc; + return -EINVAL; + } + + /* max pfn for this host */ + for (j = nr - 1; j >= 0; j--) + if (map[j].type == E820_RAM) { + end_max = map[j].addr + map[j].size; + break; + } + + memset(memblks, 0, sizeof(*memblks) * b_info->nr_vnodes); + next_start_pfn = 0; + + memblks[0].start = map[0].addr; + + for(i = 0; i < b_info->nr_vnodes; i++) { + /* start can be not zero */ + memblks[i].start += next_start_pfn; + memblks[i].end = memblks[i].start + (b_info->vnuma_memszs[i] << 20); + memblks[i]._reserved = 0; + + size = memblks[i].end - memblks[i].start; + /* + * For pv host with e820_host option turned on we need + * to take into account memory holes. For pv host with + * e820_host disabled or unset, the map is contiguous + * RAM region. + */ + if (libxl_defbool_val(b_info->u.pv.e820_host)) { + while((memblks[i].end - memblks[i].start - + e820_memory_hole_size(memblks[i].start, + memblks[i].end, map, nr)) < size ) + { + memblks[i].end += MIN_VNODE_SIZE << 10; + if (memblks[i].end > end_max) { + memblks[i].end = end_max; + break; + } + } + } + next_start_pfn = memblks[i].end; + LIBXL__LOG(ctx, LIBXL__LOG_DEBUG,"i %d, start = %#010lx, end = %#010lx\n", + i, memblks[i].start, memblks[i].end); + } + if (memblks[i-1].end > end_max) + memblks[i-1].end = end_max; + + return 0; +} -- 1.7.10.4
Prints basic information about vNUMA topology for vNUMA enabled domains when issuing debug-key ''u''. Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> --- xen/arch/x86/numa.c | 30 +++++++++++++++++++++++++++++- 1 file changed, 29 insertions(+), 1 deletion(-) diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c index b141877..2adc462 100644 --- a/xen/arch/x86/numa.c +++ b/xen/arch/x86/numa.c @@ -347,9 +347,10 @@ EXPORT_SYMBOL(node_data); static void dump_numa(unsigned char key) { s_time_t now = NOW(); - int i; + int i, j, n; struct domain *d; struct page_info *page; + char tmp[6]; unsigned int page_num_node[MAX_NUMNODES]; printk("''%c'' pressed -> dumping numa info (now-0x%X:%08X)\n", key, @@ -389,6 +390,33 @@ static void dump_numa(unsigned char key) for_each_online_node(i) printk(" Node %u: %u\n", i, page_num_node[i]); + if (d->vnuma.nr_vnodes > 0) + { + printk(" Domain has %d vnodes\n", d->vnuma.nr_vnodes); + for (i = 0; i < d->vnuma.nr_vnodes; i++) { + n = 0; + snprintf(tmp, 5, "%d", d->vnuma.vnode_to_pnode[i]); + printk(" vnode %d - pnode %s", i, + d->vnuma.vnode_to_pnode[i] >= MAX_NUMNODES ? "any" : tmp); + printk(" %"PRIu64" MB, ", + (d->vnuma.vmemrange[i].end - d->vnuma.vmemrange[i].start) >> 20); + printk("vcpus: "); + for (j = 0; j < d->max_vcpus; j++) { + if (d->vnuma.vcpu_to_vnode[j] == i) { + if (!((n + 1) % 8)) + printk("%d\n", j); + else { + if ( !(n % 8) && n != 0 ) + printk("%s%d ", " ", j); + else + printk("%d ", j); + } + n++; + } + } + printk("\n"); + } + } } rcu_read_unlock(&domlist_read_lock); -- 1.7.10.4
>>> On 04.12.13 at 06:47, Elena Ufimtseva <ufimtseva@gmail.com> wrote: > --- a/xen/arch/x86/numa.c > +++ b/xen/arch/x86/numa.c > @@ -347,9 +347,10 @@ EXPORT_SYMBOL(node_data); > static void dump_numa(unsigned char key) > { > s_time_t now = NOW(); > - int i; > + int i, j, n; > struct domain *d; > struct page_info *page; > + char tmp[6];Indentation.> @@ -389,6 +390,33 @@ static void dump_numa(unsigned char key) > > for_each_online_node(i) > printk(" Node %u: %u\n", i, page_num_node[i]); > + if (d->vnuma.nr_vnodes > 0) > + {Coding style (using Linux style in this file).> + printk(" Domain has %d vnodes\n", d->vnuma.nr_vnodes); > + for (i = 0; i < d->vnuma.nr_vnodes; i++) { > + n = 0; > + snprintf(tmp, 5, "%d", d->vnuma.vnode_to_pnode[i]);vnode_to_pnode[] is an array of "unsigned int", so how come the formatted result is (a) signed and (b) fitting in 5 characters? Apart from that what you pass here is the array size, not one less than it.> + printk(" vnode %d - pnode %s", i, > + d->vnuma.vnode_to_pnode[i] >= MAX_NUMNODES ? "any" : tmp);Indentation again.> + printk(" %"PRIu64" MB, ", > + (d->vnuma.vmemrange[i].end - d->vnuma.vmemrange[i].start) >> 0);Here too.> + printk("vcpus: "); > + for (j = 0; j < d->max_vcpus; j++) {Setting n to zero would better be put in this loop header.> + if (d->vnuma.vcpu_to_vnode[j] == i) { > + if (!((n + 1) % 8)) > + printk("%d\n", j); > + else { > + if ( !(n % 8) && n != 0 )Please combine these into an "else if", lowering the level of indentation.> + printk("%s%d ", " ", j); > + else > + printk("%d ", j); > + } > + n++; > + } > + } > + printk("\n"); > + } > + } > } > > rcu_read_unlock(&domlist_read_lock); > -- > 1.7.10.4
>>> On 04.12.13 at 06:47, Elena Ufimtseva <ufimtseva@gmail.com> wrote: > +/* > + * vNUMA topology specifies vNUMA node > + * number, distance table, memory ranges and > + * vcpu mapping provided for guests. > + */ > + > +struct vnuma_topology_info { > + /* IN */ > + domid_t domid; > + /* OUT */ > + union { > + XEN_GUEST_HANDLE(uint) h; > + uint64_t _pad; > + } nr_vnodes; > + union { > + XEN_GUEST_HANDLE(uint) h; > + uint64_t _pad; > + } nr_vcpus; > + union { > + XEN_GUEST_HANDLE(uint) h; > + uint64_t _pad; > + } vdistance; > + union { > + XEN_GUEST_HANDLE(uint) h; > + uint64_t _pad; > + } vcpu_to_vnode; > + union { > + XEN_GUEST_HANDLE(vmemrange_t) h; > + uint64_t _pad; > + } vmemrange; > +};As said before - the use of a separate sub-hypercall here is pointlessly complicating things. Furthermore I fail to see why nr_vnodes and nr_vcpus need to be guest handles - they can be simple integer fields, and _both_ must be inputs to XENMEM_get_vnuma_info (otherwise, if you - as done currently - use d->max_vcpus, there''s no guarantee that this value didn''t increase between retrieving the count and obtaining the full info. Once again: The boundaries of _any_ arrays you pass in to hypercalls must be specified by further information passed into the same hypercall, with the sole exception of cases where there is a priori, immutable information on this available through other mechanisms. Jan
On Wed, Dec 4, 2013 at 6:34 AM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 04.12.13 at 06:47, Elena Ufimtseva <ufimtseva@gmail.com> wrote: >> +/* >> + * vNUMA topology specifies vNUMA node >> + * number, distance table, memory ranges and >> + * vcpu mapping provided for guests. >> + */ >> + >> +struct vnuma_topology_info { >> + /* IN */ >> + domid_t domid; >> + /* OUT */ >> + union { >> + XEN_GUEST_HANDLE(uint) h; >> + uint64_t _pad; >> + } nr_vnodes; >> + union { >> + XEN_GUEST_HANDLE(uint) h; >> + uint64_t _pad; >> + } nr_vcpus; >> + union { >> + XEN_GUEST_HANDLE(uint) h; >> + uint64_t _pad; >> + } vdistance; >> + union { >> + XEN_GUEST_HANDLE(uint) h; >> + uint64_t _pad; >> + } vcpu_to_vnode; >> + union { >> + XEN_GUEST_HANDLE(vmemrange_t) h; >> + uint64_t _pad; >> + } vmemrange; >> +}; > > As said before - the use of a separate sub-hypercall here is > pointlessly complicating things. > > Furthermore I fail to see why nr_vnodes and nr_vcpus need > to be guest handles - they can be simple integer fields, and > _both_ must be inputs to XENMEM_get_vnuma_info (otherwise, > if you - as done currently - use d->max_vcpus, there''s no > guarantee that this value didn''t increase between retrieving > the count and obtaining the full info. > > Once again: The boundaries of _any_ arrays you pass in to > hypercalls must be specified by further information passed into > the same hypercall, with the sole exception of cases where > there is a priori, immutable information on this available through > other mechanisms. > > Jan >Thank Jan, makes sense. I will change this. -- Elena