vNUMA introduction This series of patches introduces vNUMA topology awareness and provides interfaces and data structures to enable vNUMA for PV guests. There is a plan to extend this support for dom0 and HVM domains. vNUMA topology support should be supported by PV guest kernel. Corresponging patches should be applied. Introduction ------------- vNUMA topology is exposed to the PV guest to improve performance when running workloads on NUMA machines. XEN vNUMA implementation provides a way to create vNUMA-enabled guests on NUMA/UMA and map vNUMA topology to physical NUMA in a optimal way. XEN vNUMA support Current set of patches introduces subop hypercall that is available for enlightened PV guests with vNUMA patches applied. Domain structure was modified to reflect per-domain vNUMA topology for use in other vNUMA-aware subsystems (e.g. ballooning). libxc libxc provides interfaces to build PV guests with vNUMA support and in case of NUMA machines provides initial memory allocation on physical NUMA nodes. This implemented by utilizing nodemap formed by automatic NUMA placement. Details are in patch #3. libxl libxl provides a way to predefine in VM config vNUMA topology - number of vnodes, memory arrangement, vcpus to vnodes assignment, distance map. PV guest As of now, only PV guest can take advantage of vNUMA functionality. vNUMA Linux patches should be applied and NUMA support should be compiled in kernel. This patchset can be pulled from https://git.gitorious.org/xenvnuma/xenvnuma.git:v3 Linux patchset https://git.gitorious.org/xenvnuma/linuxvnuma.git:v2 Examples of booting vNUMA enabled PV Linux guest on real NUMA machine: 1. Automatic vNUMA placement on real NUMA machine: VM config: memory = 16384 vcpus = 4 name = "rcbig" vnodes = 4 vnumamem = [10,10] vnuma_distance = [10, 30, 10, 30] vcpu_to_vnode = [0, 0, 1, 1] Xen: (XEN) Memory location of each domain: (XEN) Domain 0 (total: 2569511): (XEN) Node 0: 1416166 (XEN) Node 1: 1153345 (XEN) Domain 5 (total: 4194304): (XEN) Node 0: 2097152 (XEN) Node 1: 2097152 (XEN) Domain has 4 vnodes (XEN) vnode 0 - pnode 0 (4096) MB (XEN) vnode 1 - pnode 0 (4096) MB (XEN) vnode 2 - pnode 1 (4096) MB (XEN) vnode 3 - pnode 1 (4096) MB (XEN) Domain vcpu to vnode: (XEN) 0 1 2 3 dmesg on pv guest: [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x0009ffff] [ 0.000000] node 0: [mem 0x00100000-0xffffffff] [ 0.000000] node 1: [mem 0x100000000-0x1ffffffff] [ 0.000000] node 2: [mem 0x200000000-0x2ffffffff] [ 0.000000] node 3: [mem 0x300000000-0x3ffffffff] [ 0.000000] On node 0 totalpages: 1048479 [ 0.000000] DMA zone: 56 pages used for memmap [ 0.000000] DMA zone: 21 pages reserved [ 0.000000] DMA zone: 3999 pages, LIFO batch:0 [ 0.000000] DMA32 zone: 14280 pages used for memmap [ 0.000000] DMA32 zone: 1044480 pages, LIFO batch:31 [ 0.000000] On node 1 totalpages: 1048576 [ 0.000000] Normal zone: 14336 pages used for memmap [ 0.000000] Normal zone: 1048576 pages, LIFO batch:31 [ 0.000000] On node 2 totalpages: 1048576 [ 0.000000] Normal zone: 14336 pages used for memmap [ 0.000000] Normal zone: 1048576 pages, LIFO batch:31 [ 0.000000] On node 3 totalpages: 1048576 [ 0.000000] Normal zone: 14336 pages used for memmap [ 0.000000] Normal zone: 1048576 pages, LIFO batch:31 [ 0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org [ 0.000000] smpboot: Allowing 4 CPUs, 0 hotplug CPUs [ 0.000000] No local APIC present [ 0.000000] APIC: disable apic facility [ 0.000000] APIC: switched to apic NOOP [ 0.000000] nr_irqs_gsi: 16 [ 0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff] [ 0.000000] e820: cannot find a gap in the 32bit address range [ 0.000000] e820: PCI devices with unassigned 32bit BARs may break! [ 0.000000] e820: [mem 0x400100000-0x4004fffff] available for PCI devices [ 0.000000] Booting paravirtualized kernel on Xen [ 0.000000] Xen version: 4.4-unstable (preserve-AD) [ 0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:4 nr_node_ids:4 [ 0.000000] PERCPU: Embedded 28 pages/cpu @ffff8800ffc00000 s85376 r8192 d21120 u2097152 [ 0.000000] pcpu-alloc: s85376 r8192 d21120 u2097152 alloc=1*2097152 [ 0.000000] pcpu-alloc: [0] 0 [1] 1 [2] 2 [3] 3 pv guest: numactl --hardware: root@heatpipe:~# numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 node 0 size: 4031 MB node 0 free: 3997 MB node 1 cpus: 1 node 1 size: 4039 MB node 1 free: 4022 MB node 2 cpus: 2 node 2 size: 4039 MB node 2 free: 4023 MB node 3 cpus: 3 node 3 size: 3975 MB node 3 free: 3963 MB node distances: node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 20 2: 20 20 10 20 3: 20 20 20 10 Comments: None of the configuration options are correct so default values were used. Since machine is NUMA machine and there is no vcpu pinning defines, NUMA automatic node selection mechanism is used and you can see how vnodes were split across physical nodes. 2. vNUMA enabled guest, no default values, real NUMA machine Config: memory = 4096 vcpus = 4 name = "rc9" vnodes = 2 vnumamem = [2048, 2048] vdistance = [10, 40, 40, 10] vnuma_vcpumap = [1, 0, 1, 0] vnuma_vnodemap = [1, 0] Xen: (XEN) ''u'' pressed -> dumping numa info (now-0xA86:BD6C8829) (XEN) idx0 -> NODE0 start->0 size->4521984 free->131471 (XEN) phys_to_nid(0000000000001000) -> 0 should be 0 (XEN) idx1 -> NODE1 start->4521984 size->4194304 free->341610 (XEN) phys_to_nid(0000000450001000) -> 1 should be 1 (XEN) CPU0 -> NODE0 (XEN) CPU1 -> NODE0 (XEN) CPU2 -> NODE0 (XEN) CPU3 -> NODE0 (XEN) CPU4 -> NODE1 (XEN) CPU5 -> NODE1 (XEN) CPU6 -> NODE1 (XEN) CPU7 -> NODE1 (XEN) Memory location of each domain: (XEN) Domain 0 (total: 2569511): (XEN) Node 0: 1416166 (XEN) Node 1: 1153345 (XEN) Domain 6 (total: 1048576): (XEN) Node 0: 524288 (XEN) Node 1: 524288 (XEN) Domain has 2 vnodes (XEN) vnode 0 - pnode 1 (2048) MB (XEN) vnode 1 - pnode 0 (2048) MB (XEN) Domain vcpu to vnode: (XEN) 1 0 1 0 pv guest dmesg: [ 0.000000] NUMA: Initialized distance table, cnt=2 [ 0.000000] Initmem setup node 0 [mem 0x00000000-0x7fffffff] [ 0.000000] NODE_DATA [mem 0x7ffd9000-0x7fffffff] [ 0.000000] Initmem setup node 1 [mem 0x80000000-0xffffffff] [ 0.000000] NODE_DATA [mem 0xff7f8000-0xff81efff] [ 0.000000] Zone ranges: [ 0.000000] DMA [mem 0x00001000-0x00ffffff] [ 0.000000] DMA32 [mem 0x01000000-0xffffffff] [ 0.000000] Normal empty [ 0.000000] Movable zone start for each node [ 0.000000] Early memory node ranges [ 0.000000] node 0: [mem 0x00001000-0x0009ffff] [ 0.000000] node 0: [mem 0x00100000-0x7fffffff] [ 0.000000] node 1: [mem 0x80000000-0xffffffff] [ 0.000000] On node 0 totalpages: 524191 [ 0.000000] DMA zone: 56 pages used for memmap [ 0.000000] DMA zone: 21 pages reserved [ 0.000000] DMA zone: 3999 pages, LIFO batch:0 [ 0.000000] DMA32 zone: 7112 pages used for memmap [ 0.000000] DMA32 zone: 520192 pages, LIFO batch:31 [ 0.000000] On node 1 totalpages: 524288 [ 0.000000] DMA32 zone: 7168 pages used for memmap [ 0.000000] DMA32 zone: 524288 pages, LIFO batch:31 [ 0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org [ 0.000000] smpboot: Allowing 4 CPUs, 0 hotplug CPUs [ 0.000000] No local APIC present [ 0.000000] APIC: disable apic facility [ 0.000000] APIC: switched to apic NOOP [ 0.000000] nr_irqs_gsi: 16 [ 0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff] [ 0.000000] e820: cannot find a gap in the 32bit address range [ 0.000000] e820: PCI devices with unassigned 32bit BARs may break! [ 0.000000] e820: [mem 0x100100000-0x1004fffff] available for PCI devices [ 0.000000] Booting paravirtualized kernel on Xen [ 0.000000] Xen version: 4.4-unstable (preserve-AD) [ 0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:4 nr_node_ids:2 [ 0.000000] PERCPU: Embedded 28 pages/cpu @ffff88007fc00000 s85376 r8192 d21120 u1048576 [ 0.000000] pcpu-alloc: s85376 r8192 d21120 u1048576 alloc=1*2097152 [ 0.000000] pcpu-alloc: [0] 0 2 [1] 1 3 pv guest: root@heatpipe:~# numactl --ha available: 2 nodes (0-1) node 0 cpus: 1 3 node 0 size: 2011 MB node 0 free: 1975 MB node 1 cpus: 0 2 node 1 size: 2003 MB node 1 free: 1983 MB node distances: node 0 1 0: 10 40 1: 40 10 root@heatpipe:~# numactl --ha available: 2 nodes (0-1) node 0 cpus: 1 3 node 0 size: 2011 MB node 0 free: 1975 MB node 1 cpus: 0 2 node 1 size: 2003 MB node 1 free: 1983 MB node distances: node 0 1 0: 10 40 1: 40 10 In this case every config option is correct and we have exact vNUMA topology as it in VN config file. Example with e820_host = 1 (32GB real NUMA machines, two nodes). PV config: memory = 18096 vcpus = 6 name = "rc9" vnodes = 4 #vnumamem = [1024, 1024, 1024, 1024] vdistance = [10, 40] vnuma_vcpumap = [0, 1, 1, 3, 3, 2] vnuma_vnodemap = [1, 0] vnuma_autoplacement = 1 e820_host = 1 pv guest boot: (d29) [ 0.000000] Command line: root=/dev/xvda1 ro earlyprintk=xen debug loglevel=8 debug print_fatal_signals=1 loglvl=all guest_l (d29) oglvl=all LOGLEVEL=8 earlyprintk=xen sched_debug numa_balancing=on (d29) [ 0.000000] ACPI in unprivileged domain disabled (d29) [ 0.000000] Freeing ac228-100000 pfn range: 343512 pages freed (d29) [ 0.000000] Released 343512 pages of unused memory (d29) [ 0.000000] Set 343512 page(s) to 1-1 mapping (d29) [ 0.000000] Populating 46b000-4bedd8 pfn range: 343512 pages added (d29) [ 0.000000] e820: BIOS-provided physical RAM map: (d29) [ 0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable (d29) [ 0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved (d29) [ 0.000000] Xen: [mem 0x0000000000100000-0x00000000ac227fff] usable (d29) [ 0.000000] Xen: [mem 0x00000000ac228000-0x00000000ac26bfff] reserved (d29) [ 0.000000] Xen: [mem 0x00000000ac26c000-0x00000000ac57ffff] unusable (d29) [ 0.000000] Xen: [mem 0x00000000ac580000-0x00000000ac5a0fff] reserved (d29) [ 0.000000] Xen: [mem 0x00000000ac5a1000-0x00000000ac5bbfff] unusable (d29) [ 0.000000] Xen: [mem 0x00000000ac5bc000-0x00000000ac5bdfff] reserved (d29) [ 0.000000] Xen: [mem 0x00000000ac5be000-0x00000000ac5befff] unusable (d29) [ 0.000000] Xen: [mem 0x00000000ac5bf000-0x00000000ac5cafff] reserved (d29) [ 0.000000] Xen: [mem 0x00000000ac5cb000-0x00000000ac5d9fff] unusable (d29) [ 0.000000] Xen: [mem 0x00000000ac5da000-0x00000000ac5fafff] reserved (d29) [ 0.000000] Xen: [mem 0x00000000ac5fb000-0x00000000ac6b6fff] unusable (d29) [ 0.000000] Xen: [mem 0x00000000ac6b7000-0x00000000ac7fafff] ACPI NVS (d29) [ 0.000000] Xen: [mem 0x00000000ac7fb000-0x00000000ac80efff] unusable (d29) [ 0.000000] Xen: [mem 0x00000000ac80f000-0x00000000ac80ffff] ACPI data (d29) [ 0.000000] Xen: [mem 0x00000000ac810000-0x00000000ac810fff] unusable (d29) [ 0.000000] Xen: [mem 0x00000000ac811000-0x00000000ac812fff] ACPI data (d29) [ 0.000000] Xen: [mem 0x00000000ac813000-0x00000000ad7fffff] unusable (d29) [ 0.000000] Xen: [mem 0x00000000b0000000-0x00000000b3ffffff] reserved (d29) [ 0.000000] Xen: [mem 0x00000000fed20000-0x00000000fed3ffff] reserved (d29) [ 0.000000] Xen: [mem 0x00000000fed50000-0x00000000fed8ffff] reserved (d29) [ 0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff] reserved (d29) [ 0.000000] Xen: [mem 0x00000000ffa00000-0x00000000ffa3ffff] reserved (d29) [ 0.000000] Xen: [mem 0x0000000100000000-0x00000004bedd7fff] usable (d29) [ 0.000000] bootconsole [xenboot0] enabled (d29) [ 0.000000] NX (Execute Disable) protection: active (d29) [ 0.000000] DMI not present or invalid. (d29) [ 0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved (d29) [ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable (d29) [ 0.000000] No AGP bridge found (d29) [ 0.000000] e820: last_pfn = 0x4bedd8 max_arch_pfn = 0x400000000 (d29) [ 0.000000] e820: last_pfn = 0xac228 max_arch_pfn = 0x400000000 (d29) [ 0.000000] Base memory trampoline at [ffff88000009a000] 9a000 size 24576 (d29) [ 0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff] (d29) [ 0.000000] [mem 0x00000000-0x000fffff] page 4k (d29) [ 0.000000] init_memory_mapping: [mem 0x4bea00000-0x4bebfffff] (d29) [ 0.000000] [mem 0x4bea00000-0x4bebfffff] page 4k (d29) [ 0.000000] BRK [0x019d8000, 0x019d8fff] PGTABLE (d29) [ 0.000000] BRK [0x019d9000, 0x019d9fff] PGTABLE (d29) [ 0.000000] init_memory_mapping: [mem 0x4bc000000-0x4be9fffff] (d29) [ 0.000000] [mem 0x4bc000000-0x4be9fffff] page 4k (d29) [ 0.000000] BRK [0x019da000, 0x019dafff] PGTABLE (d29) [ 0.000000] BRK [0x019db000, 0x019dbfff] PGTABLE (d29) [ 0.000000] BRK [0x019dc000, 0x019dcfff] PGTABLE (d29) [ 0.000000] BRK [0x019dd000, 0x019ddfff] PGTABLE (d29) [ 0.000000] init_memory_mapping: [mem 0x480000000-0x4bbffffff] (d29) [ 0.000000] [mem 0x480000000-0x4bbffffff] page 4k (d29) [ 0.000000] init_memory_mapping: [mem 0x00100000-0xac227fff] (d29) [ 0.000000] [mem 0x00100000-0xac227fff] page 4k (d29) [ 0.000000] init_memory_mapping: [mem 0x100000000-0x47fffffff] (d29) [ 0.000000] [mem 0x100000000-0x47fffffff] page 4k (d29) [ 0.000000] init_memory_mapping: [mem 0x4bec00000-0x4bedd7fff] (d29) [ 0.000000] [mem 0x4bec00000-0x4bedd7fff] page 4k (d29) [ 0.000000] RAMDISK: [mem 0x01dc9000-0x03470fff] (d29) [ 0.000000] NUMA: Initialized distance table, cnt=4 (d29) [ 0.000000] Initmem setup node 0 [mem 0x00000000-0x16e9d7fff] (d29) [ 0.000000] NODE_DATA [mem 0x16e9b1000-0x16e9d7fff] (d29) [ 0.000000] Initmem setup node 1 [mem 0x16f000000-0x2895d7fff] (d29) [ 0.000000] NODE_DATA [mem 0x2895b1000-0x2895d7fff] (d29) [ 0.000000] Initmem setup node 2 [mem 0x289800000-0x3a41d7fff] (d29) [ 0.000000] NODE_DATA [mem 0x3a41b1000-0x3a41d7fff] (d29) [ 0.000000] Initmem setup node 3 [mem 0x3a4800000-0x4bedd7fff] (d29) [ 0.000000] NODE_DATA [mem 0x4bedad000-0x4bedd3fff] (d29) [ 0.000000] Zone ranges: (d29) [ 0.000000] DMA [mem 0x00001000-0x00ffffff] (d29) [ 0.000000] DMA32 [mem 0x01000000-0xffffffff] (d29) [ 0.000000] Normal [mem 0x100000000-0x4bedd7fff] (d29) [ 0.000000] Movable zone start for each node (d29) [ 0.000000] Early memory node ranges (d29) [ 0.000000] node 0: [mem 0x00001000-0x0009ffff] (d29) [ 0.000000] node 0: [mem 0x00100000-0xac227fff] (d29) [ 0.000000] node 0: [mem 0x100000000-0x16e9d7fff] (d29) [ 0.000000] node 1: [mem 0x16e9d8000-0x2895d7fff] (d29) [ 0.000000] node 2: [mem 0x2895d8000-0x3a41d7fff] (d29) [ 0.000000] node 3: [mem 0x3a41d8000-0x4bedd7fff] (d29) [ 0.000000] On node 0 totalpages: 1158047 (d29) [ 0.000000] DMA zone: 56 pages used for memmap (d29) [ 0.000000] DMA zone: 21 pages reserved (d29) [ 0.000000] DMA zone: 3999 pages, LIFO batch:0 (d29) [ 0.000000] DMA32 zone: 9584 pages used for memmap (d29) [ 0.000000] DMA32 zone: 700968 pages, LIFO batch:31 (d29) [ 0.000000] Normal zone: 6195 pages used for memmap (d29) [ 0.000000] Normal zone: 453080 pages, LIFO batch:31 (d29) [ 0.000000] On node 1 totalpages: 1158144 (d29) [ 0.000000] Normal zone: 15834 pages used for memmap (d29) [ 0.000000] Normal zone: 1158144 pages, LIFO batch:31 (d29) [ 0.000000] On node 2 totalpages: 1158144 (d29) [ 0.000000] Normal zone: 15834 pages used for memmap (d29) [ 0.000000] Normal zone: 1158144 pages, LIFO batch:31 (d29) [ 0.000000] On node 3 totalpages: 1158144 (d29) [ 0.000000] Normal zone: 15834 pages used for memmap (d29) [ 0.000000] Normal zone: 1158144 pages, LIFO batch:31 (d29) [ 0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org (d29) [ 0.000000] smpboot: Allowing 6 CPUs, 0 hotplug CPUs (d29) [ 0.000000] nr_irqs_gsi: 16 numactl --ha: root@heatpipe:~# numactl --ha available: 4 nodes (0-3) node 0 cpus: 0 node 0 size: 4451 MB node 0 free: 4417 MB node 1 cpus: 1 2 node 1 size: 4460 MB node 1 free: 4425 MB node 2 cpus: 5 node 2 size: 4462 MB node 2 free: 4444 MB node 3 cpus: 3 4 node 3 size: 4391 MB node 3 free: 4362 MB node distances: node 0 1 2 3 0: 10 40 40 40 1: 40 10 40 40 2: 40 40 10 40 3: 40 40 40 10 Xen: xl debug-keys u (XEN) Memory location of each domain: (XEN) Domain 0 (total: 2656043): (XEN) Node 0: 1453063 (XEN) Node 1: 1202980 (XEN) Domain 29 (total: 4632576): (XEN) Node 0: 2144532 (XEN) Node 1: 2488044 (XEN) Domain has 4 vnodes (XEN) vnode 0 - pnode 0: 5865 MB (XEN) vnode 1 - pnode 0: 4524 MB (XEN) vnode 2 - pnode 1: 4524 MB (XEN) vnode 3 - pnode 1: 4524 MB (XEN) Domain vcpu to vnode: (XEN) 0 1 1 3 3 2 Notes: * to enable vNUMA in linux kernel the corresponding patch set should be applied; * automatic numa balancing featurue seem to be fixed in linux kernel: https://lkml.org/lkml/2013/10/31/133 TODO: * This version limits vdistance config option to only two values - same node distance and other node distance; This prevents oopses on latest (3.13-rc1) linux kernel with non-symmetric distance; * cpu siblings for Linux machine and xen cpu trap should be detected and warning should be given; Add cpuid check if set in VM config; * benchmarking; Elena Ufimtseva (7): xen: vNUMA support for PV guests libxc: Plumb Xen with vNUMA topology for domain. libxc: vnodes allocation on NUMA nodes. libxl: vNUMA supporting interface libxl: vNUMA configuration parser xen: adds vNUMA info debug-key u xl: docs for xl config vnuma options docs/man/xl.cfg.pod.5 | 60 +++++++++ tools/libxc/xc_dom.h | 10 ++ tools/libxc/xc_dom_x86.c | 81 +++++++++-- tools/libxc/xc_domain.c | 63 +++++++++ tools/libxc/xenctrl.h | 9 ++ tools/libxc/xg_private.h | 1 + tools/libxl/libxl.c | 20 +++ tools/libxl/libxl.h | 20 +++ tools/libxl/libxl_arch.h | 13 ++ tools/libxl/libxl_dom.c | 210 +++++++++++++++++++++++++++++ tools/libxl/libxl_internal.h | 4 + tools/libxl/libxl_types.idl | 6 +- tools/libxl/libxl_vnuma.h | 12 ++ tools/libxl/libxl_x86.c | 123 +++++++++++++++++ tools/libxl/xl_cmdimpl.c | 306 ++++++++++++++++++++++++++++++++++++++++++ xen/arch/x86/numa.c | 18 +++ xen/common/domain.c | 10 ++ xen/common/domctl.c | 79 +++++++++++ xen/common/memory.c | 50 +++++++ xen/include/public/domctl.h | 29 ++++ xen/include/public/memory.h | 8 ++ xen/include/public/vnuma.h | 56 ++++++++ xen/include/xen/domain.h | 9 ++ xen/include/xen/sched.h | 1 + 24 files changed, 1188 insertions(+), 10 deletions(-) create mode 100644 tools/libxl/libxl_vnuma.h create mode 100644 xen/include/public/vnuma.h -- 1.7.10.4
Defines interface, structures and hypercalls for toolstack to build vnuma topology and for guests that wish to retreive it. Two subop hypercalls introduced by patch: XEN_DOMCTL_setvnumainfo to define vNUMA domain topology per domain and XENMEM_get_vnuma_info to retreive that topology by guest. Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> Changes since v2: * same size of struct vnuma_topology_info on 32 and 64 pv guest; Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> --- xen/common/domain.c | 10 ++++++ xen/common/domctl.c | 79 +++++++++++++++++++++++++++++++++++++++++++ xen/common/memory.c | 50 +++++++++++++++++++++++++++ xen/include/public/domctl.h | 29 ++++++++++++++++ xen/include/public/memory.h | 8 +++++ xen/include/public/vnuma.h | 56 ++++++++++++++++++++++++++++++ xen/include/xen/domain.h | 9 +++++ xen/include/xen/sched.h | 1 + 8 files changed, 242 insertions(+) create mode 100644 xen/include/public/vnuma.h diff --git a/xen/common/domain.c b/xen/common/domain.c index 2cbc489..8f5c665 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -511,6 +511,15 @@ int rcu_lock_live_remote_domain_by_id(domid_t dom, struct domain **d) return 0; } +static void vnuma_destroy(struct vnuma_info *vnuma) +{ + vnuma->nr_vnodes = 0; + xfree(vnuma->vmemrange); + xfree(vnuma->vcpu_to_vnode); + xfree(vnuma->vdistance); + xfree(vnuma->vnode_to_pnode); +} + int domain_kill(struct domain *d) { int rc = 0; @@ -531,6 +540,7 @@ int domain_kill(struct domain *d) tmem_destroy(d->tmem); domain_set_outstanding_pages(d, 0); d->tmem = NULL; + vnuma_destroy(&d->vnuma); /* fallthrough */ case DOMDYING_dying: rc = domain_relinquish_resources(d); diff --git a/xen/common/domctl.c b/xen/common/domctl.c index 904d27b..af6ae9b 100644 --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -29,6 +29,7 @@ #include <asm/page.h> #include <public/domctl.h> #include <xsm/xsm.h> +#include <public/vnuma.h> static DEFINE_SPINLOCK(domctl_lock); DEFINE_SPINLOCK(vcpu_alloc_lock); @@ -889,6 +890,84 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) } break; + case XEN_DOMCTL_setvnumainfo: + { + unsigned int dist_size, nr_vnodes; + + ret = -EINVAL; + + /* + * If number of vnodes was set before, + * dont initilize it again. + */ + //ASSERT(d->vnuma.nr_vnodes > 0); + if ( d->vnuma.nr_vnodes > 0 ) + break; + + nr_vnodes = op->u.vnuma.nr_vnodes; + if ( nr_vnodes == 0 ) + break; + if ( nr_vnodes > (UINT_MAX / nr_vnodes) ) + break; + + ret = -EFAULT; + if ( guest_handle_is_null(op->u.vnuma.vdistance) || + guest_handle_is_null(op->u.vnuma.vmemrange) || + guest_handle_is_null(op->u.vnuma.vcpu_to_vnode) || + guest_handle_is_null(op->u.vnuma.vnode_to_pnode) ) + goto setvnumainfo_out; + + dist_size = nr_vnodes * nr_vnodes; + + d->vnuma.vdistance = xmalloc_array(unsigned int, dist_size); + d->vnuma.vmemrange = xmalloc_array(vmemrange_t, nr_vnodes); + d->vnuma.vcpu_to_vnode = xmalloc_array(unsigned int, d->max_vcpus); + d->vnuma.vnode_to_pnode = xmalloc_array(unsigned int, nr_vnodes); + + if ( d->vnuma.vdistance == NULL || + d->vnuma.vmemrange == NULL || + d->vnuma.vcpu_to_vnode == NULL || + d->vnuma.vnode_to_pnode == NULL ) + { + ret = -ENOMEM; + goto setvnumainfo_out; + } + + printk(KERN_INFO "Ready to copy from guest\n"); + if ( unlikely(copy_from_guest(d->vnuma.vdistance, + op->u.vnuma.vdistance, + dist_size)) ) + goto setvnumainfo_out; + if ( unlikely(copy_from_guest(d->vnuma.vmemrange, + op->u.vnuma.vmemrange, + nr_vnodes)) ) + goto setvnumainfo_out; + if ( unlikely(copy_from_guest(d->vnuma.vcpu_to_vnode, + op->u.vnuma.vcpu_to_vnode, + d->max_vcpus)) ) + goto setvnumainfo_out; + if ( unlikely(copy_from_guest(d->vnuma.vnode_to_pnode, + op->u.vnuma.vnode_to_pnode, + nr_vnodes)) ) + goto setvnumainfo_out; + + /* Everything is good, lets set the number of vnodes */ + d->vnuma.nr_vnodes = nr_vnodes; + + ret = 0; + + setvnumainfo_out: + if ( ret != 0 ) + { + d->vnuma.nr_vnodes = 0; + xfree(d->vnuma.vdistance); + xfree(d->vnuma.vmemrange); + xfree(d->vnuma.vcpu_to_vnode); + xfree(d->vnuma.vnode_to_pnode); + } + } + break; + default: ret = arch_do_domctl(op, d, u_domctl); break; diff --git a/xen/common/memory.c b/xen/common/memory.c index 50b740f..b224c16 100644 --- a/xen/common/memory.c +++ b/xen/common/memory.c @@ -28,6 +28,7 @@ #include <public/memory.h> #include <xsm/xsm.h> #include <xen/trace.h> +#include <public/vnuma.h> struct memop_args { /* INPUT */ @@ -733,6 +734,55 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg) break; + case XENMEM_get_vnuma_info: + { + struct vnuma_topology_info mtopology; + struct domain *d; + + if ( copy_from_guest(&mtopology, arg, 1) ) + return -EFAULT; + if ( (d = rcu_lock_domain_by_any_id(mtopology.domid)) == NULL ) + return -ESRCH; + + if ( (d->vnuma.nr_vnodes == 0) || + (d->vnuma.nr_vnodes > d->max_vcpus) ) + { + rc = -EOPNOTSUPP; + goto vnumainfo_out; + } + + rc = -EFAULT; + if ( guest_handle_is_null(mtopology.vmemrange.h) || + guest_handle_is_null(mtopology.vdistance.h) || + guest_handle_is_null(mtopology.vcpu_to_vnode.h)|| + guest_handle_is_null(mtopology.nr_vnodes.h) ) + { + goto vnumainfo_out; + } + + if ( __copy_to_guest(mtopology.vmemrange.h, + d->vnuma.vmemrange, + d->vnuma.nr_vnodes) != 0 ) + goto vnumainfo_out; + if ( __copy_to_guest(mtopology.vdistance.h, + d->vnuma.vdistance, + d->vnuma.nr_vnodes * d->vnuma.nr_vnodes) != 0 ) + goto vnumainfo_out; + if ( __copy_to_guest(mtopology.vcpu_to_vnode.h, + d->vnuma.vcpu_to_vnode, + d->max_vcpus) != 0 ) + goto vnumainfo_out; + + if ( __copy_to_guest(mtopology.nr_vnodes.h, &d->vnuma.nr_vnodes, 1) != 0 ) + goto vnumainfo_out; + + rc = 0; + + vnumainfo_out: + rcu_unlock_domain(d); + break; + } + default: rc = arch_memory_op(op, arg); break; diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h index 01a3652..e6417d4 100644 --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -35,6 +35,7 @@ #include "xen.h" #include "grant_table.h" #include "hvm/save.h" +#include "vnuma.h" #define XEN_DOMCTL_INTERFACE_VERSION 0x00000009 @@ -869,6 +870,32 @@ struct xen_domctl_set_max_evtchn { typedef struct xen_domctl_set_max_evtchn xen_domctl_set_max_evtchn_t; DEFINE_XEN_GUEST_HANDLE(xen_domctl_set_max_evtchn_t); +/* + * XEN_DOMCTL_setvnumainfo: sets the vNUMA topology + * parameters from toolstack. + */ +struct xen_domctl_vnuma { + uint32_t nr_vnodes; + uint32_t __pad; + XEN_GUEST_HANDLE_64(uint) vdistance; + XEN_GUEST_HANDLE_64(uint) vcpu_to_vnode; + /* + * vnodes to physical NUMA nodes mask. + * This will be kept on per-domain basis + * for requests by consumers as vnuma + * aware ballooning. + */ + XEN_GUEST_HANDLE_64(uint) vnode_to_pnode; + /* + * memory rages that vNUMA node can represent + * If more than one, its a linked list. + */ + XEN_GUEST_HANDLE_64(vmemrange_t) vmemrange; +}; + +typedef struct xen_domctl_vnuma xen_domctl_vnuma_t; +DEFINE_XEN_GUEST_HANDLE(xen_domctl_vnuma_t); + struct xen_domctl { uint32_t cmd; #define XEN_DOMCTL_createdomain 1 @@ -938,6 +965,7 @@ struct xen_domctl { #define XEN_DOMCTL_setnodeaffinity 68 #define XEN_DOMCTL_getnodeaffinity 69 #define XEN_DOMCTL_set_max_evtchn 70 +#define XEN_DOMCTL_setvnumainfo 71 #define XEN_DOMCTL_gdbsx_guestmemio 1000 #define XEN_DOMCTL_gdbsx_pausevcpu 1001 #define XEN_DOMCTL_gdbsx_unpausevcpu 1002 @@ -998,6 +1026,7 @@ struct xen_domctl { struct xen_domctl_set_broken_page_p2m set_broken_page_p2m; struct xen_domctl_gdbsx_pauseunp_vcpu gdbsx_pauseunp_vcpu; struct xen_domctl_gdbsx_domstatus gdbsx_domstatus; + struct xen_domctl_vnuma vnuma; uint8_t pad[128]; } u; }; diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h index 7a26dee..80bb0d8 100644 --- a/xen/include/public/memory.h +++ b/xen/include/public/memory.h @@ -339,6 +339,14 @@ struct xen_pod_target { }; typedef struct xen_pod_target xen_pod_target_t; +/* + * XENMEM_get_vnuma_info used by caller to retrieve + * vNUMA topology constructed for particular domain. + * + * The data exchanged is presented by vnuma_topology_info. + */ +#define XENMEM_get_vnuma_info 25 + #if defined(__XEN__) || defined(__XEN_TOOLS__) #ifndef uint64_aligned_t diff --git a/xen/include/public/vnuma.h b/xen/include/public/vnuma.h new file mode 100644 index 0000000..9238889 --- /dev/null +++ b/xen/include/public/vnuma.h @@ -0,0 +1,56 @@ +#ifndef _XEN_PUBLIC_VNUMA_H +#define _XEN_PUBLIC_VNUMA_H + +#include "xen.h" + +/* + * Following structures are used to represent vNUMA + * topology to guest if requested. + */ + +/* + * Memory ranges can be used to define + * vNUMA memory node boundaries by the + * linked list. As of now, only one range + * per domain is suported. + */ +struct vmemrange { + uint64_t start, end; + uint64_t __padm; +}; + +typedef struct vmemrange vmemrange_t; +DEFINE_XEN_GUEST_HANDLE(vmemrange_t); + +/* + * vNUMA topology specifies vNUMA node + * number, distance table, memory ranges and + * vcpu mapping provided for guests. + */ + +struct vnuma_topology_info { + /* IN */ + domid_t domid; + uint32_t _pad; + /* OUT */ + union { + XEN_GUEST_HANDLE(uint) h; + uint64_t _padn; + } nr_vnodes; + union { + XEN_GUEST_HANDLE(uint) h; + uint64_t _padd; + } vdistance; + union { + XEN_GUEST_HANDLE(uint) h; + uint64_t _padv; + } vcpu_to_vnode; + union { + XEN_GUEST_HANDLE(vmemrange_t) h; + uint64_t _padm; + } vmemrange; +}; +typedef struct vnuma_topology_info vnuma_topology_info_t; +DEFINE_XEN_GUEST_HANDLE(vnuma_topology_info_t); + +#endif diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h index a057069..77853e8 100644 --- a/xen/include/xen/domain.h +++ b/xen/include/xen/domain.h @@ -89,4 +89,13 @@ extern unsigned int xen_processor_pmbits; extern bool_t opt_dom0_vcpus_pin; +struct vnuma_info { + unsigned int nr_vnodes; + unsigned int pad; + unsigned int *vdistance; + unsigned int *vcpu_to_vnode; + unsigned int *vnode_to_pnode; + struct vmemrange *vmemrange; +}; + #endif /* __XEN_DOMAIN_H__ */ diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index cbdf377..3765eae 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -417,6 +417,7 @@ struct domain nodemask_t node_affinity; unsigned int last_alloc_node; spinlock_t node_affinity_lock; + struct vnuma_info vnuma; }; struct domain_setup_info -- 1.7.10.4
Elena Ufimtseva
2013-Nov-18 20:24 UTC
[PATCH v3 2/7] libxc: Plumb Xen with vNUMA topology for domain.
Per-domain vNUMA topology initialization. domctl hypercall is used to set vNUMA topology per domU during domain build time. Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> Changes since v2: * modified error logic; --- tools/libxc/xc_domain.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++ tools/libxc/xenctrl.h | 9 +++++++ 2 files changed, 72 insertions(+) diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c index 1ccafc5..198694f 100644 --- a/tools/libxc/xc_domain.c +++ b/tools/libxc/xc_domain.c @@ -1776,6 +1776,69 @@ int xc_domain_set_max_evtchn(xc_interface *xch, uint32_t domid, return do_domctl(xch, &domctl); } +/* Plumbs Xen with vNUMA topology */ +int xc_domain_setvnuma(xc_interface *xch, + uint32_t domid, + uint16_t nr_vnodes, + uint16_t nr_vcpus, + vmemrange_t *vmemrange, + unsigned int *vdistance, + unsigned int *vcpu_to_vnode, + unsigned int *vnode_to_pnode) +{ + int rc; + DECLARE_DOMCTL; + DECLARE_HYPERCALL_BOUNCE(vmemrange, sizeof(*vmemrange) * nr_vnodes, + XC_HYPERCALL_BUFFER_BOUNCE_BOTH); + DECLARE_HYPERCALL_BOUNCE(vdistance, sizeof(*vdistance) * + nr_vnodes * nr_vnodes, + XC_HYPERCALL_BUFFER_BOUNCE_BOTH); + DECLARE_HYPERCALL_BOUNCE(vcpu_to_vnode, sizeof(*vcpu_to_vnode) * nr_vcpus, + XC_HYPERCALL_BUFFER_BOUNCE_BOTH); + DECLARE_HYPERCALL_BOUNCE(vnode_to_pnode, sizeof(*vnode_to_pnode) * + nr_vnodes, + XC_HYPERCALL_BUFFER_BOUNCE_BOTH); + if ( nr_vnodes == 0 ) { + errno = EINVAL; + return -1; + } + + if ( vdistance == NULL || vcpu_to_vnode == NULL || + vmemrange == NULL || vnode_to_pnode == NULL ) { + PERROR("Incorrect parameters for XEN_DOMCTL_setvnumainfo.\n"); + errno = EINVAL; + return -1; + } + + if ( xc_hypercall_bounce_pre(xch, vmemrange) || + xc_hypercall_bounce_pre(xch, vdistance) || + xc_hypercall_bounce_pre(xch, vcpu_to_vnode) || + xc_hypercall_bounce_pre(xch, vnode_to_pnode) ) { + PERROR("Could not bounce buffer for xc_domain_setvnuma.\n"); + errno = EFAULT; + return -1; + } + + set_xen_guest_handle(domctl.u.vnuma.vmemrange, vmemrange); + set_xen_guest_handle(domctl.u.vnuma.vdistance, vdistance); + set_xen_guest_handle(domctl.u.vnuma.vcpu_to_vnode, vcpu_to_vnode); + set_xen_guest_handle(domctl.u.vnuma.vnode_to_pnode, vnode_to_pnode); + + domctl.cmd = XEN_DOMCTL_setvnumainfo; + domctl.domain = (domid_t)domid; + domctl.u.vnuma.nr_vnodes = nr_vnodes; + domctl.u.vnuma.__pad = 0; + + rc = do_domctl(xch, &domctl); + + xc_hypercall_bounce_post(xch, vmemrange); + xc_hypercall_bounce_post(xch, vdistance); + xc_hypercall_bounce_post(xch, vcpu_to_vnode); + xc_hypercall_bounce_post(xch, vnode_to_pnode); + + return rc; +} + /* * Local variables: * mode: C diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h index 4ac6b8a..f360726 100644 --- a/tools/libxc/xenctrl.h +++ b/tools/libxc/xenctrl.h @@ -1136,6 +1136,15 @@ int xc_domain_set_memmap_limit(xc_interface *xch, uint32_t domid, unsigned long map_limitkb); +int xc_domain_setvnuma(xc_interface *xch, + uint32_t domid, + uint16_t nr_vnodes, + uint16_t nr_vcpus, + vmemrange_t *vmemrange, + unsigned int *vdistance, + unsigned int *vcpu_to_vnode, + unsigned int *vnode_to_pnode); + #if defined(__i386__) || defined(__x86_64__) /* * PC BIOS standard E820 types and structure. -- 1.7.10.4
Elena Ufimtseva
2013-Nov-18 20:24 UTC
[PATCH v3 3/7] libxc: vnodes allocation on NUMA nodes.
If vnuma topology is defined and its a hardware NUMA machine, allocate vnodes on physical numa nodes based on vnode_to_pnode map. TODO: * Unify allocation for domu with vnuma and without vnuma topology initialized; Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> --- tools/libxc/xc_dom.h | 10 ++++++ tools/libxc/xc_dom_x86.c | 81 ++++++++++++++++++++++++++++++++++++++++------ tools/libxc/xg_private.h | 1 + 3 files changed, 83 insertions(+), 9 deletions(-) diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h index 90679da..c8fa243 100644 --- a/tools/libxc/xc_dom.h +++ b/tools/libxc/xc_dom.h @@ -111,6 +111,15 @@ struct xc_dom_image { struct xc_dom_phys *phys_pages; int realmodearea_log; + /* + * vNUMA topology and memory allocation structure. + * Defines the way to allocate memory on per NUMA + * physical nodes that is defined by vnode_to_pnode. + */ + uint32_t nr_vnodes; + uint64_t *vnuma_memszs; + unsigned int *vnode_to_pnode; + /* malloc memory pool */ struct xc_dom_mem *memblocks; @@ -360,6 +369,7 @@ static inline xen_pfn_t xc_dom_p2m_guest(struct xc_dom_image *dom, int arch_setup_meminit(struct xc_dom_image *dom); int arch_setup_bootearly(struct xc_dom_image *dom); int arch_setup_bootlate(struct xc_dom_image *dom); +int arch_boot_numa_alloc(struct xc_dom_image *dom); /* * Local variables: diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c index e034d62..1839360 100644 --- a/tools/libxc/xc_dom_x86.c +++ b/tools/libxc/xc_dom_x86.c @@ -802,27 +802,47 @@ int arch_setup_meminit(struct xc_dom_image *dom) else { /* try to claim pages for early warning of insufficient memory avail */ + rc = 0; if ( dom->claim_enabled ) { rc = xc_domain_claim_pages(dom->xch, dom->guest_domid, dom->total_pages); if ( rc ) + { + xc_dom_panic(dom->xch, XC_INTERNAL_ERROR, + "%s: Failed to claim mem for dom\n", + __FUNCTION__); return rc; + } } /* setup initial p2m */ for ( pfn = 0; pfn < dom->total_pages; pfn++ ) dom->p2m_host[pfn] = pfn; /* allocate guest memory */ - for ( i = rc = allocsz = 0; - (i < dom->total_pages) && !rc; - i += allocsz ) + if ( dom->nr_vnodes > 0 ) { - allocsz = dom->total_pages - i; - if ( allocsz > 1024*1024 ) - allocsz = 1024*1024; - rc = xc_domain_populate_physmap_exact( - dom->xch, dom->guest_domid, allocsz, - 0, 0, &dom->p2m_host[i]); + rc = arch_boot_numa_alloc(dom); + if ( rc ) + { + xc_dom_panic(dom->xch, XC_INTERNAL_ERROR, + "%s: Failed to allocate memory on NUMA nodes\n", + __FUNCTION__); + return rc; + } + } + else + { + for ( i = rc = allocsz = 0; + (i < dom->total_pages) && !rc; + i += allocsz ) + { + allocsz = dom->total_pages - i; + if ( allocsz > 1024*1024 ) + allocsz = 1024*1024; + rc = xc_domain_populate_physmap_exact( + dom->xch, dom->guest_domid, allocsz, + 0, 0, &dom->p2m_host[i]); + } } /* Ensure no unclaimed pages are left unused. @@ -830,7 +850,50 @@ int arch_setup_meminit(struct xc_dom_image *dom) (void)xc_domain_claim_pages(dom->xch, dom->guest_domid, 0 /* cancels the claim */); } + return rc; +} + +int arch_boot_numa_alloc(struct xc_dom_image *dom) +{ + int rc; + unsigned int n; + unsigned long long guest_pages; + unsigned long long allocsz = 0, node_pfn_base, i; + unsigned long memflags; + + rc = allocsz = node_pfn_base = 0; + allocsz = 0; + for ( n = 0; n < dom->nr_vnodes; n++ ) + { + memflags = 0; + if ( dom->vnode_to_pnode[n] != VNUMA_NO_NODE ) + { + memflags |= XENMEMF_exact_node(dom->vnode_to_pnode[n]); + memflags |= XENMEMF_exact_node_request; + } + guest_pages = (dom->vnuma_memszs[n] << 20) >> PAGE_SHIFT_X86; + for ( i = 0; + (i < guest_pages) && !rc; + i += allocsz ) + { + allocsz = guest_pages - i; + if ( allocsz > 1024*1024 ) + allocsz = 1024*1024; + rc = xc_domain_populate_physmap_exact( + dom->xch, dom->guest_domid, allocsz, + 0, memflags, &dom->p2m_host[node_pfn_base + i]); + } + if ( rc ) + { + xc_dom_panic(dom->xch, XC_INTERNAL_ERROR, + "%s: Failed allocation of %Lu pages for vnode %d on pnode %d out of %lu\n", + __FUNCTION__, guest_pages, n, dom->vnode_to_pnode[n], dom->total_pages); + return rc; + } + node_pfn_base += i; + } + return rc; } diff --git a/tools/libxc/xg_private.h b/tools/libxc/xg_private.h index 5ff2124..9554b71 100644 --- a/tools/libxc/xg_private.h +++ b/tools/libxc/xg_private.h @@ -127,6 +127,7 @@ typedef uint64_t l4_pgentry_64_t; #define ROUNDUP(_x,_w) (((unsigned long)(_x)+(1UL<<(_w))-1) & ~((1UL<<(_w))-1)) #define NRPAGES(x) (ROUNDUP(x, PAGE_SHIFT) >> PAGE_SHIFT) +#define VNUMA_NO_NODE ~((unsigned int)0) /* XXX SMH: following skanky macros rely on variable p2m_size being set */ /* XXX TJD: also, "guest_width" should be the guest''s sizeof(unsigned long) */ -- 1.7.10.4
* Checks and sets if incorrect user defined map for physical NUMA nodes allocation. If fails, tries use automatic NUMA placement machanism, otherwise falls to default, not bound to any nodes, allocation. If user define allocation map can be used, disables automatic numa placement. * Verifies the correctness of memory ranges pfns for PV guest by requesting the e820 map for that domain; * Provides vNUMA topology information to Xen about vNUMA topology and allocation map used for vnodes; Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> Changes since v2: * Added vnuma memory pfn alignment which takes into account e820_host option and non-contiguous e820 memory map in that case; --- tools/libxl/libxl.c | 20 ++++ tools/libxl/libxl.h | 20 ++++ tools/libxl/libxl_arch.h | 13 +++ tools/libxl/libxl_dom.c | 210 ++++++++++++++++++++++++++++++++++++++++++ tools/libxl/libxl_internal.h | 4 + tools/libxl/libxl_types.idl | 6 +- tools/libxl/libxl_vnuma.h | 12 +++ tools/libxl/libxl_x86.c | 123 +++++++++++++++++++++++++ 8 files changed, 407 insertions(+), 1 deletion(-) create mode 100644 tools/libxl/libxl_vnuma.h diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c index 0de1112..2bd3653 100644 --- a/tools/libxl/libxl.c +++ b/tools/libxl/libxl.c @@ -4324,6 +4324,26 @@ static int libxl__set_vcpuonline_qmp(libxl__gc *gc, uint32_t domid, return 0; } +int libxl_domain_setvnuma(libxl_ctx *ctx, + uint32_t domid, + uint16_t nr_vnodes, + uint16_t nr_vcpus, + vmemrange_t *vmemrange, + unsigned int *vdistance, + unsigned int *vcpu_to_vnode, + unsigned int *vnode_to_pnode) +{ + GC_INIT(ctx); + int ret; + ret = xc_domain_setvnuma(ctx->xch, domid, nr_vnodes, + nr_vcpus, vmemrange, + vdistance, + vcpu_to_vnode, + vnode_to_pnode); + GC_FREE; + return ret; +} + int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *cpumap) { GC_INIT(ctx); diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h index c7dceda..58aed8f 100644 --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -281,11 +281,14 @@ #include <netinet/in.h> #include <sys/wait.h> /* for pid_t */ +#include <xen/memory.h> #include <xentoollog.h> #include <libxl_uuid.h> #include <_libxl_list.h> +#include <xen/vnuma.h> + /* API compatibility. */ #ifdef LIBXL_API_VERSION #if LIBXL_API_VERSION != 0x040200 && LIBXL_API_VERSION != 0x040300 && \ @@ -382,6 +385,14 @@ #define LIBXL_EXTERNAL_CALLERS_ONLY /* disappears for callers outside libxl */ #endif +/* + * LIBXL_HAVE_BUILDINFO_VNUMA indicates that vnuma topology will be + * build for the guest upon request and with VM configuration. + * It will try to define best allocation for vNUMA + * nodes on real NUMA nodes. + */ +#define LIBXL_HAVE_BUILDINFO_VNUMA 1 + typedef uint8_t libxl_mac[6]; #define LIBXL_MAC_FMT "%02hhx:%02hhx:%02hhx:%02hhx:%02hhx:%02hhx" #define LIBXL_MAC_FMTLEN ((2*6)+5) /* 6 hex bytes plus 5 colons */ @@ -741,6 +752,15 @@ void libxl_vcpuinfo_list_free(libxl_vcpuinfo *, int nr_vcpus); void libxl_device_vtpm_list_free(libxl_device_vtpm*, int nr_vtpms); void libxl_vtpminfo_list_free(libxl_vtpminfo *, int nr_vtpms); +int libxl_domain_setvnuma(libxl_ctx *ctx, + uint32_t domid, + uint16_t nr_vnodes, + uint16_t nr_vcpus, + vmemrange_t *vmemrange, + unsigned int *vdistance, + unsigned int *vcpu_to_vnode, + unsigned int *vnode_to_pnode); + /* * Devices * ======diff --git a/tools/libxl/libxl_arch.h b/tools/libxl/libxl_arch.h index abe6685..b95abab 100644 --- a/tools/libxl/libxl_arch.h +++ b/tools/libxl/libxl_arch.h @@ -19,4 +19,17 @@ int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config, uint32_t domid); +int libxl__vnuma_align_mem(libxl__gc *gc, + uint32_t domid, + struct libxl_domain_build_info *b_info, + vmemrange_t *memblks); + +unsigned long e820_memory_hole_size(unsigned long start, + unsigned long end, + struct e820entry e820[], + int nr); + +unsigned int libxl__vnodemap_is_usable(libxl__gc *gc, + libxl_domain_build_info *info); + #endif diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c index a1c16b0..378249e 100644 --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -23,6 +23,7 @@ #include <xc_dom.h> #include <xen/hvm/hvm_info_table.h> #include <xen/hvm/hvm_xs_strings.h> +#include <libxl_vnuma.h> libxl_domain_type libxl__domain_type(libxl__gc *gc, uint32_t domid) { @@ -201,6 +202,88 @@ static int numa_place_domain(libxl__gc *gc, uint32_t domid, return rc; } +/* prepares vnode to pnode map for domain vNUMA memory allocation */ +int libxl__init_vnode_to_pnode(libxl__gc *gc, uint32_t domid, + libxl_domain_build_info *info) +{ + int i, n, nr_nodes, rc; + uint64_t *mems; + unsigned long long *claim = NULL; + libxl_numainfo *ninfo = NULL; + + rc = -1; + if (info->vnode_to_pnode == NULL) { + info->vnode_to_pnode = calloc(info->nr_vnodes, + sizeof(*info->vnode_to_pnode)); + if (info->vnode_to_pnode == NULL) + return rc; + } + + /* default setting */ + for (i = 0; i < info->nr_vnodes; i++) + info->vnode_to_pnode[i] = VNUMA_NO_NODE; + + /* Get NUMA info */ + ninfo = libxl_get_numainfo(CTX, &nr_nodes); + if (ninfo == NULL || nr_nodes == 0) { + LOG(DEBUG, "No HW NUMA found.\n"); + rc = 0; + goto vnmapout; + } + + /* + * check if we have any hardware NUMA nodes selected, + * otherwise VNUMA_NO_NODE set and used default allocation + */ + if (libxl_bitmap_is_empty(&info->nodemap)) + return 0; + mems = info->vnuma_memszs; + + /* check if all vnodes will fit in one node */ + libxl_for_each_set_bit(n, info->nodemap) { + if(ninfo[n].free/1024 >= info->max_memkb && + libxl_bitmap_test(&info->nodemap, n)) + { + /* + * all domain v-nodes will fit one p-node n, + * p-node n is a best candidate selected by automatic + * NUMA placement. + */ + for (i = 0; i < info->nr_vnodes; i++) + info->vnode_to_pnode[i] = n; + /* we can exit, as we are happy with placement */ + return 0; + } + } + /* Determine the best nodes to fit vNUMA nodes */ + /* TODO: change algorithm. The current just fits the nodes + * Will be nice to have them also sorted by size + * If no p-node found, will be set to NUMA_NO_NODE + */ + claim = calloc(info->nr_vnodes, sizeof(*claim)); + if (claim == NULL) + return rc; + + libxl_for_each_set_bit(n, info->nodemap) + { + for (i = 0; i < info->nr_vnodes; i++) + { + if (((claim[n] + (mems[i] << 20)) <= ninfo[n].free) && + /*vnode was not set yet */ + (info->vnode_to_pnode[i] == VNUMA_NO_NODE ) ) + { + info->vnode_to_pnode[i] = n; + claim[n] += (mems[i] << 20); + } + } + } + + rc = 0; + vnmapout: + if (claim) free(claim); + return rc; +} + int libxl__build_pre(libxl__gc *gc, uint32_t domid, libxl_domain_config *d_config, libxl__domain_build_state *state) { @@ -235,6 +318,66 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid, if (rc) return rc; } + + if (info->nr_vnodes > 0) { + /* The memory blocks will be formed here from sizes */ + struct vmemrange *memrange = libxl__calloc(gc, info->nr_vnodes, + sizeof(*memrange)); + if (memrange == NULL) { + LOG(DETAIL, "Failed to allocate memory for memory ranges.\n"); + return ERROR_FAIL; + } + + if (libxl__vnuma_align_mem(gc, domid, info, memrange) < 0) { + LOG(DETAIL, "Failed to align memory map.\n"); + return ERROR_FAIL; + } + + /* + * If vNUMA vnode_to_pnode map defined, determine if we + * can disable automatic numa placement and place vnodes + * on specified pnodes. + * For now, if vcpu affinity specified, we will use + * specified vnode to pnode map. + */ + if (libxl__vnodemap_is_usable(gc, info) == 1) { + + /* Will use user-defined vnode to pnode mask */ + + libxl_defbool_set(&info->numa_placement, false); + } + else { + LOG(ERROR, "The allocation mask for vnuma nodes cannot be used.\n"); + if (libxl_defbool_val(info->vnuma_placement)) { + + LOG(DETAIL, "Switching to automatic vnuma to pnuma placement\n."); + /* Construct the vnode to pnode mapping if possible */ + if (libxl__init_vnode_to_pnode(gc, domid, info) < 0) { + LOG(DETAIL, "Failed to call init_vnodemap\n"); + /* vnuma_nodemap will not be used if nr_vnodes == 0 */ + return ERROR_FAIL; + } + } + else { + LOG(ERROR, "The vnodes cannot be mapped to pnodes this way\n."); + info->nr_vnodes = 0; + return ERROR_FAIL; + } + } + /* plumb domain with vNUMA topology */ + if (xc_domain_setvnuma(ctx->xch, domid, info->nr_vnodes, + info->max_vcpus, memrange, + info->vdistance, info->vcpu_to_vnode, + info->vnode_to_pnode) < 0) { + + LOG(DETAIL, "Failed to set vnuma topology for domain from\n."); + info->nr_vnodes = 0; + return ERROR_FAIL; + } + } + else + LOG(DEBUG, "Will not construct vNUMA topology.\n"); + libxl_domain_set_nodeaffinity(ctx, domid, &info->nodemap); libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap); @@ -256,6 +399,56 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid, return rc; } +/* Checks if vnuma_nodemap defined in info can be used + * for allocation of vnodes on physical NUMA nodes by + * verifying that there is enough memory on corresponding + * NUMA nodes. + */ +unsigned int libxl__vnodemap_is_usable(libxl__gc *gc, libxl_domain_build_info *info) +{ + unsigned int i; + libxl_numainfo *ninfo = NULL; + unsigned long long *claim; + unsigned int node; + uint64_t *mems; + int rc, nr_nodes; + + rc = nr_nodes = 0; + if (info->vnode_to_pnode == NULL || info->vnuma_memszs == NULL) + return rc; + /* + * Cannot use specified mapping if not NUMA machine + */ + ninfo = libxl_get_numainfo(CTX, &nr_nodes); + if (ninfo == NULL) { + return rc; + } + mems = info->vnuma_memszs; + claim = calloc(info->nr_vnodes, sizeof(*claim)); + if (claim == NULL) + return rc; + /* Sum memory request on per pnode basis */ + for (i = 0; i < info->nr_vnodes; i++) + { + node = info->vnode_to_pnode[i]; + /* Correct pnode number? */ + if (node < nr_nodes) + claim[node] += (mems[i] << 20); + else + goto vmapu; + } + for (i = 0; i < nr_nodes; i++) { + if (claim[i] > ninfo[i].free) + /* Cannot complete user request, falling to default */ + goto vmapu; + } + rc = 1; + + vmapu: + if(claim) free(claim); + return rc; +} + int libxl__build_post(libxl__gc *gc, uint32_t domid, libxl_domain_build_info *info, libxl__domain_build_state *state, @@ -381,7 +574,24 @@ int libxl__build_pv(libxl__gc *gc, uint32_t domid, } } } + if ( info->nr_vnodes > 0 ) { + dom->vnode_to_pnode = (unsigned int *)malloc( + info->nr_vnodes * sizeof(*info->vnode_to_pnode)); + dom->vnuma_memszs = (uint64_t *)malloc( + info->nr_vnodes * sizeof(*info->vnuma_memszs)); + if ( dom->vnuma_memszs == NULL || dom->vnode_to_pnode == NULL ) { + info->nr_vnodes = 0; + if (dom->vnode_to_pnode) free(dom->vnode_to_pnode); + if (dom->vnuma_memszs) free(dom->vnuma_memszs); + goto out; + } + memcpy(dom->vnuma_memszs, info->vnuma_memszs, + sizeof(*info->vnuma_memszs) * info->nr_vnodes); + memcpy(dom->vnode_to_pnode, info->vnode_to_pnode, + sizeof(*info->vnode_to_pnode) * info->nr_vnodes); + dom->nr_vnodes = info->nr_vnodes; + } dom->flags = flags; dom->console_evtchn = state->console_port; dom->console_domid = state->console_domid; diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h index 23ff265..0df742b 100644 --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -2882,6 +2882,10 @@ void libxl__numa_candidate_put_nodemap(libxl__gc *gc, libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap); } +int libxl__init_vnode_to_pnode(libxl__gc *gc, uint32_t domid, + libxl_domain_build_info *info); + + /* * Inserts "elm_new" into the sorted list "head". * diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl index de5bac3..86ad14c 100644 --- a/tools/libxl/libxl_types.idl +++ b/tools/libxl/libxl_types.idl @@ -310,7 +310,11 @@ libxl_domain_build_info = Struct("domain_build_info",[ ("disable_migrate", libxl_defbool), ("cpuid", libxl_cpuid_policy_list), ("blkdev_start", string), - + ("vnuma_memszs", Array(uint64, "nr_vnodes")), + ("vcpu_to_vnode", Array(uint32, "nr_vnodemap")), + ("vdistance", Array(uint32, "nr_vdist")), + ("vnode_to_pnode", Array(uint32, "nr_vnode_to_pnode")), + ("vnuma_placement", libxl_defbool), ("device_model_version", libxl_device_model_version), ("device_model_stubdomain", libxl_defbool), # if you set device_model you must set device_model_version too diff --git a/tools/libxl/libxl_vnuma.h b/tools/libxl/libxl_vnuma.h new file mode 100644 index 0000000..6034cd7 --- /dev/null +++ b/tools/libxl/libxl_vnuma.h @@ -0,0 +1,12 @@ +#include "libxl_osdeps.h" /* must come before any other headers */ + +#define VNUMA_NO_NODE ~((unsigned int)0) + +/* + * Max vNUMA node size in Mb is taken 64Mb even now Linux lets + * 32Mb, thus letting some slack. Will be modified to match Linux. + */ +#define MIN_VNODE_SIZE 64U + +#define MAX_VNUMA_NODES (unsigned int)1 << 10 + diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c index 87a8110..206f5be 100644 --- a/tools/libxl/libxl_x86.c +++ b/tools/libxl/libxl_x86.c @@ -310,3 +310,126 @@ int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config, return ret; } + +/* + * Checks for the beginnig and end of RAM in e820 map for domain + * and aligns start of first and end of last vNUMA memory block to + * that map. vnode memory size are passed here Megabytes. + * For PV guest e820 map has fixed hole sizes. + */ +int libxl__vnuma_align_mem(libxl__gc *gc, + uint32_t domid, + libxl_domain_build_info *b_info, /* IN: mem sizes */ + vmemrange_t *memblks) /* OUT: linux numa blocks in pfn */ +{ + int i, j, rc; + uint64_t next_start_pfn, end_max = 0, size, mem_hole; + uint32_t nr; + struct e820entry map[E820MAX]; + + if (b_info->nr_vnodes == 0) + return -EINVAL; + libxl_ctx *ctx = libxl__gc_owner(gc); + + /* retreive e820 map for this host */ + rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX); + + if (rc < 0) { + errno = rc; + return -EINVAL; + } + nr = rc; + rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb, + (b_info->max_memkb - b_info->target_memkb) + + b_info->u.pv.slack_memkb); + if (rc) + { + errno = rc; + return -EINVAL; + } + + /* max pfn for this host */ + for (j = nr - 1; j >= 0; j--) + if (map[j].type == E820_RAM) { + end_max = map[j].addr + map[j].size; + break; + } + + memset(memblks, 0, sizeof(*memblks) * b_info->nr_vnodes); + next_start_pfn = 0; + + memblks[0].start = map[0].addr; + + for(i = 0; i < b_info->nr_vnodes; i++) { + /* start can be not zero */ + memblks[i].start += next_start_pfn; + memblks[i].end = memblks[i].start + (b_info->vnuma_memszs[i] << 20); + + size = memblks[i].end - memblks[i].start; + /* + * For pv host with e820_host option turned on we need + * to rake into account memory holes. For pv host with + * e820_host disabled or unset, the map is contiguous + * RAM region. + */ + if (libxl_defbool_val(b_info->u.pv.e820_host)) { + while (mem_hole = e820_memory_hole_size(memblks[i].start, + memblks[i].end, map, nr), + memblks[i].end - memblks[i].start - mem_hole < size) + { + memblks[i].end += mem_hole; + + if (memblks[i].end > end_max) { + memblks[i].end = end_max; + break; + } + } + } + next_start_pfn = memblks[i].end; + } + + if (memblks[i-1].end > end_max) + memblks[i-1].end = end_max; + + return 0; +} + +/* + * Used for PV guest with e802_host enabled and thus + * having non-contiguous e820 memory map. + */ +unsigned long e820_memory_hole_size(unsigned long start, + unsigned long end, + struct e820entry e820[], + int nr) +{ + int i; + unsigned long absent, start_pfn, end_pfn; + + absent = end - start; + for(i = 0; i < nr; i++) { + /* if not E820_RAM region, skip it and dont substract from absent */ + if(e820[i].type == E820_RAM) { + start_pfn = e820[i].addr; + end_pfn = e820[i].addr + e820[i].size; + /* beginning pfn is in this region? */ + if (start >= start_pfn && start <= end_pfn) { + if (end > end_pfn) + absent -= end_pfn - start; + else + /* fit the region? then no absent pages */ + absent -= end - start; + continue; + } + /* found the end of range in this region? */ + if (end <= end_pfn && end >= start_pfn) { + absent -= end - start_pfn; + /* no need to look for more ranges */ + break; + } + } + } + return absent; +} + + -- 1.7.10.4
Parses guest vNUMA related options, verifies and sets default values if needed. all possible config options are: vnodes: number of vnuma nodes: vnodes = 2 vnumamem: vnodes memory in MBytes: vnumamem = [2048, 2048] vdistance: distance table or part of it. Can use minimal form with same-node distance first and second - all other nodes distances: vdistance = [10, 20] will expand to square symmetrical matrix of any n x n: [10, 20, 20] [20, 10, 20] [20, 20, 10] If not defined or format is incorrect, default [10, 20] will be used. Latest kernel (3.13-rc1) oopses on non-symmetrical distance table and next version will introduce extended distance table in config. vnuma_vcpumap: vcpu to vnode mapping. The index is a vcpu number starting from 0, the value is the node number, starting from zero. vnuma_vcpumap =[1, 0, 1, 0] node 1 - vcpus 0 and 2 node 0 - vcpus 1 and 3 vnuma_vnodemap - vnode tp physical node mapping. Suggested mask will be verified and if the requested mapping cannot be completed (no NUMA hardware, not correct mapping, not enough memory), will try to set this with help of automatic numa placement if vnuma_autoplacement is enabled. Otherwise fails. vnuma_vnodemap = [1, 0] vnode0 -> pnode1 vnode1 -> pnode0 vnuma_autoplacement - enables automatic vnuma placement mechanism if vnode_to_pnode mask is not defined or vnodes cannot be placed on pnodes specified in the mask. Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> Changes since v2: * Added option vnuma_autoplacement; * Changed to fail when incorrect values are set instead of faulting to default ones; Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> --- tools/libxl/xl_cmdimpl.c | 306 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 306 insertions(+) diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index 8690ec7..d7aa0b3 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -40,6 +40,7 @@ #include "libxl_json.h" #include "libxlutil.h" #include "xl.h" +#include "libxl_vnuma.h" #define CHK_ERRNO( call ) ({ \ int chk_errno = (call); \ @@ -572,6 +573,75 @@ vcpp_out: return rc; } +static void vnuma_info_release(libxl_domain_build_info *info) +{ + if ( info->vnuma_memszs ) free(info->vnuma_memszs); + if ( info->vdistance ) free(info->vdistance); + if ( info->vcpu_to_vnode ) free(info->vcpu_to_vnode); + if ( info->vnode_to_pnode ) free(info->vnode_to_pnode); + info->nr_vnodes = 0; +} + +static int get_list_item_uint(XLU_ConfigList *list, unsigned int i) +{ + const char *buf; + char *ep; + unsigned long ul; + int rc = -EINVAL; + buf = xlu_cfg_get_listitem(list, i); + if (!buf) + return rc; + ul = strtoul(buf, &ep, 10); + if (ep == buf) + return rc; + if (ul >= UINT16_MAX) + return rc; + return (int)ul; +} + +static void vdistance_default(unsigned int *vdistance, + unsigned int nr_vnodes, + unsigned int samenode, + unsigned int othernode) +{ + int i, j; + for (i = 0; i < nr_vnodes; i++) + for (j = 0; j < nr_vnodes; j++) + *(vdistance + j * nr_vnodes + i) = i == j ? samenode : othernode; +} + +static void vcputovnode_default(unsigned int *vcpu_to_vnode, + unsigned int nr_vnodes, + unsigned int max_vcpus) +{ + int i; + if (vcpu_to_vnode == NULL) + return; + for(i = 0; i < max_vcpus; i++) + vcpu_to_vnode[i] = i % nr_vnodes; +} + +/* Split domain memory between vNUMA nodes equally */ +static int split_vnumamem(libxl_domain_build_info *b_info) +{ + unsigned long long vnodemem = 0; + unsigned long n; + unsigned int i; + + /* In MBytes */ + vnodemem = (b_info->max_memkb >> 10) / b_info->nr_vnodes; + if (vnodemem < MIN_VNODE_SIZE) + return -1; + /* reminder in MBytes */ + n = (b_info->max_memkb >> 10) % b_info->nr_vnodes; + /* get final sizes in MBytes */ + for(i = 0; i < (b_info->nr_vnodes - 1); i++) + b_info->vnuma_memszs[i] = vnodemem; + /* add the reminder to the last node */ + b_info->vnuma_memszs[i] = vnodemem + n; + return 0; +} + static void parse_config_data(const char *config_source, const char *config_data, int config_len, @@ -908,6 +978,11 @@ static void parse_config_data(const char *config_source, char *cmdline = NULL; const char *root = NULL, *extra = ""; + XLU_ConfigList *vnumamemcfg, *vdistancecfg, *vnodemap, *vcpumap; + int nr_vnuma_regions, nr_vdist, nr_vnodemap; + unsigned long long vnuma_memparsed = 0; + unsigned long ul; + xlu_cfg_replace_string (config, "kernel", &b_info->u.pv.kernel, 0); xlu_cfg_get_string (config, "root", &root, 0); @@ -924,7 +999,238 @@ static void parse_config_data(const char *config_source, fprintf(stderr, "Failed to allocate memory for cmdline\n"); exit(1); } + + if (!xlu_cfg_get_long (config, "vnodes", &l, 0)) { + if (l > MAX_VNUMA_NODES) { + fprintf(stderr, "Too many vnuma nodes, max %d is allowed.\n", MAX_VNUMA_NODES); + exit(1); + } + b_info->nr_vnodes = l; + + libxl_defbool_set(&b_info->vnuma_placement, false); + xlu_cfg_get_defbool(config, "vnuma_autoplacement", &b_info->vnuma_placement, 0); + + if (b_info->nr_vnodes != 0 && b_info->max_vcpus >= b_info->nr_vnodes) { + if (!xlu_cfg_get_list(config, "vnumamem", + &vnumamemcfg, &nr_vnuma_regions, 0)) { + + if (nr_vnuma_regions != b_info->nr_vnodes) { + fprintf(stderr, "Number of numa regions is incorrect.\n"); + exit(1); + } + + b_info->vnuma_memszs = calloc(b_info->nr_vnodes, + sizeof(*b_info->vnuma_memszs)); + if (b_info->vnuma_memszs == NULL) { + fprintf(stderr, "unable to allocate memory for vnuma ranges.\n"); + exit(1); + } + + char *ep; + /* + * Will parse only nr_vnodes times, even if we have more/less regions. + * Take care of it later if less or discard if too many regions. + */ + for (i = 0; i < b_info->nr_vnodes; i++) { + buf = xlu_cfg_get_listitem(vnumamemcfg, i); + if (!buf) { + fprintf(stderr, + "xl: Unable to get element %d in vnuma memroy list.\n", i); + break; + } + ul = strtoul(buf, &ep, 10); + if (ep == buf) { + fprintf(stderr, + "xl: Invalid argument parsing vnumamem: %s.\n", buf); + break; + } + + /* 32Mb is a min size for a node, taken from Linux */ + if (ul >= UINT32_MAX || ul < MIN_VNODE_SIZE) { + fprintf(stderr, "xl: vnuma memory %lu is not withing %u - %u range.\n", + ul, MIN_VNODE_SIZE, UINT32_MAX); + break; + } + + /* memory in MBytes */ + b_info->vnuma_memszs[i] = ul; + } + + /* Total memory for vNUMA parsed to verify */ + for(i = 0; i < nr_vnuma_regions; i++) + vnuma_memparsed = vnuma_memparsed + (b_info->vnuma_memszs[i]); + + /* Amount of memory for vnodes same as total? */ + if((vnuma_memparsed << 10) != (b_info->max_memkb)) { + fprintf(stderr, "xl: vnuma memory is not the same as initial domain memory size.\n"); + vnuma_info_release(b_info); + exit(1); + } + } else { + b_info->vnuma_memszs = calloc(b_info->nr_vnodes, + sizeof(*b_info->vnuma_memszs)); + if (b_info->vnuma_memszs == NULL) { + fprintf(stderr, "unable to allocate memory for vnuma ranges.\n"); + exit(1); + } + + fprintf(stderr, "WARNING: vNUMA memory ranges were not specified.\n"); + fprintf(stderr, "Using default equal vnode memory size %lu Kbytes to cover %lu Kbytes.\n", + b_info->max_memkb / b_info->nr_vnodes, b_info->max_memkb); + + if (split_vnumamem(b_info) < 0) { + fprintf(stderr, "Could not split vnuma memory into equal chunks.\n"); + vnuma_info_release(b_info); + exit(1); + } + } + + b_info->vdistance = calloc(b_info->nr_vnodes * b_info->nr_vnodes, + sizeof(*b_info->vdistance)); + if (b_info->vdistance == NULL) { + vnuma_info_release(b_info); + exit(1); + } + + if(!xlu_cfg_get_list(config, "vdistance", &vdistancecfg, &nr_vdist, 0) && + nr_vdist == 2) { + /* + * If only two elements are in the vdistance list, consider + * first as value for same node distance, the second as the + * rest of distances. + * The following is required right now to avoid non-symmetrical + * distance table as it may break latest kernel. + * TODO: Better way to analyze extended distance table, possibly + * OS specific. + */ + int d1, d2; + d1 = get_list_item_uint(vdistancecfg, 0); + d2 = get_list_item_uint(vdistancecfg, 1); + + if (d1 >= 0 && d2 >= 0 && d1 < d2) { + vdistance_default(b_info->vdistance, b_info->nr_vnodes, d1, d2); + } else { + fprintf(stderr, "WARNING: Distances are not correct.\n"); + vnuma_info_release(b_info); + exit(1); + } + + } else + vdistance_default(b_info->vdistance, b_info->nr_vnodes, 10, 20); + + b_info->vcpu_to_vnode = (unsigned int *)calloc(b_info->max_vcpus, + sizeof(*b_info->vcpu_to_vnode)); + if (b_info->vcpu_to_vnode == NULL) { + vnuma_info_release(b_info); + exit(1); + } + if (!xlu_cfg_get_list(config, "vnuma_vcpumap", + &vcpumap, &nr_vnodemap, 0)) { + if (nr_vnodemap == b_info->max_vcpus) { + unsigned int vnodemask = 0, vnode, smask, vcpumask = 0, vmask; + smask = ~(~0 << b_info->nr_vnodes); + vmask = ~(~0 << nr_vnodemap); + for (i = 0; i < nr_vnodemap; i++) { + vnode = get_list_item_uint(vcpumap, i); + if (vnode >= 0) { + vnodemask |= (1 << vnode); + vcpumask |= (1 << i); + b_info->vcpu_to_vnode[i] = vnode; + } + } + + /* Did it covered all vnodes in the vcpu mask? */ + if ( !(((smask & vnodemask) + 1) == (1 << b_info->nr_vnodes)) ) { + fprintf(stderr, "WARNING: Not all vnodes were covered in vnuma_vcpumap.\n"); + vnuma_info_release(b_info); + exit(1); + } + if ( !(((vmask & vcpumask) + 1) == (1 << nr_vnodemap)) ) { + fprintf(stderr, "WARNING: Not all vnode were covered in vnuma_vcpumap.\n"); + vnuma_info_release(b_info); + exit(1); + } + } else { + fprintf(stderr, "WARNING: Not all vcpus were covered in vnuma_vcpumap.\n"); + vnuma_info_release(b_info); + exit(1); + } + } + else + vcputovnode_default(b_info->vcpu_to_vnode, + b_info->nr_vnodes, + b_info->max_vcpus); + + /* There is mapping to NUMA physical nodes? */ + b_info->vnode_to_pnode = (unsigned int *)calloc(b_info->nr_vnodes, + sizeof(*b_info->vnode_to_pnode)); + if (b_info->vnode_to_pnode == NULL) { + vnuma_info_release(b_info); + exit(1); + } + if (!xlu_cfg_get_list(config, "vnuma_vnodemap",&vnodemap, + &nr_vnodemap, 0)) { + /* + * If not specified or incorred, will be defined + * later based on the machine architecture, configuration + * and memory availble when creating domain. + */ + if (nr_vnodemap == b_info->nr_vnodes) { + unsigned int vnodemask = 0, pnode, smask; + smask = ~(~0 << b_info->nr_vnodes); + for (i = 0; i < b_info->nr_vnodes; i++) { + pnode = get_list_item_uint(vnodemap, i); + if (pnode >= 0) { + vnodemask |= (1 << i); + b_info->vnode_to_pnode[i] = pnode; + } + } + + /* Did it covered all vnodes in the mask? */ + if ( !(((vnodemask & smask) + 1) == (1 << nr_vnodemap)) ) { + fprintf(stderr, "WARNING: Not all vnodes were covered vnuma_vnodemap.\n"); + + if (libxl_defbool_val(b_info->vnuma_placement)) { + fprintf(stderr, "Automatic placement will be used for vnodes.\n"); + free(b_info->vnode_to_pnode); + b_info->vnode_to_pnode = NULL; + } else { + vnuma_info_release(b_info); + exit(1); + } + } + } else { + fprintf(stderr, "WARNING: Incorrect vnuma_vnodemap.\n"); + + if (libxl_defbool_val(b_info->vnuma_placement)) { + fprintf(stderr, "Automatic placement will be used for vnodes.\n"); + free(b_info->vnode_to_pnode); + b_info->vnode_to_pnode = NULL; + } else { + vnuma_info_release(b_info); + exit(1); + } + } + } else { + fprintf(stderr, "WARNING: Missing vnuma_vnodemap.\n"); + + if (libxl_defbool_val(b_info->vnuma_placement)) { + fprintf(stderr, "Automatic placement will be used for vnodes.\n"); + free(b_info->vnode_to_pnode); + b_info->vnode_to_pnode = NULL; + } else { + vnuma_info_release(b_info); + exit(1); + } + } + } + else + b_info->nr_vnodes = 0; + } + else + b_info->nr_vnodes = 0; + xlu_cfg_replace_string (config, "bootloader", &b_info->u.pv.bootloader, 0); switch (xlu_cfg_get_list_as_string_list(config, "bootloader_args", &b_info->u.pv.bootloader_args, 1)) -- 1.7.10.4
Prints basic information about vNUMA topology for vNUMA enabled domains when issuing debug-key ''u''. Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> --- xen/arch/x86/numa.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c index b141877..1cf74ee 100644 --- a/xen/arch/x86/numa.c +++ b/xen/arch/x86/numa.c @@ -389,6 +389,24 @@ static void dump_numa(unsigned char key) for_each_online_node(i) printk(" Node %u: %u\n", i, page_num_node[i]); + + if (d->vnuma.nr_vnodes > 0) + { + printk(" Domain has %d vnodes\n", d->vnuma.nr_vnodes); + for (i = 0; i < d->vnuma.nr_vnodes; i++) { + printk(" vnode %d - pnode %d:", i, + d->vnuma.vnode_to_pnode[i] >= MAX_NUMNODES ? -1 : d->vnuma.vnode_to_pnode[i]); + printk(" %"PRIu64" MB \n", + (d->vnuma.vmemrange[i].end - d->vnuma.vmemrange[i].start) >> 20); + } + + printk(" Domain vcpu to vnode: \n"); + for (i = 0; i < d->max_vcpus; i++) + printk("%s%d %s", i % 8 == 0 ? " " : "", + d->vnuma.vcpu_to_vnode[i], + (i + 1) % 8 == 0 ? "\n" : ""); + printk("\n"); + } } rcu_read_unlock(&domlist_read_lock); -- 1.7.10.4
xl: docs for xl config vnuma options Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> --- docs/man/xl.cfg.pod.5 | 60 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 60 insertions(+) diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 index e6fc83f..2705eb7 100644 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -216,6 +216,66 @@ if the values of B<memory=> and B<maxmem=> differ. A "pre-ballooned" HVM guest needs a balloon driver, without a balloon driver it will crash. +=item B<vnuma_nodes=N> + +Number of vNUMA nodes the guest will be initialized with on boot. + +=item B<vnuma_mem=[vmem1, vmem2, ...]> + +The vnode memory sizes defined in MBytes. If the sum of all vnode memories +does not match the domain memory or not all the nodes defined here, will fail. +If not specified, memory will be equally split between vnodes. Currently +minimum vnode size is 64MB. + +Example: vnuma_mem=[1024, 1024, 2048, 2048] + +=item B<vdistance=[d1, d2]> + +Defines the distance table for vNUMA nodes. Distance for NUMA machines usually + represented by two dimensional array and all distance may be spcified in one +line here, by rows. Distance can be specified as two numbers [d1, d2], +where d1 is same node distance, d2 is a value for all other distances. +If not specified, the defaul distance will be used, e.g. [10, 20]. + +Examples: +vnodes = 3 +vdistance=[10, 20] +will expand to this distance table (this is default setting as well): +[10, 20, 20] +[20, 10, 20] +[20, 20, 10] + +=item B<vnuma_vcpumap=[vcpu1, vcpu2, ...]> + +Defines vcpu to vnode mapping as a string of integers, representing node +numbers. If not defined, the vcpus are interleaved over the virtual nodes. +Current limitation: vNUMA nodes have to have at least one vcpu, otherwise +default vcpu_to_vnode will be used. + +Example: +to map 4 vcpus to 2 nodes - 0,1 vcpu -> vnode1, 2,3 vcpu -> vnode2: +vnuma_vcpumap = [0, 0, 1, 1] + +=item B<vnuma_vnodemap=[p1, p2, ..., pn]> + +vnode to pnode mapping. Can be configured if manual vnode allocation +required. Will be only taken into effect on real NUMA machines and if +memory or other constraints do not prevent it. If the mapping is ok, +automatic NUMA placement will be disabled. If the mapping incorrect +and vnuma_autoplacement is true, automatical numa placement will be used, +otherwise fails to create domain. + +Example: +assume two node NUMA node machine: +vnuma_vndoemap=[1, 0] +first vnode will be placed on node 1, second on node0. + +=item B<vnuma_autoplacement=[0|1]> + +If enabled, automatically will find the best placement physical node candidate for +each vnode if vnuma_vnodemap is incorrect or memory requirements prevent +using it. Set to ''0'' by default. + =back =head3 Event Actions -- 1.7.10.4
On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva wrote:> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>> diff --git a/xen/common/domctl.c b/xen/common/domctl.c > index 904d27b..af6ae9b 100644> @@ -889,6 +890,84 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) > } > break; > > + case XEN_DOMCTL_setvnumainfo: > + { > + unsigned int dist_size, nr_vnodes; > + > + ret = -EINVAL; > + > + /* > + * If number of vnodes was set before, > + * dont initilize it again. > + */ > + //ASSERT(d->vnuma.nr_vnodes > 0); >Kill this? :-P Apart from this, most of my comments to previous round have been addressed, but I think some of the ones Jan provided haven''t. Elena, can you double check? http://bugs.xenproject.org/xen/mid/%3C5285EE480200007800103692@nat28.tlf.novell.com%3E> diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h > index a057069..77853e8 100644 > --- a/xen/include/xen/domain.h > +++ b/xen/include/xen/domain.h > @@ -89,4 +89,13 @@ extern unsigned int xen_processor_pmbits; > > extern bool_t opt_dom0_vcpus_pin; > > +struct vnuma_info { > + unsigned int nr_vnodes; > + unsigned int pad; > + unsigned int *vdistance; > + unsigned int *vcpu_to_vnode; > + unsigned int *vnode_to_pnode; > + struct vmemrange *vmemrange; > +}; > + >And one last thing. Sorry if it''s me not remembering, but what''s ''pad'' above useful for? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-19 08:37 UTC
Re: [PATCH v3 2/7] libxc: Plumb Xen with vNUMA topology for domain.
On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva wrote:> Per-domain vNUMA topology initialization. > domctl hypercall is used to set vNUMA topology > per domU during domain build time. > > Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> >Ok, this one below should either go under its own "---"> Changes since v2: > * modified error logic; > ---Or here. I mean, something like the following: " bla bla changelog bal bla Signed-off-by: Elena... --- Changes since v2: * xxx * yyyy --- <diffstat> " This way the "Changes since" part will be automatically kept out of the final committed changelog by the git commands committers use to commit the patch. ISTR this happens for everything that''s below "---". If you put it _inside_ the actual changelog, they''ll have to remove it by hand, which is annoying.> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c > index 1ccafc5..198694f 100644> +/* Plumbs Xen with vNUMA topology */ > +int xc_domain_setvnuma(xc_interface *xch, > + uint32_t domid, > + uint16_t nr_vnodes, > + uint16_t nr_vcpus, > + vmemrange_t *vmemrange, > + unsigned int *vdistance, > + unsigned int *vcpu_to_vnode, > + unsigned int *vnode_to_pnode) > +{ > + int rc; > + DECLARE_DOMCTL; > + DECLARE_HYPERCALL_BOUNCE(vmemrange, sizeof(*vmemrange) * nr_vnodes, > + XC_HYPERCALL_BUFFER_BOUNCE_BOTH); > + DECLARE_HYPERCALL_BOUNCE(vdistance, sizeof(*vdistance) * > + nr_vnodes * nr_vnodes, > + XC_HYPERCALL_BUFFER_BOUNCE_BOTH); > + DECLARE_HYPERCALL_BOUNCE(vcpu_to_vnode, sizeof(*vcpu_to_vnode) * nr_vcpus, > + XC_HYPERCALL_BUFFER_BOUNCE_BOTH); > + DECLARE_HYPERCALL_BOUNCE(vnode_to_pnode, sizeof(*vnode_to_pnode) * > + nr_vnodes, > + XC_HYPERCALL_BUFFER_BOUNCE_BOTH); >So, last round I suggested that these bounce buffers should have been _BOUNCE_IN, as information only flow toward Xen, and nothing gets copied back to xc (and neither it is from xc to the application). However, Elena reported the system crashing (actually, what is it that was exactly happening, Elena?) if doing so. I gave it another look, but I don''t think I see a possible reason for that... Anyone? IanC? Andrew?> + if ( nr_vnodes == 0 ) { > + errno = EINVAL; > + return -1; > + } > + > + if ( vdistance == NULL || vcpu_to_vnode == NULL || > + vmemrange == NULL || vnode_to_pnode == NULL ) { > + PERROR("Incorrect parameters for XEN_DOMCTL_setvnumainfo.\n"); > + errno = EINVAL; > + return -1; > + } > +I think this line above has some spurious spaces.> + if ( xc_hypercall_bounce_pre(xch, vmemrange) || > + xc_hypercall_bounce_pre(xch, vdistance) || > + xc_hypercall_bounce_pre(xch, vcpu_to_vnode) || > + xc_hypercall_bounce_pre(xch, vnode_to_pnode) ) { > + PERROR("Could not bounce buffer for xc_domain_setvnuma.\n"); > + errno = EFAULT; > + return -1; > + } > +And this one too (spurious spaces). Also, while the setting of errno in the first two if-s is fine, in this case I think you should avoid doing it, as we rely on xc_hypercall_bounce_pre to have done so.> + set_xen_guest_handle(domctl.u.vnuma.vmemrange, vmemrange); > + set_xen_guest_handle(domctl.u.vnuma.vdistance, vdistance); > + set_xen_guest_handle(domctl.u.vnuma.vcpu_to_vnode, vcpu_to_vnode); > + set_xen_guest_handle(domctl.u.vnuma.vnode_to_pnode, vnode_to_pnode); > +^ and here> + domctl.cmd = XEN_DOMCTL_setvnumainfo; > + domctl.domain = (domid_t)domid; > + domctl.u.vnuma.nr_vnodes = nr_vnodes; > + domctl.u.vnuma.__pad = 0; > +^ here> + rc = do_domctl(xch, &domctl); > +^ here> + xc_hypercall_bounce_post(xch, vmemrange); > + xc_hypercall_bounce_post(xch, vdistance); > + xc_hypercall_bounce_post(xch, vcpu_to_vnode); > + xc_hypercall_bounce_post(xch, vnode_to_pnode); > +^ and here too. I double checked that this isn''t an issue with my MUA by importing the patch in this e-mail and I confirm it is not, it looks like there are a few trailing whitespaces around. I went checking a couple of the other patches too, and found something similar. Can you fix this? For the Linux patches, checkpatch is usually of great help. For Xen ones, I don''t think we have anything like it (yet? :-D). However, for this specific problem (trailing whitespaces), even just looking at the patch with `git show'' should make the issue evident. For instance, here I see them as big red rectangles. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 18.11.13 at 21:24, Elena Ufimtseva <ufimtseva@gmail.com> wrote: > @@ -889,6 +890,84 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) > } > break; > > + case XEN_DOMCTL_setvnumainfo: > + { > + unsigned int dist_size, nr_vnodes; > + > + ret = -EINVAL; > + > + /* > + * If number of vnodes was set before, > + * dont initilize it again. > + */ > + //ASSERT(d->vnuma.nr_vnodes > 0);Please, finally, clean up your patches before submitting. I''m not going to point out further obvious violations below.> --- /dev/null > +++ b/xen/include/public/vnuma.h > @@ -0,0 +1,56 @@ > +#ifndef _XEN_PUBLIC_VNUMA_H > +#define _XEN_PUBLIC_VNUMA_H > + > +#include "xen.h" > + > +/* > + * Following structures are used to represent vNUMA > + * topology to guest if requested. > + */ > + > +/* > + * Memory ranges can be used to define > + * vNUMA memory node boundaries by the > + * linked list. As of now, only one range > + * per domain is suported. > + */ > +struct vmemrange { > + uint64_t start, end; > + uint64_t __padm;What are you padding here?> +}; > + > +typedef struct vmemrange vmemrange_t; > +DEFINE_XEN_GUEST_HANDLE(vmemrange_t); > + > +/* > + * vNUMA topology specifies vNUMA node > + * number, distance table, memory ranges and > + * vcpu mapping provided for guests. > + */ > + > +struct vnuma_topology_info { > + /* IN */ > + domid_t domid; > + uint32_t _pad; > + /* OUT */ > + union { > + XEN_GUEST_HANDLE(uint) h; > + uint64_t _padn;No need for the odd trailing n (and other characters below). One thing that I only realized - what information is the guest supposed to use to know the number of virtual nodes it has (needed to size the arrays the handles here point to)?> + } nr_vnodes; > + union { > + XEN_GUEST_HANDLE(uint) h; > + uint64_t _padd; > + } vdistance; > + union { > + XEN_GUEST_HANDLE(uint) h;And this one is as problematic - I don''t think the guest necessarily knows what max_vcpus the hypervisor has set for it. Jan
Konrad Rzeszutek Wilk
2013-Nov-19 14:03 UTC
Re: [PATCH v3 2/7] libxc: Plumb Xen with vNUMA topology for domain.
On Tue, Nov 19, 2013 at 09:37:25AM +0100, Dario Faggioli wrote:> On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva wrote: > > Per-domain vNUMA topology initialization. > > domctl hypercall is used to set vNUMA topology > > per domU during domain build time. > > > > Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> > > > Ok, this one below should either go under its own "---" > > > Changes since v2: > > * modified error logic; > > --- > Or here. > > I mean, something like the following: > > " > bla bla changelog bal bla > > Signed-off-by: Elena... > --- > Changes since v2: > * xxx > * yyyy > --- > <diffstat> > " > > This way the "Changes since" part will be automatically kept out of the > final committed changelog by the git commands committers use to commit > the patch. ISTR this happens for everything that''s below "---". If you > put it _inside_ the actual changelog, they''ll have to remove it by hand, > which is annoying. > > > diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c > > index 1ccafc5..198694f 100644 > > > +/* Plumbs Xen with vNUMA topology */ > > +int xc_domain_setvnuma(xc_interface *xch, > > + uint32_t domid, > > + uint16_t nr_vnodes, > > + uint16_t nr_vcpus, > > + vmemrange_t *vmemrange, > > + unsigned int *vdistance, > > + unsigned int *vcpu_to_vnode, > > + unsigned int *vnode_to_pnode) > > +{ > > + int rc; > > + DECLARE_DOMCTL; > > + DECLARE_HYPERCALL_BOUNCE(vmemrange, sizeof(*vmemrange) * nr_vnodes, > > + XC_HYPERCALL_BUFFER_BOUNCE_BOTH); > > + DECLARE_HYPERCALL_BOUNCE(vdistance, sizeof(*vdistance) * > > + nr_vnodes * nr_vnodes, > > + XC_HYPERCALL_BUFFER_BOUNCE_BOTH); > > + DECLARE_HYPERCALL_BOUNCE(vcpu_to_vnode, sizeof(*vcpu_to_vnode) * nr_vcpus, > > + XC_HYPERCALL_BUFFER_BOUNCE_BOTH); > > + DECLARE_HYPERCALL_BOUNCE(vnode_to_pnode, sizeof(*vnode_to_pnode) * > > + nr_vnodes, > > + XC_HYPERCALL_BUFFER_BOUNCE_BOTH); > > > So, last round I suggested that these bounce buffers should have been > _BOUNCE_IN, as information only flow toward Xen, and nothing gets copied > back to xc (and neither it is from xc to the application). > > However, Elena reported the system crashing (actually, what is it that > was exactly happening, Elena?) if doing so. > > I gave it another look, but I don''t think I see a possible reason for > that... Anyone? IanC? Andrew? > > > + if ( nr_vnodes == 0 ) { > > + errno = EINVAL; > > + return -1; > > + } > > + > > + if ( vdistance == NULL || vcpu_to_vnode == NULL || > > + vmemrange == NULL || vnode_to_pnode == NULL ) { > > + PERROR("Incorrect parameters for XEN_DOMCTL_setvnumainfo.\n"); > > + errno = EINVAL; > > + return -1; > > + } > > + > I think this line above has some spurious spaces. > > > + if ( xc_hypercall_bounce_pre(xch, vmemrange) || > > + xc_hypercall_bounce_pre(xch, vdistance) || > > + xc_hypercall_bounce_pre(xch, vcpu_to_vnode) || > > + xc_hypercall_bounce_pre(xch, vnode_to_pnode) ) { > > + PERROR("Could not bounce buffer for xc_domain_setvnuma.\n"); > > + errno = EFAULT; > > + return -1; > > + } > > + > And this one too (spurious spaces). > > Also, while the setting of errno in the first two if-s is fine, in this > case I think you should avoid doing it, as we rely on > xc_hypercall_bounce_pre to have done so. > > > + set_xen_guest_handle(domctl.u.vnuma.vmemrange, vmemrange); > > + set_xen_guest_handle(domctl.u.vnuma.vdistance, vdistance); > > + set_xen_guest_handle(domctl.u.vnuma.vcpu_to_vnode, vcpu_to_vnode); > > + set_xen_guest_handle(domctl.u.vnuma.vnode_to_pnode, vnode_to_pnode); > > + > ^ and here > > > + domctl.cmd = XEN_DOMCTL_setvnumainfo; > > + domctl.domain = (domid_t)domid; > > + domctl.u.vnuma.nr_vnodes = nr_vnodes; > > + domctl.u.vnuma.__pad = 0; > > + > ^ here > > + rc = do_domctl(xch, &domctl); > > + > ^ here > > > + xc_hypercall_bounce_post(xch, vmemrange); > > + xc_hypercall_bounce_post(xch, vdistance); > > + xc_hypercall_bounce_post(xch, vcpu_to_vnode); > > + xc_hypercall_bounce_post(xch, vnode_to_pnode); > > + > ^ and here too. > > I double checked that this isn''t an issue with my MUA by importing the > patch in this e-mail and I confirm it is not, it looks like there are a > few trailing whitespaces around. > > I went checking a couple of the other patches too, and found something > similar. Can you fix this? > > For the Linux patches, checkpatch is usually of great help. For Xen > ones, I don''t think we have anything like it (yet? :-D). However, for > this specific problem (trailing whitespaces), even just looking at the > patch with `git show'' should make the issue evident. For instance, here > I see them as big red rectangles. >Or in vim you can have syn match ErrorLeadSpace /^ \+/ " highlight any leading spaces syn match ErrorTailSpace / \+$/ " highlight any trailing spaces syn match Error80 /\%>80v.\+/ " highlight anything past 80 hi Error80 ctermbg=red hi ErrorLeadSpace ctermbg=red hi ErrorTailSpace ctermbg=red And that will nicely point it to you.> Regards, > Dario > > -- > <<This happens because I choose it to happen!>> (Raistlin Majere) > ----------------------------------------------------------------- > Dario Faggioli, Ph.D, http://about.me/dario.faggioli > Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) >> _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-19 14:22 UTC
Re: [PATCH v3 3/7] libxc: vnodes allocation on NUMA nodes.
On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva wrote:> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>> diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c > index e034d62..1839360 100644> @@ -802,27 +802,47 @@ int arch_setup_meminit(struct xc_dom_image *dom) > else > { > /* try to claim pages for early warning of insufficient memory avail */ > + rc = 0; > if ( dom->claim_enabled ) { > rc = xc_domain_claim_pages(dom->xch, dom->guest_domid, > dom->total_pages); > if ( rc ) > + { > + xc_dom_panic(dom->xch, XC_INTERNAL_ERROR, > + "%s: Failed to claim mem for dom\n", > + __FUNCTION__); > return rc; > + } >Mmm... here you''re just adding some log output. I think it''s a minor thing, i.e., it''s not terrible to have this in this patch. However, ideally, this should happen in its own patch, as it is not much related to vNUMA, is it?> /* allocate guest memory */ > - for ( i = rc = allocsz = 0; > - (i < dom->total_pages) && !rc; > - i += allocsz ) > + if ( dom->nr_vnodes > 0 ) > { > - allocsz = dom->total_pages - i; > - if ( allocsz > 1024*1024 ) > - allocsz = 1024*1024; > - rc = xc_domain_populate_physmap_exact( > - dom->xch, dom->guest_domid, allocsz, > - 0, 0, &dom->p2m_host[i]); > + rc = arch_boot_numa_alloc(dom); > + if ( rc ) > + { > + xc_dom_panic(dom->xch, XC_INTERNAL_ERROR, > + "%s: Failed to allocate memory on NUMA nodes\n", > + __FUNCTION__); > + return rc; > + } > + } > + else > + { > + for ( i = rc = allocsz = 0; > + (i < dom->total_pages) && !rc; > + i += allocsz ) > + { > + allocsz = dom->total_pages - i; > + if ( allocsz > 1024*1024 ) > + allocsz = 1024*1024; > + rc = xc_domain_populate_physmap_exact( > + dom->xch, dom->guest_domid, allocsz, > + 0, 0, &dom->p2m_host[i]); > + } > }I think I agree with George''s comment from last round: do unify the two paths (yes, I saw it''s on the TODO for this patch... so, go ahead and do it :-) ).> +int arch_boot_numa_alloc(struct xc_dom_image *dom) > +{ > + int rc; > + unsigned int n; > + unsigned long long guest_pages; > + unsigned long long allocsz = 0, node_pfn_base, i; > + unsigned long memflags; > + > + rc = allocsz = node_pfn_base = 0; > > + allocsz = 0; > + for ( n = 0; n < dom->nr_vnodes; n++ ) > + { > + memflags = 0; > + if ( dom->vnode_to_pnode[n] != VNUMA_NO_NODE ) > + { > + memflags |= XENMEMF_exact_node(dom->vnode_to_pnode[n]); > + memflags |= XENMEMF_exact_node_request; > + } > + guest_pages = (dom->vnuma_memszs[n] << 20) >> PAGE_SHIFT_X86; >I probably would call this ''pages_node'' (or ''node_pages'', or ''guest_pages_node'', etc.), but that''s also something not that important. :-) Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On mar, 2013-11-19 at 14:01 +0000, Jan Beulich wrote:> One thing that I only realized - what information is the guest > supposed to use to know the number of virtual nodes it has > (needed to size the arrays the handles here point to)? >If I understand correctly what you''re asking, the (current) answer is in the Linux patch, here: http://bugs.xenproject.org/xen/mid/%3C1384811922-14642-2-git-send-email-ufimtseva@gmail.com%3E which uses num_possible_cpus() http://lxr.linux.no/linux+v3.12/+code=num_possible_cpus to size all the three arrays. Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 19.11.13 at 15:35, Dario Faggioli <dario.faggioli@citrix.com> wrote: > On mar, 2013-11-19 at 14:01 +0000, Jan Beulich wrote: >> One thing that I only realized - what information is the guest >> supposed to use to know the number of virtual nodes it has >> (needed to size the arrays the handles here point to)? >> > If I understand correctly what you''re asking, the (current) answer is in > the Linux patch, here: > > > http://bugs.xenproject.org/xen/mid/%3C1384811922-14642-2-git-send-email-ufimt > seva@gmail.com%3E > > which uses num_possible_cpus() > > http://lxr.linux.no/linux+v3.12/+code=num_possible_cpus > > to size all the three arrays.But what guarantees num_possible_cpus() to be larger than some hypervisor internal variables that the domain has no knowledge about? Just consider a guest with a Linux configured for just 4 CPUs and 2 nodes, but having a config file specifying 16 vCPU-s on 4 virtual nodes? Jan
On mar, 2013-11-19 at 14:48 +0000, Jan Beulich wrote:> >>> On 19.11.13 at 15:35, Dario Faggioli <dario.faggioli@citrix.com> wrote: > > If I understand correctly what you''re asking, the (current) answer is in > > the Linux patch, here: > > > > > > http://bugs.xenproject.org/xen/mid/%3C1384811922-14642-2-git-send-email-ufimt > > seva@gmail.com%3E > > > > which uses num_possible_cpus() > > > > http://lxr.linux.no/linux+v3.12/+code=num_possible_cpus > > > > to size all the three arrays. > > But what guarantees num_possible_cpus() to be larger than some > hypervisor internal variables that the domain has no knowledge > about? >Nothing, I guess.> Just consider a guest with a Linux configured for just 4 > CPUs and 2 nodes, but having a config file specifying 16 vCPU-s on > 4 virtual nodes? >That''s a fair point. I can''t remember the rationale behind the choice of using num_possible_cpus()... ISTR some very early version (probably not even shared on xen-devel) of the series using something like NR_CPUS, but that would suffer from the same issue, I think. Probably, we just overlooked the situation you''re describing and though that, given we don''t allow nr_vcpus > nr_vnodes, using num_possible_cpus() ought to be enough. But I see it now. So, what would the best option be? Another hypercall (or a special way of calling this one) "just" to retrieve the number of vnodes? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 19.11.13 at 16:42, Dario Faggioli <dario.faggioli@citrix.com> wrote: > On mar, 2013-11-19 at 14:48 +0000, Jan Beulich wrote: >> Just consider a guest with a Linux configured for just 4 >> CPUs and 2 nodes, but having a config file specifying 16 vCPU-s on >> 4 virtual nodes? >> > That''s a fair point. I can''t remember the rationale behind the choice of > using num_possible_cpus()... ISTR some very early version (probably not > even shared on xen-devel) of the series using something like NR_CPUS, > but that would suffer from the same issue, I think. > > Probably, we just overlooked the situation you''re describing and though > that, given we don''t allow nr_vcpus > nr_vnodes, using > num_possible_cpus() ought to be enough. But I see it now. > > So, what would the best option be? Another hypercall (or a special way > of calling this one) "just" to retrieve the number of vnodes?Iirc there''s a padding field in the interface structure, which could be leveraged. But then again you need two counts, and hence it might be better to simply add two respective fields. Then make it/them IN/OUT, and rather than filling the arrays when they''re too small just send back the necessary values. (And of course you''ll want to also send back the actual values in case the passed in ones turned out to large, so the guest would know how many of the array elements actually have valid data). But in the end the fundamental question stands - how was a PV guest in your so far proposed model supposed to know its number of vNodes? While for HVM guests you can make this available via ACPI, that''s not an option for PV. Jan
On mar, 2013-11-19 at 15:54 +0000, Jan Beulich wrote:> >>> On 19.11.13 at 16:42, Dario Faggioli <dario.faggioli@citrix.com> wrote: > > So, what would the best option be? Another hypercall (or a special way > > of calling this one) "just" to retrieve the number of vnodes? > > Iirc there''s a padding field in the interface structure, which could > be leveraged. But then again you need two counts, and hence it > might be better to simply add two respective fields. Then make > it/them IN/OUT, and rather than filling the arrays when they''re > too small just send back the necessary values. (And of course > you''ll want to also send back the actual values in case the passed > in ones turned out to large, so the guest would know how many > of the array elements actually have valid data). > > But in the end the fundamental question stands - how was a PV > guest in your so far proposed model supposed to know its number > of vNodes? While for HVM guests you can make this available via > ACPI, that''s not an option for PV. >Wait... I''m no longer so sure I''m getting what you say. I''d be inclined to say "by the XENMEM_get_vnuma_info hcall implemented here", but then again, maybe I''m missing something. The hypercall does provide a mean for the guest to retrieve _all_ the virtual topology information, such as: - number of virtual nodes - virtual node memory ranges - virtual cpu to virtual node mapping - virtual node to physical node mapping, for use in (future) in-guest vNUMA aware subsystems (e.g., ballooning) So, if your point is (as I thought) that for properly allocating the buffers for this hypercall to work we need an information only provided by this hypercall itself, then I agree, and that''s why I asked what alternative way would be best to retrieve that bit of information. If it''s something else, then I don''t know. :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 19.11.13 at 17:36, Dario Faggioli <dario.faggioli@citrix.com> wrote: > On mar, 2013-11-19 at 15:54 +0000, Jan Beulich wrote: >> >>> On 19.11.13 at 16:42, Dario Faggioli <dario.faggioli@citrix.com> wrote: >> > So, what would the best option be? Another hypercall (or a special way >> > of calling this one) "just" to retrieve the number of vnodes? >> >> Iirc there''s a padding field in the interface structure, which could >> be leveraged. But then again you need two counts, and hence it >> might be better to simply add two respective fields. Then make >> it/them IN/OUT, and rather than filling the arrays when they''re >> too small just send back the necessary values. (And of course >> you''ll want to also send back the actual values in case the passed >> in ones turned out to large, so the guest would know how many >> of the array elements actually have valid data). >> >> But in the end the fundamental question stands - how was a PV >> guest in your so far proposed model supposed to know its number >> of vNodes? While for HVM guests you can make this available via >> ACPI, that''s not an option for PV. >> > Wait... I''m no longer so sure I''m getting what you say. I''d be inclined > to say "by the XENMEM_get_vnuma_info hcall implemented here", but then > again, maybe I''m missing something. > > The hypercall does provide a mean for the guest to retrieve _all_ the > virtual topology information, such as: > - number of virtual nodes > - virtual node memory ranges > - virtual cpu to virtual node mapping > - virtual node to physical node mapping, for use in (future) in-guest > vNUMA aware subsystems (e.g., ballooning) > > So, if your point is (as I thought) that for properly allocating the > buffers for this hypercall to work we need an information only provided > by this hypercall itself, then I agree, and that''s why I asked what > alternative way would be best to retrieve that bit of information. > > If it''s something else, then I don''t know. :-)No, it is what you''re naming above. I was merely curious how you had supposed the guest would know the vNode count prior to this change request of mine. I didn''t look at the Linux patches yet (due to lack of time), hence I don''t know how you derived(?) the node count without it coming back from the hypercall here. Jan
Just one thing (for now)... On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva wrote:> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c > index 8690ec7..d7aa0b3 100644 > --- a/tools/libxl/xl_cmdimpl.c > +++ b/tools/libxl/xl_cmdimpl.c >This is the only file you modify in the patch, so the patch itself is about _xl_, rather than _libxl_, isn''t it? Please, make the subject consistent with that. It''s a big patch (and it should be even bigger, as you''ll see from my comments on patch 7/7) so, perhaps, consider splitting it up, e.g., by having one patch for each new config option introduced (if feasible and not too hard). Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-19 17:23 UTC
Re: [PATCH v3 7/7] xl: docs for xl config vnuma options
Again, only one quick note (for now). On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva wrote:> xl: docs for xl config vnuma options >Changes to the documentation should happen in the same patch that introduces the feature or modifies the behavior. This means this patch should be merged with 5/7, making it even bigger and, to a certain extent, harder to review (that''s why I was suggesting breaking it down). Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
George Dunlap
2013-Nov-19 17:26 UTC
Re: [PATCH v3 7/7] xl: docs for xl config vnuma options
On 11/19/2013 05:23 PM, Dario Faggioli wrote:> Again, only one quick note (for now). > > On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva wrote: >> xl: docs for xl config vnuma options >> > Changes to the documentation should happen in the same patch that > introduces the feature or modifies the behavior.Although there is a certain logic to that, if there are major documentation changes, I find that having the documentation in the same exact patch makes the patch a lot harder to review. I much prefer to have the documentation happen at the end of the patch series, unless it''s something really minor. (Although I generally won''t complain either way.) -George
On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva wrote:> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> > > Changes since v2: > * Added vnuma memory pfn alignment which takes into > account e820_host option and non-contiguous e820 memory map > in that case; >Wow, cool that you got this one working too! ;-P> diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c > index 0de1112..2bd3653 100644> +int libxl_domain_setvnuma(libxl_ctx *ctx, > + uint32_t domid, > + uint16_t nr_vnodes, > + uint16_t nr_vcpus, > + vmemrange_t *vmemrange, > + unsigned int *vdistance, > + unsigned int *vcpu_to_vnode, > + unsigned int *vnode_to_pnode) > +{ > + GC_INIT(ctx); >Do you need this? I don''t think so: you don''t need a gc, and ctx appears to be enough for you.> + int ret; > + ret = xc_domain_setvnuma(ctx->xch, domid, nr_vnodes, > + nr_vcpus, vmemrange, > + vdistance, > + vcpu_to_vnode, > + vnode_to_pnode); > + GC_FREE; > + return ret; > +} > +libxl function should always return libxl error codes, such as ERROR_FAIL, ERROR_INVAL, ecc. It''s nice to show what actually went wrong in the xc_* call, but that''s usually done by: rc = xc_call(); if (rc < 0) { LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "error xxx"); return ERROR_FAIL; } On the grounds that libxc would properly set errno and return -1 on failure.> diff --git a/tools/libxl/libxl_arch.h b/tools/libxl/libxl_arch.h > index abe6685..b95abab 100644> int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config, > uint32_t domid); > > +int libxl__vnuma_align_mem(libxl__gc *gc, > + uint32_t domid, > + struct libxl_domain_build_info *b_info, > + vmemrange_t *memblks); > + > +unsigned long e820_memory_hole_size(unsigned long start, > + unsigned long end, > + struct e820entry e820[], > + int nr); > + > +unsigned int libxl__vnodemap_is_usable(libxl__gc *gc, > + libxl_domain_build_info *info); > + >I think we''ll better have a libxl maintainer comment on this but, FWIW, I don''t think I''d put these functions in libxl_arch.h. We do have a libxl_numa.c file, for host their implementation (rather than in libxl_dom.c) and, as libxl hidden functions, their prototype is probably fine in libxl_internal.h (unless there is some specific reason for having them here which I''m overlooking). As per e820_memory_hole_size(), it looks like it''s only called from inside e820_memory_hole_size(), which means it can be just a static (still inside libxl_numa.c).> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c > index a1c16b0..378249e 100644> +/* prepares vnode to pnode map for domain vNUMA memory allocation */ > +int libxl__init_vnode_to_pnode(libxl__gc *gc, uint32_t domid, > + libxl_domain_build_info *info) > +{ > + int i, n, nr_nodes, rc; >Please, initialize nr_nodes to 0. Given how libxl_get_numainfo() works, that''s the best and easiest way for being sure we won''t leak any of the memory it deals with.> + uint64_t *mems; > + unsigned long long *claim = NULL; > + libxl_numainfo *ninfo = NULL; > + > + rc = -1; >As said above, use libxl error codes.> + if (info->vnode_to_pnode == NULL) { > + info->vnode_to_pnode = calloc(info->nr_vnodes, > + sizeof(*info->vnode_to_pnode)); > + if (info->vnode_to_pnode == NULL) > + return rc; > + } > +Also, in libxl, memory allocation never fails. :-) What this really mean is that such kind of failure is handled for you by the library itself, if you use the proper wrappers (and doing so is not optional, is the way you deal with memory in libxl). For this specific example, look for libxl__calloc() and some usage examples of it (you will find there never is any error handling code).> + /* default setting */ > + for (i = 0; i < info->nr_vnodes; i++) > + info->vnode_to_pnode[i] = VNUMA_NO_NODE; > + > + /* Get NUMA info */ > + ninfo = libxl_get_numainfo(CTX, &nr_nodes); > + if (ninfo == NULL || nr_nodes == 0) { > + LOG(DEBUG, "No HW NUMA found.\n"); > + rc = 0; > + goto vnmapout; > + } > +I''m not so sure about this. What do you mean with "No HW NUMA found"? IIRC, on a non-NUMA box, libxl_get_numainfo returns something. I just checked on my (non-NUMA) laptop, and I do see a node 0 with all the memory and 10 as distance to itself. So, if what you get is ninfo=NULL and nr_nodes=0, it means something went wrong, and we shouldn''t return success. OTOH, if you''re interested to detect the case where the call succeeded, and there really is only one node, you should probably check for nr_nodes == 1.> + /* > + * check if we have any hardware NUMA nodes selected, > + * otherwise VNUMA_NO_NODE set and used default allocation > + */ > + if (libxl_bitmap_is_empty(&info->nodemap)) > + return 0; >Why _is_empty()? I mean, under what circumstances info->nodemap is empty? I''m asking because info->nodemap is initialized as full in libxl__domain_build_info_setdefault() and can only change in libxl__build_pre() in a way that makes it really unlikely for it to become empty... so...> + mems = info->vnuma_memszs; > + > + /* check if all vnodes will fit in one node */ > + libxl_for_each_set_bit(n, info->nodemap) { > + if(ninfo[n].free/1024 >= info->max_memkb && > + libxl_bitmap_test(&info->nodemap, n)) > + { > + /* > + * all domain v-nodes will fit one p-node n, > + * p-node n is a best candidate selected by automatic > + * NUMA placement. > + */ > + for (i = 0; i < info->nr_vnodes; i++) > + info->vnode_to_pnode[i] = n; > + /* we can exit, as we are happy with placement */ > + return 0; > + } > + } >Wait, why do we want to do that? If info->nodemap is not fully set (which BTW is the actual situation used to exemplify "auto", or "do this for me", rather than it being empty) it already contains some instructions that, I think, everyone in the downstream chain --be it other parts of libxl, libxc and Xen-- should follow. What I mean is, if info->nodemap has two bits sets, that means the domain should be "created on" the (two) nodes corresponding to the bits themselves. OTOH, if info->nodemap has only one bit set, then the whole domain should stay there and in that case there isn''t much choice, is it? (basically, your "libxl_for_each_set_bit()" will only take one step.) Here, it looks like, no matter what you we have in info->nodemap, you look for one of the nodes that can contain the whole domain, which doesn''t look right. Below...> + /* Determine the best nodes to fit vNUMA nodes */ > + /* TODO: change algorithm. The current just fits the nodes > + * Will be nice to have them also sorted by size > + * If no p-node found, will be set to NUMA_NO_NODE > + */ > + claim = calloc(info->nr_vnodes, sizeof(*claim)); > + if (claim == NULL) > + return rc; > + > + libxl_for_each_set_bit(n, info->nodemap) > + { > + for (i = 0; i < info->nr_vnodes; i++) > + { > + if (((claim[n] + (mems[i] << 20)) <= ninfo[n].free) && > + /*vnode was not set yet */ > + (info->vnode_to_pnode[i] == VNUMA_NO_NODE ) ) > + { > + info->vnode_to_pnode[i] = n; > + claim[n] += (mems[i] << 20); > + } > + } > + } > +...you''re trying to do something similar, i.e., pack the domain within a subset of what you find in info->cpumap instead of just use it. Anyway, I know that automatic placement code may look tricky... Let me see if I can hack something together quickly, to show what I mean and what I''d have put here instead of all this. :-)> int libxl__build_pre(libxl__gc *gc, uint32_t domid, > libxl_domain_config *d_config, libxl__domain_build_state *state) > { > @@ -235,6 +318,66 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid, > if (rc) > return rc; > } > + > + if (info->nr_vnodes > 0) { > + /* The memory blocks will be formed here from sizes */ > + struct vmemrange *memrange = libxl__calloc(gc, info->nr_vnodes, > + sizeof(*memrange)); > + if (memrange == NULL) { > + LOG(DETAIL, "Failed to allocate memory for memory ranges.\n"); > + return ERROR_FAIL; > + } >AhA, here you''re using the correct memory allocation wrapper, so you can ditch the error handling.> + if (libxl__vnuma_align_mem(gc, domid, info, memrange) < 0) { > + LOG(DETAIL, "Failed to align memory map.\n"); > + return ERROR_FAIL; > + } > + > + /* > + * If vNUMA vnode_to_pnode map defined, determine if we > + * can disable automatic numa placement and place vnodes > + * on specified pnodes. > + * For now, if vcpu affinity specified, we will use > + * specified vnode to pnode map. > + */ > + if (libxl__vnodemap_is_usable(gc, info) == 1) { > +Extra blank line.> + /* Will use user-defined vnode to pnode mask */ > +Here too, and with spurious spaces.> + libxl_defbool_set(&info->numa_placement, false); >Setting info->numa_placement to false only makes sense if you do it _before_ the following block of code: if (libxl_defbool_val(info->numa_placement)) { if (!libxl_bitmap_is_full(&info->cpumap)) { LOG(ERROR, "Can run NUMA placement only if no vcpu " "affinity is specified"); return ERROR_INVAL; } rc = numa_place_domain(gc, domid, info); if (rc) return rc; } There isn''t any other place where that flag is checked and, if you get past this, automatic placement either already happened (if numa_placement was true and the other condition is met), or it never will. The right thing to do would probably be having libxl__vnodemap_is_usable() called *before* (even immediately above is fine) we get here, and have _it_ disable info->numa_placement, if that reveals to be the case. At this point, most of the code below (until ''***'') can just live inside the above "if (libxl_defbool_val(info->numa_placement))" above, without the need of introducing another identical block.> + } > + else { > + LOG(ERROR, "The allocation mask for vnuma nodes cannot be used.\n"); > + if (libxl_defbool_val(info->vnuma_placement)) { > + > + LOG(DETAIL, "Switching to automatic vnuma to pnuma placement\n."); > + /* Construct the vnode to pnode mapping if possible */ > + if (libxl__init_vnode_to_pnode(gc, domid, info) < 0) { > + LOG(DETAIL, "Failed to call init_vnodemap\n"); > + /* vnuma_nodemap will not be used if nr_vnodes == 0 */ > + return ERROR_FAIL; > + } > + } > + else { > + LOG(ERROR, "The vnodes cannot be mapped to pnodes this way\n."); > + info->nr_vnodes = 0; > + return ERROR_FAIL; > + } > + } > + /* plumb domain with vNUMA topology */ > + if (xc_domain_setvnuma(ctx->xch, domid, info->nr_vnodes, > + info->max_vcpus, memrange, > + info->vdistance, info->vcpu_to_vnode, > + info->vnode_to_pnode) < 0) { > + > + LOG(DETAIL, "Failed to set vnuma topology for domain from\n."); > + info->nr_vnodes = 0; > + return ERROR_FAIL; > + } > + } > + else > + LOG(DEBUG, "Will not construct vNUMA topology.\n"); > +*** <-- until here> +/* Checks if vnuma_nodemap defined in info can be used > + * for allocation of vnodes on physical NUMA nodes by > + * verifying that there is enough memory on corresponding > + * NUMA nodes. > + */ > +unsigned int libxl__vnodemap_is_usable(libxl__gc *gc, libxl_domain_build_info *info) > +{ > + unsigned int i; > + libxl_numainfo *ninfo = NULL; > + unsigned long long *claim; > + unsigned int node; > + uint64_t *mems; > + int rc, nr_nodes; > + > + rc = nr_nodes = 0; > + if (info->vnode_to_pnode == NULL || info->vnuma_memszs == NULL) > + return rc; > + /* > + * Cannot use specified mapping if not NUMA machine > + */ > + ninfo = libxl_get_numainfo(CTX, &nr_nodes); > + if (ninfo == NULL) { > + return rc; > + } > + mems = info->vnuma_memszs; > + claim = calloc(info->nr_vnodes, sizeof(*claim)); > + if (claim == NULL) > + return rc; > + /* Sum memory request on per pnode basis */ > + for (i = 0; i < info->nr_vnodes; i++) > + { > + node = info->vnode_to_pnode[i]; > + /* Correct pnode number? */ > + if (node < nr_nodes) > + claim[node] += (mems[i] << 20); > + else > + goto vmapu; > + } > + for (i = 0; i < nr_nodes; i++) { > + if (claim[i] > ninfo[i].free) > + /* Cannot complete user request, falling to default */ > + goto vmapu; > + } > + rc = 1; > + > + vmapu: > + if(claim) free(claim); > + return rc; > +} >This function looks conceptually right. Despite that, it has comes with coding style, trailing whitespaces, inconsistent (wrt libxl conventions) error and memory allocation/failure handling. :-D :-D> diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c > index 87a8110..206f5be 100644> + > +/* > + * Checks for the beginnig and end of RAM in e820 map for domain > + * and aligns start of first and end of last vNUMA memory block to > + * that map. vnode memory size are passed here Megabytes. > + * For PV guest e820 map has fixed hole sizes. > + */ > +int libxl__vnuma_align_mem(libxl__gc *gc, > + uint32_t domid, > + libxl_domain_build_info *b_info, /* IN: mem sizes */ > + vmemrange_t *memblks) /* OUT: linux numa blocks in pfn */ > +{ > + int i, j, rc; > + uint64_t next_start_pfn, end_max = 0, size, mem_hole; > + uint32_t nr; > + struct e820entry map[E820MAX]; > + > + if (b_info->nr_vnodes == 0) > + return -EINVAL; > + libxl_ctx *ctx = libxl__gc_owner(gc); > + > + /* retreive e820 map for this host */ > + rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX); > + > + if (rc < 0) { > + errno = rc; > + return -EINVAL; > + } > + nr = rc; > + rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb, > + (b_info->max_memkb - b_info->target_memkb) + > + b_info->u.pv.slack_memkb); > + if (rc) > + { > + errno = rc; > + return -EINVAL; > + } > + > + /* max pfn for this host */ > + for (j = nr - 1; j >= 0; j--) > + if (map[j].type == E820_RAM) { > + end_max = map[j].addr + map[j].size; > + break; > + } > + > + memset(memblks, 0, sizeof(*memblks) * b_info->nr_vnodes); > + next_start_pfn = 0; > + > + memblks[0].start = map[0].addr; > + > + for(i = 0; i < b_info->nr_vnodes; i++) { > + /* start can be not zero */ > + memblks[i].start += next_start_pfn; > + memblks[i].end = memblks[i].start + (b_info->vnuma_memszs[i] << 20); > + > + size = memblks[i].end - memblks[i].start; > + /* > + * For pv host with e820_host option turned on we need > + * to rake into account memory holes. For pv host with > + * e820_host disabled or unset, the map is contiguous > + * RAM region. > + */ > + if (libxl_defbool_val(b_info->u.pv.e820_host)) { > + while (mem_hole = e820_memory_hole_size(memblks[i].start, > + memblks[i].end, map, nr), > + memblks[i].end - memblks[i].start - mem_hole < size) > + { > + memblks[i].end += mem_hole; > + > + if (memblks[i].end > end_max) { > + memblks[i].end = end_max; > + break; > + } > + } > + } > + next_start_pfn = memblks[i].end; > + } > + > + if (memblks[i-1].end > end_max) > + memblks[i-1].end = end_max; > + > + return 0; > +} > +All this code here would require for someone who knows more about e820 to chime in (or for me to learn something more about it, which I definitely want to, but will take a bit). Perhaps you can help me at leas a bit. Can you comment on/describe what is it that you do when you find an hole? It looks from the above that you just need to move the end of the node(s) memory range(s)... Is that the case? For some reason, I was assuming that dealing with holes would have required to add support for more than one memory block per node, but it''s apparently not the case, is it? Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Elena Ufimtseva
2013-Nov-19 22:06 UTC
Re: [PATCH v3 2/7] libxc: Plumb Xen with vNUMA topology for domain.
On Tue, Nov 19, 2013 at 9:03 AM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:> On Tue, Nov 19, 2013 at 09:37:25AM +0100, Dario Faggioli wrote: >> On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva wrote: >> > Per-domain vNUMA topology initialization. >> > domctl hypercall is used to set vNUMA topology >> > per domU during domain build time. >> > >> > Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> >> > >> Ok, this one below should either go under its own "---" >> >> > Changes since v2: >> > * modified error logic; >> > --- >> Or here. >> >> I mean, something like the following: >> >> " >> bla bla changelog bal bla >> >> Signed-off-by: Elena... >> --- >> Changes since v2: >> * xxx >> * yyyy >> --- >> <diffstat> >> " >> >> This way the "Changes since" part will be automatically kept out of the >> final committed changelog by the git commands committers use to commit >> the patch. ISTR this happens for everything that''s below "---". If you >> put it _inside_ the actual changelog, they''ll have to remove it by hand, >> which is annoying. >> >> > diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c >> > index 1ccafc5..198694f 100644 >> >> > +/* Plumbs Xen with vNUMA topology */ >> > +int xc_domain_setvnuma(xc_interface *xch, >> > + uint32_t domid, >> > + uint16_t nr_vnodes, >> > + uint16_t nr_vcpus, >> > + vmemrange_t *vmemrange, >> > + unsigned int *vdistance, >> > + unsigned int *vcpu_to_vnode, >> > + unsigned int *vnode_to_pnode) >> > +{ >> > + int rc; >> > + DECLARE_DOMCTL; >> > + DECLARE_HYPERCALL_BOUNCE(vmemrange, sizeof(*vmemrange) * nr_vnodes, >> > + XC_HYPERCALL_BUFFER_BOUNCE_BOTH); >> > + DECLARE_HYPERCALL_BOUNCE(vdistance, sizeof(*vdistance) * >> > + nr_vnodes * nr_vnodes, >> > + XC_HYPERCALL_BUFFER_BOUNCE_BOTH); >> > + DECLARE_HYPERCALL_BOUNCE(vcpu_to_vnode, sizeof(*vcpu_to_vnode) * nr_vcpus, >> > + XC_HYPERCALL_BUFFER_BOUNCE_BOTH); >> > + DECLARE_HYPERCALL_BOUNCE(vnode_to_pnode, sizeof(*vnode_to_pnode) * >> > + nr_vnodes, >> > + XC_HYPERCALL_BUFFER_BOUNCE_BOTH); >> > >> So, last round I suggested that these bounce buffers should have been >> _BOUNCE_IN, as information only flow toward Xen, and nothing gets copied >> back to xc (and neither it is from xc to the application). >> >> However, Elena reported the system crashing (actually, what is it that >> was exactly happening, Elena?) if doing so. >> >> I gave it another look, but I don''t think I see a possible reason for >> that... Anyone? IanC? Andrew? >> >> > + if ( nr_vnodes == 0 ) { >> > + errno = EINVAL; >> > + return -1; >> > + } >> > + >> > + if ( vdistance == NULL || vcpu_to_vnode == NULL || >> > + vmemrange == NULL || vnode_to_pnode == NULL ) { >> > + PERROR("Incorrect parameters for XEN_DOMCTL_setvnumainfo.\n"); >> > + errno = EINVAL; >> > + return -1; >> > + } >> > + >> I think this line above has some spurious spaces. >> >> > + if ( xc_hypercall_bounce_pre(xch, vmemrange) || >> > + xc_hypercall_bounce_pre(xch, vdistance) || >> > + xc_hypercall_bounce_pre(xch, vcpu_to_vnode) || >> > + xc_hypercall_bounce_pre(xch, vnode_to_pnode) ) { >> > + PERROR("Could not bounce buffer for xc_domain_setvnuma.\n"); >> > + errno = EFAULT; >> > + return -1; >> > + } >> > + >> And this one too (spurious spaces). >> >> Also, while the setting of errno in the first two if-s is fine, in this >> case I think you should avoid doing it, as we rely on >> xc_hypercall_bounce_pre to have done so. >> >> > + set_xen_guest_handle(domctl.u.vnuma.vmemrange, vmemrange); >> > + set_xen_guest_handle(domctl.u.vnuma.vdistance, vdistance); >> > + set_xen_guest_handle(domctl.u.vnuma.vcpu_to_vnode, vcpu_to_vnode); >> > + set_xen_guest_handle(domctl.u.vnuma.vnode_to_pnode, vnode_to_pnode); >> > + >> ^ and here >> >> > + domctl.cmd = XEN_DOMCTL_setvnumainfo; >> > + domctl.domain = (domid_t)domid; >> > + domctl.u.vnuma.nr_vnodes = nr_vnodes; >> > + domctl.u.vnuma.__pad = 0; >> > + >> ^ here >> > + rc = do_domctl(xch, &domctl); >> > + >> ^ here >> >> > + xc_hypercall_bounce_post(xch, vmemrange); >> > + xc_hypercall_bounce_post(xch, vdistance); >> > + xc_hypercall_bounce_post(xch, vcpu_to_vnode); >> > + xc_hypercall_bounce_post(xch, vnode_to_pnode); >> > + >> ^ and here too. >> >> I double checked that this isn''t an issue with my MUA by importing the >> patch in this e-mail and I confirm it is not, it looks like there are a >> few trailing whitespaces around. >> >> I went checking a couple of the other patches too, and found something >> similar. Can you fix this? >> >> For the Linux patches, checkpatch is usually of great help. For Xen >> ones, I don''t think we have anything like it (yet? :-D). However, for >> this specific problem (trailing whitespaces), even just looking at the >> patch with `git show'' should make the issue evident. For instance, here >> I see them as big red rectangles. >> > > Or in vim you can have > > syn match ErrorLeadSpace /^ \+/ " highlight any leading spaces > syn match ErrorTailSpace / \+$/ " highlight any trailing spaces > syn match Error80 /\%>80v.\+/ " highlight anything past 80 > > hi Error80 ctermbg=red > hi ErrorLeadSpace ctermbg=red > hi ErrorTailSpace ctermbg=redTrue ) checkpatch also works with --types TRAILING_WHITESPACE or others options. Ill clean these.> > > And that will nicely point it to you. > >> Regards, >> Dario >> >> -- >> <<This happens because I choose it to happen!>> (Raistlin Majere) >> ----------------------------------------------------------------- >> Dario Faggioli, Ph.D, http://about.me/dario.faggioli >> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) >> > > > >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel >-- Elena
On Tue, Nov 19, 2013 at 9:24 AM, Elena Ufimtseva <ufimtseva@gmail.com> wrote:> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c > index 8690ec7..d7aa0b3 100644 > --- a/tools/libxl/xl_cmdimpl.c > +++ b/tools/libxl/xl_cmdimpl.c > @@ -40,6 +40,7 @@ > #include "libxl_json.h" > #include "libxlutil.h" > #include "xl.h" > +#include "libxl_vnuma.h" > > #define CHK_ERRNO( call ) ({ \ > int chk_errno = (call); \ > @@ -572,6 +573,75 @@ vcpp_out: > return rc; > } > > +static void vnuma_info_release(libxl_domain_build_info *info) > +{ > + if ( info->vnuma_memszs ) free(info->vnuma_memszs); > + if ( info->vdistance ) free(info->vdistance); > + if ( info->vcpu_to_vnode ) free(info->vcpu_to_vnode); > + if ( info->vnode_to_pnode ) free(info->vnode_to_pnode);You don''t need to check for NULL before passing a pointer to free. I just removed most instances of this pattern Xenwide and I''d hate to have some reintroduced :) - Matthew
Elena Ufimtseva
2013-Nov-21 03:20 UTC
Re: [PATCH v3 5/7] libxl: vNUMA configuration parser
On Wed, Nov 20, 2013 at 5:48 PM, Matthew Daley <mattd@bugfuzz.com> wrote:> On Tue, Nov 19, 2013 at 9:24 AM, Elena Ufimtseva <ufimtseva@gmail.com> wrote: >> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c >> index 8690ec7..d7aa0b3 100644 >> --- a/tools/libxl/xl_cmdimpl.c >> +++ b/tools/libxl/xl_cmdimpl.c >> @@ -40,6 +40,7 @@ >> #include "libxl_json.h" >> #include "libxlutil.h" >> #include "xl.h" >> +#include "libxl_vnuma.h" >> >> #define CHK_ERRNO( call ) ({ \ >> int chk_errno = (call); \ >> @@ -572,6 +573,75 @@ vcpp_out: >> return rc; >> } >> >> +static void vnuma_info_release(libxl_domain_build_info *info) >> +{ >> + if ( info->vnuma_memszs ) free(info->vnuma_memszs); >> + if ( info->vdistance ) free(info->vdistance); >> + if ( info->vcpu_to_vnode ) free(info->vcpu_to_vnode); >> + if ( info->vnode_to_pnode ) free(info->vnode_to_pnode); > > You don''t need to check for NULL before passing a pointer to free. I > just removed most instances of this pattern Xenwide and I''d hate to > have some reintroduced :)Thank you Matthew for pointing at this )> > - Matthew-- Elena
Hi Elena and Dario,> + /* Determine the best nodes to fit vNUMA nodes */ > + /* TODO: change algorithm. The current just fits the nodes > + * Will be nice to have them also sorted by size > + * If no p-node found, will be set to NUMA_NO_NODE > + */ > + claim = calloc(info->nr_vnodes, sizeof(*claim)); > + if (claim == NULL) > + return rc; > + > + libxl_for_each_set_bit(n, info->nodemap) > + { > + for (i = 0; i < info->nr_vnodes; i++) > + { > + if (((claim[n] + (mems[i] << 20)) <= ninfo[n].free) && > + /*vnode was not set yet */ > + (info->vnode_to_pnode[i] == VNUMA_NO_NODE ) ) > + { > + info->vnode_to_pnode[i] = n; > + claim[n] += (mems[i] << 20); > + } > + } > + } > +So in the initial part, each vnode is set to EXACTLY ONE pnode, right? -- Yechen Li Team of System Virtualization and Cloud Computing School of Electronic Engineering and Computer Science, Peking University, China Nothing is impossible because impossible itself says: " I''m possible " lccycc From PKU _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva wrote:> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>> diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c > index b141877..1cf74ee 100644 > @@ -389,6 +389,24 @@ static void dump_numa(unsigned char key) > > for_each_online_node(i) > printk(" Node %u: %u\n", i, page_num_node[i]); > + > + if (d->vnuma.nr_vnodes > 0) > + { > + printk(" Domain has %d vnodes\n", d->vnuma.nr_vnodes); > + for (i = 0; i < d->vnuma.nr_vnodes; i++) { > + printk(" vnode %d - pnode %d:", i,^ I''d intend this a bit less, but that''s certainly a minor thing.> + d->vnuma.vnode_to_pnode[i] >= MAX_NUMNODES ? -1 : d->vnuma.vnode_to_pnode[i]); >How long is this line above? :-P> + printk(" %"PRIu64" MB \n", > + (d->vnuma.vmemrange[i].end - d->vnuma.vmemrange[i].start) >> 20); > + } > + > + printk(" Domain vcpu to vnode: \n"); > + for (i = 0; i < d->max_vcpus; i++) > + printk("%s%d %s", i % 8 == 0 ? " " : "", > + d->vnuma.vcpu_to_vnode[i], > + (i + 1) % 8 == 0 ? "\n" : ""); >Do you have an example of the output for this? Looks like you''re just listing the vcpus, although in a specific order... How one is supposed to tell what set of vcpus belong to what vnode? Wouldn''t it be clearer to print the list of vcpus belonging to each node right after having printed its memory (i.e., in the for above)? Also, regarding the ''(i + 1) % 8 == 0 ? "\n" : ""'' I found i rather ugly. Itt may be a matter of personal taste, but I think something along the line of: if ( !(i % 8) ) printk("\n"); would be much more readable. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Tue, Nov 19, 2013 at 11:43 AM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 19.11.13 at 17:36, Dario Faggioli <dario.faggioli@citrix.com> wrote: >> On mar, 2013-11-19 at 15:54 +0000, Jan Beulich wrote: >>> >>> On 19.11.13 at 16:42, Dario Faggioli <dario.faggioli@citrix.com> wrote: >>> > So, what would the best option be? Another hypercall (or a special way >>> > of calling this one) "just" to retrieve the number of vnodes? >>> >>> Iirc there''s a padding field in the interface structure, which could >>> be leveraged. But then again you need two counts, and hence it >>> might be better to simply add two respective fields. Then make >>> it/them IN/OUT, and rather than filling the arrays when they''re >>> too small just send back the necessary values. (And of course >>> you''ll want to also send back the actual values in case the passed >>> in ones turned out to large, so the guest would know how many >>> of the array elements actually have valid data). >>> >>> But in the end the fundamental question stands - how was a PV >>> guest in your so far proposed model supposed to know its number >>> of vNodes? While for HVM guests you can make this available via >>> ACPI, that''s not an option for PV. >>> >> Wait... I''m no longer so sure I''m getting what you say. I''d be inclined >> to say "by the XENMEM_get_vnuma_info hcall implemented here", but then >> again, maybe I''m missing something. >> >> The hypercall does provide a mean for the guest to retrieve _all_ the >> virtual topology information, such as: >> - number of virtual nodes >> - virtual node memory ranges >> - virtual cpu to virtual node mapping >> - virtual node to physical node mapping, for use in (future) in-guest >> vNUMA aware subsystems (e.g., ballooning) >> >> So, if your point is (as I thought) that for properly allocating the >> buffers for this hypercall to work we need an information only provided >> by this hypercall itself, then I agree, and that''s why I asked what >> alternative way would be best to retrieve that bit of information. >> >> If it''s something else, then I don''t know. :-) > > No, it is what you''re naming above. I was merely curious how you > had supposed the guest would know the vNode count prior to this > change request of mine. I didn''t look at the Linux patches yet (due > to lack of time), hence I don''t know how you derived(?) the node > count without it coming back from the hypercall here. > > Jan >Hello Jan and Dario. I have looked at what Jan asked and wanted to see if that can be resolved. Jan is right, if the guest is running with Linux configured with maxcpus less than vcpus in VM config, there is a problem. On this boot stage where xen_numa_init is called xen_smp_prepare_cpus equals to vcpus in config; It only will be reduced to maxcpus (from kernel boot args) after xen_numa_init during xen_smp_prepare. In xen_numa_init I have all values I need to make a decision in regards to initialize vnuma or not, or modify. These are the numbers I have in xen_numa_init: num_possible_cpus() = hypervisor provided guest vcpus; setup_max_cpus = boot kernel param maxcpus; When setup_max_cpus > num_possible_cpus, num_possible_cpus will be brought up; I can detect that setup_max_cpus < num_possible_cpus and do not init vnuma at all, and just do a fake node. I can also make sure that hypervisor is aware of it (by calling same subop with NULL, lets suppose). Then hypervisor have to make some decision regarding vnuma topology for this domain. Thus this will be as before, when guest is not aware of underlying NUMA. It will have to fix vcpu_to_vnode mask, and possibly adjust pinned vcpus to cpus. The memory, if allocated on different nodes, will remain as it is. Does it sound like sensible solution? Or maybe you have some other ideas? -- Elena
Elena Ufimtseva
2013-Nov-26 22:14 UTC
Re: [PATCH v3 4/7] libxl: vNUMA supporting interface
On Thu, Nov 21, 2013 at 4:59 AM, Li Yechen <lccycc123@gmail.com> wrote:> Hi Elena and Dario, >> + /* Determine the best nodes to fit vNUMA nodes */ >> + /* TODO: change algorithm. The current just fits the nodes >> + * Will be nice to have them also sorted by size >> + * If no p-node found, will be set to NUMA_NO_NODE >> + */ >> + claim = calloc(info->nr_vnodes, sizeof(*claim)); >> + if (claim == NULL) >> + return rc; >> + >> + libxl_for_each_set_bit(n, info->nodemap) >> + { >> + for (i = 0; i < info->nr_vnodes; i++) >> + { >> + if (((claim[n] + (mems[i] << 20)) <= ninfo[n].free) && >> + /*vnode was not set yet */ >> + (info->vnode_to_pnode[i] == VNUMA_NO_NODE ) ) >> + { >> + info->vnode_to_pnode[i] = n; >> + claim[n] += (mems[i] << 20); >> + } >> + } >> + } >> + > > So in the initial part, each vnode is set to EXACTLY ONE pnode, right?Hi Yechen Yes, one node or VNUMA_NO_NODE (any node);> > > -- > > Yechen Li > > Team of System Virtualization and Cloud Computing > School of Electronic Engineering and Computer Science, > Peking University, China > > Nothing is impossible because impossible itself says: " I''m possible " > lccycc From PKU-- Elena
On mar, 2013-11-26 at 17:14 -0500, Elena Ufimtseva wrote:> On Thu, Nov 21, 2013 at 4:59 AM, Li Yechen <lccycc123@gmail.com> wrote: > > Hi Elena and Dario, > >> + /* Determine the best nodes to fit vNUMA nodes */ > >> + /* TODO: change algorithm. The current just fits the nodes > >> + * Will be nice to have them also sorted by size > >> + * If no p-node found, will be set to NUMA_NO_NODE > >> + */ > >> + claim = calloc(info->nr_vnodes, sizeof(*claim)); > >> + if (claim == NULL) > >> + return rc; > >> + > >> + libxl_for_each_set_bit(n, info->nodemap) > >> + { > >> + for (i = 0; i < info->nr_vnodes; i++) > >> + { > >> + if (((claim[n] + (mems[i] << 20)) <= ninfo[n].free) && > >> + /*vnode was not set yet */ > >> + (info->vnode_to_pnode[i] == VNUMA_NO_NODE ) ) > >> + { > >> + info->vnode_to_pnode[i] = n; > >> + claim[n] += (mems[i] << 20); > >> + } > >> + } > >> + } > >> + > > > > So in the initial part, each vnode is set to EXACTLY ONE pnode, right? > > Hi Yechen > > Yes, one node or VNUMA_NO_NODE (any node); >Yes, but, at the same time, this particular chunk of code is most likely going to change, in order to better integrate with automatic NUMA placement. In fact, if I''m reading this right (as I said when reviewing this patch), this kind of try to compact more vnodes in the same pnode, if they fit. Although I see the reason why you ma want to do that, I''m still of the opinion that, if info->nodemap has some info coming from the placement algorithm, we should just honor them... That thing (the placement algorithm) is too complicated to just trow away the results! :-P I''m sorry I haven''t got to write some code about this, I hope to find some time to do that soon enough. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On mar, 2013-11-26 at 16:59 -0500, Elena Ufimtseva wrote:> Hello Jan and Dario. >Hi,> I have looked at what Jan asked and wanted to see if that can be resolved. > > Jan is right, if the guest is running with Linux configured with > maxcpus less than vcpus in VM config, > there is a problem. >Yep, there is indeed.> On this boot stage where xen_numa_init is called xen_smp_prepare_cpus > equals to vcpus in config; > It only will be reduced to maxcpus (from kernel boot args) after > xen_numa_init during xen_smp_prepare. > > In xen_numa_init I have all values I need to make a decision in > regards to initialize vnuma or not, or modify. > > These are the numbers I have in xen_numa_init: > num_possible_cpus() = hypervisor provided guest vcpus; > setup_max_cpus = boot kernel param maxcpus; > > When setup_max_cpus > num_possible_cpus, num_possible_cpus will be brought up; >Wait. I think I see what you mean, but that''s probably a different issue from what (at least as far as I understood it) Jan is reporting. So, in Linux you can specify NR_CPUS at compile time, right? Imagine doing that and setting it to 1. Then, you use the resulting kernel for a VM, and in the VM''s config file you specify 4 vcpus and 2 vnodes. Now, apart from what you say above, i.e., only num_possible_cpus()=1 will be brought up, that would also mean that, in xen_numa_init(), you allocate an array only 1 element big for hosting vnuma data and pass it to Xen, which will overflow it when trying to put info for 2 vnodes there! That''s the thing we want to avoid...> I can detect that setup_max_cpus < num_possible_cpus and do not init > vnuma at all, and just do a fake node. > I can also make sure that hypervisor is aware of it (by calling same > subop with NULL, lets suppose). >Right, but I''m not sure I''m getting how detecting that would relevant to the issue above... What we''d need is something like nr_vnodes<num_possible_cpus(), which means we need some mechanism to retrieve the number of nodes _before_ allocating the arrays. As per what to do, well, once we have a way to know te number of vnodes, you just can allocate arrays of correct size and avoid this issue all together. Of course, as we do not support guests with more nodes than vcpus, at that time you can also check whether nr_vnodes>num_possible_cpus() and, if yes, have xen_numa_init() take some error path, but that is a different thing. Does this make sense? Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 26.11.13 at 22:59, Elena Ufimtseva <ufimtseva@gmail.com> wrote: > Jan is right, if the guest is running with Linux configured with > maxcpus less than vcpus in VM config, > there is a problem. > > On this boot stage where xen_numa_init is called xen_smp_prepare_cpus > equals to vcpus in config; > It only will be reduced to maxcpus (from kernel boot args) after > xen_numa_init during xen_smp_prepare. > > In xen_numa_init I have all values I need to make a decision in > regards to initialize vnuma or not, or modify. > > These are the numbers I have in xen_numa_init: > num_possible_cpus() = hypervisor provided guest vcpus; > setup_max_cpus = boot kernel param maxcpus; > > When setup_max_cpus > num_possible_cpus, num_possible_cpus will be brought > up; > > I can detect that setup_max_cpus < num_possible_cpus and do not init > vnuma at all, and just do a fake node. > I can also make sure that hypervisor is aware of it (by calling same > subop with NULL, lets suppose). > > Then hypervisor have to make some decision regarding vnuma topology > for this domain. Thus this will be as before, when > guest is not aware of underlying NUMA. It will have to fix > vcpu_to_vnode mask, and possibly adjust pinned vcpus to cpus. > The memory, if allocated on different nodes, will remain as it is. > > Does it sound like sensible solution? Or maybe you have some other ideas?Imo there''s nothing the hypervisor should do in reaction to the guest not utilizing all of its assigned vCPU-s - it can''t really know whether they''re just not being brought up at boot, but may get started later. All we need here is a way for the guest to learn its number of virtual nodes in order to sensibly use the hypercall to obtain its virtual topology information. Jan
On Wed, Nov 27, 2013 at 3:14 AM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 26.11.13 at 22:59, Elena Ufimtseva <ufimtseva@gmail.com> wrote: >> Jan is right, if the guest is running with Linux configured with >> maxcpus less than vcpus in VM config, >> there is a problem. >> >> On this boot stage where xen_numa_init is called xen_smp_prepare_cpus >> equals to vcpus in config; >> It only will be reduced to maxcpus (from kernel boot args) after >> xen_numa_init during xen_smp_prepare. >> >> In xen_numa_init I have all values I need to make a decision in >> regards to initialize vnuma or not, or modify. >> >> These are the numbers I have in xen_numa_init: >> num_possible_cpus() = hypervisor provided guest vcpus; >> setup_max_cpus = boot kernel param maxcpus; >> >> When setup_max_cpus > num_possible_cpus, num_possible_cpus will bebrought>> up; >> >> I can detect that setup_max_cpus < num_possible_cpus and do not init >> vnuma at all, and just do a fake node. >> I can also make sure that hypervisor is aware of it (by calling same >> subop with NULL, lets suppose). >> >> Then hypervisor have to make some decision regarding vnuma topology >> for this domain. Thus this will be as before, when >> guest is not aware of underlying NUMA. It will have to fix >> vcpu_to_vnode mask, and possibly adjust pinned vcpus to cpus. >> The memory, if allocated on different nodes, will remain as it is. >> >> Does it sound like sensible solution? Or maybe you have some other ideas? > > Imo there''s nothing the hypervisor should do in reaction to the > guest not utilizing all of its assigned vCPU-s - it can''t really know > whether they''re just not being brought up at boot, but may get > started later. > > All we need here is a way for the guest to learn its number of > virtual nodes in order to sensibly use the hypercall to obtain its > virtual topology information. >I worked on this and now I have another subop hypercall to retreive number of nodes before doing any allocations. Not sure the second number you were talking about, but looks like number of nodes is enough to make correct allocations.> Jan >-- Elena _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 02.12.13 at 18:06, Elena Ufimtseva <ufimtseva@gmail.com> wrote: > On Wed, Nov 27, 2013 at 3:14 AM, Jan Beulich <JBeulich@suse.com> wrote: >> All we need here is a way for the guest to learn its number of >> virtual nodes in order to sensibly use the hypercall to obtain its >> virtual topology information. > > I worked on this and now I have another subop hypercall to retreive number > of nodes > before doing any allocations.Good, albeit from the previous code I doubt you really need a new sub-op. Jan
On Mon, Dec 2, 2013 at 12:09 PM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 02.12.13 at 18:06, Elena Ufimtseva <ufimtseva@gmail.com> wrote: >> On Wed, Nov 27, 2013 at 3:14 AM, Jan Beulich <JBeulich@suse.com> wrote: >>> All we need here is a way for the guest to learn its number of >>> virtual nodes in order to sensibly use the hypercall to obtain its >>> virtual topology information. >> >> I worked on this and now I have another subop hypercall to retreive number >> of nodes >> before doing any allocations. > > Good, albeit from the previous code I doubt you really need a new > sub-op.True, I did not want to obscure code, but it can be one.> > Jan >-- Elena
Elena Ufimtseva
2013-Dec-02 18:14 UTC
Re: [PATCH v3 4/7] libxl: vNUMA supporting interface
On Tue, Nov 19, 2013 at 1:37 PM, Dario Faggioli <dario.faggioli@citrix.com> wrote:> On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva wrote: >> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> >> >> Changes since v2: >> * Added vnuma memory pfn alignment which takes into >> account e820_host option and non-contiguous e820 memory map >> in that case; >> > Wow, cool that you got this one working too! ;-P > >> diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c >> index 0de1112..2bd3653 100644 > >> +int libxl_domain_setvnuma(libxl_ctx *ctx, >> + uint32_t domid, >> + uint16_t nr_vnodes, >> + uint16_t nr_vcpus, >> + vmemrange_t *vmemrange, >> + unsigned int *vdistance, >> + unsigned int *vcpu_to_vnode, >> + unsigned int *vnode_to_pnode) >> +{ >> + GC_INIT(ctx); >> > Do you need this? I don''t think so: you don''t need a gc, and ctx appears > to be enough for you. > >> + int ret; >> + ret = xc_domain_setvnuma(ctx->xch, domid, nr_vnodes, >> + nr_vcpus, vmemrange, >> + vdistance, >> + vcpu_to_vnode, >> + vnode_to_pnode); >> + GC_FREE; >> + return ret; >> +} >> + > libxl function should always return libxl error codes, such as > ERROR_FAIL, ERROR_INVAL, ecc. It''s nice to show what actually went wrong > in the xc_* call, but that''s usually done by: > > rc = xc_call(); > if (rc < 0) { > LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "error xxx"); > return ERROR_FAIL; > } > > On the grounds that libxc would properly set errno and return -1 on > failure. > > >> diff --git a/tools/libxl/libxl_arch.h b/tools/libxl/libxl_arch.h >> index abe6685..b95abab 100644 > >> int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config, >> uint32_t domid); >> >> +int libxl__vnuma_align_mem(libxl__gc *gc, >> + uint32_t domid, >> + struct libxl_domain_build_info *b_info, >> + vmemrange_t *memblks); >> + >> +unsigned long e820_memory_hole_size(unsigned long start, >> + unsigned long end, >> + struct e820entry e820[], >> + int nr); >> + >> +unsigned int libxl__vnodemap_is_usable(libxl__gc *gc, >> + libxl_domain_build_info *info); >> + >> > I think we''ll better have a libxl maintainer comment on this but, FWIW, > I don''t think I''d put these functions in libxl_arch.h. > > We do have a libxl_numa.c file, for host their implementation (rather > than in libxl_dom.c) and, as libxl hidden functions, their prototype is > probably fine in libxl_internal.h (unless there is some specific reason > for having them here which I''m overlooking). > > As per e820_memory_hole_size(), it looks like it''s only called from > inside e820_memory_hole_size(), which means it can be just a static > (still inside libxl_numa.c). > >> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c >> index a1c16b0..378249e 100644 > >> +/* prepares vnode to pnode map for domain vNUMA memory allocation */ >> +int libxl__init_vnode_to_pnode(libxl__gc *gc, uint32_t domid, >> + libxl_domain_build_info *info) >> +{ >> + int i, n, nr_nodes, rc; >> > Please, initialize nr_nodes to 0. Given how libxl_get_numainfo() works, > that''s the best and easiest way for being sure we won''t leak any of the > memory it deals with. > >> + uint64_t *mems; >> + unsigned long long *claim = NULL; >> + libxl_numainfo *ninfo = NULL; >> + >> + rc = -1; >> > As said above, use libxl error codes. > >> + if (info->vnode_to_pnode == NULL) { >> + info->vnode_to_pnode = calloc(info->nr_vnodes, >> + sizeof(*info->vnode_to_pnode)); >> + if (info->vnode_to_pnode == NULL) >> + return rc; >> + } >> + > Also, in libxl, memory allocation never fails. :-) What this really mean > is that such kind of failure is handled for you by the library itself, > if you use the proper wrappers (and doing so is not optional, is the way > you deal with memory in libxl). > > For this specific example, look for libxl__calloc() and some usage > examples of it (you will find there never is any error handling code). > >> + /* default setting */ >> + for (i = 0; i < info->nr_vnodes; i++) >> + info->vnode_to_pnode[i] = VNUMA_NO_NODE; >> + >> + /* Get NUMA info */ >> + ninfo = libxl_get_numainfo(CTX, &nr_nodes); >> + if (ninfo == NULL || nr_nodes == 0) { >> + LOG(DEBUG, "No HW NUMA found.\n"); >> + rc = 0; >> + goto vnmapout; >> + } >> + > I''m not so sure about this. What do you mean with "No HW NUMA found"? > IIRC, on a non-NUMA box, libxl_get_numainfo returns something. I just > checked on my (non-NUMA) laptop, and I do see a node 0 with all the > memory and 10 as distance to itself. > > So, if what you get is ninfo=NULL and nr_nodes=0, it means something > went wrong, and we shouldn''t return success. OTOH, if you''re interested > to detect the case where the call succeeded, and there really is only > one node, you should probably check for nr_nodes == 1.If thats the case and there will be at least one node present, then I dont care about this )> >> + /* >> + * check if we have any hardware NUMA nodes selected, >> + * otherwise VNUMA_NO_NODE set and used default allocation >> + */ >> + if (libxl_bitmap_is_empty(&info->nodemap)) >> + return 0; >> > Why _is_empty()? I mean, under what circumstances info->nodemap is > empty? > > I''m asking because info->nodemap is initialized as full in > libxl__domain_build_info_setdefault() and can only change in > libxl__build_pre() in a way that makes it really unlikely for it to > become empty... so... > >> + mems = info->vnuma_memszs; >> + >> + /* check if all vnodes will fit in one node */ >> + libxl_for_each_set_bit(n, info->nodemap) { >> + if(ninfo[n].free/1024 >= info->max_memkb && >> + libxl_bitmap_test(&info->nodemap, n)) >> + { >> + /* >> + * all domain v-nodes will fit one p-node n, >> + * p-node n is a best candidate selected by automatic >> + * NUMA placement. >> + */ >> + for (i = 0; i < info->nr_vnodes; i++) >> + info->vnode_to_pnode[i] = n; >> + /* we can exit, as we are happy with placement */ >> + return 0; >> + } >> + } >> > Wait, why do we want to do that? If info->nodemap is not fully set > (which BTW is the actual situation used to exemplify "auto", or "do this > for me", rather than it being empty) it already contains some > instructions that, I think, everyone in the downstream chain --be it > other parts of libxl, libxc and Xen-- should follow. > > What I mean is, if info->nodemap has two bits sets, that means the > domain should be "created on" the (two) nodes corresponding to the bits > themselves. OTOH, if info->nodemap has only one bit set, then the whole > domain should stay there and in that case there isn''t much choice, is > it? (basically, your "libxl_for_each_set_bit()" will only take one > step.) > > Here, it looks like, no matter what you we have in info->nodemap, you > look for one of the nodes that can contain the whole domain, which > doesn''t look right.> > Below... >> + /* Determine the best nodes to fit vNUMA nodes */ >> + /* TODO: change algorithm. The current just fits the nodes >> + * Will be nice to have them also sorted by size >> + * If no p-node found, will be set to NUMA_NO_NODE >> + */ >> + claim = calloc(info->nr_vnodes, sizeof(*claim)); >> + if (claim == NULL) >> + return rc; >> + >> + libxl_for_each_set_bit(n, info->nodemap) >> + { >> + for (i = 0; i < info->nr_vnodes; i++) >> + { >> + if (((claim[n] + (mems[i] << 20)) <= ninfo[n].free) && >> + /*vnode was not set yet */ >> + (info->vnode_to_pnode[i] == VNUMA_NO_NODE ) ) >> + { >> + info->vnode_to_pnode[i] = n; >> + claim[n] += (mems[i] << 20); >> + } >> + } >> + } >> + > ...you''re trying to do something similar, i.e., pack the domain within a > subset of what you find in info->cpumap instead of just use it. > > Anyway, I know that automatic placement code may look tricky... Let me > see if I can hack something together quickly, to show what I mean and > what I''d have put here instead of all this. :-) > >> int libxl__build_pre(libxl__gc *gc, uint32_t domid, >> libxl_domain_config *d_config, libxl__domain_build_state *state) >> { >> @@ -235,6 +318,66 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid, >> if (rc) >> return rc; >> } >> + >> + if (info->nr_vnodes > 0) { >> + /* The memory blocks will be formed here from sizes */ >> + struct vmemrange *memrange = libxl__calloc(gc, info->nr_vnodes, >> + sizeof(*memrange)); >> + if (memrange == NULL) { >> + LOG(DETAIL, "Failed to allocate memory for memory ranges.\n"); >> + return ERROR_FAIL; >> + } >> > AhA, here you''re using the correct memory allocation wrapper, so you can > ditch the error handling. > >> + if (libxl__vnuma_align_mem(gc, domid, info, memrange) < 0) { >> + LOG(DETAIL, "Failed to align memory map.\n"); >> + return ERROR_FAIL; >> + } >> + >> + /* >> + * If vNUMA vnode_to_pnode map defined, determine if we >> + * can disable automatic numa placement and place vnodes >> + * on specified pnodes. >> + * For now, if vcpu affinity specified, we will use >> + * specified vnode to pnode map. >> + */ >> + if (libxl__vnodemap_is_usable(gc, info) == 1) { >> + > Extra blank line. >> + /* Will use user-defined vnode to pnode mask */ >> + > Here too, and with spurious spaces. > >> + libxl_defbool_set(&info->numa_placement, false); >> > Setting info->numa_placement to false only makes sense if you do it > _before_ the following block of code: > > if (libxl_defbool_val(info->numa_placement)) { > > if (!libxl_bitmap_is_full(&info->cpumap)) { > LOG(ERROR, "Can run NUMA placement only if no vcpu " > "affinity is specified"); > return ERROR_INVAL; > } > > rc = numa_place_domain(gc, domid, info); > if (rc) > return rc; > } > > There isn''t any other place where that flag is checked and, if you get > past this, automatic placement either already happened (if > numa_placement was true and the other condition is met), or it never > will. > > The right thing to do would probably be having > libxl__vnodemap_is_usable() called *before* (even immediately above is > fine) we get here, and have _it_ disable info->numa_placement, if that > reveals to be the case. > > At this point, most of the code below (until ''***'') can just live inside > the above "if (libxl_defbool_val(info->numa_placement))" above, without > the need of introducing another identical block.Yes, I worked on this piece and looks now as it does the right thing.> >> + } >> + else { >> + LOG(ERROR, "The allocation mask for vnuma nodes cannot be used.\n"); >> + if (libxl_defbool_val(info->vnuma_placement)) { >> + >> + LOG(DETAIL, "Switching to automatic vnuma to pnuma placement\n."); >> + /* Construct the vnode to pnode mapping if possible */ >> + if (libxl__init_vnode_to_pnode(gc, domid, info) < 0) { >> + LOG(DETAIL, "Failed to call init_vnodemap\n"); >> + /* vnuma_nodemap will not be used if nr_vnodes == 0 */ >> + return ERROR_FAIL; >> + } >> + } >> + else { >> + LOG(ERROR, "The vnodes cannot be mapped to pnodes this way\n."); >> + info->nr_vnodes = 0; >> + return ERROR_FAIL; >> + } >> + } >> + /* plumb domain with vNUMA topology */ >> + if (xc_domain_setvnuma(ctx->xch, domid, info->nr_vnodes, >> + info->max_vcpus, memrange, >> + info->vdistance, info->vcpu_to_vnode, >> + info->vnode_to_pnode) < 0) { >> + >> + LOG(DETAIL, "Failed to set vnuma topology for domain from\n."); >> + info->nr_vnodes = 0; >> + return ERROR_FAIL; >> + } >> + } >> + else >> + LOG(DEBUG, "Will not construct vNUMA topology.\n"); >> + > *** <-- until here > >> +/* Checks if vnuma_nodemap defined in info can be used >> + * for allocation of vnodes on physical NUMA nodes by >> + * verifying that there is enough memory on corresponding >> + * NUMA nodes. >> + */ >> +unsigned int libxl__vnodemap_is_usable(libxl__gc *gc, libxl_domain_build_info *info) >> +{ >> + unsigned int i; >> + libxl_numainfo *ninfo = NULL; >> + unsigned long long *claim; >> + unsigned int node; >> + uint64_t *mems; >> + int rc, nr_nodes; >> + >> + rc = nr_nodes = 0; >> + if (info->vnode_to_pnode == NULL || info->vnuma_memszs == NULL) >> + return rc; >> + /* >> + * Cannot use specified mapping if not NUMA machine >> + */ >> + ninfo = libxl_get_numainfo(CTX, &nr_nodes); >> + if (ninfo == NULL) { >> + return rc; >> + } >> + mems = info->vnuma_memszs; >> + claim = calloc(info->nr_vnodes, sizeof(*claim)); >> + if (claim == NULL) >> + return rc; >> + /* Sum memory request on per pnode basis */ >> + for (i = 0; i < info->nr_vnodes; i++) >> + { >> + node = info->vnode_to_pnode[i]; >> + /* Correct pnode number? */ >> + if (node < nr_nodes) >> + claim[node] += (mems[i] << 20); >> + else >> + goto vmapu; >> + } >> + for (i = 0; i < nr_nodes; i++) { >> + if (claim[i] > ninfo[i].free) >> + /* Cannot complete user request, falling to default */ >> + goto vmapu; >> + } >> + rc = 1; >> + >> + vmapu: >> + if(claim) free(claim); >> + return rc; >> +} >> > This function looks conceptually right. Despite that, it has comes with > coding style, trailing whitespaces, inconsistent (wrt libxl conventions) > error and memory allocation/failure handling. :-D :-D > >> diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c >> index 87a8110..206f5be 100644 > >> + >> +/* >> + * Checks for the beginnig and end of RAM in e820 map for domain >> + * and aligns start of first and end of last vNUMA memory block to >> + * that map. vnode memory size are passed here Megabytes. >> + * For PV guest e820 map has fixed hole sizes. >> + */ >> +int libxl__vnuma_align_mem(libxl__gc *gc, >> + uint32_t domid, >> + libxl_domain_build_info *b_info, /* IN: mem sizes */ >> + vmemrange_t *memblks) /* OUT: linux numa blocks in pfn */ >> +{ >> + int i, j, rc; >> + uint64_t next_start_pfn, end_max = 0, size, mem_hole; >> + uint32_t nr; >> + struct e820entry map[E820MAX]; >> + >> + if (b_info->nr_vnodes == 0) >> + return -EINVAL; >> + libxl_ctx *ctx = libxl__gc_owner(gc); >> + >> + /* retreive e820 map for this host */ >> + rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX); >> + >> + if (rc < 0) { >> + errno = rc; >> + return -EINVAL; >> + } >> + nr = rc; >> + rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb, >> + (b_info->max_memkb - b_info->target_memkb) + >> + b_info->u.pv.slack_memkb); >> + if (rc) >> + { >> + errno = rc; >> + return -EINVAL; >> + } >> + >> + /* max pfn for this host */ >> + for (j = nr - 1; j >= 0; j--) >> + if (map[j].type == E820_RAM) { >> + end_max = map[j].addr + map[j].size; >> + break; >> + } >> + >> + memset(memblks, 0, sizeof(*memblks) * b_info->nr_vnodes); >> + next_start_pfn = 0; >> + >> + memblks[0].start = map[0].addr; >> + >> + for(i = 0; i < b_info->nr_vnodes; i++) { >> + /* start can be not zero */ >> + memblks[i].start += next_start_pfn; >> + memblks[i].end = memblks[i].start + (b_info->vnuma_memszs[i] << 20); >> + >> + size = memblks[i].end - memblks[i].start; >> + /* >> + * For pv host with e820_host option turned on we need >> + * to rake into account memory holes. For pv host with >> + * e820_host disabled or unset, the map is contiguous >> + * RAM region. >> + */ >> + if (libxl_defbool_val(b_info->u.pv.e820_host)) { >> + while (mem_hole = e820_memory_hole_size(memblks[i].start, >> + memblks[i].end, map, nr), >> + memblks[i].end - memblks[i].start - mem_hole < size) >> + { >> + memblks[i].end += mem_hole; >> + >> + if (memblks[i].end > end_max) { >> + memblks[i].end = end_max; >> + break; >> + } >> + } >> + } >> + next_start_pfn = memblks[i].end; >> + } >> + >> + if (memblks[i-1].end > end_max) >> + memblks[i-1].end = end_max; >> + >> + return 0; >> +} >> + > All this code here would require for someone who knows more about e820 > to chime in (or for me to learn something more about it, which I > definitely want to, but will take a bit). > > Perhaps you can help me at leas a bit. Can you comment on/describe what > is it that you do when you find an hole? > > It looks from the above that you just need to move the end of the > node(s) memory range(s)... Is that the case? For some reason, I was > assuming that dealing with holes would have required to add support for > more than one memory block per node, but it''s apparently not the case, > is it?The code looks right now a bit different as I found some problems there. But here is the general idea. When e820_host is not set, then pv guest has fixed e820 map and I dont care about the holes much since its always looks like this: [ 0.000000] e820: BIOS-provided physical RAM map: [ 0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable [ 0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved [ 0.000000] Xen: [mem 0x0000000000100000-0x00000000f9ffffff] usable [ 0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved [ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable Means, first 4KB (0x0000 - 0x0fff) and 384K gap between 0xa0000 and 0xfffff will be reserved. Since these pfns will never appear in the pages allocations and the the beginning and end of memory blocks I construct are exactly as in guest, I form memory blocks in this case bases in sizes of vnodes. In case e820_host is set to 1, I have to account for memory holes and build correct memory ranges. As for memory blocks. There will be need in second memory block of certain type (memory, in this case) if the node spans non-contiguous memory ranges. For example, if node 0 has 0x000000 - 0x100000 and 0x400000 - 0x800000, then it will look like this: hypothethical addresses here :) node0, memblock 0: 0x000000 - 0x0fffff node1, memblock 1: 0x100000 - 0x3ffffff node 0, memblock 1: 0x400000 - 0x7fffff If the memory blocks can be merged and form contiguous block, that will be done and will be one block per node. So I dont see at this point as having two memory blocks. gaps in e820 are being accounted in linux in memory ranges list in certain memory block. Elena> > Regards, > Dario > > -- > <<This happens because I choose it to happen!>> (Raistlin Majere) > ----------------------------------------------------------------- > Dario Faggioli, Ph.D, http://about.me/dario.faggioli > Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) >-- Elena