vNUMA introduction
This series of patches introduces vNUMA topology awareness and
provides interfaces and data structures to enable vNUMA for
PV guests. There is a plan to extend this support for dom0 and
HVM domains.
vNUMA topology support should be supported by PV guest kernel.
Corresponging patches should be applied.
Introduction
-------------
vNUMA topology is exposed to the PV guest to improve performance when running
workloads on NUMA machines.
XEN vNUMA implementation provides a way to create vNUMA-enabled guests on
NUMA/UMA
and map vNUMA topology to physical NUMA in a optimal way.
XEN vNUMA support
Current set of patches introduces subop hypercall that is available for
enlightened
PV guests with vNUMA patches applied.
Domain structure was modified to reflect per-domain vNUMA topology for use in
other
vNUMA-aware subsystems (e.g. ballooning).
libxc
libxc provides interfaces to build PV guests with vNUMA support and in case of
NUMA
machines provides initial memory allocation on physical NUMA nodes. This
implemented by
utilizing nodemap formed by automatic NUMA placement. Details are in patch #3.
libxl
libxl provides a way to predefine in VM config vNUMA topology - number of
vnodes,
memory arrangement, vcpus to vnodes assignment, distance map.
PV guest
As of now, only PV guest can take advantage of vNUMA functionality. vNUMA Linux
patches
should be applied and NUMA support should be compiled in kernel.
This patchset can be pulled from https://git.gitorious.org/xenvnuma/xenvnuma.git
Linux patchset https://git.gitorious.org/xenvnuma/linuxvnuma.git
Examples of booting vNUMA enabled PV Linux guest on real NUMA machine:
1. Automatic vNUMA placement on real NUMA machine:
VM config:
memory = 16384
vcpus = 4
name = "rcbig"
vnodes = 4
vnumamem = [10,10]
vnuma_distance = [10, 30, 10, 30]
vcpu_to_vnode = [0, 0, 1, 1]
Xen:
(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 2569511):
(XEN) Node 0: 1416166
(XEN) Node 1: 1153345
(XEN) Domain 5 (total: 4194304):
(XEN) Node 0: 2097152
(XEN) Node 1: 2097152
(XEN) Domain has 4 vnodes
(XEN) vnode 0 - pnode 0 (4096) MB
(XEN) vnode 1 - pnode 0 (4096) MB
(XEN) vnode 2 - pnode 1 (4096) MB
(XEN) vnode 3 - pnode 1 (4096) MB
(XEN) Domain vcpu to vnode:
(XEN) 0 1 2 3
dmesg on pv guest:
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x00001000-0x0009ffff]
[ 0.000000] node 0: [mem 0x00100000-0xffffffff]
[ 0.000000] node 1: [mem 0x100000000-0x1ffffffff]
[ 0.000000] node 2: [mem 0x200000000-0x2ffffffff]
[ 0.000000] node 3: [mem 0x300000000-0x3ffffffff]
[ 0.000000] On node 0 totalpages: 1048479
[ 0.000000] DMA zone: 56 pages used for memmap
[ 0.000000] DMA zone: 21 pages reserved
[ 0.000000] DMA zone: 3999 pages, LIFO batch:0
[ 0.000000] DMA32 zone: 14280 pages used for memmap
[ 0.000000] DMA32 zone: 1044480 pages, LIFO batch:31
[ 0.000000] On node 1 totalpages: 1048576
[ 0.000000] Normal zone: 14336 pages used for memmap
[ 0.000000] Normal zone: 1048576 pages, LIFO batch:31
[ 0.000000] On node 2 totalpages: 1048576
[ 0.000000] Normal zone: 14336 pages used for memmap
[ 0.000000] Normal zone: 1048576 pages, LIFO batch:31
[ 0.000000] On node 3 totalpages: 1048576
[ 0.000000] Normal zone: 14336 pages used for memmap
[ 0.000000] Normal zone: 1048576 pages, LIFO batch:31
[ 0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
[ 0.000000] smpboot: Allowing 4 CPUs, 0 hotplug CPUs
[ 0.000000] No local APIC present
[ 0.000000] APIC: disable apic facility
[ 0.000000] APIC: switched to apic NOOP
[ 0.000000] nr_irqs_gsi: 16
[ 0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[ 0.000000] e820: cannot find a gap in the 32bit address range
[ 0.000000] e820: PCI devices with unassigned 32bit BARs may break!
[ 0.000000] e820: [mem 0x400100000-0x4004fffff] available for PCI devices
[ 0.000000] Booting paravirtualized kernel on Xen
[ 0.000000] Xen version: 4.4-unstable (preserve-AD)
[ 0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:4
nr_node_ids:4
[ 0.000000] PERCPU: Embedded 28 pages/cpu @ffff8800ffc00000 s85376 r8192
d21120 u2097152
[ 0.000000] pcpu-alloc: s85376 r8192 d21120 u2097152 alloc=1*2097152
[ 0.000000] pcpu-alloc: [0] 0 [1] 1 [2] 2 [3] 3
pv guest: numactl --hardware:
root@heatpipe:~# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0
node 0 size: 4031 MB
node 0 free: 3997 MB
node 1 cpus: 1
node 1 size: 4039 MB
node 1 free: 4022 MB
node 2 cpus: 2
node 2 size: 4039 MB
node 2 free: 4023 MB
node 3 cpus: 3
node 3 size: 3975 MB
node 3 free: 3963 MB
node distances:
node 0 1 2 3
0: 10 20 20 20
1: 20 10 20 20
2: 20 20 10 20
3: 20 20 20 10
Comments:
None of the configuration options are correct so default values were used.
Since machine is NUMA machine and there is no vcpu pinning defines, NUMA
automatic node selection mechanism is used and you can see how vnodes
were split across physical nodes.
2. vNUMA enabled guest, no default values, real NUMA machine
Config:
memory = 4096
vcpus = 4
name = "rc9"
vnodes = 2
vnumamem = [2048, 2048]
vdistance = [10, 40, 40, 10]
vnuma_vcpumap = [1, 0, 1, 0]
vnuma_vnodemap = [1, 0]
Xen:
(XEN) ''u'' pressed -> dumping numa info (now-0xA86:BD6C8829)
(XEN) idx0 -> NODE0 start->0 size->4521984 free->131471
(XEN) phys_to_nid(0000000000001000) -> 0 should be 0
(XEN) idx1 -> NODE1 start->4521984 size->4194304 free->341610
(XEN) phys_to_nid(0000000450001000) -> 1 should be 1
(XEN) CPU0 -> NODE0
(XEN) CPU1 -> NODE0
(XEN) CPU2 -> NODE0
(XEN) CPU3 -> NODE0
(XEN) CPU4 -> NODE1
(XEN) CPU5 -> NODE1
(XEN) CPU6 -> NODE1
(XEN) CPU7 -> NODE1
(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 2569511):
(XEN) Node 0: 1416166
(XEN) Node 1: 1153345
(XEN) Domain 6 (total: 1048576):
(XEN) Node 0: 524288
(XEN) Node 1: 524288
(XEN) Domain has 2 vnodes
(XEN) vnode 0 - pnode 1 (2048) MB
(XEN) vnode 1 - pnode 0 (2048) MB
(XEN) Domain vcpu to vnode:
(XEN) 1 0 1 0
pv guest dmesg:
[ 0.000000] NUMA: Initialized distance table, cnt=2
[ 0.000000] Initmem setup node 0 [mem 0x00000000-0x7fffffff]
[ 0.000000] NODE_DATA [mem 0x7ffd9000-0x7fffffff]
[ 0.000000] Initmem setup node 1 [mem 0x80000000-0xffffffff]
[ 0.000000] NODE_DATA [mem 0xff7f8000-0xff81efff]
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x00001000-0x00ffffff]
[ 0.000000] DMA32 [mem 0x01000000-0xffffffff]
[ 0.000000] Normal empty
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x00001000-0x0009ffff]
[ 0.000000] node 0: [mem 0x00100000-0x7fffffff]
[ 0.000000] node 1: [mem 0x80000000-0xffffffff]
[ 0.000000] On node 0 totalpages: 524191
[ 0.000000] DMA zone: 56 pages used for memmap
[ 0.000000] DMA zone: 21 pages reserved
[ 0.000000] DMA zone: 3999 pages, LIFO batch:0
[ 0.000000] DMA32 zone: 7112 pages used for memmap
[ 0.000000] DMA32 zone: 520192 pages, LIFO batch:31
[ 0.000000] On node 1 totalpages: 524288
[ 0.000000] DMA32 zone: 7168 pages used for memmap
[ 0.000000] DMA32 zone: 524288 pages, LIFO batch:31
[ 0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
[ 0.000000] smpboot: Allowing 4 CPUs, 0 hotplug CPUs
[ 0.000000] No local APIC present
[ 0.000000] APIC: disable apic facility
[ 0.000000] APIC: switched to apic NOOP
[ 0.000000] nr_irqs_gsi: 16
[ 0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[ 0.000000] e820: cannot find a gap in the 32bit address range
[ 0.000000] e820: PCI devices with unassigned 32bit BARs may break!
[ 0.000000] e820: [mem 0x100100000-0x1004fffff] available for PCI devices
[ 0.000000] Booting paravirtualized kernel on Xen
[ 0.000000] Xen version: 4.4-unstable (preserve-AD)
[ 0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:4
nr_node_ids:2
[ 0.000000] PERCPU: Embedded 28 pages/cpu @ffff88007fc00000 s85376 r8192
d21120 u1048576
[ 0.000000] pcpu-alloc: s85376 r8192 d21120 u1048576 alloc=1*2097152
[ 0.000000] pcpu-alloc: [0] 0 2 [1] 1 3
pv guest:
root@heatpipe:~# numactl --ha
available: 2 nodes (0-1)
node 0 cpus: 1 3
node 0 size: 2011 MB
node 0 free: 1975 MB
node 1 cpus: 0 2
node 1 size: 2003 MB
node 1 free: 1983 MB
node distances:
node 0 1
0: 10 40
1: 40 10
root@heatpipe:~# numactl --ha
available: 2 nodes (0-1)
node 0 cpus: 1 3
node 0 size: 2011 MB
node 0 free: 1975 MB
node 1 cpus: 0 2
node 1 size: 2003 MB
node 1 free: 1983 MB
node distances:
node 0 1
0: 10 40
1: 40 10
In this case every config option is correct and we have exact vNUMA topology
as it in VN config file.
Notes:
* to enable vNUMA in linux kernel the corresponding patch set should be
applied;
* automatic numa balancing featurue seem to be fixed in linux kernel:
https://lkml.org/lkml/2013/7/31/647
TODO:
* This version limits vdistance config option to only two values - same node
distance and other node distance; This prevents oopses on latest (3.13-rc1)
linux kernel with non-symmetric distance;
* cpu siblings for Linux machine and xen cpu trap should be detected and
warning should be given; Add cpuid check if set in VM config;
* benchmarking;
Elena Ufimtseva (7):
xen: vNUMA support for guests.
libxc: Plumb Xen with vNUMA topology for domain.
libxc: vnodes allocation on NUMA nodes.
libxl: vNUMA supporting interface.
libxl: vNUMA configuration parser
xen: adds vNUMA info debug-key u
xl: docs for xl config vnuma options
docs/man/xl.cfg.pod.5 | 55 +++++++++
tools/libxc/xc_dom.h | 10 ++
tools/libxc/xc_dom_x86.c | 85 ++++++++++++--
tools/libxc/xc_domain.c | 61 ++++++++++
tools/libxc/xenctrl.h | 9 ++
tools/libxc/xg_private.h | 1 +
tools/libxl/libxl.c | 20 ++++
tools/libxl/libxl.h | 20 ++++
tools/libxl/libxl_arch.h | 8 ++
tools/libxl/libxl_dom.c | 189 ++++++++++++++++++++++++++++-
tools/libxl/libxl_internal.h | 3 +
tools/libxl/libxl_types.idl | 5 +-
tools/libxl/libxl_vnuma.h | 7 ++
tools/libxl/libxl_x86.c | 58 +++++++++
tools/libxl/xl_cmdimpl.c | 268 +++++++++++++++++++++++++++++++++++++++++-
xen/arch/x86/numa.c | 19 +++
xen/common/domain.c | 10 ++
xen/common/domctl.c | 82 +++++++++++++
xen/common/memory.c | 36 ++++++
xen/include/public/domctl.h | 24 ++++
xen/include/public/memory.h | 8 ++
xen/include/public/vnuma.h | 44 +++++++
xen/include/xen/domain.h | 10 ++
xen/include/xen/sched.h | 1 +
24 files changed, 1020 insertions(+), 13 deletions(-)
create mode 100644 tools/libxl/libxl_vnuma.h
create mode 100644 xen/include/public/vnuma.h
--
1.7.10.4