thr3ads.net - Xen devel - [PATCH v3 0/7] vNUMA introduction [Nov 2013]

If this information is useful, please help other people find it:
Share via:

Elena Ufimtseva

2013-Nov-18 20:24 UTC

[PATCH v3 0/7] vNUMA introduction

vNUMA introduction

This series of patches introduces vNUMA topology awareness and
provides interfaces and data structures to enable vNUMA for
PV guests. There is a plan to extend this support for dom0 and
HVM domains.

vNUMA topology support should be supported by PV guest kernel.
Corresponging patches should be applied.

Introduction
-------------

vNUMA topology is exposed to the PV guest to improve performance when running
workloads on NUMA machines.
XEN vNUMA implementation provides a way to create vNUMA-enabled guests on
NUMA/UMA
and map vNUMA topology to physical NUMA in a optimal way.

XEN vNUMA support

Current set of patches introduces subop hypercall that is available for
enlightened
PV guests with vNUMA patches applied.

Domain structure was modified to reflect per-domain vNUMA topology for use in
other
vNUMA-aware subsystems (e.g. ballooning).

libxc

libxc provides interfaces to build PV guests with vNUMA support and in case of
NUMA
machines provides initial memory allocation on physical NUMA nodes. This
implemented by
utilizing nodemap formed by automatic NUMA placement. Details are in patch #3.

libxl

libxl provides a way to predefine in VM config vNUMA topology - number of
vnodes,
memory arrangement, vcpus to vnodes assignment, distance map.

PV guest

As of now, only PV guest can take advantage of vNUMA functionality. vNUMA Linux
patches
should be applied and NUMA support should be compiled in kernel.

This patchset can be pulled from
https://git.gitorious.org/xenvnuma/xenvnuma.git:v3
Linux patchset https://git.gitorious.org/xenvnuma/linuxvnuma.git:v2

Examples of booting vNUMA enabled PV Linux guest on real NUMA machine:

1. Automatic vNUMA placement on real NUMA machine:

VM config:

memory = 16384
vcpus = 4
name = "rcbig"
vnodes = 4
vnumamem = [10,10]
vnuma_distance = [10, 30, 10, 30]
vcpu_to_vnode = [0, 0, 1, 1]

Xen:

(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 2569511):
(XEN)     Node 0: 1416166
(XEN)     Node 1: 1153345
(XEN) Domain 5 (total: 4194304):
(XEN)     Node 0: 2097152
(XEN)     Node 1: 2097152
(XEN)     Domain has 4 vnodes
(XEN)         vnode 0 - pnode 0  (4096) MB
(XEN)         vnode 1 - pnode 0  (4096) MB
(XEN)         vnode 2 - pnode 1  (4096) MB
(XEN)         vnode 3 - pnode 1  (4096) MB
(XEN)     Domain vcpu to vnode:
(XEN)     0 1 2 3

dmesg on pv guest:

[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
[    0.000000]   node   0: [mem 0x00100000-0xffffffff]
[    0.000000]   node   1: [mem 0x100000000-0x1ffffffff]
[    0.000000]   node   2: [mem 0x200000000-0x2ffffffff]
[    0.000000]   node   3: [mem 0x300000000-0x3ffffffff]
[    0.000000] On node 0 totalpages: 1048479
[    0.000000]   DMA zone: 56 pages used for memmap
[    0.000000]   DMA zone: 21 pages reserved
[    0.000000]   DMA zone: 3999 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 14280 pages used for memmap
[    0.000000]   DMA32 zone: 1044480 pages, LIFO batch:31
[    0.000000] On node 1 totalpages: 1048576
[    0.000000]   Normal zone: 14336 pages used for memmap
[    0.000000]   Normal zone: 1048576 pages, LIFO batch:31
[    0.000000] On node 2 totalpages: 1048576
[    0.000000]   Normal zone: 14336 pages used for memmap
[    0.000000]   Normal zone: 1048576 pages, LIFO batch:31
[    0.000000] On node 3 totalpages: 1048576
[    0.000000]   Normal zone: 14336 pages used for memmap
[    0.000000]   Normal zone: 1048576 pages, LIFO batch:31
[    0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
[    0.000000] smpboot: Allowing 4 CPUs, 0 hotplug CPUs
[    0.000000] No local APIC present
[    0.000000] APIC: disable apic facility
[    0.000000] APIC: switched to apic NOOP
[    0.000000] nr_irqs_gsi: 16
[    0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[    0.000000] e820: cannot find a gap in the 32bit address range
[    0.000000] e820: PCI devices with unassigned 32bit BARs may break!
[    0.000000] e820: [mem 0x400100000-0x4004fffff] available for PCI devices
[    0.000000] Booting paravirtualized kernel on Xen
[    0.000000] Xen version: 4.4-unstable (preserve-AD)
[    0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:4
nr_node_ids:4
[    0.000000] PERCPU: Embedded 28 pages/cpu @ffff8800ffc00000 s85376 r8192
d21120 u2097152
[    0.000000] pcpu-alloc: s85376 r8192 d21120 u2097152 alloc=1*2097152
[    0.000000] pcpu-alloc: [0] 0 [1] 1 [2] 2 [3] 3


pv guest: numactl --hardware:

root@heatpipe:~# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0
node 0 size: 4031 MB
node 0 free: 3997 MB
node 1 cpus: 1
node 1 size: 4039 MB
node 1 free: 4022 MB
node 2 cpus: 2
node 2 size: 4039 MB
node 2 free: 4023 MB
node 3 cpus: 3
node 3 size: 3975 MB
node 3 free: 3963 MB
node distances:
node   0   1   2   3
  0:  10  20  20  20
  1:  20  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10

Comments:
None of the configuration options are correct so default values were used.
Since machine is NUMA machine and there is no vcpu pinning defines, NUMA
automatic node selection mechanism is used and you can see how vnodes
were split across physical nodes.

2. vNUMA enabled guest, no default values, real NUMA machine

Config:

memory = 4096
vcpus = 4
name = "rc9"
vnodes = 2
vnumamem = [2048, 2048]
vdistance = [10, 40, 40, 10]
vnuma_vcpumap = [1, 0, 1, 0]
vnuma_vnodemap = [1, 0]


Xen:

(XEN) ''u'' pressed -> dumping numa info (now-0xA86:BD6C8829)
(XEN) idx0 -> NODE0 start->0 size->4521984 free->131471
(XEN) phys_to_nid(0000000000001000) -> 0 should be 0
(XEN) idx1 -> NODE1 start->4521984 size->4194304 free->341610
(XEN) phys_to_nid(0000000450001000) -> 1 should be 1
(XEN) CPU0 -> NODE0
(XEN) CPU1 -> NODE0
(XEN) CPU2 -> NODE0
(XEN) CPU3 -> NODE0
(XEN) CPU4 -> NODE1
(XEN) CPU5 -> NODE1
(XEN) CPU6 -> NODE1
(XEN) CPU7 -> NODE1
(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 2569511):
(XEN)     Node 0: 1416166
(XEN)     Node 1: 1153345
(XEN) Domain 6 (total: 1048576):
(XEN)     Node 0: 524288
(XEN)     Node 1: 524288
(XEN)     Domain has 2 vnodes
(XEN)         vnode 0 - pnode 1  (2048) MB
(XEN)         vnode 1 - pnode 0  (2048) MB
(XEN)     Domain vcpu to vnode:
(XEN)     1 0 1 0

pv guest dmesg:

[    0.000000] NUMA: Initialized distance table, cnt=2
[    0.000000] Initmem setup node 0 [mem 0x00000000-0x7fffffff]
[    0.000000]   NODE_DATA [mem 0x7ffd9000-0x7fffffff]
[    0.000000] Initmem setup node 1 [mem 0x80000000-0xffffffff]
[    0.000000]   NODE_DATA [mem 0xff7f8000-0xff81efff]
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
[    0.000000]   Normal   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
[    0.000000]   node   0: [mem 0x00100000-0x7fffffff]
[    0.000000]   node   1: [mem 0x80000000-0xffffffff]
[    0.000000] On node 0 totalpages: 524191
[    0.000000]   DMA zone: 56 pages used for memmap
[    0.000000]   DMA zone: 21 pages reserved
[    0.000000]   DMA zone: 3999 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 7112 pages used for memmap
[    0.000000]   DMA32 zone: 520192 pages, LIFO batch:31
[    0.000000] On node 1 totalpages: 524288
[    0.000000]   DMA32 zone: 7168 pages used for memmap
[    0.000000]   DMA32 zone: 524288 pages, LIFO batch:31
[    0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
[    0.000000] smpboot: Allowing 4 CPUs, 0 hotplug CPUs
[    0.000000] No local APIC present
[    0.000000] APIC: disable apic facility
[    0.000000] APIC: switched to apic NOOP
[    0.000000] nr_irqs_gsi: 16
[    0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[    0.000000] e820: cannot find a gap in the 32bit address range
[    0.000000] e820: PCI devices with unassigned 32bit BARs may break!
[    0.000000] e820: [mem 0x100100000-0x1004fffff] available for PCI devices
[    0.000000] Booting paravirtualized kernel on Xen
[    0.000000] Xen version: 4.4-unstable (preserve-AD)
[    0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:4
nr_node_ids:2
[    0.000000] PERCPU: Embedded 28 pages/cpu @ffff88007fc00000 s85376 r8192
d21120 u1048576
[    0.000000] pcpu-alloc: s85376 r8192 d21120 u1048576 alloc=1*2097152
[    0.000000] pcpu-alloc: [0] 0 2 [1] 1 3


pv guest:

root@heatpipe:~# numactl --ha
available: 2 nodes (0-1)
node 0 cpus: 1 3
node 0 size: 2011 MB
node 0 free: 1975 MB
node 1 cpus: 0 2
node 1 size: 2003 MB
node 1 free: 1983 MB
node distances:
node   0   1
  0:  10  40
  1:  40  10
root@heatpipe:~# numactl --ha
available: 2 nodes (0-1)
node 0 cpus: 1 3
node 0 size: 2011 MB
node 0 free: 1975 MB
node 1 cpus: 0 2
node 1 size: 2003 MB
node 1 free: 1983 MB
node distances:
node   0   1
  0:  10  40
  1:  40  10

In this case every config option is correct and we have exact vNUMA topology
as it in VN config file.

Example with e820_host = 1 (32GB real NUMA machines, two nodes).

PV config:

memory = 18096
vcpus = 6
name = "rc9"
vnodes = 4
#vnumamem = [1024, 1024, 1024, 1024]
vdistance = [10, 40]
vnuma_vcpumap = [0, 1, 1, 3, 3, 2]
vnuma_vnodemap = [1, 0]
vnuma_autoplacement = 1
e820_host = 1


pv guest boot:
(d29) [    0.000000] Command line: root=/dev/xvda1 ro earlyprintk=xen debug
loglevel=8 debug print_fatal_signals=1 loglvl=all guest_l
(d29) oglvl=all LOGLEVEL=8 earlyprintk=xen sched_debug numa_balancing=on
(d29) [    0.000000] ACPI in unprivileged domain disabled
(d29) [    0.000000] Freeing ac228-100000 pfn range: 343512 pages freed
(d29) [    0.000000] Released 343512 pages of unused memory
(d29) [    0.000000] Set 343512 page(s) to 1-1 mapping
(d29) [    0.000000] Populating 46b000-4bedd8 pfn range: 343512 pages added
(d29) [    0.000000] e820: BIOS-provided physical RAM map:
(d29) [    0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
(d29) [    0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
(d29) [    0.000000] Xen: [mem 0x0000000000100000-0x00000000ac227fff] usable
(d29) [    0.000000] Xen: [mem 0x00000000ac228000-0x00000000ac26bfff] reserved
(d29) [    0.000000] Xen: [mem 0x00000000ac26c000-0x00000000ac57ffff] unusable
(d29) [    0.000000] Xen: [mem 0x00000000ac580000-0x00000000ac5a0fff] reserved
(d29) [    0.000000] Xen: [mem 0x00000000ac5a1000-0x00000000ac5bbfff] unusable
(d29) [    0.000000] Xen: [mem 0x00000000ac5bc000-0x00000000ac5bdfff] reserved
(d29) [    0.000000] Xen: [mem 0x00000000ac5be000-0x00000000ac5befff] unusable
(d29) [    0.000000] Xen: [mem 0x00000000ac5bf000-0x00000000ac5cafff] reserved
(d29) [    0.000000] Xen: [mem 0x00000000ac5cb000-0x00000000ac5d9fff] unusable
(d29) [    0.000000] Xen: [mem 0x00000000ac5da000-0x00000000ac5fafff] reserved
(d29) [    0.000000] Xen: [mem 0x00000000ac5fb000-0x00000000ac6b6fff] unusable
(d29) [    0.000000] Xen: [mem 0x00000000ac6b7000-0x00000000ac7fafff] ACPI NVS
(d29) [    0.000000] Xen: [mem 0x00000000ac7fb000-0x00000000ac80efff] unusable
(d29) [    0.000000] Xen: [mem 0x00000000ac80f000-0x00000000ac80ffff] ACPI data
(d29) [    0.000000] Xen: [mem 0x00000000ac810000-0x00000000ac810fff] unusable
(d29) [    0.000000] Xen: [mem 0x00000000ac811000-0x00000000ac812fff] ACPI data
(d29) [    0.000000] Xen: [mem 0x00000000ac813000-0x00000000ad7fffff] unusable
(d29) [    0.000000] Xen: [mem 0x00000000b0000000-0x00000000b3ffffff] reserved
(d29) [    0.000000] Xen: [mem 0x00000000fed20000-0x00000000fed3ffff] reserved
(d29) [    0.000000] Xen: [mem 0x00000000fed50000-0x00000000fed8ffff] reserved
(d29) [    0.000000] Xen: [mem 0x00000000fee00000-0x00000000feefffff] reserved
(d29) [    0.000000] Xen: [mem 0x00000000ffa00000-0x00000000ffa3ffff] reserved
(d29) [    0.000000] Xen: [mem 0x0000000100000000-0x00000004bedd7fff] usable
(d29) [    0.000000] bootconsole [xenboot0] enabled
(d29) [    0.000000] NX (Execute Disable) protection: active
(d29) [    0.000000] DMI not present or invalid.
(d29) [    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==>
reserved
(d29) [    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
(d29) [    0.000000] No AGP bridge found
(d29) [    0.000000] e820: last_pfn = 0x4bedd8 max_arch_pfn = 0x400000000
(d29) [    0.000000] e820: last_pfn = 0xac228 max_arch_pfn = 0x400000000
(d29) [    0.000000] Base memory trampoline at [ffff88000009a000] 9a000 size
24576
(d29) [    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
(d29) [    0.000000]  [mem 0x00000000-0x000fffff] page 4k
(d29) [    0.000000] init_memory_mapping: [mem 0x4bea00000-0x4bebfffff]
(d29) [    0.000000]  [mem 0x4bea00000-0x4bebfffff] page 4k
(d29) [    0.000000] BRK [0x019d8000, 0x019d8fff] PGTABLE
(d29) [    0.000000] BRK [0x019d9000, 0x019d9fff] PGTABLE
(d29) [    0.000000] init_memory_mapping: [mem 0x4bc000000-0x4be9fffff]
(d29) [    0.000000]  [mem 0x4bc000000-0x4be9fffff] page 4k
(d29) [    0.000000] BRK [0x019da000, 0x019dafff] PGTABLE
(d29) [    0.000000] BRK [0x019db000, 0x019dbfff] PGTABLE
(d29) [    0.000000] BRK [0x019dc000, 0x019dcfff] PGTABLE
(d29) [    0.000000] BRK [0x019dd000, 0x019ddfff] PGTABLE
(d29) [    0.000000] init_memory_mapping: [mem 0x480000000-0x4bbffffff]
(d29) [    0.000000]  [mem 0x480000000-0x4bbffffff] page 4k
(d29) [    0.000000] init_memory_mapping: [mem 0x00100000-0xac227fff]
(d29) [    0.000000]  [mem 0x00100000-0xac227fff] page 4k
(d29) [    0.000000] init_memory_mapping: [mem 0x100000000-0x47fffffff]
(d29) [    0.000000]  [mem 0x100000000-0x47fffffff] page 4k
(d29) [    0.000000] init_memory_mapping: [mem 0x4bec00000-0x4bedd7fff]
(d29) [    0.000000]  [mem 0x4bec00000-0x4bedd7fff] page 4k
(d29) [    0.000000] RAMDISK: [mem 0x01dc9000-0x03470fff]
(d29) [    0.000000] NUMA: Initialized distance table, cnt=4
(d29) [    0.000000] Initmem setup node 0 [mem 0x00000000-0x16e9d7fff]
(d29) [    0.000000]   NODE_DATA [mem 0x16e9b1000-0x16e9d7fff]
(d29) [    0.000000] Initmem setup node 1 [mem 0x16f000000-0x2895d7fff]
(d29) [    0.000000]   NODE_DATA [mem 0x2895b1000-0x2895d7fff]
(d29) [    0.000000] Initmem setup node 2 [mem 0x289800000-0x3a41d7fff]
(d29) [    0.000000]   NODE_DATA [mem 0x3a41b1000-0x3a41d7fff]
(d29) [    0.000000] Initmem setup node 3 [mem 0x3a4800000-0x4bedd7fff]
(d29) [    0.000000]   NODE_DATA [mem 0x4bedad000-0x4bedd3fff]
(d29) [    0.000000] Zone ranges:
(d29) [    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
(d29) [    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
(d29) [    0.000000]   Normal   [mem 0x100000000-0x4bedd7fff]
(d29) [    0.000000] Movable zone start for each node
(d29) [    0.000000] Early memory node ranges
(d29) [    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
(d29) [    0.000000]   node   0: [mem 0x00100000-0xac227fff]
(d29) [    0.000000]   node   0: [mem 0x100000000-0x16e9d7fff]
(d29) [    0.000000]   node   1: [mem 0x16e9d8000-0x2895d7fff]
(d29) [    0.000000]   node   2: [mem 0x2895d8000-0x3a41d7fff]
(d29) [    0.000000]   node   3: [mem 0x3a41d8000-0x4bedd7fff]
(d29) [    0.000000] On node 0 totalpages: 1158047
(d29) [    0.000000]   DMA zone: 56 pages used for memmap
(d29) [    0.000000]   DMA zone: 21 pages reserved
(d29) [    0.000000]   DMA zone: 3999 pages, LIFO batch:0
(d29) [    0.000000]   DMA32 zone: 9584 pages used for memmap
(d29) [    0.000000]   DMA32 zone: 700968 pages, LIFO batch:31
(d29) [    0.000000]   Normal zone: 6195 pages used for memmap
(d29) [    0.000000]   Normal zone: 453080 pages, LIFO batch:31
(d29) [    0.000000] On node 1 totalpages: 1158144
(d29) [    0.000000]   Normal zone: 15834 pages used for memmap
(d29) [    0.000000]   Normal zone: 1158144 pages, LIFO batch:31
(d29) [    0.000000] On node 2 totalpages: 1158144
(d29) [    0.000000]   Normal zone: 15834 pages used for memmap
(d29) [    0.000000]   Normal zone: 1158144 pages, LIFO batch:31
(d29) [    0.000000] On node 3 totalpages: 1158144
(d29) [    0.000000]   Normal zone: 15834 pages used for memmap
(d29) [    0.000000]   Normal zone: 1158144 pages, LIFO batch:31
(d29) [    0.000000] SFI: Simple Firmware Interface v0.81
http://simplefirmware.org
(d29) [    0.000000] smpboot: Allowing 6 CPUs, 0 hotplug CPUs
(d29) [    0.000000] nr_irqs_gsi: 16

numactl --ha:

root@heatpipe:~# numactl --ha
available: 4 nodes (0-3)
node 0 cpus: 0
node 0 size: 4451 MB
node 0 free: 4417 MB
node 1 cpus: 1 2
node 1 size: 4460 MB
node 1 free: 4425 MB
node 2 cpus: 5
node 2 size: 4462 MB
node 2 free: 4444 MB
node 3 cpus: 3 4
node 3 size: 4391 MB
node 3 free: 4362 MB
node distances:
node   0   1   2   3 
  0:  10  40  40  40 
  1:  40  10  40  40 
  2:  40  40  10  40 
  3:  40  40  40  10 

Xen: xl debug-keys u
(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 2656043):
(XEN)     Node 0: 1453063
(XEN)     Node 1: 1202980
(XEN) Domain 29 (total: 4632576):
(XEN)     Node 0: 2144532
(XEN)     Node 1: 2488044
(XEN)     Domain has 4 vnodes
(XEN)         vnode 0 - pnode 0: 5865 MB 
(XEN)         vnode 1 - pnode 0: 4524 MB 
(XEN)         vnode 2 - pnode 1: 4524 MB 
(XEN)         vnode 3 - pnode 1: 4524 MB 
(XEN)     Domain vcpu to vnode: 
(XEN)     0 1 1 3 3 2 


Notes:
*   to enable vNUMA in linux kernel the corresponding patch set should be
    applied;
*   automatic numa balancing featurue seem to be fixed in linux kernel:
    https://lkml.org/lkml/2013/10/31/133

TODO:
*   This version limits vdistance config option to only two values - same node
    distance and other node distance; This prevents oopses on latest (3.13-rc1)
    linux kernel with non-symmetric distance;
*   cpu siblings for Linux machine and xen cpu trap should be detected and
    warning should be given; Add cpuid check if set in VM config;
*   benchmarking;

Elena Ufimtseva (7):
  xen: vNUMA support for PV guests
  libxc: Plumb Xen with vNUMA topology for domain.
  libxc: vnodes allocation on NUMA nodes.
  libxl: vNUMA supporting interface
  libxl: vNUMA configuration parser
  xen: adds vNUMA info debug-key u
  xl: docs for xl config vnuma options

 docs/man/xl.cfg.pod.5        |   60 +++++++++
 tools/libxc/xc_dom.h         |   10 ++
 tools/libxc/xc_dom_x86.c     |   81 +++++++++--
 tools/libxc/xc_domain.c      |   63 +++++++++
 tools/libxc/xenctrl.h        |    9 ++
 tools/libxc/xg_private.h     |    1 +
 tools/libxl/libxl.c          |   20 +++
 tools/libxl/libxl.h          |   20 +++
 tools/libxl/libxl_arch.h     |   13 ++
 tools/libxl/libxl_dom.c      |  210 +++++++++++++++++++++++++++++
 tools/libxl/libxl_internal.h |    4 +
 tools/libxl/libxl_types.idl  |    6 +-
 tools/libxl/libxl_vnuma.h    |   12 ++
 tools/libxl/libxl_x86.c      |  123 +++++++++++++++++
 tools/libxl/xl_cmdimpl.c     |  306 ++++++++++++++++++++++++++++++++++++++++++
 xen/arch/x86/numa.c          |   18 +++
 xen/common/domain.c          |   10 ++
 xen/common/domctl.c          |   79 +++++++++++
 xen/common/memory.c          |   50 +++++++
 xen/include/public/domctl.h  |   29 ++++
 xen/include/public/memory.h  |    8 ++
 xen/include/public/vnuma.h   |   56 ++++++++
 xen/include/xen/domain.h     |    9 ++
 xen/include/xen/sched.h      |    1 +
 24 files changed, 1188 insertions(+), 10 deletions(-)
 create mode 100644 tools/libxl/libxl_vnuma.h
 create mode 100644 xen/include/public/vnuma.h

-- 
1.7.10.4

Elena Ufimtseva

2013-Nov-18 20:24 UTC

head link

[PATCH v3 1/7] xen: vNUMA support for PV guests

Defines interface, structures and hypercalls for toolstack to
build vnuma topology and for guests that wish to retreive it.
Two subop hypercalls introduced by patch:
XEN_DOMCTL_setvnumainfo to define vNUMA domain topology per domain
and XENMEM_get_vnuma_info to retreive that topology by guest.

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>

Changes since v2:
* same size of struct vnuma_topology_info on 32 and 64 pv guest;

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 xen/common/domain.c         |   10 ++++++
 xen/common/domctl.c         |   79 +++++++++++++++++++++++++++++++++++++++++++
 xen/common/memory.c         |   50 +++++++++++++++++++++++++++
 xen/include/public/domctl.h |   29 ++++++++++++++++
 xen/include/public/memory.h |    8 +++++
 xen/include/public/vnuma.h  |   56 ++++++++++++++++++++++++++++++
 xen/include/xen/domain.h    |    9 +++++
 xen/include/xen/sched.h     |    1 +
 8 files changed, 242 insertions(+)
 create mode 100644 xen/include/public/vnuma.h

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 2cbc489..8f5c665 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -511,6 +511,15 @@ int rcu_lock_live_remote_domain_by_id(domid_t dom, struct
domain **d)
     return 0;
 }
 
+static void vnuma_destroy(struct vnuma_info *vnuma)
+{
+    vnuma->nr_vnodes = 0;
+    xfree(vnuma->vmemrange);
+    xfree(vnuma->vcpu_to_vnode);
+    xfree(vnuma->vdistance);
+    xfree(vnuma->vnode_to_pnode);
+}
+
 int domain_kill(struct domain *d)
 {
     int rc = 0;
@@ -531,6 +540,7 @@ int domain_kill(struct domain *d)
         tmem_destroy(d->tmem);
         domain_set_outstanding_pages(d, 0);
         d->tmem = NULL;
+        vnuma_destroy(&d->vnuma);
         /* fallthrough */
     case DOMDYING_dying:
         rc = domain_relinquish_resources(d);
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
index 904d27b..af6ae9b 100644
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -29,6 +29,7 @@
 #include <asm/page.h>
 #include <public/domctl.h>
 #include <xsm/xsm.h>
+#include <public/vnuma.h>
 
 static DEFINE_SPINLOCK(domctl_lock);
 DEFINE_SPINLOCK(vcpu_alloc_lock);
@@ -889,6 +890,84 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t)
u_domctl)
     }
     break;
 
+    case XEN_DOMCTL_setvnumainfo:
+    {
+        unsigned int dist_size, nr_vnodes;
+        
+        ret = -EINVAL;
+        
+        /* 
+         * If number of vnodes was set before,
+         * dont initilize it again.
+         */
+        //ASSERT(d->vnuma.nr_vnodes > 0);
+        if ( d->vnuma.nr_vnodes > 0 )
+            break;
+
+        nr_vnodes = op->u.vnuma.nr_vnodes;
+        if ( nr_vnodes == 0 )
+            break;
+        if ( nr_vnodes > (UINT_MAX / nr_vnodes) )
+            break;
+        
+        ret = -EFAULT;
+        if ( guest_handle_is_null(op->u.vnuma.vdistance)     ||
+             guest_handle_is_null(op->u.vnuma.vmemrange)     ||
+             guest_handle_is_null(op->u.vnuma.vcpu_to_vnode) ||
+             guest_handle_is_null(op->u.vnuma.vnode_to_pnode) )
+            goto setvnumainfo_out;
+         
+        dist_size = nr_vnodes * nr_vnodes;
+        
+        d->vnuma.vdistance = xmalloc_array(unsigned int, dist_size);
+        d->vnuma.vmemrange = xmalloc_array(vmemrange_t, nr_vnodes);
+        d->vnuma.vcpu_to_vnode = xmalloc_array(unsigned int,
d->max_vcpus);
+        d->vnuma.vnode_to_pnode = xmalloc_array(unsigned int, nr_vnodes);
+
+        if ( d->vnuma.vdistance == NULL ||
+             d->vnuma.vmemrange == NULL ||
+             d->vnuma.vcpu_to_vnode == NULL ||
+             d->vnuma.vnode_to_pnode == NULL )
+        {
+            ret = -ENOMEM;
+            goto setvnumainfo_out;
+        }
+
+        printk(KERN_INFO "Ready to copy from guest\n");
+        if ( unlikely(copy_from_guest(d->vnuma.vdistance,
+                                    op->u.vnuma.vdistance,
+                                    dist_size)) )
+            goto setvnumainfo_out;
+        if ( unlikely(copy_from_guest(d->vnuma.vmemrange,
+                                    op->u.vnuma.vmemrange,
+                                    nr_vnodes)) )
+            goto setvnumainfo_out;
+        if ( unlikely(copy_from_guest(d->vnuma.vcpu_to_vnode,
+                                    op->u.vnuma.vcpu_to_vnode,
+                                    d->max_vcpus)) )
+            goto setvnumainfo_out;
+        if ( unlikely(copy_from_guest(d->vnuma.vnode_to_pnode,
+                                    op->u.vnuma.vnode_to_pnode,
+                                    nr_vnodes)) )
+            goto setvnumainfo_out;
+       
+        /* Everything is good, lets set the number of vnodes */
+        d->vnuma.nr_vnodes = nr_vnodes;
+
+        ret = 0;
+
+ setvnumainfo_out:
+        if ( ret != 0 )
+        {
+            d->vnuma.nr_vnodes = 0;
+            xfree(d->vnuma.vdistance);
+            xfree(d->vnuma.vmemrange);
+            xfree(d->vnuma.vcpu_to_vnode);
+            xfree(d->vnuma.vnode_to_pnode);
+        }
+    }
+    break;
+
     default:
         ret = arch_do_domctl(op, d, u_domctl);
         break;
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 50b740f..b224c16 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -28,6 +28,7 @@
 #include <public/memory.h>
 #include <xsm/xsm.h>
 #include <xen/trace.h>
+#include <public/vnuma.h>
 
 struct memop_args {
     /* INPUT */
@@ -733,6 +734,55 @@ long do_memory_op(unsigned long cmd,
XEN_GUEST_HANDLE_PARAM(void) arg)
 
         break;
 
+    case XENMEM_get_vnuma_info:
+    {
+        struct vnuma_topology_info mtopology;
+        struct domain *d;
+
+        if ( copy_from_guest(&mtopology, arg, 1) )
+            return -EFAULT;
+        if ( (d = rcu_lock_domain_by_any_id(mtopology.domid)) == NULL )
+            return -ESRCH;
+        
+        if ( (d->vnuma.nr_vnodes == 0) ||
+             (d->vnuma.nr_vnodes > d->max_vcpus) )
+        {
+            rc = -EOPNOTSUPP;
+            goto vnumainfo_out;
+        }
+        
+        rc = -EFAULT;
+        if ( guest_handle_is_null(mtopology.vmemrange.h)    ||
+             guest_handle_is_null(mtopology.vdistance.h)    ||
+             guest_handle_is_null(mtopology.vcpu_to_vnode.h)||
+             guest_handle_is_null(mtopology.nr_vnodes.h) )
+        {
+            goto vnumainfo_out;
+        }
+
+        if ( __copy_to_guest(mtopology.vmemrange.h,
+                                d->vnuma.vmemrange,
+                                d->vnuma.nr_vnodes) != 0 )
+            goto vnumainfo_out;
+        if ( __copy_to_guest(mtopology.vdistance.h,
+                                d->vnuma.vdistance,
+                                d->vnuma.nr_vnodes * d->vnuma.nr_vnodes)
!= 0 )
+            goto vnumainfo_out;
+        if ( __copy_to_guest(mtopology.vcpu_to_vnode.h,
+                                d->vnuma.vcpu_to_vnode,
+                                d->max_vcpus) != 0 )
+            goto vnumainfo_out;
+
+        if ( __copy_to_guest(mtopology.nr_vnodes.h, &d->vnuma.nr_vnodes,
1) != 0 )
+            goto vnumainfo_out;
+
+        rc = 0;
+    
+    vnumainfo_out:
+        rcu_unlock_domain(d);
+        break;
+    }
+
     default:
         rc = arch_memory_op(op, arg);
         break;
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 01a3652..e6417d4 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -35,6 +35,7 @@
 #include "xen.h"
 #include "grant_table.h"
 #include "hvm/save.h"
+#include "vnuma.h"
 
 #define XEN_DOMCTL_INTERFACE_VERSION 0x00000009
 
@@ -869,6 +870,32 @@ struct xen_domctl_set_max_evtchn {
 typedef struct xen_domctl_set_max_evtchn xen_domctl_set_max_evtchn_t;
 DEFINE_XEN_GUEST_HANDLE(xen_domctl_set_max_evtchn_t);
 
+/* 
+ * XEN_DOMCTL_setvnumainfo: sets the vNUMA topology
+ * parameters from toolstack.
+ */
+struct xen_domctl_vnuma {
+    uint32_t nr_vnodes;
+    uint32_t __pad;
+    XEN_GUEST_HANDLE_64(uint) vdistance;
+    XEN_GUEST_HANDLE_64(uint) vcpu_to_vnode;
+    /* 
+     * vnodes to physical NUMA nodes mask.
+     * This will be kept on per-domain basis
+     * for requests by consumers as vnuma
+     * aware ballooning.
+     */
+    XEN_GUEST_HANDLE_64(uint) vnode_to_pnode;
+    /* 
+     * memory rages that vNUMA node can represent
+     * If more than one, its a linked list.
+     */
+    XEN_GUEST_HANDLE_64(vmemrange_t) vmemrange;
+};
+
+typedef struct xen_domctl_vnuma xen_domctl_vnuma_t;
+DEFINE_XEN_GUEST_HANDLE(xen_domctl_vnuma_t);
+
 struct xen_domctl {
     uint32_t cmd;
 #define XEN_DOMCTL_createdomain                   1
@@ -938,6 +965,7 @@ struct xen_domctl {
 #define XEN_DOMCTL_setnodeaffinity               68
 #define XEN_DOMCTL_getnodeaffinity               69
 #define XEN_DOMCTL_set_max_evtchn                70
+#define XEN_DOMCTL_setvnumainfo                  71
 #define XEN_DOMCTL_gdbsx_guestmemio            1000
 #define XEN_DOMCTL_gdbsx_pausevcpu             1001
 #define XEN_DOMCTL_gdbsx_unpausevcpu           1002
@@ -998,6 +1026,7 @@ struct xen_domctl {
         struct xen_domctl_set_broken_page_p2m set_broken_page_p2m;
         struct xen_domctl_gdbsx_pauseunp_vcpu gdbsx_pauseunp_vcpu;
         struct xen_domctl_gdbsx_domstatus   gdbsx_domstatus;
+        struct xen_domctl_vnuma             vnuma;
         uint8_t                             pad[128];
     } u;
 };
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 7a26dee..80bb0d8 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -339,6 +339,14 @@ struct xen_pod_target {
 };
 typedef struct xen_pod_target xen_pod_target_t;
 
+/*
+ * XENMEM_get_vnuma_info used by caller to retrieve
+ * vNUMA topology constructed for particular domain.
+ *
+ * The data exchanged is presented by vnuma_topology_info. 
+ */
+#define XENMEM_get_vnuma_info               25
+
 #if defined(__XEN__) || defined(__XEN_TOOLS__)
 
 #ifndef uint64_aligned_t
diff --git a/xen/include/public/vnuma.h b/xen/include/public/vnuma.h
new file mode 100644
index 0000000..9238889
--- /dev/null
+++ b/xen/include/public/vnuma.h
@@ -0,0 +1,56 @@
+#ifndef _XEN_PUBLIC_VNUMA_H
+#define _XEN_PUBLIC_VNUMA_H
+
+#include "xen.h"
+
+/*
+ * Following structures are used to represent vNUMA 
+ * topology to guest if requested.
+ */
+
+/* 
+ * Memory ranges can be used to define
+ * vNUMA memory node boundaries by the 
+ * linked list. As of now, only one range
+ * per domain is suported.
+ */
+struct vmemrange {
+    uint64_t start, end;
+    uint64_t __padm;
+};
+
+typedef struct vmemrange vmemrange_t;
+DEFINE_XEN_GUEST_HANDLE(vmemrange_t);
+
+/* 
+ * vNUMA topology specifies vNUMA node
+ * number, distance table, memory ranges and
+ * vcpu mapping provided for guests.
+ */
+
+struct vnuma_topology_info {
+    /* IN */
+    domid_t domid;
+    uint32_t _pad;
+    /* OUT */
+    union {
+        XEN_GUEST_HANDLE(uint) h;
+        uint64_t    _padn;
+    } nr_vnodes;
+    union {
+        XEN_GUEST_HANDLE(uint) h;
+        uint64_t    _padd;
+    } vdistance;
+    union {
+        XEN_GUEST_HANDLE(uint) h;
+        uint64_t    _padv;
+    } vcpu_to_vnode;
+    union {
+        XEN_GUEST_HANDLE(vmemrange_t) h;
+        uint64_t    _padm;
+    } vmemrange;
+};
+typedef struct vnuma_topology_info vnuma_topology_info_t;
+DEFINE_XEN_GUEST_HANDLE(vnuma_topology_info_t);
+
+#endif
diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h
index a057069..77853e8 100644
--- a/xen/include/xen/domain.h
+++ b/xen/include/xen/domain.h
@@ -89,4 +89,13 @@ extern unsigned int xen_processor_pmbits;
 
 extern bool_t opt_dom0_vcpus_pin;
 
+struct vnuma_info {
+    unsigned int nr_vnodes;
+    unsigned int pad;
+    unsigned int *vdistance;
+    unsigned int *vcpu_to_vnode;
+    unsigned int *vnode_to_pnode;
+    struct vmemrange *vmemrange;
+};
+
 #endif /* __XEN_DOMAIN_H__ */
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index cbdf377..3765eae 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -417,6 +417,7 @@ struct domain
     nodemask_t node_affinity;
     unsigned int last_alloc_node;
     spinlock_t node_affinity_lock;
+    struct vnuma_info vnuma;
 };
 
 struct domain_setup_info
-- 
1.7.10.4

Elena Ufimtseva

2013-Nov-18 20:24 UTC

head link

[PATCH v3 2/7] libxc: Plumb Xen with vNUMA topology for domain.

Per-domain vNUMA topology initialization.
domctl hypercall is used to set vNUMA topology
per domU during domain build time.

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>

Changes since v2:
* modified error logic;
---
 tools/libxc/xc_domain.c |   63 +++++++++++++++++++++++++++++++++++++++++++++++
 tools/libxc/xenctrl.h   |    9 +++++++
 2 files changed, 72 insertions(+)

diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
index 1ccafc5..198694f 100644
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -1776,6 +1776,69 @@ int xc_domain_set_max_evtchn(xc_interface *xch, uint32_t
domid,
     return do_domctl(xch, &domctl);
 }
 
+/* Plumbs Xen with vNUMA topology */
+int xc_domain_setvnuma(xc_interface *xch,
+                        uint32_t domid,
+                        uint16_t nr_vnodes,
+                        uint16_t nr_vcpus,
+                        vmemrange_t *vmemrange,
+                        unsigned int *vdistance,
+                        unsigned int *vcpu_to_vnode,
+                        unsigned int *vnode_to_pnode)
+{
+    int rc;
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BOUNCE(vmemrange, sizeof(*vmemrange) * nr_vnodes,
+                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
+    DECLARE_HYPERCALL_BOUNCE(vdistance, sizeof(*vdistance) *
+                                    nr_vnodes * nr_vnodes,
+                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
+    DECLARE_HYPERCALL_BOUNCE(vcpu_to_vnode, sizeof(*vcpu_to_vnode) * nr_vcpus,
+                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
+    DECLARE_HYPERCALL_BOUNCE(vnode_to_pnode, sizeof(*vnode_to_pnode) *
+                                    nr_vnodes,
+                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
+    if ( nr_vnodes == 0 ) {
+        errno = EINVAL;
+        return -1;
+    }
+
+    if ( vdistance == NULL || vcpu_to_vnode == NULL ||
+         vmemrange == NULL || vnode_to_pnode == NULL ) {
+        PERROR("Incorrect parameters for
XEN_DOMCTL_setvnumainfo.\n");
+        errno = EINVAL;
+        return -1;
+    }
+    
+    if ( xc_hypercall_bounce_pre(xch, vmemrange)      ||
+         xc_hypercall_bounce_pre(xch, vdistance)      ||
+         xc_hypercall_bounce_pre(xch, vcpu_to_vnode)  ||
+         xc_hypercall_bounce_pre(xch, vnode_to_pnode) ) {
+        PERROR("Could not bounce buffer for xc_domain_setvnuma.\n");
+        errno = EFAULT;
+        return -1;
+    }
+    
+    set_xen_guest_handle(domctl.u.vnuma.vmemrange, vmemrange);
+    set_xen_guest_handle(domctl.u.vnuma.vdistance, vdistance);
+    set_xen_guest_handle(domctl.u.vnuma.vcpu_to_vnode, vcpu_to_vnode);
+    set_xen_guest_handle(domctl.u.vnuma.vnode_to_pnode, vnode_to_pnode);
+    
+    domctl.cmd = XEN_DOMCTL_setvnumainfo;
+    domctl.domain = (domid_t)domid;
+    domctl.u.vnuma.nr_vnodes = nr_vnodes;
+    domctl.u.vnuma.__pad = 0;
+    
+    rc = do_domctl(xch, &domctl);
+    
+    xc_hypercall_bounce_post(xch, vmemrange);
+    xc_hypercall_bounce_post(xch, vdistance);
+    xc_hypercall_bounce_post(xch, vcpu_to_vnode);
+    xc_hypercall_bounce_post(xch, vnode_to_pnode);
+    
+    return rc;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
index 4ac6b8a..f360726 100644
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -1136,6 +1136,15 @@ int xc_domain_set_memmap_limit(xc_interface *xch,
                                uint32_t domid,
                                unsigned long map_limitkb);
 
+int xc_domain_setvnuma(xc_interface *xch,
+                        uint32_t domid,
+                        uint16_t nr_vnodes,
+                        uint16_t nr_vcpus,
+                        vmemrange_t *vmemrange,
+                        unsigned int *vdistance,
+                        unsigned int *vcpu_to_vnode,
+                        unsigned int *vnode_to_pnode);
+
 #if defined(__i386__) || defined(__x86_64__)
 /*
  * PC BIOS standard E820 types and structure.
-- 
1.7.10.4

Elena Ufimtseva

2013-Nov-18 20:24 UTC

head link

[PATCH v3 3/7] libxc: vnodes allocation on NUMA nodes.

If vnuma topology is defined and its a hardware NUMA machine,
allocate vnodes on physical numa nodes based on vnode_to_pnode map.

TODO:
* Unify allocation for domu with vnuma and without vnuma topology
initialized;

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxc/xc_dom.h     |   10 ++++++
 tools/libxc/xc_dom_x86.c |   81 ++++++++++++++++++++++++++++++++++++++++------
 tools/libxc/xg_private.h |    1 +
 3 files changed, 83 insertions(+), 9 deletions(-)

diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h
index 90679da..c8fa243 100644
--- a/tools/libxc/xc_dom.h
+++ b/tools/libxc/xc_dom.h
@@ -111,6 +111,15 @@ struct xc_dom_image {
     struct xc_dom_phys *phys_pages;
     int realmodearea_log;
 
+    /*
+     * vNUMA topology and memory allocation structure.
+     * Defines the way to allocate memory on per NUMA
+     * physical nodes that is defined by vnode_to_pnode.
+     */
+    uint32_t nr_vnodes;
+    uint64_t *vnuma_memszs;
+    unsigned int *vnode_to_pnode;
+
     /* malloc memory pool */
     struct xc_dom_mem *memblocks;
 
@@ -360,6 +369,7 @@ static inline xen_pfn_t xc_dom_p2m_guest(struct xc_dom_image
*dom,
 int arch_setup_meminit(struct xc_dom_image *dom);
 int arch_setup_bootearly(struct xc_dom_image *dom);
 int arch_setup_bootlate(struct xc_dom_image *dom);
+int arch_boot_numa_alloc(struct xc_dom_image *dom);
 
 /*
  * Local variables:
diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
index e034d62..1839360 100644
--- a/tools/libxc/xc_dom_x86.c
+++ b/tools/libxc/xc_dom_x86.c
@@ -802,27 +802,47 @@ int arch_setup_meminit(struct xc_dom_image *dom)
     else
     {
         /* try to claim pages for early warning of insufficient memory avail */
+        rc = 0;
         if ( dom->claim_enabled ) {
             rc = xc_domain_claim_pages(dom->xch, dom->guest_domid,
                                        dom->total_pages);
             if ( rc )
+            {
+                xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+                             "%s: Failed to claim mem for dom\n",
+                             __FUNCTION__);
                 return rc;
+            }
         }
         /* setup initial p2m */
         for ( pfn = 0; pfn < dom->total_pages; pfn++ )
             dom->p2m_host[pfn] = pfn;
         
         /* allocate guest memory */
-        for ( i = rc = allocsz = 0;
-              (i < dom->total_pages) && !rc;
-              i += allocsz )
+        if ( dom->nr_vnodes > 0 )
         {
-            allocsz = dom->total_pages - i;
-            if ( allocsz > 1024*1024 )
-                allocsz = 1024*1024;
-            rc = xc_domain_populate_physmap_exact(
-                dom->xch, dom->guest_domid, allocsz,
-                0, 0, &dom->p2m_host[i]);
+            rc = arch_boot_numa_alloc(dom);
+            if ( rc )
+            {
+                xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+                             "%s: Failed to allocate memory on NUMA
nodes\n",
+                             __FUNCTION__);
+                return rc;
+            }
+        }
+        else
+        {
+            for ( i = rc = allocsz = 0;
+                  (i < dom->total_pages) && !rc;
+                  i += allocsz )
+            {
+                allocsz = dom->total_pages - i;
+                if ( allocsz > 1024*1024 )
+                    allocsz = 1024*1024;
+                rc = xc_domain_populate_physmap_exact(
+                    dom->xch, dom->guest_domid, allocsz,
+                    0, 0, &dom->p2m_host[i]);
+            }
         }
 
         /* Ensure no unclaimed pages are left unused.
@@ -830,7 +850,50 @@ int arch_setup_meminit(struct xc_dom_image *dom)
         (void)xc_domain_claim_pages(dom->xch, dom->guest_domid,
                                     0 /* cancels the claim */);
     }
+    return rc;
+}
+
+int arch_boot_numa_alloc(struct xc_dom_image *dom)
+{ 
+    int rc;
+    unsigned int n;
+    unsigned long long guest_pages;
+    unsigned long long allocsz = 0, node_pfn_base, i;
+    unsigned long memflags;
+
+    rc = allocsz = node_pfn_base = 0;
 
+    allocsz = 0;
+    for ( n = 0; n < dom->nr_vnodes; n++ )
+    {
+        memflags = 0;
+        if ( dom->vnode_to_pnode[n] != VNUMA_NO_NODE )
+        {
+            memflags |= XENMEMF_exact_node(dom->vnode_to_pnode[n]);
+            memflags |= XENMEMF_exact_node_request;
+        }
+        guest_pages = (dom->vnuma_memszs[n] << 20) >>
PAGE_SHIFT_X86;
+        for ( i = 0;
+            (i < guest_pages) && !rc;
+                i += allocsz )
+        {
+            allocsz = guest_pages - i;
+            if ( allocsz > 1024*1024 )
+                allocsz = 1024*1024;
+                rc = xc_domain_populate_physmap_exact(
+                                    dom->xch, dom->guest_domid, allocsz,
+                                    0, memflags,
&dom->p2m_host[node_pfn_base + i]);
+        }
+        if ( rc )
+        {
+            xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+                    "%s: Failed allocation of %Lu pages for vnode %d on
pnode %d out of %lu\n",
+                    __FUNCTION__, guest_pages, n, dom->vnode_to_pnode[n],
dom->total_pages);
+            return rc;
+        }
+        node_pfn_base += i;
+    }
+    
     return rc;
 }
 
diff --git a/tools/libxc/xg_private.h b/tools/libxc/xg_private.h
index 5ff2124..9554b71 100644
--- a/tools/libxc/xg_private.h
+++ b/tools/libxc/xg_private.h
@@ -127,6 +127,7 @@ typedef uint64_t l4_pgentry_64_t;
 #define ROUNDUP(_x,_w) (((unsigned long)(_x)+(1UL<<(_w))-1) &
~((1UL<<(_w))-1))
 #define NRPAGES(x) (ROUNDUP(x, PAGE_SHIFT) >> PAGE_SHIFT)
 
+#define VNUMA_NO_NODE ~((unsigned int)0)
 
 /* XXX SMH: following skanky macros rely on variable p2m_size being set */
 /* XXX TJD: also, "guest_width" should be the guest''s
sizeof(unsigned long) */
-- 
1.7.10.4

Elena Ufimtseva

2013-Nov-18 20:24 UTC

head link

[PATCH v3 4/7] libxl: vNUMA supporting interface

* Checks and sets if incorrect user defined map for physical
NUMA nodes allocation. If fails, tries use automatic NUMA placement
machanism, otherwise falls to default, not bound to any nodes,
allocation. If user define allocation map can be used, disables
automatic numa placement.

* Verifies the correctness of memory ranges pfns for PV guest
by requesting the e820 map for that domain;

* Provides vNUMA topology information to Xen about vNUMA topology
and allocation map used for vnodes;

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>

Changes since v2:
*   Added vnuma memory pfn alignment which takes into
account e820_host option and non-contiguous e820 memory map
in that case;
---
 tools/libxl/libxl.c          |   20 ++++
 tools/libxl/libxl.h          |   20 ++++
 tools/libxl/libxl_arch.h     |   13 +++
 tools/libxl/libxl_dom.c      |  210 ++++++++++++++++++++++++++++++++++++++++++
 tools/libxl/libxl_internal.h |    4 +
 tools/libxl/libxl_types.idl  |    6 +-
 tools/libxl/libxl_vnuma.h    |   12 +++
 tools/libxl/libxl_x86.c      |  123 +++++++++++++++++++++++++
 8 files changed, 407 insertions(+), 1 deletion(-)
 create mode 100644 tools/libxl/libxl_vnuma.h

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 0de1112..2bd3653 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -4324,6 +4324,26 @@ static int libxl__set_vcpuonline_qmp(libxl__gc *gc,
uint32_t domid,
     return 0;
 }
 
+int libxl_domain_setvnuma(libxl_ctx *ctx,
+                            uint32_t domid,
+                            uint16_t nr_vnodes,
+                            uint16_t nr_vcpus,
+                            vmemrange_t *vmemrange,
+                            unsigned int *vdistance,
+                            unsigned int *vcpu_to_vnode,
+                            unsigned int *vnode_to_pnode)
+{
+    GC_INIT(ctx);
+    int ret;
+    ret = xc_domain_setvnuma(ctx->xch, domid, nr_vnodes,
+                                nr_vcpus, vmemrange,
+                                vdistance, 
+                                vcpu_to_vnode,
+                                vnode_to_pnode);
+    GC_FREE;
+    return ret;
+}
+
 int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *cpumap)
 {
     GC_INIT(ctx);
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index c7dceda..58aed8f 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -281,11 +281,14 @@
 #include <netinet/in.h>
 #include <sys/wait.h> /* for pid_t */
 
+#include <xen/memory.h>
 #include <xentoollog.h>
 
 #include <libxl_uuid.h>
 #include <_libxl_list.h>
 
+#include <xen/vnuma.h>
+
 /* API compatibility. */
 #ifdef LIBXL_API_VERSION
 #if LIBXL_API_VERSION != 0x040200 && LIBXL_API_VERSION != 0x040300
&& \
@@ -382,6 +385,14 @@
 #define LIBXL_EXTERNAL_CALLERS_ONLY /* disappears for callers outside libxl */
 #endif
 
+/*
+ * LIBXL_HAVE_BUILDINFO_VNUMA indicates that vnuma topology will be
+ * build for the guest upon request and with VM configuration.
+ * It will try to define best allocation for vNUMA
+ * nodes on real NUMA nodes.
+ */
+#define LIBXL_HAVE_BUILDINFO_VNUMA 1
+
 typedef uint8_t libxl_mac[6];
 #define LIBXL_MAC_FMT "%02hhx:%02hhx:%02hhx:%02hhx:%02hhx:%02hhx"
 #define LIBXL_MAC_FMTLEN ((2*6)+5) /* 6 hex bytes plus 5 colons */
@@ -741,6 +752,15 @@ void libxl_vcpuinfo_list_free(libxl_vcpuinfo *, int
nr_vcpus);
 void libxl_device_vtpm_list_free(libxl_device_vtpm*, int nr_vtpms);
 void libxl_vtpminfo_list_free(libxl_vtpminfo *, int nr_vtpms);
 
+int libxl_domain_setvnuma(libxl_ctx *ctx,
+                           uint32_t domid,
+                           uint16_t nr_vnodes,
+                           uint16_t nr_vcpus,
+                           vmemrange_t *vmemrange,
+                           unsigned int *vdistance,
+                           unsigned int *vcpu_to_vnode,
+                           unsigned int *vnode_to_pnode);
+
 /*
  * Devices
  * ======diff --git a/tools/libxl/libxl_arch.h b/tools/libxl/libxl_arch.h
index abe6685..b95abab 100644
--- a/tools/libxl/libxl_arch.h
+++ b/tools/libxl/libxl_arch.h
@@ -19,4 +19,17 @@
 int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config,
                uint32_t domid);
 
+int libxl__vnuma_align_mem(libxl__gc *gc,
+                            uint32_t domid,
+                            struct libxl_domain_build_info *b_info,
+                            vmemrange_t *memblks); 
+
+unsigned long e820_memory_hole_size(unsigned long start,
+                                    unsigned long end,
+                                    struct e820entry e820[],
+                                    int nr);
+
+unsigned int libxl__vnodemap_is_usable(libxl__gc *gc,
+                                libxl_domain_build_info *info);
+
 #endif
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index a1c16b0..378249e 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -23,6 +23,7 @@
 #include <xc_dom.h>
 #include <xen/hvm/hvm_info_table.h>
 #include <xen/hvm/hvm_xs_strings.h>
+#include <libxl_vnuma.h>
 
 libxl_domain_type libxl__domain_type(libxl__gc *gc, uint32_t domid)
 {
@@ -201,6 +202,88 @@ static int numa_place_domain(libxl__gc *gc, uint32_t domid,
     return rc;
 }
 
+/* prepares vnode to pnode map for domain vNUMA memory allocation */
+int libxl__init_vnode_to_pnode(libxl__gc *gc, uint32_t domid,
+                        libxl_domain_build_info *info)
+{
+    int i, n, nr_nodes, rc;
+    uint64_t *mems;
+    unsigned long long *claim = NULL;
+    libxl_numainfo *ninfo = NULL;
+
+    rc = -1;
+    if (info->vnode_to_pnode == NULL) {
+        info->vnode_to_pnode = calloc(info->nr_vnodes,
+                                      sizeof(*info->vnode_to_pnode));
+        if (info->vnode_to_pnode == NULL)
+            return rc; 
+    }
+
+    /* default setting */
+    for (i = 0; i < info->nr_vnodes; i++)
+        info->vnode_to_pnode[i] = VNUMA_NO_NODE;
+
+    /* Get NUMA info */
+    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
+    if (ninfo == NULL || nr_nodes == 0) {
+        LOG(DEBUG, "No HW NUMA found.\n");
+        rc = 0;
+        goto vnmapout;
+    }
+
+    /* 
+     * check if we have any hardware NUMA nodes selected,
+     * otherwise VNUMA_NO_NODE set and used default allocation
+     */ 
+    if (libxl_bitmap_is_empty(&info->nodemap))
+        return 0;
+    mems = info->vnuma_memszs;
+    
+    /* check if all vnodes will fit in one node */
+    libxl_for_each_set_bit(n, info->nodemap) {
+        if(ninfo[n].free/1024 >= info->max_memkb  && 
+           libxl_bitmap_test(&info->nodemap, n))
+           {
+               /* 
+                * all domain v-nodes will fit one p-node n, 
+                * p-node n is a best candidate selected by automatic 
+                * NUMA placement.
+                */
+               for (i = 0; i < info->nr_vnodes; i++)
+                    info->vnode_to_pnode[i] = n;
+               /* we can exit, as we are happy with placement */
+               return 0;
+           }
+    }
+    /* Determine the best nodes to fit vNUMA nodes */
+    /* TODO: change algorithm. The current just fits the nodes
+     * Will be nice to have them also sorted by size 
+     * If no p-node found, will be set to NUMA_NO_NODE
+     */
+    claim = calloc(info->nr_vnodes, sizeof(*claim));
+    if (claim == NULL)
+        return rc;
+     
+    libxl_for_each_set_bit(n, info->nodemap)
+    {
+        for (i = 0; i < info->nr_vnodes; i++)
+        {
+            if (((claim[n] + (mems[i] << 20)) <= ninfo[n].free)
&&
+                 /*vnode was not set yet */
+                 (info->vnode_to_pnode[i] == VNUMA_NO_NODE ) )
+            {
+                info->vnode_to_pnode[i] = n;
+                claim[n] += (mems[i] << 20);
+            }
+        }
+    }
+
+    rc = 0;
+ vnmapout:
+    if (claim) free(claim);
+    return rc;
+}
+
 int libxl__build_pre(libxl__gc *gc, uint32_t domid,
               libxl_domain_config *d_config, libxl__domain_build_state *state)
 {
@@ -235,6 +318,66 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
         if (rc)
             return rc;
     }
+
+    if (info->nr_vnodes > 0) {
+        /* The memory blocks will be formed here from sizes */
+        struct vmemrange *memrange = libxl__calloc(gc, info->nr_vnodes,
+                                                sizeof(*memrange));
+        if (memrange == NULL) {
+            LOG(DETAIL, "Failed to allocate memory for memory
ranges.\n");
+            return ERROR_FAIL;
+        }
+
+        if (libxl__vnuma_align_mem(gc, domid, info, memrange) < 0) {
+            LOG(DETAIL, "Failed to align memory map.\n");
+            return ERROR_FAIL;
+        }
+
+        /* 
+        * If vNUMA vnode_to_pnode map defined, determine if we
+        * can disable automatic numa placement and place vnodes
+        * on specified pnodes.
+        * For now, if vcpu affinity specified, we will use 
+        * specified vnode to pnode map.
+        */
+        if (libxl__vnodemap_is_usable(gc, info) == 1) {
+
+            /* Will use user-defined vnode to pnode mask */
+        
+            libxl_defbool_set(&info->numa_placement, false);
+        }
+        else {
+            LOG(ERROR, "The allocation mask for vnuma nodes cannot be
used.\n");
+            if (libxl_defbool_val(info->vnuma_placement)) {
+                
+                LOG(DETAIL, "Switching to automatic vnuma to pnuma
placement\n.");
+                /* Construct the vnode to pnode mapping if possible */
+                if (libxl__init_vnode_to_pnode(gc, domid, info) < 0) {
+                    LOG(DETAIL, "Failed to call init_vnodemap\n");
+                    /* vnuma_nodemap will not be used if nr_vnodes == 0 */
+                    return ERROR_FAIL;
+                }
+            }
+            else {
+                LOG(ERROR, "The vnodes cannot be mapped to pnodes this
way\n.");
+                info->nr_vnodes = 0;
+                return ERROR_FAIL;
+            }
+        }
+        /* plumb domain with vNUMA topology */
+        if (xc_domain_setvnuma(ctx->xch, domid, info->nr_vnodes,
+                                info->max_vcpus, memrange,
+                                info->vdistance, info->vcpu_to_vnode,
+                                info->vnode_to_pnode) < 0) {
+        
+           LOG(DETAIL, "Failed to set vnuma topology for domain
from\n.");
+           info->nr_vnodes = 0;
+           return ERROR_FAIL;
+        }
+    }
+    else
+        LOG(DEBUG, "Will not construct vNUMA topology.\n");
+    
     libxl_domain_set_nodeaffinity(ctx, domid, &info->nodemap);
     libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus,
&info->cpumap);
 
@@ -256,6 +399,56 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
     return rc;
 }
 
+/* Checks if vnuma_nodemap defined in info can be used
+ * for allocation of vnodes on physical NUMA nodes by
+ * verifying that there is enough memory on corresponding
+ * NUMA nodes.
+ */
+unsigned int libxl__vnodemap_is_usable(libxl__gc *gc, libxl_domain_build_info
*info)
+{
+    unsigned int i;
+    libxl_numainfo *ninfo = NULL;
+    unsigned long long *claim;
+    unsigned int node;
+    uint64_t *mems;
+    int rc, nr_nodes;
+
+    rc = nr_nodes = 0;
+    if (info->vnode_to_pnode == NULL || info->vnuma_memszs == NULL)
+        return rc;
+    /*
+     * Cannot use specified mapping if not NUMA machine 
+     */
+    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
+    if (ninfo == NULL) {
+        return rc;   
+    }
+    mems = info->vnuma_memszs;   
+    claim = calloc(info->nr_vnodes, sizeof(*claim));
+    if (claim == NULL)
+        return rc;
+    /* Sum memory request on per pnode basis */ 
+    for (i = 0; i < info->nr_vnodes; i++)
+    {
+        node = info->vnode_to_pnode[i];
+        /* Correct pnode number? */
+        if (node < nr_nodes)
+            claim[node] += (mems[i] << 20);
+        else
+            goto vmapu;
+   }
+   for (i = 0; i < nr_nodes; i++) {
+       if (claim[i] > ninfo[i].free)
+          /* Cannot complete user request, falling to default */
+          goto vmapu;
+   }
+   rc = 1;
+
+ vmapu:
+   if(claim) free(claim);
+   return rc;
+}
+
 int libxl__build_post(libxl__gc *gc, uint32_t domid,
                       libxl_domain_build_info *info,
                       libxl__domain_build_state *state,
@@ -381,7 +574,24 @@ int libxl__build_pv(libxl__gc *gc, uint32_t domid,
             }
         }
     }
+    if ( info->nr_vnodes > 0 ) { 
+        dom->vnode_to_pnode = (unsigned int *)malloc(
+                                info->nr_vnodes *
sizeof(*info->vnode_to_pnode));
+        dom->vnuma_memszs = (uint64_t *)malloc(
+                                info->nr_vnodes *
sizeof(*info->vnuma_memszs));
 
+        if ( dom->vnuma_memszs == NULL || dom->vnode_to_pnode == NULL ) {
+            info->nr_vnodes = 0;
+            if (dom->vnode_to_pnode) free(dom->vnode_to_pnode);
+            if (dom->vnuma_memszs) free(dom->vnuma_memszs);
+            goto out;
+        }
+        memcpy(dom->vnuma_memszs, info->vnuma_memszs,
+               sizeof(*info->vnuma_memszs) * info->nr_vnodes);
+        memcpy(dom->vnode_to_pnode, info->vnode_to_pnode,
+               sizeof(*info->vnode_to_pnode) * info->nr_vnodes);
+        dom->nr_vnodes = info->nr_vnodes;
+    }
     dom->flags = flags;
     dom->console_evtchn = state->console_port;
     dom->console_domid = state->console_domid;
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 23ff265..0df742b 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2882,6 +2882,10 @@ void libxl__numa_candidate_put_nodemap(libxl__gc *gc,
     libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap);
 }
 
+int libxl__init_vnode_to_pnode(libxl__gc *gc, uint32_t domid,
+                                libxl_domain_build_info *info);
+
+
 /*
  * Inserts "elm_new" into the sorted list "head".
  *
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index de5bac3..86ad14c 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -310,7 +310,11 @@ libxl_domain_build_info =
Struct("domain_build_info",[
     ("disable_migrate", libxl_defbool),
     ("cpuid",           libxl_cpuid_policy_list),
     ("blkdev_start",    string),
-    
+    ("vnuma_memszs",    Array(uint64, "nr_vnodes")),
+    ("vcpu_to_vnode",   Array(uint32, "nr_vnodemap")),
+    ("vdistance",       Array(uint32, "nr_vdist")),
+    ("vnode_to_pnode",  Array(uint32,
"nr_vnode_to_pnode")),
+    ("vnuma_placement", libxl_defbool),
     ("device_model_version", libxl_device_model_version),
     ("device_model_stubdomain", libxl_defbool),
     # if you set device_model you must set device_model_version too
diff --git a/tools/libxl/libxl_vnuma.h b/tools/libxl/libxl_vnuma.h
new file mode 100644
index 0000000..6034cd7
--- /dev/null
+++ b/tools/libxl/libxl_vnuma.h
@@ -0,0 +1,12 @@
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#define VNUMA_NO_NODE ~((unsigned int)0)
+
+/* 
+ * Max vNUMA node size in Mb is taken 64Mb even now Linux lets
+ * 32Mb, thus letting some slack. Will be modified to match Linux.
+ */
+#define MIN_VNODE_SIZE  64U
+
+#define MAX_VNUMA_NODES (unsigned int)1 << 10
+
diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
index 87a8110..206f5be 100644
--- a/tools/libxl/libxl_x86.c
+++ b/tools/libxl/libxl_x86.c
@@ -310,3 +310,126 @@ int libxl__arch_domain_create(libxl__gc *gc,
libxl_domain_config *d_config,
 
     return ret;
 }
+
+/*
+ * Checks for the beginnig and end of RAM in e820 map for domain
+ * and aligns start of first and end of last vNUMA memory block to
+ * that map. vnode memory size are passed here Megabytes.
+ * For PV guest e820 map has fixed hole sizes.
+ */
+int libxl__vnuma_align_mem(libxl__gc *gc,
+                            uint32_t domid,
+                            libxl_domain_build_info *b_info, /* IN: mem sizes
*/
+                            vmemrange_t *memblks)        /* OUT: linux numa
blocks in pfn */
+{
+    int i, j, rc;
+    uint64_t next_start_pfn, end_max = 0, size, mem_hole;
+    uint32_t nr;
+    struct e820entry map[E820MAX];
+    
+    if (b_info->nr_vnodes == 0)
+        return -EINVAL;
+    libxl_ctx *ctx = libxl__gc_owner(gc);
+
+    /* retreive e820 map for this host */
+    rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX);
+
+    if (rc < 0) {
+        errno = rc;
+        return -EINVAL;
+    }
+    nr = rc;
+    rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
+                       (b_info->max_memkb - b_info->target_memkb) +
+                       b_info->u.pv.slack_memkb);
+    if (rc)
+    {   
+        errno = rc;
+        return -EINVAL;
+    }
+
+    /* max pfn for this host */
+    for (j = nr - 1; j >= 0; j--)
+        if (map[j].type == E820_RAM) {
+            end_max = map[j].addr + map[j].size;
+            break;
+        }
+    
+    memset(memblks, 0, sizeof(*memblks) * b_info->nr_vnodes);
+    next_start_pfn = 0;
+
+    memblks[0].start = map[0].addr;
+
+    for(i = 0; i < b_info->nr_vnodes; i++) {
+        /* start can be not zero */
+        memblks[i].start += next_start_pfn; 
+        memblks[i].end = memblks[i].start + (b_info->vnuma_memszs[i]
<< 20);
+        
+        size = memblks[i].end - memblks[i].start;
+        /*
+         * For pv host with e820_host option turned on we need
+         * to rake into account memory holes. For pv host with
+         * e820_host disabled or unset, the map is contiguous 
+         * RAM region.
+         */ 
+        if (libxl_defbool_val(b_info->u.pv.e820_host)) {
+            while (mem_hole = e820_memory_hole_size(memblks[i].start,
+                                                 memblks[i].end, map, nr),
+                    memblks[i].end - memblks[i].start - mem_hole < size)
+            {
+                memblks[i].end += mem_hole;
+
+                if (memblks[i].end > end_max) {
+                    memblks[i].end = end_max;
+                    break;
+                }
+            }
+        }
+        next_start_pfn = memblks[i].end;
+    }
+    
+    if (memblks[i-1].end > end_max)
+        memblks[i-1].end = end_max;
+
+    return 0;
+}
+
+/* 
+ * Used for PV guest with e802_host enabled and thus
+ * having non-contiguous e820 memory map.
+ */
+unsigned long e820_memory_hole_size(unsigned long start,
+                                    unsigned long end,
+                                    struct e820entry e820[],
+                                    int nr)
+{
+    int i;
+    unsigned long absent, start_pfn, end_pfn;
+
+    absent = end - start;
+    for(i = 0; i < nr; i++) {
+        /* if not E820_RAM region, skip it and dont substract from absent */
+        if(e820[i].type == E820_RAM) {
+            start_pfn = e820[i].addr;
+            end_pfn =   e820[i].addr + e820[i].size;
+            /* beginning pfn is in this region? */
+            if (start >= start_pfn && start <= end_pfn) {
+                if (end > end_pfn)
+                    absent -= end_pfn - start;
+                else
+                    /* fit the region? then no absent pages */
+                    absent -= end - start;
+                continue;
+            }
+            /* found the end of range in this region? */
+            if (end <= end_pfn && end >= start_pfn) {
+                absent -= end - start_pfn;
+                /* no need to look for more ranges */
+                break;
+            }
+        }
+    }
+    return absent;
+}
+
+
-- 
1.7.10.4

Elena Ufimtseva

2013-Nov-18 20:24 UTC

head link

[PATCH v3 5/7] libxl: vNUMA configuration parser

Parses guest vNUMA related options, verifies and
sets default values if needed.

all possible config options are:

vnodes: number of vnuma nodes:
vnodes = 2

vnumamem: vnodes memory in MBytes:
vnumamem = [2048, 2048]

vdistance: distance table or part of it.
Can use minimal form with same-node distance first and
second - all other nodes distances:
vdistance = [10, 20]
will expand to square symmetrical matrix of any n x n:
[10, 20, 20]
[20, 10, 20]
[20, 20, 10]

If not defined or format is incorrect, default [10, 20] will be used.
Latest kernel (3.13-rc1) oopses on non-symmetrical distance table and
next version will introduce extended distance table in config.

vnuma_vcpumap: vcpu to vnode mapping. The index is a vcpu
number starting from 0, the value is the node number, starting
from zero.
vnuma_vcpumap =[1, 0, 1, 0]
node 1 - vcpus 0 and 2
node 0 - vcpus 1 and 3

vnuma_vnodemap - vnode tp physical node mapping.
Suggested mask will be verified and if the requested mapping
cannot be completed (no NUMA hardware, not correct mapping,
not enough memory), will try to set this with help of automatic
numa placement if vnuma_autoplacement is enabled. Otherwise fails.

vnuma_vnodemap = [1, 0]
vnode0 -> pnode1
vnode1 -> pnode0

vnuma_autoplacement - enables automatic vnuma placement
mechanism if vnode_to_pnode mask is not defined or vnodes
cannot be placed on pnodes specified in the mask.

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>

Changes since v2:

* Added option vnuma_autoplacement;
* Changed to fail when incorrect values are set instead of faulting
to default ones;

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxl/xl_cmdimpl.c |  306 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 306 insertions(+)

diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 8690ec7..d7aa0b3 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -40,6 +40,7 @@
 #include "libxl_json.h"
 #include "libxlutil.h"
 #include "xl.h"
+#include "libxl_vnuma.h"
 
 #define CHK_ERRNO( call ) ({                                            \
         int chk_errno = (call);                                         \
@@ -572,6 +573,75 @@ vcpp_out:
     return rc;
 }
 
+static void vnuma_info_release(libxl_domain_build_info *info)
+{
+    if ( info->vnuma_memszs ) free(info->vnuma_memszs);
+    if ( info->vdistance ) free(info->vdistance);
+    if ( info->vcpu_to_vnode ) free(info->vcpu_to_vnode);
+    if ( info->vnode_to_pnode ) free(info->vnode_to_pnode);
+    info->nr_vnodes = 0;
+}
+
+static int get_list_item_uint(XLU_ConfigList *list, unsigned int i)
+{
+    const char *buf;
+    char *ep;
+    unsigned long ul;
+    int rc = -EINVAL;
+    buf = xlu_cfg_get_listitem(list, i);
+    if (!buf) 
+        return rc;
+    ul = strtoul(buf, &ep, 10);
+    if (ep == buf) 
+        return rc;
+    if (ul >= UINT16_MAX) 
+        return rc; 
+    return (int)ul;
+}
+
+static void vdistance_default(unsigned int *vdistance,
+                                unsigned int nr_vnodes,
+                                unsigned int samenode,
+                                unsigned int othernode)
+{
+    int i, j;
+    for (i = 0; i < nr_vnodes; i++)
+        for (j = 0; j < nr_vnodes; j++)
+            *(vdistance + j * nr_vnodes + i) = i == j ? samenode : othernode;
+}
+
+static void vcputovnode_default(unsigned int *vcpu_to_vnode,
+                                unsigned int nr_vnodes,
+                                unsigned int max_vcpus)
+{
+    int i;
+    if (vcpu_to_vnode == NULL)
+        return;
+    for(i = 0; i < max_vcpus; i++)
+        vcpu_to_vnode[i] = i % nr_vnodes;
+}
+
+/* Split domain memory between vNUMA nodes equally */
+static int split_vnumamem(libxl_domain_build_info *b_info)
+{
+    unsigned long long vnodemem = 0;
+    unsigned long n;
+    unsigned int i;
+
+    /* In MBytes */ 
+    vnodemem = (b_info->max_memkb >> 10) / b_info->nr_vnodes;
+    if (vnodemem < MIN_VNODE_SIZE)
+        return -1;
+    /* reminder in MBytes */ 
+    n = (b_info->max_memkb >> 10) % b_info->nr_vnodes;
+    /* get final sizes in MBytes */
+    for(i = 0; i < (b_info->nr_vnodes - 1); i++)
+        b_info->vnuma_memszs[i] = vnodemem;
+    /* add the reminder to the last node */
+    b_info->vnuma_memszs[i] = vnodemem + n;
+    return 0;
+}
+
 static void parse_config_data(const char *config_source,
                               const char *config_data,
                               int config_len,
@@ -908,6 +978,11 @@ static void parse_config_data(const char *config_source,
         char *cmdline = NULL;
         const char *root = NULL, *extra = "";
 
+        XLU_ConfigList *vnumamemcfg, *vdistancecfg, *vnodemap, *vcpumap;
+        int nr_vnuma_regions, nr_vdist, nr_vnodemap;
+        unsigned long long vnuma_memparsed = 0;
+        unsigned long ul;
+
         xlu_cfg_replace_string (config, "kernel",
&b_info->u.pv.kernel, 0);
 
         xlu_cfg_get_string (config, "root", &root, 0);
@@ -924,7 +999,238 @@ static void parse_config_data(const char *config_source,
             fprintf(stderr, "Failed to allocate memory for
cmdline\n");
             exit(1);
         }
+        
+        if (!xlu_cfg_get_long (config, "vnodes", &l, 0)) {
+            if (l > MAX_VNUMA_NODES) {
+                fprintf(stderr, "Too many vnuma nodes, max %d is
allowed.\n", MAX_VNUMA_NODES);
+                exit(1);
+            }
 
+            b_info->nr_vnodes = l;
+
+            libxl_defbool_set(&b_info->vnuma_placement, false);
+            xlu_cfg_get_defbool(config, "vnuma_autoplacement",
&b_info->vnuma_placement, 0);
+  
+            if (b_info->nr_vnodes != 0 && b_info->max_vcpus >=
b_info->nr_vnodes) {
+                if (!xlu_cfg_get_list(config, "vnumamem",
+                                      &vnumamemcfg, &nr_vnuma_regions,
0)) {
+                    
+                    if (nr_vnuma_regions != b_info->nr_vnodes) {
+                        fprintf(stderr, "Number of numa regions is
incorrect.\n");
+                        exit(1);
+                    }
+                    
+                    b_info->vnuma_memszs = calloc(b_info->nr_vnodes,
+                                                 
sizeof(*b_info->vnuma_memszs));
+                    if (b_info->vnuma_memszs == NULL) {
+                        fprintf(stderr, "unable to allocate memory for
vnuma ranges.\n");
+                        exit(1);
+                    }
+                    
+                    char *ep;
+                    /* 
+                     * Will parse only nr_vnodes times, even if we have
more/less regions.
+                     * Take care of it later if less or discard if too many
regions.
+                     */
+                    for (i = 0; i < b_info->nr_vnodes; i++) {
+                        buf = xlu_cfg_get_listitem(vnumamemcfg, i);
+                        if (!buf) {
+                            fprintf(stderr,
+                                    "xl: Unable to get element %d in vnuma
memroy list.\n", i);
+                            break;
+                        }
+                        ul = strtoul(buf, &ep, 10);
+                        if (ep == buf) {
+                            fprintf(stderr,
+                                    "xl: Invalid argument parsing
vnumamem: %s.\n", buf);
+                            break;
+                        }
+                        
+                        /* 32Mb is a min size for a node, taken from Linux */
+                        if (ul >= UINT32_MAX || ul < MIN_VNODE_SIZE) {
+                            fprintf(stderr, "xl: vnuma memory %lu is not
withing %u - %u range.\n",
+                                    ul, MIN_VNODE_SIZE, UINT32_MAX);
+                            break;
+                        }
+                        
+                        /* memory in MBytes */
+                        b_info->vnuma_memszs[i] = ul;
+                    }
+                    
+                    /* Total memory for vNUMA parsed to verify */
+                    for(i = 0; i < nr_vnuma_regions; i++)
+                        vnuma_memparsed = vnuma_memparsed +
(b_info->vnuma_memszs[i]);
+                    
+                    /* Amount of memory for vnodes same as total? */
+                    if((vnuma_memparsed << 10) != (b_info->max_memkb))
{
+                        fprintf(stderr, "xl: vnuma memory is not the same
as initial domain memory size.\n");
+                        vnuma_info_release(b_info);
+                        exit(1);
+                    }
+                } else {
+                    b_info->vnuma_memszs = calloc(b_info->nr_vnodes,
+                                                 
sizeof(*b_info->vnuma_memszs));
+                    if (b_info->vnuma_memszs == NULL) {
+                        fprintf(stderr, "unable to allocate memory for
vnuma ranges.\n");
+                        exit(1);
+                    }
+
+                    fprintf(stderr, "WARNING: vNUMA memory ranges were not
specified.\n");
+                    fprintf(stderr, "Using default equal vnode memory size
%lu Kbytes to cover %lu Kbytes.\n",
+                                    b_info->max_memkb /
b_info->nr_vnodes, b_info->max_memkb);
+
+                    if (split_vnumamem(b_info) < 0) {
+                        fprintf(stderr, "Could not split vnuma memory into
equal chunks.\n");
+                        vnuma_info_release(b_info);
+                        exit(1);
+                    }
+                }
+
+                b_info->vdistance = calloc(b_info->nr_vnodes *
b_info->nr_vnodes,
+                                           sizeof(*b_info->vdistance));
+                if (b_info->vdistance == NULL) {
+                    vnuma_info_release(b_info);
+                    exit(1);
+                }
+
+                if(!xlu_cfg_get_list(config, "vdistance",
&vdistancecfg, &nr_vdist, 0) &&
+                        nr_vdist == 2) {
+                    /*
+                     * If only two elements are in the vdistance list, consider
+                     * first as value for same node distance, the second as the
+                     * rest of distances.
+                     * The following is required right now to avoid
non-symmetrical
+                     * distance table as it may break latest kernel.
+                     * TODO: Better way to analyze extended distance table,
possibly
+                     * OS specific.
+                     */
+                     int d1, d2;
+                     d1 = get_list_item_uint(vdistancecfg, 0);
+                     d2 = get_list_item_uint(vdistancecfg, 1);
+                     
+                     if (d1 >= 0 && d2 >= 0 && d1 <
d2) {
+                        vdistance_default(b_info->vdistance,
b_info->nr_vnodes, d1, d2);
+                     } else {
+                        fprintf(stderr, "WARNING: Distances are not
correct.\n");
+                        vnuma_info_release(b_info);
+                        exit(1);
+                     }
+
+                } else
+                    vdistance_default(b_info->vdistance,
b_info->nr_vnodes, 10, 20);
+                    
+                b_info->vcpu_to_vnode = (unsigned int
*)calloc(b_info->max_vcpus,
+                                         sizeof(*b_info->vcpu_to_vnode));
+                if (b_info->vcpu_to_vnode == NULL) {
+                    vnuma_info_release(b_info);
+                    exit(1);
+                }
+                if (!xlu_cfg_get_list(config, "vnuma_vcpumap",
+                                      &vcpumap, &nr_vnodemap, 0)) {
+                    if (nr_vnodemap == b_info->max_vcpus) {
+                        unsigned int vnodemask = 0, vnode, smask, vcpumask = 0,
vmask;
+                        smask = ~(~0 << b_info->nr_vnodes);
+                        vmask = ~(~0 << nr_vnodemap);
+                        for (i = 0; i < nr_vnodemap; i++) {
+                            vnode = get_list_item_uint(vcpumap, i);
+                            if (vnode >= 0) {
+                                vnodemask |= (1 << vnode); 
+                                vcpumask  |= (1 << i);
+                                b_info->vcpu_to_vnode[i] = vnode;
+                            }
+                        }
+                        
+                        /* Did it covered all vnodes in the vcpu mask? */
+                        if ( !(((smask & vnodemask) + 1) == (1 <<
b_info->nr_vnodes)) ) {
+                            fprintf(stderr, "WARNING: Not all vnodes were
covered in vnuma_vcpumap.\n");
+                            vnuma_info_release(b_info);
+                            exit(1);
+                        }
+                        if ( !(((vmask & vcpumask) + 1) == (1 <<
nr_vnodemap)) ) {
+                            fprintf(stderr, "WARNING: Not all vnode were
covered in vnuma_vcpumap.\n");
+                            vnuma_info_release(b_info);
+                            exit(1);
+                        }
+                    } else {
+                        fprintf(stderr, "WARNING: Not all vcpus were
covered in vnuma_vcpumap.\n");
+                        vnuma_info_release(b_info);
+                        exit(1);
+                    }
+                }
+                else
+                    vcputovnode_default(b_info->vcpu_to_vnode,
+                                        b_info->nr_vnodes,
+                                        b_info->max_vcpus);
+
+                /* There is mapping to NUMA physical nodes? */
+                b_info->vnode_to_pnode = (unsigned int
*)calloc(b_info->nr_vnodes,
+                                               
sizeof(*b_info->vnode_to_pnode));
+                if (b_info->vnode_to_pnode == NULL) {
+                    vnuma_info_release(b_info);
+                    exit(1);
+                }
+                if (!xlu_cfg_get_list(config,
"vnuma_vnodemap",&vnodemap,
+                                                        &nr_vnodemap, 0)) {
+                    /* 
+                    * If not specified or incorred, will be defined
+                    * later based on the machine architecture, configuration
+                    * and memory availble when creating domain.
+                    */
+                    if (nr_vnodemap == b_info->nr_vnodes) {
+                        unsigned int vnodemask = 0, pnode, smask;
+                        smask = ~(~0 << b_info->nr_vnodes);
+                        for (i = 0; i < b_info->nr_vnodes; i++) {
+                            pnode = get_list_item_uint(vnodemap, i); 
+                            if (pnode >= 0) {
+                                vnodemask |= (1 << i);
+                                b_info->vnode_to_pnode[i] = pnode;
+                            }
+                        }
+
+                        /* Did it covered all vnodes in the mask? */
+                        if ( !(((vnodemask & smask) + 1) == (1 <<
nr_vnodemap)) ) {
+                            fprintf(stderr, "WARNING: Not all vnodes were
covered vnuma_vnodemap.\n");
+                            
+                            if (libxl_defbool_val(b_info->vnuma_placement))
{
+                                fprintf(stderr, "Automatic placement will
be used for vnodes.\n");
+                                free(b_info->vnode_to_pnode);
+                                b_info->vnode_to_pnode = NULL;
+                            } else {
+                                vnuma_info_release(b_info);
+                                exit(1);
+                            }
+                        }
+                    } else {
+                        fprintf(stderr, "WARNING: Incorrect
vnuma_vnodemap.\n");
+                        
+                        if (libxl_defbool_val(b_info->vnuma_placement)) {
+                            fprintf(stderr, "Automatic placement will be
used for vnodes.\n");
+                            free(b_info->vnode_to_pnode);
+                            b_info->vnode_to_pnode = NULL;
+                        } else {
+                            vnuma_info_release(b_info);
+                            exit(1);
+                        }
+                    }
+                } else {
+                    fprintf(stderr, "WARNING: Missing
vnuma_vnodemap.\n");
+
+                    if (libxl_defbool_val(b_info->vnuma_placement)) {
+                        fprintf(stderr, "Automatic placement will be used
for vnodes.\n");
+                        free(b_info->vnode_to_pnode);
+                        b_info->vnode_to_pnode = NULL;
+                    } else {
+                        vnuma_info_release(b_info);
+                        exit(1);
+                    }
+                }
+            }
+            else 
+                b_info->nr_vnodes = 0;
+        }
+        else 
+            b_info->nr_vnodes = 0;
+        
         xlu_cfg_replace_string (config, "bootloader",
&b_info->u.pv.bootloader, 0);
         switch (xlu_cfg_get_list_as_string_list(config,
"bootloader_args",
                                       &b_info->u.pv.bootloader_args, 1))
-- 
1.7.10.4

Elena Ufimtseva

2013-Nov-18 20:24 UTC

head link

[PATCH v3 6/7] xen: adds vNUMA info debug-key u

Prints basic information about vNUMA topology
for vNUMA enabled domains when issuing debug-key ''u''.

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 xen/arch/x86/numa.c |   18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index b141877..1cf74ee 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -389,6 +389,24 @@ static void dump_numa(unsigned char key)
 
 		for_each_online_node(i)
 			printk("    Node %u: %u\n", i, page_num_node[i]);
+
+		if (d->vnuma.nr_vnodes > 0)
+		{
+			printk("    Domain has %d vnodes\n", d->vnuma.nr_vnodes);
+			for (i = 0; i < d->vnuma.nr_vnodes; i++) {
+				printk("        vnode %d - pnode %d:", i,
+					d->vnuma.vnode_to_pnode[i] >= MAX_NUMNODES ? -1 :
d->vnuma.vnode_to_pnode[i]);
+				printk(" %"PRIu64" MB \n",
+                                        (d->vnuma.vmemrange[i].end -
d->vnuma.vmemrange[i].start) >> 20);
+			}
+
+			printk("    Domain vcpu to vnode: \n");
+			for (i = 0; i < d->max_vcpus; i++)
+				printk("%s%d %s", i % 8 == 0 ? "    " : "",
+					d->vnuma.vcpu_to_vnode[i],
+					(i + 1) % 8 == 0 ? "\n" : "");
+			printk("\n");
+		}
 	}
 
 	rcu_read_unlock(&domlist_read_lock);
-- 
1.7.10.4

Elena Ufimtseva

2013-Nov-18 20:24 UTC

head link

[PATCH v3 7/7] xl: docs for xl config vnuma options

xl: docs for xl config vnuma options

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 docs/man/xl.cfg.pod.5 |   60 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
index e6fc83f..2705eb7 100644
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -216,6 +216,66 @@ if the values of B<memory=> and B<maxmem=>
differ.
 A "pre-ballooned" HVM guest needs a balloon driver, without a balloon
driver
 it will crash.
 
+=item B<vnuma_nodes=N>
+
+Number of vNUMA nodes the guest will be initialized with on boot. 
+ 
+=item B<vnuma_mem=[vmem1, vmem2, ...]>
+ 
+The vnode memory sizes defined in MBytes. If the sum of all vnode memories 
+does not match the domain memory or not all the nodes defined here, will fail.
+If not specified, memory will be equally split between vnodes. Currently
+minimum vnode size is 64MB.
+
+Example: vnuma_mem=[1024, 1024, 2048, 2048]
+
+=item B<vdistance=[d1, d2]>
+
+Defines the distance table for vNUMA nodes. Distance for NUMA machines usually 
+ represented by two dimensional array and all distance may be spcified in one
+line here, by rows. Distance can be specified as two numbers [d1, d2],
+where d1 is same node distance, d2 is a value for all other distances.
+If not specified, the defaul distance will be used, e.g. [10, 20].
+
+Examples:
+vnodes = 3
+vdistance=[10, 20]
+will expand to this distance table (this is default setting as well):
+[10, 20, 20]
+[20, 10, 20]
+[20, 20, 10]
+ 
+=item B<vnuma_vcpumap=[vcpu1, vcpu2, ...]>
+ 
+Defines vcpu to vnode mapping as a string of integers, representing node
+numbers. If not defined, the vcpus are interleaved over the virtual nodes.
+Current limitation: vNUMA nodes have to have at least one vcpu, otherwise 
+default vcpu_to_vnode will be used.
+
+Example:
+to map 4 vcpus to 2 nodes - 0,1 vcpu -> vnode1, 2,3 vcpu -> vnode2:
+vnuma_vcpumap = [0, 0, 1, 1] 
+
+=item B<vnuma_vnodemap=[p1, p2, ..., pn]>
+
+vnode to pnode mapping. Can be configured if manual vnode allocation
+required. Will be only taken into effect on real NUMA machines and if
+memory or other constraints do not prevent it. If the mapping is ok,
+automatic NUMA placement will be disabled. If the mapping incorrect
+and vnuma_autoplacement is true, automatical numa placement will be used,
+otherwise fails to create domain.
+
+Example:
+assume two node NUMA node machine:
+vnuma_vndoemap=[1, 0]
+first vnode will be placed on node 1, second on node0.
+ 
+=item B<vnuma_autoplacement=[0|1]>
+
+If enabled, automatically will find the best placement physical node candidate
for
+each vnode if vnuma_vnodemap is incorrect or memory requirements prevent
+using it. Set to ''0'' by default.
+
 =back
 
 =head3 Event Actions
-- 
1.7.10.4

Dario Faggioli

2013-Nov-19 07:41 UTC

head link

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva
wrote:> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
> diff --git a/xen/common/domctl.c b/xen/common/domctl.c
> index 904d27b..af6ae9b 100644
> @@ -889,6 +890,84 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t)
u_domctl)
>      }
>      break;
>  
> +    case XEN_DOMCTL_setvnumainfo:
> +    {
> +        unsigned int dist_size, nr_vnodes;
> +        
> +        ret = -EINVAL;
> +        
> +        /* 
> +         * If number of vnodes was set before,
> +         * dont initilize it again.
> +         */
> +        //ASSERT(d->vnuma.nr_vnodes > 0);
>Kill this? :-P

Apart from this, most of my comments to previous round have been
addressed, but I think some of the ones Jan provided haven''t.

Elena, can you double check?

http://bugs.xenproject.org/xen/mid/%3C5285EE480200007800103692@nat28.tlf.novell.com%3E
> diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h
> index a057069..77853e8 100644
> --- a/xen/include/xen/domain.h
> +++ b/xen/include/xen/domain.h
> @@ -89,4 +89,13 @@ extern unsigned int xen_processor_pmbits;
>  
>  extern bool_t opt_dom0_vcpus_pin;
>  
> +struct vnuma_info {
> +    unsigned int nr_vnodes;
> +    unsigned int pad;
> +    unsigned int *vdistance;
> +    unsigned int *vcpu_to_vnode;
> +    unsigned int *vnode_to_pnode;
> +    struct vmemrange *vmemrange;
> +};
> +
>And one last thing. Sorry if it''s me not remembering, but
what''s ''pad''
above useful for?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Nov-19 08:37 UTC

head link

Re: [PATCH v3 2/7] libxc: Plumb Xen with vNUMA topology for domain.

On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva
wrote:> Per-domain vNUMA topology initialization.
> domctl hypercall is used to set vNUMA topology
> per domU during domain build time.
> 
> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
> Ok, this one below should either go under its own "---"
> Changes since v2:
> * modified error logic;
> ---Or here.

I mean, something like the following:

"
  bla bla changelog bal bla

  Signed-off-by: Elena...
  ---
  Changes since v2:
   * xxx
   * yyyy
  ---
  <diffstat>
"

This way the "Changes since" part will be automatically kept out of
the
final committed changelog by the git commands committers use to commit
the patch. ISTR this happens for everything that''s below
"---". If you
put it _inside_ the actual changelog, they''ll have to remove it by
hand,
which is annoying.
> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
> index 1ccafc5..198694f 100644
> +/* Plumbs Xen with vNUMA topology */
> +int xc_domain_setvnuma(xc_interface *xch,
> +                        uint32_t domid,
> +                        uint16_t nr_vnodes,
> +                        uint16_t nr_vcpus,
> +                        vmemrange_t *vmemrange,
> +                        unsigned int *vdistance,
> +                        unsigned int *vcpu_to_vnode,
> +                        unsigned int *vnode_to_pnode)
> +{
> +    int rc;
> +    DECLARE_DOMCTL;
> +    DECLARE_HYPERCALL_BOUNCE(vmemrange, sizeof(*vmemrange) * nr_vnodes,
> +                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
> +    DECLARE_HYPERCALL_BOUNCE(vdistance, sizeof(*vdistance) *
> +                                    nr_vnodes * nr_vnodes,
> +                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
> +    DECLARE_HYPERCALL_BOUNCE(vcpu_to_vnode, sizeof(*vcpu_to_vnode) *
nr_vcpus,
> +                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
> +    DECLARE_HYPERCALL_BOUNCE(vnode_to_pnode, sizeof(*vnode_to_pnode) *
> +                                    nr_vnodes,
> +                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
>So, last round I suggested that these bounce buffers should have been
_BOUNCE_IN, as information only flow toward Xen, and nothing gets copied
back to xc (and neither it is from xc to the application).

However, Elena reported the system crashing (actually, what is it that
was exactly happening, Elena?) if doing so.

I gave it another look, but I don''t think I see a possible reason for
that... Anyone? IanC? Andrew?
> +    if ( nr_vnodes == 0 ) {
> +        errno = EINVAL;
> +        return -1;
> +    }
> +
> +    if ( vdistance == NULL || vcpu_to_vnode == NULL ||
> +         vmemrange == NULL || vnode_to_pnode == NULL ) {
> +        PERROR("Incorrect parameters for
XEN_DOMCTL_setvnumainfo.\n");
> +        errno = EINVAL;
> +        return -1;
> +    }
> +    I think this line above has some spurious spaces.
> +    if ( xc_hypercall_bounce_pre(xch, vmemrange)      ||
> +         xc_hypercall_bounce_pre(xch, vdistance)      ||
> +         xc_hypercall_bounce_pre(xch, vcpu_to_vnode)  ||
> +         xc_hypercall_bounce_pre(xch, vnode_to_pnode) ) {
> +        PERROR("Could not bounce buffer for
xc_domain_setvnuma.\n");
> +        errno = EFAULT;
> +        return -1;
> +    }
> +    And this one too (spurious spaces).

Also, while the setting of errno in the first two if-s is fine, in this
case I think you should avoid doing it, as we rely on
xc_hypercall_bounce_pre to have done so.
> +    set_xen_guest_handle(domctl.u.vnuma.vmemrange, vmemrange);
> +    set_xen_guest_handle(domctl.u.vnuma.vdistance, vdistance);
> +    set_xen_guest_handle(domctl.u.vnuma.vcpu_to_vnode, vcpu_to_vnode);
> +    set_xen_guest_handle(domctl.u.vnuma.vnode_to_pnode, vnode_to_pnode);
> +           ^ and here
> +    domctl.cmd = XEN_DOMCTL_setvnumainfo;
> +    domctl.domain = (domid_t)domid;
> +    domctl.u.vnuma.nr_vnodes = nr_vnodes;
> +    domctl.u.vnuma.__pad = 0;
> +    
       ^ here> +    rc = do_domctl(xch, &domctl);
> +           ^ here
> +    xc_hypercall_bounce_post(xch, vmemrange);
> +    xc_hypercall_bounce_post(xch, vdistance);
> +    xc_hypercall_bounce_post(xch, vcpu_to_vnode);
> +    xc_hypercall_bounce_post(xch, vnode_to_pnode);
> +           ^ and here too.

I double checked that this isn''t an issue with my MUA by importing the
patch in this e-mail and I confirm it is not, it looks like there are a
few trailing whitespaces around.

I went checking a couple of the other patches too, and found something
similar. Can you fix this?

For the Linux patches, checkpatch is usually of great help. For Xen
ones, I don''t think we have anything like it (yet? :-D). However, for
this specific problem (trailing whitespaces), even just looking at the
patch with `git show'' should make the issue evident. For instance, here
I see them as big red rectangles.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Nov-19 14:01 UTC

head link

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

>>> On 18.11.13 at 21:24, Elena Ufimtseva <ufimtseva@gmail.com>
wrote:
> @@ -889,6 +890,84 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t)
u_domctl)
>      }
>      break;
>  
> +    case XEN_DOMCTL_setvnumainfo:
> +    {
> +        unsigned int dist_size, nr_vnodes;
> +        
> +        ret = -EINVAL;
> +        
> +        /* 
> +         * If number of vnodes was set before,
> +         * dont initilize it again.
> +         */
> +        //ASSERT(d->vnuma.nr_vnodes > 0);
Please, finally, clean up your patches before submitting. I''m not
going to point out further obvious violations below.
> --- /dev/null
> +++ b/xen/include/public/vnuma.h
> @@ -0,0 +1,56 @@
> +#ifndef _XEN_PUBLIC_VNUMA_H
> +#define _XEN_PUBLIC_VNUMA_H
> +
> +#include "xen.h"
> +
> +/*
> + * Following structures are used to represent vNUMA 
> + * topology to guest if requested.
> + */
> +
> +/* 
> + * Memory ranges can be used to define
> + * vNUMA memory node boundaries by the 
> + * linked list. As of now, only one range
> + * per domain is suported.
> + */
> +struct vmemrange {
> +    uint64_t start, end;
> +    uint64_t __padm;
What are you padding here?
> +};
> +
> +typedef struct vmemrange vmemrange_t;
> +DEFINE_XEN_GUEST_HANDLE(vmemrange_t);
> +
> +/* 
> + * vNUMA topology specifies vNUMA node
> + * number, distance table, memory ranges and
> + * vcpu mapping provided for guests.
> + */
> +
> +struct vnuma_topology_info {
> +    /* IN */
> +    domid_t domid;
> +    uint32_t _pad;
> +    /* OUT */
> +    union {
> +        XEN_GUEST_HANDLE(uint) h;
> +        uint64_t    _padn;
No need for the odd trailing n (and other characters below).

One thing that I only realized - what information is the guest
supposed to use to know the number of virtual nodes it has
(needed to size the arrays the handles here point to)?
> +    } nr_vnodes;
> +    union {
> +        XEN_GUEST_HANDLE(uint) h;
> +        uint64_t    _padd;
> +    } vdistance;
> +    union {
> +        XEN_GUEST_HANDLE(uint) h;
And this one is as problematic - I don''t think the guest necessarily
knows what max_vcpus the hypervisor has set for it.

Jan

Konrad Rzeszutek Wilk

2013-Nov-19 14:03 UTC

head link

Re: [PATCH v3 2/7] libxc: Plumb Xen with vNUMA topology for domain.

On Tue, Nov 19, 2013 at 09:37:25AM +0100, Dario Faggioli
wrote:> On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva wrote:
> > Per-domain vNUMA topology initialization.
> > domctl hypercall is used to set vNUMA topology
> > per domU during domain build time.
> > 
> > Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
> > 
> Ok, this one below should either go under its own "---"
> 
> > Changes since v2:
> > * modified error logic;
> > ---
> Or here.
> 
> I mean, something like the following:
> 
> "
>   bla bla changelog bal bla
> 
>   Signed-off-by: Elena...
>   ---
>   Changes since v2:
>    * xxx
>    * yyyy
>   ---
>   <diffstat>
> "
> 
> This way the "Changes since" part will be automatically kept out
of the
> final committed changelog by the git commands committers use to commit
> the patch. ISTR this happens for everything that''s below
"---". If you
> put it _inside_ the actual changelog, they''ll have to remove it by
hand,
> which is annoying.
> 
> > diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
> > index 1ccafc5..198694f 100644
> 
> > +/* Plumbs Xen with vNUMA topology */
> > +int xc_domain_setvnuma(xc_interface *xch,
> > +                        uint32_t domid,
> > +                        uint16_t nr_vnodes,
> > +                        uint16_t nr_vcpus,
> > +                        vmemrange_t *vmemrange,
> > +                        unsigned int *vdistance,
> > +                        unsigned int *vcpu_to_vnode,
> > +                        unsigned int *vnode_to_pnode)
> > +{
> > +    int rc;
> > +    DECLARE_DOMCTL;
> > +    DECLARE_HYPERCALL_BOUNCE(vmemrange, sizeof(*vmemrange) *
nr_vnodes,
> > +                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
> > +    DECLARE_HYPERCALL_BOUNCE(vdistance, sizeof(*vdistance) *
> > +                                    nr_vnodes * nr_vnodes,
> > +                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
> > +    DECLARE_HYPERCALL_BOUNCE(vcpu_to_vnode, sizeof(*vcpu_to_vnode) *
nr_vcpus,
> > +                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
> > +    DECLARE_HYPERCALL_BOUNCE(vnode_to_pnode, sizeof(*vnode_to_pnode)
*
> > +                                    nr_vnodes,
> > +                                    XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
> >
> So, last round I suggested that these bounce buffers should have been
> _BOUNCE_IN, as information only flow toward Xen, and nothing gets copied
> back to xc (and neither it is from xc to the application).
> 
> However, Elena reported the system crashing (actually, what is it that
> was exactly happening, Elena?) if doing so.
> 
> I gave it another look, but I don''t think I see a possible reason
for
> that... Anyone? IanC? Andrew?
> 
> > +    if ( nr_vnodes == 0 ) {
> > +        errno = EINVAL;
> > +        return -1;
> > +    }
> > +
> > +    if ( vdistance == NULL || vcpu_to_vnode == NULL ||
> > +         vmemrange == NULL || vnode_to_pnode == NULL ) {
> > +        PERROR("Incorrect parameters for
XEN_DOMCTL_setvnumainfo.\n");
> > +        errno = EINVAL;
> > +        return -1;
> > +    }
> > +    
> I think this line above has some spurious spaces.
> 
> > +    if ( xc_hypercall_bounce_pre(xch, vmemrange)      ||
> > +         xc_hypercall_bounce_pre(xch, vdistance)      ||
> > +         xc_hypercall_bounce_pre(xch, vcpu_to_vnode)  ||
> > +         xc_hypercall_bounce_pre(xch, vnode_to_pnode) ) {
> > +        PERROR("Could not bounce buffer for
xc_domain_setvnuma.\n");
> > +        errno = EFAULT;
> > +        return -1;
> > +    }
> > +    
> And this one too (spurious spaces).
> 
> Also, while the setting of errno in the first two if-s is fine, in this
> case I think you should avoid doing it, as we rely on
> xc_hypercall_bounce_pre to have done so.
> 
> > +    set_xen_guest_handle(domctl.u.vnuma.vmemrange, vmemrange);
> > +    set_xen_guest_handle(domctl.u.vnuma.vdistance, vdistance);
> > +    set_xen_guest_handle(domctl.u.vnuma.vcpu_to_vnode,
vcpu_to_vnode);
> > +    set_xen_guest_handle(domctl.u.vnuma.vnode_to_pnode,
vnode_to_pnode);
> > +    
>        ^ and here
> 
> > +    domctl.cmd = XEN_DOMCTL_setvnumainfo;
> > +    domctl.domain = (domid_t)domid;
> > +    domctl.u.vnuma.nr_vnodes = nr_vnodes;
> > +    domctl.u.vnuma.__pad = 0;
> > +    
>        ^ here
> > +    rc = do_domctl(xch, &domctl);
> > +    
>        ^ here
> 
> > +    xc_hypercall_bounce_post(xch, vmemrange);
> > +    xc_hypercall_bounce_post(xch, vdistance);
> > +    xc_hypercall_bounce_post(xch, vcpu_to_vnode);
> > +    xc_hypercall_bounce_post(xch, vnode_to_pnode);
> > +    
>        ^ and here too.
> 
> I double checked that this isn''t an issue with my MUA by importing
the
> patch in this e-mail and I confirm it is not, it looks like there are a
> few trailing whitespaces around.
> 
> I went checking a couple of the other patches too, and found something
> similar. Can you fix this?
> 
> For the Linux patches, checkpatch is usually of great help. For Xen
> ones, I don''t think we have anything like it (yet? :-D). However,
for
> this specific problem (trailing whitespaces), even just looking at the
> patch with `git show'' should make the issue evident. For instance,
here
> I see them as big red rectangles.
> 
Or in vim you can have

syn match ErrorLeadSpace /^ \+/      " highlight any leading spaces
syn match ErrorTailSpace / \+$/      " highlight any trailing spaces
syn match Error80        /\%>80v.\+/ " highlight anything past 80

hi Error80 ctermbg=red
hi ErrorLeadSpace ctermbg=red
hi ErrorTailSpace ctermbg=red


And that will nicely point it to you.
> Regards,
> Dario
> 
> -- 
> <<This happens because I choose it to happen!>> (Raistlin
Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
> 

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Dario Faggioli

2013-Nov-19 14:22 UTC

head link

Re: [PATCH v3 3/7] libxc: vnodes allocation on NUMA nodes.

On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva
wrote:> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
> diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
> index e034d62..1839360 100644
> @@ -802,27 +802,47 @@ int arch_setup_meminit(struct xc_dom_image *dom)
>      else
>      {
>          /* try to claim pages for early warning of insufficient memory
avail */
> +        rc = 0;
>          if ( dom->claim_enabled ) {
>              rc = xc_domain_claim_pages(dom->xch, dom->guest_domid,
>                                         dom->total_pages);
>              if ( rc )
> +            {
> +                xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
> +                             "%s: Failed to claim mem for
dom\n",
> +                             __FUNCTION__);
>                  return rc;
> +            }
>Mmm... here you''re just adding some log output. I think it''s a
minor
thing, i.e., it''s not terrible to have this in this patch. However,
ideally, this should happen in its own patch, as it is not much related
to vNUMA, is it?
>          /* allocate guest memory */
> -        for ( i = rc = allocsz = 0;
> -              (i < dom->total_pages) && !rc;
> -              i += allocsz )
> +        if ( dom->nr_vnodes > 0 )
>          {
> -            allocsz = dom->total_pages - i;
> -            if ( allocsz > 1024*1024 )
> -                allocsz = 1024*1024;
> -            rc = xc_domain_populate_physmap_exact(
> -                dom->xch, dom->guest_domid, allocsz,
> -                0, 0, &dom->p2m_host[i]);
> +            rc = arch_boot_numa_alloc(dom);
> +            if ( rc )
> +            {
> +                xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
> +                             "%s: Failed to allocate memory on NUMA
nodes\n",
> +                             __FUNCTION__);
> +                return rc;
> +            }
> +        }
> +        else
> +        {
> +            for ( i = rc = allocsz = 0;
> +                  (i < dom->total_pages) && !rc;
> +                  i += allocsz )
> +            {
> +                allocsz = dom->total_pages - i;
> +                if ( allocsz > 1024*1024 )
> +                    allocsz = 1024*1024;
> +                rc = xc_domain_populate_physmap_exact(
> +                    dom->xch, dom->guest_domid, allocsz,
> +                    0, 0, &dom->p2m_host[i]);
> +            }
>          }I think I agree with George''s comment from last round: do unify the two
paths (yes, I saw it''s on the TODO for this patch... so, go ahead and
do
it :-) ).
> +int arch_boot_numa_alloc(struct xc_dom_image *dom)
> +{ 
> +    int rc;
> +    unsigned int n;
> +    unsigned long long guest_pages;
> +    unsigned long long allocsz = 0, node_pfn_base, i;
> +    unsigned long memflags;
> +
> +    rc = allocsz = node_pfn_base = 0;
>  
> +    allocsz = 0;
> +    for ( n = 0; n < dom->nr_vnodes; n++ )
> +    {
> +        memflags = 0;
> +        if ( dom->vnode_to_pnode[n] != VNUMA_NO_NODE )
> +        {
> +            memflags |= XENMEMF_exact_node(dom->vnode_to_pnode[n]);
> +            memflags |= XENMEMF_exact_node_request;
> +        }
> +        guest_pages = (dom->vnuma_memszs[n] << 20) >>
PAGE_SHIFT_X86;
>I probably would call this ''pages_node'' (or
''node_pages'', or
''guest_pages_node'', etc.), but that''s also something
not that
important. :-)

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Nov-19 14:35 UTC

head link

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

On mar, 2013-11-19 at 14:01 +0000, Jan Beulich wrote:> One thing that I only realized - what information is the guest
> supposed to use to know the number of virtual nodes it has
> (needed to size the arrays the handles here point to)?
> If I understand correctly what you''re asking, the (current) answer is
in
the Linux patch, here:

 http://bugs.xenproject.org/xen/mid/%3C1384811922-14642-2-git-send-email-ufimtseva@gmail.com%3E

which uses num_possible_cpus()

 http://lxr.linux.no/linux+v3.12/+code=num_possible_cpus

to size all the three arrays.

Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Nov-19 14:48 UTC

head link

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

>>> On 19.11.13 at 15:35, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On mar, 2013-11-19 at 14:01 +0000, Jan Beulich wrote:
>> One thing that I only realized - what information is the guest
>> supposed to use to know the number of virtual nodes it has
>> (needed to size the arrays the handles here point to)?
>> 
> If I understand correctly what you''re asking, the (current) answer
is in
> the Linux patch, here:
> 
>  
>
http://bugs.xenproject.org/xen/mid/%3C1384811922-14642-2-git-send-email-ufimt
> seva@gmail.com%3E
> 
> which uses num_possible_cpus()
> 
>  http://lxr.linux.no/linux+v3.12/+code=num_possible_cpus 
> 
> to size all the three arrays.
But what guarantees num_possible_cpus() to be larger than some
hypervisor internal variables that the domain has no knowledge
about? Just consider a guest with a Linux configured for just 4
CPUs and 2 nodes, but having a config file specifying 16 vCPU-s on
4 virtual nodes?

Jan

Dario Faggioli

2013-Nov-19 15:42 UTC

head link

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

On mar, 2013-11-19 at 14:48 +0000, Jan Beulich wrote:> >>> On 19.11.13 at 15:35, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> > If I understand correctly what you''re asking, the (current)
answer is in
> > the Linux patch, here:
> > 
> >  
> >
http://bugs.xenproject.org/xen/mid/%3C1384811922-14642-2-git-send-email-ufimt
> > seva@gmail.com%3E
> > 
> > which uses num_possible_cpus()
> > 
> >  http://lxr.linux.no/linux+v3.12/+code=num_possible_cpus 
> > 
> > to size all the three arrays.
> 
> But what guarantees num_possible_cpus() to be larger than some
> hypervisor internal variables that the domain has no knowledge
> about? 
>Nothing, I guess.
> Just consider a guest with a Linux configured for just 4
> CPUs and 2 nodes, but having a config file specifying 16 vCPU-s on
> 4 virtual nodes?
> That''s a fair point. I can''t remember the rationale behind the
choice of
using num_possible_cpus()... ISTR some very early version (probably not
even shared on xen-devel) of the series using something like NR_CPUS,
but that would suffer from the same issue, I think.

Probably, we just overlooked the situation you''re describing and though
that, given we don''t allow nr_vcpus > nr_vnodes, using
num_possible_cpus() ought to be enough. But I see it now.

So, what would the best option be? Another hypercall (or a special way
of calling this one) "just" to retrieve the number of vnodes?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Nov-19 15:54 UTC

head link

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

>>> On 19.11.13 at 16:42, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On mar, 2013-11-19 at 14:48 +0000, Jan Beulich wrote:
>> Just consider a guest with a Linux configured for just 4
>> CPUs and 2 nodes, but having a config file specifying 16 vCPU-s on
>> 4 virtual nodes?
>> 
> That''s a fair point. I can''t remember the rationale
behind the choice of
> using num_possible_cpus()... ISTR some very early version (probably not
> even shared on xen-devel) of the series using something like NR_CPUS,
> but that would suffer from the same issue, I think.
> 
> Probably, we just overlooked the situation you''re describing and
though
> that, given we don''t allow nr_vcpus > nr_vnodes, using
> num_possible_cpus() ought to be enough. But I see it now.
> 
> So, what would the best option be? Another hypercall (or a special way
> of calling this one) "just" to retrieve the number of vnodes?
Iirc there''s a padding field in the interface structure, which could
be leveraged. But then again you need two counts, and hence it
might be better to simply add two respective fields. Then make
it/them IN/OUT, and rather than filling the arrays when they''re
too small just send back the necessary values. (And of course
you''ll want to also send back the actual values in case the passed
in ones turned out to large, so the guest would know how many
of the array elements actually have valid data).

But in the end the fundamental question stands - how was a PV
guest in your so far proposed model supposed to know its number
of vNodes? While for HVM guests you can make this available via
ACPI, that''s not an option for PV.

Jan

Dario Faggioli

2013-Nov-19 16:36 UTC

head link

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

On mar, 2013-11-19 at 15:54 +0000, Jan Beulich wrote:> >>> On 19.11.13 at 16:42, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> > So, what would the best option be? Another hypercall (or a special way
> > of calling this one) "just" to retrieve the number of
vnodes?
> 
> Iirc there''s a padding field in the interface structure, which
could
> be leveraged. But then again you need two counts, and hence it
> might be better to simply add two respective fields. Then make
> it/them IN/OUT, and rather than filling the arrays when they''re
> too small just send back the necessary values. (And of course
> you''ll want to also send back the actual values in case the passed
> in ones turned out to large, so the guest would know how many
> of the array elements actually have valid data).
> 
> But in the end the fundamental question stands - how was a PV
> guest in your so far proposed model supposed to know its number
> of vNodes? While for HVM guests you can make this available via
> ACPI, that''s not an option for PV.
> Wait... I''m no longer so sure I''m getting what you say.
I''d be inclined
to say "by the XENMEM_get_vnuma_info hcall implemented here", but then
again, maybe I''m missing something.

The hypercall does provide a mean for the guest to retrieve _all_ the
virtual topology information, such as:
 - number of virtual nodes
 - virtual node memory ranges
 - virtual cpu to virtual node mapping
 - virtual node to physical node mapping, for use in (future) in-guest 
   vNUMA aware subsystems (e.g., ballooning)

So, if your point is (as I thought) that for properly allocating the
buffers for this hypercall to work we need an information only provided
by this hypercall itself, then I agree, and that''s why I asked what
alternative way would be best to retrieve that bit of information.

If it''s something else, then I don''t know. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Nov-19 16:43 UTC

head link

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

>>> On 19.11.13 at 17:36, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On mar, 2013-11-19 at 15:54 +0000, Jan Beulich wrote:
>> >>> On 19.11.13 at 16:42, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
>> > So, what would the best option be? Another hypercall (or a special
way
>> > of calling this one) "just" to retrieve the number of
vnodes?
>> 
>> Iirc there''s a padding field in the interface structure, which
could
>> be leveraged. But then again you need two counts, and hence it
>> might be better to simply add two respective fields. Then make
>> it/them IN/OUT, and rather than filling the arrays when
they''re
>> too small just send back the necessary values. (And of course
>> you''ll want to also send back the actual values in case the
passed
>> in ones turned out to large, so the guest would know how many
>> of the array elements actually have valid data).
>> 
>> But in the end the fundamental question stands - how was a PV
>> guest in your so far proposed model supposed to know its number
>> of vNodes? While for HVM guests you can make this available via
>> ACPI, that''s not an option for PV.
>> 
> Wait... I''m no longer so sure I''m getting what you say.
I''d be inclined
> to say "by the XENMEM_get_vnuma_info hcall implemented here", but
then
> again, maybe I''m missing something.
> 
> The hypercall does provide a mean for the guest to retrieve _all_ the
> virtual topology information, such as:
>  - number of virtual nodes
>  - virtual node memory ranges
>  - virtual cpu to virtual node mapping
>  - virtual node to physical node mapping, for use in (future) in-guest 
>    vNUMA aware subsystems (e.g., ballooning)
> 
> So, if your point is (as I thought) that for properly allocating the
> buffers for this hypercall to work we need an information only provided
> by this hypercall itself, then I agree, and that''s why I asked
what
> alternative way would be best to retrieve that bit of information.
> 
> If it''s something else, then I don''t know. :-)
No, it is what you''re naming above. I was merely curious how you
had supposed the guest would know the vNode count prior to this
change request of mine. I didn''t look at the Linux patches yet (due
to lack of time), hence I don''t know how you derived(?) the node
count without it coming back from the hypercall here.

Jan

Dario Faggioli

2013-Nov-19 17:20 UTC

head link

Re: [PATCH v3 5/7] libxl: vNUMA configuration parser

Just one thing (for now)...

On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva
wrote:> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> index 8690ec7..d7aa0b3 100644
> --- a/tools/libxl/xl_cmdimpl.c
> +++ b/tools/libxl/xl_cmdimpl.c
>This is the only file you modify in the patch, so the patch itself is
about _xl_, rather than _libxl_, isn''t it?

Please, make the subject consistent with that.

It''s a big patch (and it should be even bigger, as you''ll see
from my
comments on patch 7/7) so, perhaps, consider splitting it up, e.g., by
having one patch for each new config option introduced (if feasible and
not too hard).

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Nov-19 17:23 UTC

head link

Re: [PATCH v3 7/7] xl: docs for xl config vnuma options

Again, only one quick note (for now).

On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva
wrote:> xl: docs for xl config vnuma options
> Changes to the documentation should happen in the same patch that
introduces the feature or modifies the behavior.

This means this patch should be merged with 5/7, making it even bigger
and, to a certain extent, harder to review (that''s why I was suggesting
breaking it down).

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

George Dunlap

2013-Nov-19 17:26 UTC

head link

Re: [PATCH v3 7/7] xl: docs for xl config vnuma options

On 11/19/2013 05:23 PM, Dario Faggioli wrote:> Again, only one quick note (for now).
>
> On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva wrote:
>> xl: docs for xl config vnuma options
>>
> Changes to the documentation should happen in the same patch that
> introduces the feature or modifies the behavior.
Although there is a certain logic to that, if there are major 
documentation changes, I find that having the documentation in the same 
exact patch makes the patch a lot harder to review.  I much prefer to 
have the documentation happen at the end of the patch series, unless 
it''s something really minor.  (Although I generally won''t
complain
either way.)

  -George

Dario Faggioli

2013-Nov-19 18:37 UTC

head link

Re: [PATCH v3 4/7] libxl: vNUMA supporting interface

On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva
wrote:> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
> 
> Changes since v2:
> *   Added vnuma memory pfn alignment which takes into
> account e820_host option and non-contiguous e820 memory map
> in that case;
>Wow, cool that you got this one working too! ;-P
> diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
> index 0de1112..2bd3653 100644
 > +int libxl_domain_setvnuma(libxl_ctx *ctx,
> +                            uint32_t domid,
> +                            uint16_t nr_vnodes,
> +                            uint16_t nr_vcpus,
> +                            vmemrange_t *vmemrange,
> +                            unsigned int *vdistance,
> +                            unsigned int *vcpu_to_vnode,
> +                            unsigned int *vnode_to_pnode)
> +{
> +    GC_INIT(ctx);
>Do you need this? I don''t think so: you don''t need a gc, and
ctx appears
to be enough for you.
> +    int ret;
> +    ret = xc_domain_setvnuma(ctx->xch, domid, nr_vnodes,
> +                                nr_vcpus, vmemrange,
> +                                vdistance, 
> +                                vcpu_to_vnode,
> +                                vnode_to_pnode);
> +    GC_FREE;
> +    return ret;
> +}
> +libxl function should always return libxl error codes, such as
ERROR_FAIL, ERROR_INVAL, ecc. It''s nice to show what actually went
wrong
in the xc_* call, but that''s usually done by:

 rc = xc_call();
 if (rc < 0) {
  LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "error xxx");
  return ERROR_FAIL;
 }

On the grounds that libxc would properly set errno and return -1 on
failure.

> diff --git a/tools/libxl/libxl_arch.h b/tools/libxl/libxl_arch.h
> index abe6685..b95abab 100644
>  int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config
*d_config,
>                 uint32_t domid);
>  
> +int libxl__vnuma_align_mem(libxl__gc *gc,
> +                            uint32_t domid,
> +                            struct libxl_domain_build_info *b_info,
> +                            vmemrange_t *memblks); 
> +
> +unsigned long e820_memory_hole_size(unsigned long start,
> +                                    unsigned long end,
> +                                    struct e820entry e820[],
> +                                    int nr);
> +
> +unsigned int libxl__vnodemap_is_usable(libxl__gc *gc,
> +                                libxl_domain_build_info *info);
> +
>I think we''ll better have a libxl maintainer comment on this but, FWIW,
I don''t think I''d put these functions in libxl_arch.h.

We do have a libxl_numa.c file, for host their implementation (rather
than in libxl_dom.c) and, as libxl hidden functions, their prototype is
probably fine in libxl_internal.h (unless there is some specific reason
for having them here which I''m overlooking).

As per e820_memory_hole_size(), it looks like it''s only called from
inside e820_memory_hole_size(), which means it can be just a static
(still inside libxl_numa.c).
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> index a1c16b0..378249e 100644
 > +/* prepares vnode to pnode map for domain vNUMA memory allocation */
> +int libxl__init_vnode_to_pnode(libxl__gc *gc, uint32_t domid,
> +                        libxl_domain_build_info *info)
> +{
> +    int i, n, nr_nodes, rc;
>Please, initialize nr_nodes to 0. Given how libxl_get_numainfo() works,
that''s the best and easiest way for being sure we won''t leak
any of the
memory it deals with.
> +    uint64_t *mems;
> +    unsigned long long *claim = NULL;
> +    libxl_numainfo *ninfo = NULL;
> +
> +    rc = -1;
>As said above, use libxl error codes.
> +    if (info->vnode_to_pnode == NULL) {
> +        info->vnode_to_pnode = calloc(info->nr_vnodes,
> +                                      sizeof(*info->vnode_to_pnode));
> +        if (info->vnode_to_pnode == NULL)
> +            return rc; 
> +    }
> +Also, in libxl, memory allocation never fails. :-) What this really mean
is that such kind of failure is handled for you by the library itself,
if you use the proper wrappers (and doing so is not optional, is the way
you deal with memory in libxl).

For this specific example, look for libxl__calloc() and some usage
examples of it (you will find there never is any error handling code).
> +    /* default setting */
> +    for (i = 0; i < info->nr_vnodes; i++)
> +        info->vnode_to_pnode[i] = VNUMA_NO_NODE;
> +
> +    /* Get NUMA info */
> +    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
> +    if (ninfo == NULL || nr_nodes == 0) {
> +        LOG(DEBUG, "No HW NUMA found.\n");
> +        rc = 0;
> +        goto vnmapout;
> +    }
> +I''m not so sure about this. What do you mean with "No HW NUMA
found"?
IIRC, on a non-NUMA box, libxl_get_numainfo returns something. I just
checked on my (non-NUMA) laptop, and I do see a node 0 with all the
memory and 10 as distance to itself.

So, if what you get is ninfo=NULL and nr_nodes=0, it means something
went wrong, and we shouldn''t return success. OTOH, if you''re
interested
to detect the case where the call succeeded, and there really is only
one node, you should probably check for nr_nodes == 1.
> +    /* 
> +     * check if we have any hardware NUMA nodes selected,
> +     * otherwise VNUMA_NO_NODE set and used default allocation
> +     */ 
> +    if (libxl_bitmap_is_empty(&info->nodemap))
> +        return 0;
>Why _is_empty()? I mean, under what circumstances info->nodemap is
empty?

I''m asking because info->nodemap is initialized as full in
libxl__domain_build_info_setdefault() and can only change in
libxl__build_pre() in a way that makes it really unlikely for it to
become empty... so...
> +    mems = info->vnuma_memszs;
> +    
> +    /* check if all vnodes will fit in one node */
> +    libxl_for_each_set_bit(n, info->nodemap) {
> +        if(ninfo[n].free/1024 >= info->max_memkb  && 
> +           libxl_bitmap_test(&info->nodemap, n))
> +           {
> +               /* 
> +                * all domain v-nodes will fit one p-node n, 
> +                * p-node n is a best candidate selected by automatic 
> +                * NUMA placement.
> +                */
> +               for (i = 0; i < info->nr_vnodes; i++)
> +                    info->vnode_to_pnode[i] = n;
> +               /* we can exit, as we are happy with placement */
> +               return 0;
> +           }
> +    }
>Wait, why do we want to do that? If info->nodemap is not fully set
(which BTW is the actual situation used to exemplify "auto", or
"do this
for me", rather than it being empty) it already contains some
instructions that, I think, everyone in the downstream chain --be it
other parts of libxl, libxc and Xen-- should follow.

What I mean is, if info->nodemap has two bits sets, that means the
domain should be "created on" the (two) nodes corresponding to the
bits
themselves. OTOH, if info->nodemap has only one bit set, then the whole
domain should stay there and in that case there isn''t much choice, is
it? (basically, your "libxl_for_each_set_bit()" will only take one
step.)

Here, it looks like, no matter what you we have in info->nodemap, you
look for one of the nodes that can contain the whole domain, which
doesn''t look right.

Below...> +    /* Determine the best nodes to fit vNUMA nodes */
> +    /* TODO: change algorithm. The current just fits the nodes
> +     * Will be nice to have them also sorted by size 
> +     * If no p-node found, will be set to NUMA_NO_NODE
> +     */
> +    claim = calloc(info->nr_vnodes, sizeof(*claim));
> +    if (claim == NULL)
> +        return rc;
> +     
> +    libxl_for_each_set_bit(n, info->nodemap)
> +    {
> +        for (i = 0; i < info->nr_vnodes; i++)
> +        {
> +            if (((claim[n] + (mems[i] << 20)) <= ninfo[n].free)
&&
> +                 /*vnode was not set yet */
> +                 (info->vnode_to_pnode[i] == VNUMA_NO_NODE ) )
> +            {
> +                info->vnode_to_pnode[i] = n;
> +                claim[n] += (mems[i] << 20);
> +            }
> +        }
> +    }
> +...you''re trying to do something similar, i.e., pack the domain within
a
subset of what you find in info->cpumap instead of just use it.

Anyway, I know that automatic placement code may look tricky... Let me
see if I can hack something together quickly, to show what I mean and
what I''d have put here instead of all this. :-)
>  int libxl__build_pre(libxl__gc *gc, uint32_t domid,
>                libxl_domain_config *d_config, libxl__domain_build_state
*state)
>  {
> @@ -235,6 +318,66 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
>          if (rc)
>              return rc;
>      }
> +
> +    if (info->nr_vnodes > 0) {
> +        /* The memory blocks will be formed here from sizes */
> +        struct vmemrange *memrange = libxl__calloc(gc, info->nr_vnodes,
> +                                                sizeof(*memrange));
> +        if (memrange == NULL) {
> +            LOG(DETAIL, "Failed to allocate memory for memory
ranges.\n");
> +            return ERROR_FAIL;
> +        }
>AhA, here you''re using the correct memory allocation wrapper, so you
can
ditch the error handling.
> +        if (libxl__vnuma_align_mem(gc, domid, info, memrange) < 0) {
> +            LOG(DETAIL, "Failed to align memory map.\n");
> +            return ERROR_FAIL;
> +        }
> +
> +        /* 
> +        * If vNUMA vnode_to_pnode map defined, determine if we
> +        * can disable automatic numa placement and place vnodes
> +        * on specified pnodes.
> +        * For now, if vcpu affinity specified, we will use 
> +        * specified vnode to pnode map.
> +        */
> +        if (libxl__vnodemap_is_usable(gc, info) == 1) {
> +
Extra blank line.> +            /* Will use user-defined vnode to pnode mask */
> +        Here too, and with spurious spaces.
> +            libxl_defbool_set(&info->numa_placement, false);
>Setting info->numa_placement to false only makes sense if you do it
_before_ the following block of code:

    if (libxl_defbool_val(info->numa_placement)) {

        if (!libxl_bitmap_is_full(&info->cpumap)) {
            LOG(ERROR, "Can run NUMA placement only if no vcpu "
                       "affinity is specified");
            return ERROR_INVAL;
        }

        rc = numa_place_domain(gc, domid, info);
        if (rc)
            return rc;
    }

There isn''t any other place where that flag is checked and, if you get
past this, automatic placement either already happened (if
numa_placement was true and the other condition is met), or it never
will.

The right thing to do would probably be having
libxl__vnodemap_is_usable() called *before* (even immediately above is
fine) we get here, and have _it_ disable info->numa_placement, if that
reveals to be the case.

At this point, most of the code below (until ''***'') can just
live inside
the above "if (libxl_defbool_val(info->numa_placement))" above,
without
the need of introducing another identical block.
> +        }
> +        else {
> +            LOG(ERROR, "The allocation mask for vnuma nodes cannot be
used.\n");
> +            if (libxl_defbool_val(info->vnuma_placement)) {
> +                
> +                LOG(DETAIL, "Switching to automatic vnuma to pnuma
placement\n.");
> +                /* Construct the vnode to pnode mapping if possible */
> +                if (libxl__init_vnode_to_pnode(gc, domid, info) < 0) {
> +                    LOG(DETAIL, "Failed to call
init_vnodemap\n");
> +                    /* vnuma_nodemap will not be used if nr_vnodes == 0 */
> +                    return ERROR_FAIL;
> +                }
> +            }
> +            else {
> +                LOG(ERROR, "The vnodes cannot be mapped to pnodes
this way\n.");
> +                info->nr_vnodes = 0;
> +                return ERROR_FAIL;
> +            }
> +        }
> +        /* plumb domain with vNUMA topology */
> +        if (xc_domain_setvnuma(ctx->xch, domid, info->nr_vnodes,
> +                                info->max_vcpus, memrange,
> +                                info->vdistance,
info->vcpu_to_vnode,
> +                                info->vnode_to_pnode) < 0) {
> +        
> +           LOG(DETAIL, "Failed to set vnuma topology for domain
from\n.");
> +           info->nr_vnodes = 0;
> +           return ERROR_FAIL;
> +        }
> +    }
> +    else
> +        LOG(DEBUG, "Will not construct vNUMA topology.\n");
> +    *** <-- until here
> +/* Checks if vnuma_nodemap defined in info can be used
> + * for allocation of vnodes on physical NUMA nodes by
> + * verifying that there is enough memory on corresponding
> + * NUMA nodes.
> + */
> +unsigned int libxl__vnodemap_is_usable(libxl__gc *gc,
libxl_domain_build_info *info)
> +{
> +    unsigned int i;
> +    libxl_numainfo *ninfo = NULL;
> +    unsigned long long *claim;
> +    unsigned int node;
> +    uint64_t *mems;
> +    int rc, nr_nodes;
> +
> +    rc = nr_nodes = 0;
> +    if (info->vnode_to_pnode == NULL || info->vnuma_memszs == NULL)
> +        return rc;
> +    /*
> +     * Cannot use specified mapping if not NUMA machine 
> +     */
> +    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
> +    if (ninfo == NULL) {
> +        return rc;   
> +    }
> +    mems = info->vnuma_memszs;   
> +    claim = calloc(info->nr_vnodes, sizeof(*claim));
> +    if (claim == NULL)
> +        return rc;
> +    /* Sum memory request on per pnode basis */ 
> +    for (i = 0; i < info->nr_vnodes; i++)
> +    {
> +        node = info->vnode_to_pnode[i];
> +        /* Correct pnode number? */
> +        if (node < nr_nodes)
> +            claim[node] += (mems[i] << 20);
> +        else
> +            goto vmapu;
> +   }
> +   for (i = 0; i < nr_nodes; i++) {
> +       if (claim[i] > ninfo[i].free)
> +          /* Cannot complete user request, falling to default */
> +          goto vmapu;
> +   }
> +   rc = 1;
> +
> + vmapu:
> +   if(claim) free(claim);
> +   return rc;
> +}
>This function looks conceptually right. Despite that, it has comes with
coding style, trailing whitespaces, inconsistent (wrt libxl conventions)
error and memory allocation/failure handling. :-D :-D
> diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
> index 87a8110..206f5be 100644
> +
> +/*
> + * Checks for the beginnig and end of RAM in e820 map for domain
> + * and aligns start of first and end of last vNUMA memory block to
> + * that map. vnode memory size are passed here Megabytes.
> + * For PV guest e820 map has fixed hole sizes.
> + */
> +int libxl__vnuma_align_mem(libxl__gc *gc,
> +                            uint32_t domid,
> +                            libxl_domain_build_info *b_info, /* IN: mem
sizes */
> +                            vmemrange_t *memblks)        /* OUT: linux
numa blocks in pfn */
> +{
> +    int i, j, rc;
> +    uint64_t next_start_pfn, end_max = 0, size, mem_hole;
> +    uint32_t nr;
> +    struct e820entry map[E820MAX];
> +    
> +    if (b_info->nr_vnodes == 0)
> +        return -EINVAL;
> +    libxl_ctx *ctx = libxl__gc_owner(gc);
> +
> +    /* retreive e820 map for this host */
> +    rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX);
> +
> +    if (rc < 0) {
> +        errno = rc;
> +        return -EINVAL;
> +    }
> +    nr = rc;
> +    rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
> +                       (b_info->max_memkb - b_info->target_memkb) +
> +                       b_info->u.pv.slack_memkb);
> +    if (rc)
> +    {   
> +        errno = rc;
> +        return -EINVAL;
> +    }
> +
> +    /* max pfn for this host */
> +    for (j = nr - 1; j >= 0; j--)
> +        if (map[j].type == E820_RAM) {
> +            end_max = map[j].addr + map[j].size;
> +            break;
> +        }
> +    
> +    memset(memblks, 0, sizeof(*memblks) * b_info->nr_vnodes);
> +    next_start_pfn = 0;
> +
> +    memblks[0].start = map[0].addr;
> +
> +    for(i = 0; i < b_info->nr_vnodes; i++) {
> +        /* start can be not zero */
> +        memblks[i].start += next_start_pfn; 
> +        memblks[i].end = memblks[i].start + (b_info->vnuma_memszs[i]
<< 20);
> +        
> +        size = memblks[i].end - memblks[i].start;
> +        /*
> +         * For pv host with e820_host option turned on we need
> +         * to rake into account memory holes. For pv host with
> +         * e820_host disabled or unset, the map is contiguous 
> +         * RAM region.
> +         */ 
> +        if (libxl_defbool_val(b_info->u.pv.e820_host)) {
> +            while (mem_hole = e820_memory_hole_size(memblks[i].start,
> +                                                 memblks[i].end, map, nr),
> +                    memblks[i].end - memblks[i].start - mem_hole <
size)
> +            {
> +                memblks[i].end += mem_hole;
> +
> +                if (memblks[i].end > end_max) {
> +                    memblks[i].end = end_max;
> +                    break;
> +                }
> +            }
> +        }
> +        next_start_pfn = memblks[i].end;
> +    }
> +    
> +    if (memblks[i-1].end > end_max)
> +        memblks[i-1].end = end_max;
> +
> +    return 0;
> +}
> +All this code here would require for someone who knows more about e820
to chime in (or for me to learn something more about it, which I
definitely want to, but will take a bit).

Perhaps you can help me at leas a bit. Can you comment on/describe what
is it that you do when you find an hole?

It looks from the above that you just need to move the end of the
node(s) memory range(s)... Is that the case? For some reason, I was
assuming that dealing with holes would have required to add support for
more than one memory block per node, but it''s apparently not the case,
is it?

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Elena Ufimtseva

2013-Nov-19 22:06 UTC

head link

Re: [PATCH v3 2/7] libxc: Plumb Xen with vNUMA topology for domain.

On Tue, Nov 19, 2013 at 9:03 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:> On Tue, Nov 19, 2013 at 09:37:25AM +0100, Dario Faggioli wrote:
>> On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva wrote:
>> > Per-domain vNUMA topology initialization.
>> > domctl hypercall is used to set vNUMA topology
>> > per domU during domain build time.
>> >
>> > Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
>> >
>> Ok, this one below should either go under its own "---"
>>
>> > Changes since v2:
>> > * modified error logic;
>> > ---
>> Or here.
>>
>> I mean, something like the following:
>>
>> "
>>   bla bla changelog bal bla
>>
>>   Signed-off-by: Elena...
>>   ---
>>   Changes since v2:
>>    * xxx
>>    * yyyy
>>   ---
>>   <diffstat>
>> "
>>
>> This way the "Changes since" part will be automatically kept
out of the
>> final committed changelog by the git commands committers use to commit
>> the patch. ISTR this happens for everything that''s below
"---". If you
>> put it _inside_ the actual changelog, they''ll have to remove
it by hand,
>> which is annoying.
>>
>> > diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
>> > index 1ccafc5..198694f 100644
>>
>> > +/* Plumbs Xen with vNUMA topology */
>> > +int xc_domain_setvnuma(xc_interface *xch,
>> > +                        uint32_t domid,
>> > +                        uint16_t nr_vnodes,
>> > +                        uint16_t nr_vcpus,
>> > +                        vmemrange_t *vmemrange,
>> > +                        unsigned int *vdistance,
>> > +                        unsigned int *vcpu_to_vnode,
>> > +                        unsigned int *vnode_to_pnode)
>> > +{
>> > +    int rc;
>> > +    DECLARE_DOMCTL;
>> > +    DECLARE_HYPERCALL_BOUNCE(vmemrange, sizeof(*vmemrange) *
nr_vnodes,
>> > +                                   
XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
>> > +    DECLARE_HYPERCALL_BOUNCE(vdistance, sizeof(*vdistance) *
>> > +                                    nr_vnodes * nr_vnodes,
>> > +                                   
XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
>> > +    DECLARE_HYPERCALL_BOUNCE(vcpu_to_vnode,
sizeof(*vcpu_to_vnode) * nr_vcpus,
>> > +                                   
XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
>> > +    DECLARE_HYPERCALL_BOUNCE(vnode_to_pnode,
sizeof(*vnode_to_pnode) *
>> > +                                    nr_vnodes,
>> > +                                   
XC_HYPERCALL_BUFFER_BOUNCE_BOTH);
>> >
>> So, last round I suggested that these bounce buffers should have been
>> _BOUNCE_IN, as information only flow toward Xen, and nothing gets
copied
>> back to xc (and neither it is from xc to the application).
>>
>> However, Elena reported the system crashing (actually, what is it that
>> was exactly happening, Elena?) if doing so.
>>
>> I gave it another look, but I don''t think I see a possible
reason for
>> that... Anyone? IanC? Andrew?
>>
>> > +    if ( nr_vnodes == 0 ) {
>> > +        errno = EINVAL;
>> > +        return -1;
>> > +    }
>> > +
>> > +    if ( vdistance == NULL || vcpu_to_vnode == NULL ||
>> > +         vmemrange == NULL || vnode_to_pnode == NULL ) {
>> > +        PERROR("Incorrect parameters for
XEN_DOMCTL_setvnumainfo.\n");
>> > +        errno = EINVAL;
>> > +        return -1;
>> > +    }
>> > +
>> I think this line above has some spurious spaces.
>>
>> > +    if ( xc_hypercall_bounce_pre(xch, vmemrange)      ||
>> > +         xc_hypercall_bounce_pre(xch, vdistance)      ||
>> > +         xc_hypercall_bounce_pre(xch, vcpu_to_vnode)  ||
>> > +         xc_hypercall_bounce_pre(xch, vnode_to_pnode) ) {
>> > +        PERROR("Could not bounce buffer for
xc_domain_setvnuma.\n");
>> > +        errno = EFAULT;
>> > +        return -1;
>> > +    }
>> > +
>> And this one too (spurious spaces).
>>
>> Also, while the setting of errno in the first two if-s is fine, in this
>> case I think you should avoid doing it, as we rely on
>> xc_hypercall_bounce_pre to have done so.
>>
>> > +    set_xen_guest_handle(domctl.u.vnuma.vmemrange, vmemrange);
>> > +    set_xen_guest_handle(domctl.u.vnuma.vdistance, vdistance);
>> > +    set_xen_guest_handle(domctl.u.vnuma.vcpu_to_vnode,
vcpu_to_vnode);
>> > +    set_xen_guest_handle(domctl.u.vnuma.vnode_to_pnode,
vnode_to_pnode);
>> > +
>>        ^ and here
>>
>> > +    domctl.cmd = XEN_DOMCTL_setvnumainfo;
>> > +    domctl.domain = (domid_t)domid;
>> > +    domctl.u.vnuma.nr_vnodes = nr_vnodes;
>> > +    domctl.u.vnuma.__pad = 0;
>> > +
>>        ^ here
>> > +    rc = do_domctl(xch, &domctl);
>> > +
>>        ^ here
>>
>> > +    xc_hypercall_bounce_post(xch, vmemrange);
>> > +    xc_hypercall_bounce_post(xch, vdistance);
>> > +    xc_hypercall_bounce_post(xch, vcpu_to_vnode);
>> > +    xc_hypercall_bounce_post(xch, vnode_to_pnode);
>> > +
>>        ^ and here too.
>>
>> I double checked that this isn''t an issue with my MUA by
importing the
>> patch in this e-mail and I confirm it is not, it looks like there are a
>> few trailing whitespaces around.
>>
>> I went checking a couple of the other patches too, and found something
>> similar. Can you fix this?
>>
>> For the Linux patches, checkpatch is usually of great help. For Xen
>> ones, I don''t think we have anything like it (yet? :-D).
However, for
>> this specific problem (trailing whitespaces), even just looking at the
>> patch with `git show'' should make the issue evident. For
instance, here
>> I see them as big red rectangles.
>>
>
> Or in vim you can have
>
> syn match ErrorLeadSpace /^ \+/      " highlight any leading spaces
> syn match ErrorTailSpace / \+$/      " highlight any trailing spaces
> syn match Error80        /\%>80v.\+/ " highlight anything past 80
>
> hi Error80 ctermbg=red
> hi ErrorLeadSpace ctermbg=red
> hi ErrorTailSpace ctermbg=red
True )

checkpatch also works with --types TRAILING_WHITESPACE or others options.

Ill clean these.

>
>
> And that will nicely point it to you.
>
>> Regards,
>> Dario
>>
>> --
>> <<This happens because I choose it to happen!>> (Raistlin
Majere)
>> -----------------------------------------------------------------
>> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
>> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>>
>
>
>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
>


-- 
Elena

Matthew Daley

2013-Nov-20 22:48 UTC

head link

Re: [PATCH v3 5/7] libxl: vNUMA configuration parser

On Tue, Nov 19, 2013 at 9:24 AM, Elena Ufimtseva <ufimtseva@gmail.com>
wrote:> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> index 8690ec7..d7aa0b3 100644
> --- a/tools/libxl/xl_cmdimpl.c
> +++ b/tools/libxl/xl_cmdimpl.c
> @@ -40,6 +40,7 @@
>  #include "libxl_json.h"
>  #include "libxlutil.h"
>  #include "xl.h"
> +#include "libxl_vnuma.h"
>
>  #define CHK_ERRNO( call ) ({                                            \
>          int chk_errno = (call);                                         \
> @@ -572,6 +573,75 @@ vcpp_out:
>      return rc;
>  }
>
> +static void vnuma_info_release(libxl_domain_build_info *info)
> +{
> +    if ( info->vnuma_memszs ) free(info->vnuma_memszs);
> +    if ( info->vdistance ) free(info->vdistance);
> +    if ( info->vcpu_to_vnode ) free(info->vcpu_to_vnode);
> +    if ( info->vnode_to_pnode ) free(info->vnode_to_pnode);
You don''t need to check for NULL before passing a pointer to free. I
just removed most instances of this pattern Xenwide and I''d hate to
have some reintroduced :)

- Matthew

Elena Ufimtseva

2013-Nov-21 03:20 UTC

head link

Re: [PATCH v3 5/7] libxl: vNUMA configuration parser

On Wed, Nov 20, 2013 at 5:48 PM, Matthew Daley <mattd@bugfuzz.com>
wrote:> On Tue, Nov 19, 2013 at 9:24 AM, Elena Ufimtseva
<ufimtseva@gmail.com> wrote:
>> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
>> index 8690ec7..d7aa0b3 100644
>> --- a/tools/libxl/xl_cmdimpl.c
>> +++ b/tools/libxl/xl_cmdimpl.c
>> @@ -40,6 +40,7 @@
>>  #include "libxl_json.h"
>>  #include "libxlutil.h"
>>  #include "xl.h"
>> +#include "libxl_vnuma.h"
>>
>>  #define CHK_ERRNO( call ) ({                                          
\
>>          int chk_errno = (call);                                       
\
>> @@ -572,6 +573,75 @@ vcpp_out:
>>      return rc;
>>  }
>>
>> +static void vnuma_info_release(libxl_domain_build_info *info)
>> +{
>> +    if ( info->vnuma_memszs ) free(info->vnuma_memszs);
>> +    if ( info->vdistance ) free(info->vdistance);
>> +    if ( info->vcpu_to_vnode ) free(info->vcpu_to_vnode);
>> +    if ( info->vnode_to_pnode ) free(info->vnode_to_pnode);
>
> You don''t need to check for NULL before passing a pointer to free.
I
> just removed most instances of this pattern Xenwide and I''d hate
to
> have some reintroduced :)
Thank you Matthew for pointing at this )
>
> - Matthew


-- 
Elena

Li Yechen

2013-Nov-21 09:59 UTC

head link

Re: [PATCH v3 4/7] libxl: vNUMA supporting interface

Hi Elena and Dario,> +    /* Determine the best nodes to fit vNUMA nodes */
> +    /* TODO: change algorithm. The current just fits the nodes
> +     * Will be nice to have them also sorted by size
> +     * If no p-node found, will be set to NUMA_NO_NODE
> +     */
> +    claim = calloc(info->nr_vnodes, sizeof(*claim));
> +    if (claim == NULL)
> +        return rc;
> +
> +    libxl_for_each_set_bit(n, info->nodemap)
> +    {
> +        for (i = 0; i < info->nr_vnodes; i++)
> +        {
> +            if (((claim[n] + (mems[i] << 20)) <= ninfo[n].free)
&&
> +                 /*vnode was not set yet */
> +                 (info->vnode_to_pnode[i] == VNUMA_NO_NODE ) )
> +            {
> +                info->vnode_to_pnode[i] = n;
> +                claim[n] += (mems[i] << 20);
> +            }
> +        }
> +    }
> +
So in the initial part, each vnode is set to EXACTLY ONE pnode, right?


-- 

Yechen Li

Team of System Virtualization and Cloud Computing
School of Electronic Engineering  and Computer Science,
Peking University, China

Nothing is impossible because impossible itself  says: " I''m
possible "
lccycc From PKU


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Nov-22 18:15 UTC

head link

Re: [PATCH v3 6/7] xen: adds vNUMA info debug-key u

On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva
wrote:> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
> diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
> index b141877..1cf74ee 100644
> @@ -389,6 +389,24 @@ static void dump_numa(unsigned char key)
>  
>  		for_each_online_node(i)
>  			printk("    Node %u: %u\n", i, page_num_node[i]);
> +
> +		if (d->vnuma.nr_vnodes > 0)
> +		{
> +			printk("    Domain has %d vnodes\n", d->vnuma.nr_vnodes);
> +			for (i = 0; i < d->vnuma.nr_vnodes; i++) {
> +				printk("        vnode %d - pnode %d:", i,                                          ^ I''d intend this a bit less,
but that''s certainly a minor thing.
> +					d->vnuma.vnode_to_pnode[i] >= MAX_NUMNODES ? -1 :
d->vnuma.vnode_to_pnode[i]);
>How long is this line above? :-P
> +				printk(" %"PRIu64" MB \n",
> +                                        (d->vnuma.vmemrange[i].end -
d->vnuma.vmemrange[i].start) >> 20);
> +			}
> +
> +			printk("    Domain vcpu to vnode: \n");
> +			for (i = 0; i < d->max_vcpus; i++)
> +				printk("%s%d %s", i % 8 == 0 ? "    " :
"",
> +					d->vnuma.vcpu_to_vnode[i],
> +					(i + 1) % 8 == 0 ? "\n" : "");
>Do you have an example of the output for this? Looks like you''re just
listing the vcpus, although in a specific order... How one is supposed
to tell what set of vcpus belong to what vnode?

Wouldn''t it be clearer to print the list of vcpus belonging to each
node
right after having printed its memory (i.e., in the for above)?

Also, regarding the ''(i + 1) % 8 == 0 ? "\n" :
""'' I found i rather
ugly. Itt may be a matter of personal taste, but I think something along
the line of:

  if ( !(i % 8) )
    printk("\n");

would be much more readable.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Elena Ufimtseva

2013-Nov-26 21:59 UTC

head link

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

On Tue, Nov 19, 2013 at 11:43 AM, Jan Beulich <JBeulich@suse.com>
wrote:>>>> On 19.11.13 at 17:36, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
>> On mar, 2013-11-19 at 15:54 +0000, Jan Beulich wrote:
>>> >>> On 19.11.13 at 16:42, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
>>> > So, what would the best option be? Another hypercall (or a
special way
>>> > of calling this one) "just" to retrieve the number
of vnodes?
>>>
>>> Iirc there''s a padding field in the interface structure,
which could
>>> be leveraged. But then again you need two counts, and hence it
>>> might be better to simply add two respective fields. Then make
>>> it/them IN/OUT, and rather than filling the arrays when
they''re
>>> too small just send back the necessary values. (And of course
>>> you''ll want to also send back the actual values in case
the passed
>>> in ones turned out to large, so the guest would know how many
>>> of the array elements actually have valid data).
>>>
>>> But in the end the fundamental question stands - how was a PV
>>> guest in your so far proposed model supposed to know its number
>>> of vNodes? While for HVM guests you can make this available via
>>> ACPI, that''s not an option for PV.
>>>
>> Wait... I''m no longer so sure I''m getting what you
say. I''d be inclined
>> to say "by the XENMEM_get_vnuma_info hcall implemented here",
but then
>> again, maybe I''m missing something.
>>
>> The hypercall does provide a mean for the guest to retrieve _all_ the
>> virtual topology information, such as:
>>  - number of virtual nodes
>>  - virtual node memory ranges
>>  - virtual cpu to virtual node mapping
>>  - virtual node to physical node mapping, for use in (future) in-guest
>>    vNUMA aware subsystems (e.g., ballooning)
>>
>> So, if your point is (as I thought) that for properly allocating the
>> buffers for this hypercall to work we need an information only provided
>> by this hypercall itself, then I agree, and that''s why I asked
what
>> alternative way would be best to retrieve that bit of information.
>>
>> If it''s something else, then I don''t know. :-)
>
> No, it is what you''re naming above. I was merely curious how you
> had supposed the guest would know the vNode count prior to this
> change request of mine. I didn''t look at the Linux patches yet
(due
> to lack of time), hence I don''t know how you derived(?) the node
> count without it coming back from the hypercall here.
>
> Jan
>
Hello Jan and Dario.


I have looked at what Jan asked and wanted to see if that can be resolved.

Jan is right, if the guest is running with Linux configured with
maxcpus less than vcpus in VM config,
there is a problem.

On this boot stage where xen_numa_init is called xen_smp_prepare_cpus
equals to vcpus in config;
It only will be reduced to maxcpus (from kernel boot args) after
xen_numa_init during xen_smp_prepare.

In xen_numa_init I have all values I need to make a decision in
regards to initialize vnuma or not, or modify.

These are the numbers I have in xen_numa_init:
num_possible_cpus() = hypervisor provided guest vcpus;
setup_max_cpus = boot kernel param maxcpus;

When setup_max_cpus > num_possible_cpus, num_possible_cpus will be brought
up;

I can detect that setup_max_cpus < num_possible_cpus and do not init
vnuma at all, and just do a fake node.
I can also make sure that hypervisor is aware of it (by calling same
subop with NULL, lets suppose).

Then hypervisor have to make some decision regarding vnuma topology
for this domain. Thus this will be as before, when
guest is not aware of underlying NUMA. It will have to fix
vcpu_to_vnode mask, and possibly adjust pinned vcpus to cpus.
The memory, if allocated on different nodes, will remain as it is.

Does it sound like sensible solution? Or maybe you have some other ideas?

-- 
Elena

Elena Ufimtseva

2013-Nov-26 22:14 UTC

head link

Re: [PATCH v3 4/7] libxl: vNUMA supporting interface

On Thu, Nov 21, 2013 at 4:59 AM, Li Yechen <lccycc123@gmail.com>
wrote:> Hi Elena and Dario,
>> +    /* Determine the best nodes to fit vNUMA nodes */
>> +    /* TODO: change algorithm. The current just fits the nodes
>> +     * Will be nice to have them also sorted by size
>> +     * If no p-node found, will be set to NUMA_NO_NODE
>> +     */
>> +    claim = calloc(info->nr_vnodes, sizeof(*claim));
>> +    if (claim == NULL)
>> +        return rc;
>> +
>> +    libxl_for_each_set_bit(n, info->nodemap)
>> +    {
>> +        for (i = 0; i < info->nr_vnodes; i++)
>> +        {
>> +            if (((claim[n] + (mems[i] << 20)) <=
ninfo[n].free) &&
>> +                 /*vnode was not set yet */
>> +                 (info->vnode_to_pnode[i] == VNUMA_NO_NODE ) )
>> +            {
>> +                info->vnode_to_pnode[i] = n;
>> +                claim[n] += (mems[i] << 20);
>> +            }
>> +        }
>> +    }
>> +
>
> So in the initial part, each vnode is set to EXACTLY ONE pnode, right?
Hi Yechen

Yes, one node or VNUMA_NO_NODE (any node);
>
>
> --
>
> Yechen Li
>
> Team of System Virtualization and Cloud Computing
> School of Electronic Engineering  and Computer Science,
> Peking University, China
>
> Nothing is impossible because impossible itself  says: " I''m
possible "
> lccycc From PKU


-- 
Elena

Dario Faggioli

2013-Nov-26 23:21 UTC

head link

Re: [PATCH v3 4/7] libxl: vNUMA supporting interface

On mar, 2013-11-26 at 17:14 -0500, Elena Ufimtseva
wrote:> On Thu, Nov 21, 2013 at 4:59 AM, Li Yechen <lccycc123@gmail.com>
wrote:
> > Hi Elena and Dario,
> >> +    /* Determine the best nodes to fit vNUMA nodes */
> >> +    /* TODO: change algorithm. The current just fits the nodes
> >> +     * Will be nice to have them also sorted by size
> >> +     * If no p-node found, will be set to NUMA_NO_NODE
> >> +     */
> >> +    claim = calloc(info->nr_vnodes, sizeof(*claim));
> >> +    if (claim == NULL)
> >> +        return rc;
> >> +
> >> +    libxl_for_each_set_bit(n, info->nodemap)
> >> +    {
> >> +        for (i = 0; i < info->nr_vnodes; i++)
> >> +        {
> >> +            if (((claim[n] + (mems[i] << 20)) <=
ninfo[n].free) &&
> >> +                 /*vnode was not set yet */
> >> +                 (info->vnode_to_pnode[i] == VNUMA_NO_NODE ) )
> >> +            {
> >> +                info->vnode_to_pnode[i] = n;
> >> +                claim[n] += (mems[i] << 20);
> >> +            }
> >> +        }
> >> +    }
> >> +
> >
> > So in the initial part, each vnode is set to EXACTLY ONE pnode, right?
> 
> Hi Yechen
> 
> Yes, one node or VNUMA_NO_NODE (any node);
> Yes, but, at the same time, this particular chunk of code is most likely
going to change, in order to better integrate with automatic NUMA
placement.

In fact, if I''m reading this right (as I said when reviewing this
patch), this kind of try to compact more vnodes in the same pnode, if
they fit.

Although I see the reason why you ma want to do that, I''m still of the
opinion that, if info->nodemap has some info coming from the placement
algorithm, we should just honor them... That thing (the placement
algorithm) is too complicated to just trow away the results! :-P

I''m sorry I haven''t got to write some code about this, I hope
to find
some time to do that soon enough.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Nov-27 01:23 UTC

head link

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

On mar, 2013-11-26 at 16:59 -0500, Elena Ufimtseva
wrote:> Hello Jan and Dario.
> Hi,
> I have looked at what Jan asked and wanted to see if that can be resolved.
> 
> Jan is right, if the guest is running with Linux configured with
> maxcpus less than vcpus in VM config,
> there is a problem.
> Yep, there is indeed.
> On this boot stage where xen_numa_init is called xen_smp_prepare_cpus
> equals to vcpus in config;
> It only will be reduced to maxcpus (from kernel boot args) after
> xen_numa_init during xen_smp_prepare.
> 
> In xen_numa_init I have all values I need to make a decision in
> regards to initialize vnuma or not, or modify.
> 
> These are the numbers I have in xen_numa_init:
> num_possible_cpus() = hypervisor provided guest vcpus;
> setup_max_cpus = boot kernel param maxcpus;
> 
> When setup_max_cpus > num_possible_cpus, num_possible_cpus will be
brought up;
> Wait. I think I see what you mean, but that''s probably a different
issue
from what (at least as far as I understood it) Jan is reporting.

So, in Linux you can specify NR_CPUS at compile time, right? Imagine
doing that and setting it to 1. Then, you use the resulting kernel for a
VM, and in the VM''s config file you specify 4 vcpus and 2 vnodes.

Now, apart from what you say above, i.e., only num_possible_cpus()=1
will be brought up, that would also mean that, in xen_numa_init(), you
allocate an array only 1 element big for hosting vnuma data and pass it
to Xen, which will overflow it when trying to put info for 2 vnodes
there!

That''s the thing we want to avoid...
> I can detect that setup_max_cpus < num_possible_cpus and do not init
> vnuma at all, and just do a fake node.
> I can also make sure that hypervisor is aware of it (by calling same
> subop with NULL, lets suppose).
> Right, but I''m not sure I''m getting how detecting that would
relevant to
the issue above... What we''d need is something like
nr_vnodes<num_possible_cpus(), which means we need some mechanism to
retrieve the number of nodes _before_ allocating the arrays.

As per what to do, well, once we have a way to know te number of vnodes,
you just can allocate arrays of correct size and avoid this issue all
together. Of course, as we do not support guests with more nodes than
vcpus, at that time you can also check whether
nr_vnodes>num_possible_cpus() and, if yes, have xen_numa_init() take
some error path, but that is a different thing.

Does this make sense?

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Nov-27 08:14 UTC

head link

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

>>> On 26.11.13 at 22:59, Elena Ufimtseva <ufimtseva@gmail.com>
wrote:
> Jan is right, if the guest is running with Linux configured with
> maxcpus less than vcpus in VM config,
> there is a problem.
> 
> On this boot stage where xen_numa_init is called xen_smp_prepare_cpus
> equals to vcpus in config;
> It only will be reduced to maxcpus (from kernel boot args) after
> xen_numa_init during xen_smp_prepare.
> 
> In xen_numa_init I have all values I need to make a decision in
> regards to initialize vnuma or not, or modify.
> 
> These are the numbers I have in xen_numa_init:
> num_possible_cpus() = hypervisor provided guest vcpus;
> setup_max_cpus = boot kernel param maxcpus;
> 
> When setup_max_cpus > num_possible_cpus, num_possible_cpus will be
brought
> up;
> 
> I can detect that setup_max_cpus < num_possible_cpus and do not init
> vnuma at all, and just do a fake node.
> I can also make sure that hypervisor is aware of it (by calling same
> subop with NULL, lets suppose).
> 
> Then hypervisor have to make some decision regarding vnuma topology
> for this domain. Thus this will be as before, when
> guest is not aware of underlying NUMA. It will have to fix
> vcpu_to_vnode mask, and possibly adjust pinned vcpus to cpus.
> The memory, if allocated on different nodes, will remain as it is.
> 
> Does it sound like sensible solution? Or maybe you have some other ideas?
Imo there''s nothing the hypervisor should do in reaction to the
guest not utilizing all of its assigned vCPU-s - it can''t really know
whether they''re just not being brought up at boot, but may get
started later.

All we need here is a way for the guest to learn its number of
virtual nodes in order to sensibly use the hypercall to obtain its
virtual topology information.

Jan

Elena Ufimtseva

2013-Dec-02 17:06 UTC

head link

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

On Wed, Nov 27, 2013 at 3:14 AM, Jan Beulich <JBeulich@suse.com>
wrote:>>>> On 26.11.13 at 22:59, Elena Ufimtseva
<ufimtseva@gmail.com> wrote:
>> Jan is right, if the guest is running with Linux configured with
>> maxcpus less than vcpus in VM config,
>> there is a problem.
>>
>> On this boot stage where xen_numa_init is called xen_smp_prepare_cpus
>> equals to vcpus in config;
>> It only will be reduced to maxcpus (from kernel boot args) after
>> xen_numa_init during xen_smp_prepare.
>>
>> In xen_numa_init I have all values I need to make a decision in
>> regards to initialize vnuma or not, or modify.
>>
>> These are the numbers I have in xen_numa_init:
>> num_possible_cpus() = hypervisor provided guest vcpus;
>> setup_max_cpus = boot kernel param maxcpus;
>>
>> When setup_max_cpus > num_possible_cpus, num_possible_cpus will be
brought>> up;
>>
>> I can detect that setup_max_cpus < num_possible_cpus and do not init
>> vnuma at all, and just do a fake node.
>> I can also make sure that hypervisor is aware of it (by calling same
>> subop with NULL, lets suppose).
>>
>> Then hypervisor have to make some decision regarding vnuma topology
>> for this domain. Thus this will be as before, when
>> guest is not aware of underlying NUMA. It will have to fix
>> vcpu_to_vnode mask, and possibly adjust pinned vcpus to cpus.
>> The memory, if allocated on different nodes, will remain as it is.
>>
>> Does it sound like sensible solution? Or maybe you have some other
ideas?
>
> Imo there''s nothing the hypervisor should do in reaction to the
> guest not utilizing all of its assigned vCPU-s - it can''t really
know
> whether they''re just not being brought up at boot, but may get
> started later.
>
> All we need here is a way for the guest to learn its number of
> virtual nodes in order to sensibly use the hypercall to obtain its
> virtual topology information.
>
I worked on this and now I have another subop hypercall to retreive number
of nodes
before doing any allocations.

Not sure the second number you were talking about, but looks like number of
nodes is enough
to make correct allocations.
> Jan
>


-- 
Elena


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Dec-02 17:09 UTC

head link

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

>>> On 02.12.13 at 18:06, Elena Ufimtseva <ufimtseva@gmail.com>
wrote:
> On Wed, Nov 27, 2013 at 3:14 AM, Jan Beulich <JBeulich@suse.com>
wrote:
>> All we need here is a way for the guest to learn its number of
>> virtual nodes in order to sensibly use the hypercall to obtain its
>> virtual topology information.
> 
> I worked on this and now I have another subop hypercall to retreive number
> of nodes
> before doing any allocations.
Good, albeit from the previous code I doubt you really need a new
sub-op.

Jan

Elena Ufimtseva

2013-Dec-02 17:27 UTC

head link

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

On Mon, Dec 2, 2013 at 12:09 PM, Jan Beulich <JBeulich@suse.com>
wrote:>>>> On 02.12.13 at 18:06, Elena Ufimtseva
<ufimtseva@gmail.com> wrote:
>> On Wed, Nov 27, 2013 at 3:14 AM, Jan Beulich <JBeulich@suse.com>
wrote:
>>> All we need here is a way for the guest to learn its number of
>>> virtual nodes in order to sensibly use the hypercall to obtain its
>>> virtual topology information.
>>
>> I worked on this and now I have another subop hypercall to retreive
number
>> of nodes
>> before doing any allocations.
>
> Good, albeit from the previous code I doubt you really need a new
> sub-op.
True, I did not want to obscure code, but it can be one.
>
> Jan
>


-- 
Elena

Elena Ufimtseva

2013-Dec-02 18:14 UTC

head link

Re: [PATCH v3 4/7] libxl: vNUMA supporting interface

On Tue, Nov 19, 2013 at 1:37 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:> On lun, 2013-11-18 at 15:24 -0500, Elena Ufimtseva wrote:
>> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
>>
>> Changes since v2:
>> *   Added vnuma memory pfn alignment which takes into
>> account e820_host option and non-contiguous e820 memory map
>> in that case;
>>
> Wow, cool that you got this one working too! ;-P
>
>> diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
>> index 0de1112..2bd3653 100644
>
>> +int libxl_domain_setvnuma(libxl_ctx *ctx,
>> +                            uint32_t domid,
>> +                            uint16_t nr_vnodes,
>> +                            uint16_t nr_vcpus,
>> +                            vmemrange_t *vmemrange,
>> +                            unsigned int *vdistance,
>> +                            unsigned int *vcpu_to_vnode,
>> +                            unsigned int *vnode_to_pnode)
>> +{
>> +    GC_INIT(ctx);
>>
> Do you need this? I don''t think so: you don''t need a gc,
and ctx appears
> to be enough for you.
>
>> +    int ret;
>> +    ret = xc_domain_setvnuma(ctx->xch, domid, nr_vnodes,
>> +                                nr_vcpus, vmemrange,
>> +                                vdistance,
>> +                                vcpu_to_vnode,
>> +                                vnode_to_pnode);
>> +    GC_FREE;
>> +    return ret;
>> +}
>> +
> libxl function should always return libxl error codes, such as
> ERROR_FAIL, ERROR_INVAL, ecc. It''s nice to show what actually went
wrong
> in the xc_* call, but that''s usually done by:
>
>  rc = xc_call();
>  if (rc < 0) {
>   LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "error xxx");
>   return ERROR_FAIL;
>  }
>
> On the grounds that libxc would properly set errno and return -1 on
> failure.
>
>
>> diff --git a/tools/libxl/libxl_arch.h b/tools/libxl/libxl_arch.h
>> index abe6685..b95abab 100644
>
>>  int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config
*d_config,
>>                 uint32_t domid);
>>
>> +int libxl__vnuma_align_mem(libxl__gc *gc,
>> +                            uint32_t domid,
>> +                            struct libxl_domain_build_info *b_info,
>> +                            vmemrange_t *memblks);
>> +
>> +unsigned long e820_memory_hole_size(unsigned long start,
>> +                                    unsigned long end,
>> +                                    struct e820entry e820[],
>> +                                    int nr);
>> +
>> +unsigned int libxl__vnodemap_is_usable(libxl__gc *gc,
>> +                                libxl_domain_build_info *info);
>> +
>>
> I think we''ll better have a libxl maintainer comment on this but,
FWIW,
> I don''t think I''d put these functions in libxl_arch.h.
>
> We do have a libxl_numa.c file, for host their implementation (rather
> than in libxl_dom.c) and, as libxl hidden functions, their prototype is
> probably fine in libxl_internal.h (unless there is some specific reason
> for having them here which I''m overlooking).
>
> As per e820_memory_hole_size(), it looks like it''s only called
from
> inside e820_memory_hole_size(), which means it can be just a static
> (still inside libxl_numa.c).
>
>> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
>> index a1c16b0..378249e 100644
>
>> +/* prepares vnode to pnode map for domain vNUMA memory allocation */
>> +int libxl__init_vnode_to_pnode(libxl__gc *gc, uint32_t domid,
>> +                        libxl_domain_build_info *info)
>> +{
>> +    int i, n, nr_nodes, rc;
>>
> Please, initialize nr_nodes to 0. Given how libxl_get_numainfo() works,
> that''s the best and easiest way for being sure we won''t
leak any of the
> memory it deals with.
>
>> +    uint64_t *mems;
>> +    unsigned long long *claim = NULL;
>> +    libxl_numainfo *ninfo = NULL;
>> +
>> +    rc = -1;
>>
> As said above, use libxl error codes.
>
>> +    if (info->vnode_to_pnode == NULL) {
>> +        info->vnode_to_pnode = calloc(info->nr_vnodes,
>> +                                     
sizeof(*info->vnode_to_pnode));
>> +        if (info->vnode_to_pnode == NULL)
>> +            return rc;
>> +    }
>> +
> Also, in libxl, memory allocation never fails. :-) What this really mean
> is that such kind of failure is handled for you by the library itself,
> if you use the proper wrappers (and doing so is not optional, is the way
> you deal with memory in libxl).
>
> For this specific example, look for libxl__calloc() and some usage
> examples of it (you will find there never is any error handling code).
>
>> +    /* default setting */
>> +    for (i = 0; i < info->nr_vnodes; i++)
>> +        info->vnode_to_pnode[i] = VNUMA_NO_NODE;
>> +
>> +    /* Get NUMA info */
>> +    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
>> +    if (ninfo == NULL || nr_nodes == 0) {
>> +        LOG(DEBUG, "No HW NUMA found.\n");
>> +        rc = 0;
>> +        goto vnmapout;
>> +    }
>> +
> I''m not so sure about this. What do you mean with "No HW NUMA
found"?
> IIRC, on a non-NUMA box, libxl_get_numainfo returns something. I just
> checked on my (non-NUMA) laptop, and I do see a node 0 with all the
> memory and 10 as distance to itself.
>
> So, if what you get is ninfo=NULL and nr_nodes=0, it means something
> went wrong, and we shouldn''t return success. OTOH, if
you''re interested
> to detect the case where the call succeeded, and there really is only
> one node, you should probably check for nr_nodes == 1.
If thats the case and there will be at least one node present, then I
dont care about this )
>
>> +    /*
>> +     * check if we have any hardware NUMA nodes selected,
>> +     * otherwise VNUMA_NO_NODE set and used default allocation
>> +     */
>> +    if (libxl_bitmap_is_empty(&info->nodemap))
>> +        return 0;
>>
> Why _is_empty()? I mean, under what circumstances info->nodemap is
> empty?
>
> I''m asking because info->nodemap is initialized as full in
> libxl__domain_build_info_setdefault() and can only change in
> libxl__build_pre() in a way that makes it really unlikely for it to
> become empty... so...
>
>> +    mems = info->vnuma_memszs;
>> +
>> +    /* check if all vnodes will fit in one node */
>> +    libxl_for_each_set_bit(n, info->nodemap) {
>> +        if(ninfo[n].free/1024 >= info->max_memkb  &&
>> +           libxl_bitmap_test(&info->nodemap, n))
>> +           {
>> +               /*
>> +                * all domain v-nodes will fit one p-node n,
>> +                * p-node n is a best candidate selected by automatic
>> +                * NUMA placement.
>> +                */
>> +               for (i = 0; i < info->nr_vnodes; i++)
>> +                    info->vnode_to_pnode[i] = n;
>> +               /* we can exit, as we are happy with placement */
>> +               return 0;
>> +           }
>> +    }
>>
> Wait, why do we want to do that? If info->nodemap is not fully set
> (which BTW is the actual situation used to exemplify "auto", or
"do this
> for me", rather than it being empty) it already contains some
> instructions that, I think, everyone in the downstream chain --be it
> other parts of libxl, libxc and Xen-- should follow.
>
> What I mean is, if info->nodemap has two bits sets, that means the
> domain should be "created on" the (two) nodes corresponding to
the bits
> themselves. OTOH, if info->nodemap has only one bit set, then the whole
> domain should stay there and in that case there isn''t much choice,
is
> it? (basically, your "libxl_for_each_set_bit()" will only take
one
> step.)
>
> Here, it looks like, no matter what you we have in info->nodemap, you
> look for one of the nodes that can contain the whole domain, which
> doesn''t look right.
>
> Below...
>> +    /* Determine the best nodes to fit vNUMA nodes */
>> +    /* TODO: change algorithm. The current just fits the nodes
>> +     * Will be nice to have them also sorted by size
>> +     * If no p-node found, will be set to NUMA_NO_NODE
>> +     */
>> +    claim = calloc(info->nr_vnodes, sizeof(*claim));
>> +    if (claim == NULL)
>> +        return rc;
>> +
>> +    libxl_for_each_set_bit(n, info->nodemap)
>> +    {
>> +        for (i = 0; i < info->nr_vnodes; i++)
>> +        {
>> +            if (((claim[n] + (mems[i] << 20)) <=
ninfo[n].free) &&
>> +                 /*vnode was not set yet */
>> +                 (info->vnode_to_pnode[i] == VNUMA_NO_NODE ) )
>> +            {
>> +                info->vnode_to_pnode[i] = n;
>> +                claim[n] += (mems[i] << 20);
>> +            }
>> +        }
>> +    }
>> +
> ...you''re trying to do something similar, i.e., pack the domain
within a
> subset of what you find in info->cpumap instead of just use it.
>
> Anyway, I know that automatic placement code may look tricky... Let me
> see if I can hack something together quickly, to show what I mean and
> what I''d have put here instead of all this. :-)
>
>>  int libxl__build_pre(libxl__gc *gc, uint32_t domid,
>>                libxl_domain_config *d_config, libxl__domain_build_state
*state)
>>  {
>> @@ -235,6 +318,66 @@ int libxl__build_pre(libxl__gc *gc, uint32_t
domid,
>>          if (rc)
>>              return rc;
>>      }
>> +
>> +    if (info->nr_vnodes > 0) {
>> +        /* The memory blocks will be formed here from sizes */
>> +        struct vmemrange *memrange = libxl__calloc(gc,
info->nr_vnodes,
>> +                                                sizeof(*memrange));
>> +        if (memrange == NULL) {
>> +            LOG(DETAIL, "Failed to allocate memory for memory
ranges.\n");
>> +            return ERROR_FAIL;
>> +        }
>>
> AhA, here you''re using the correct memory allocation wrapper, so
you can
> ditch the error handling.
>
>> +        if (libxl__vnuma_align_mem(gc, domid, info, memrange) < 0)
{
>> +            LOG(DETAIL, "Failed to align memory map.\n");
>> +            return ERROR_FAIL;
>> +        }
>> +
>> +        /*
>> +        * If vNUMA vnode_to_pnode map defined, determine if we
>> +        * can disable automatic numa placement and place vnodes
>> +        * on specified pnodes.
>> +        * For now, if vcpu affinity specified, we will use
>> +        * specified vnode to pnode map.
>> +        */
>> +        if (libxl__vnodemap_is_usable(gc, info) == 1) {
>> +
> Extra blank line.
>> +            /* Will use user-defined vnode to pnode mask */
>> +
> Here too, and with spurious spaces.
>
>> +            libxl_defbool_set(&info->numa_placement, false);
>>
> Setting info->numa_placement to false only makes sense if you do it
> _before_ the following block of code:
>
>     if (libxl_defbool_val(info->numa_placement)) {
>
>         if (!libxl_bitmap_is_full(&info->cpumap)) {
>             LOG(ERROR, "Can run NUMA placement only if no vcpu "
>                        "affinity is specified");
>             return ERROR_INVAL;
>         }
>
>         rc = numa_place_domain(gc, domid, info);
>         if (rc)
>             return rc;
>     }
>
> There isn''t any other place where that flag is checked and, if you
get
> past this, automatic placement either already happened (if
> numa_placement was true and the other condition is met), or it never
> will.
>
> The right thing to do would probably be having
> libxl__vnodemap_is_usable() called *before* (even immediately above is
> fine) we get here, and have _it_ disable info->numa_placement, if that
> reveals to be the case.
>
> At this point, most of the code below (until ''***'') can
just live inside
> the above "if (libxl_defbool_val(info->numa_placement))"
above, without
> the need of introducing another identical block.
Yes, I worked on this piece and looks now as it does the right thing.
>
>> +        }
>> +        else {
>> +            LOG(ERROR, "The allocation mask for vnuma nodes
cannot be used.\n");
>> +            if (libxl_defbool_val(info->vnuma_placement)) {
>> +
>> +                LOG(DETAIL, "Switching to automatic vnuma to
pnuma placement\n.");
>> +                /* Construct the vnode to pnode mapping if possible */
>> +                if (libxl__init_vnode_to_pnode(gc, domid, info) <
0) {
>> +                    LOG(DETAIL, "Failed to call
init_vnodemap\n");
>> +                    /* vnuma_nodemap will not be used if nr_vnodes ==
0 */
>> +                    return ERROR_FAIL;
>> +                }
>> +            }
>> +            else {
>> +                LOG(ERROR, "The vnodes cannot be mapped to pnodes
this way\n.");
>> +                info->nr_vnodes = 0;
>> +                return ERROR_FAIL;
>> +            }
>> +        }
>> +        /* plumb domain with vNUMA topology */
>> +        if (xc_domain_setvnuma(ctx->xch, domid, info->nr_vnodes,
>> +                                info->max_vcpus, memrange,
>> +                                info->vdistance,
info->vcpu_to_vnode,
>> +                                info->vnode_to_pnode) < 0) {
>> +
>> +           LOG(DETAIL, "Failed to set vnuma topology for domain
from\n.");
>> +           info->nr_vnodes = 0;
>> +           return ERROR_FAIL;
>> +        }
>> +    }
>> +    else
>> +        LOG(DEBUG, "Will not construct vNUMA topology.\n");
>> +
> *** <-- until here
>
>> +/* Checks if vnuma_nodemap defined in info can be used
>> + * for allocation of vnodes on physical NUMA nodes by
>> + * verifying that there is enough memory on corresponding
>> + * NUMA nodes.
>> + */
>> +unsigned int libxl__vnodemap_is_usable(libxl__gc *gc,
libxl_domain_build_info *info)
>> +{
>> +    unsigned int i;
>> +    libxl_numainfo *ninfo = NULL;
>> +    unsigned long long *claim;
>> +    unsigned int node;
>> +    uint64_t *mems;
>> +    int rc, nr_nodes;
>> +
>> +    rc = nr_nodes = 0;
>> +    if (info->vnode_to_pnode == NULL || info->vnuma_memszs ==
NULL)
>> +        return rc;
>> +    /*
>> +     * Cannot use specified mapping if not NUMA machine
>> +     */
>> +    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
>> +    if (ninfo == NULL) {
>> +        return rc;
>> +    }
>> +    mems = info->vnuma_memszs;
>> +    claim = calloc(info->nr_vnodes, sizeof(*claim));
>> +    if (claim == NULL)
>> +        return rc;
>> +    /* Sum memory request on per pnode basis */
>> +    for (i = 0; i < info->nr_vnodes; i++)
>> +    {
>> +        node = info->vnode_to_pnode[i];
>> +        /* Correct pnode number? */
>> +        if (node < nr_nodes)
>> +            claim[node] += (mems[i] << 20);
>> +        else
>> +            goto vmapu;
>> +   }
>> +   for (i = 0; i < nr_nodes; i++) {
>> +       if (claim[i] > ninfo[i].free)
>> +          /* Cannot complete user request, falling to default */
>> +          goto vmapu;
>> +   }
>> +   rc = 1;
>> +
>> + vmapu:
>> +   if(claim) free(claim);
>> +   return rc;
>> +}
>>
> This function looks conceptually right. Despite that, it has comes with
> coding style, trailing whitespaces, inconsistent (wrt libxl conventions)
> error and memory allocation/failure handling. :-D :-D
>
>> diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
>> index 87a8110..206f5be 100644
>
>> +
>> +/*
>> + * Checks for the beginnig and end of RAM in e820 map for domain
>> + * and aligns start of first and end of last vNUMA memory block to
>> + * that map. vnode memory size are passed here Megabytes.
>> + * For PV guest e820 map has fixed hole sizes.
>> + */
>> +int libxl__vnuma_align_mem(libxl__gc *gc,
>> +                            uint32_t domid,
>> +                            libxl_domain_build_info *b_info, /* IN:
mem sizes */
>> +                            vmemrange_t *memblks)        /* OUT: linux
numa blocks in pfn */
>> +{
>> +    int i, j, rc;
>> +    uint64_t next_start_pfn, end_max = 0, size, mem_hole;
>> +    uint32_t nr;
>> +    struct e820entry map[E820MAX];
>> +
>> +    if (b_info->nr_vnodes == 0)
>> +        return -EINVAL;
>> +    libxl_ctx *ctx = libxl__gc_owner(gc);
>> +
>> +    /* retreive e820 map for this host */
>> +    rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX);
>> +
>> +    if (rc < 0) {
>> +        errno = rc;
>> +        return -EINVAL;
>> +    }
>> +    nr = rc;
>> +    rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
>> +                       (b_info->max_memkb -
b_info->target_memkb) +
>> +                       b_info->u.pv.slack_memkb);
>> +    if (rc)
>> +    {
>> +        errno = rc;
>> +        return -EINVAL;
>> +    }
>> +
>> +    /* max pfn for this host */
>> +    for (j = nr - 1; j >= 0; j--)
>> +        if (map[j].type == E820_RAM) {
>> +            end_max = map[j].addr + map[j].size;
>> +            break;
>> +        }
>> +
>> +    memset(memblks, 0, sizeof(*memblks) * b_info->nr_vnodes);
>> +    next_start_pfn = 0;
>> +
>> +    memblks[0].start = map[0].addr;
>> +
>> +    for(i = 0; i < b_info->nr_vnodes; i++) {
>> +        /* start can be not zero */
>> +        memblks[i].start += next_start_pfn;
>> +        memblks[i].end = memblks[i].start +
(b_info->vnuma_memszs[i] << 20);
>> +
>> +        size = memblks[i].end - memblks[i].start;
>> +        /*
>> +         * For pv host with e820_host option turned on we need
>> +         * to rake into account memory holes. For pv host with
>> +         * e820_host disabled or unset, the map is contiguous
>> +         * RAM region.
>> +         */
>> +        if (libxl_defbool_val(b_info->u.pv.e820_host)) {
>> +            while (mem_hole = e820_memory_hole_size(memblks[i].start,
>> +                                                 memblks[i].end, map,
nr),
>> +                    memblks[i].end - memblks[i].start - mem_hole <
size)
>> +            {
>> +                memblks[i].end += mem_hole;
>> +
>> +                if (memblks[i].end > end_max) {
>> +                    memblks[i].end = end_max;
>> +                    break;
>> +                }
>> +            }
>> +        }
>> +        next_start_pfn = memblks[i].end;
>> +    }
>> +
>> +    if (memblks[i-1].end > end_max)
>> +        memblks[i-1].end = end_max;
>> +
>> +    return 0;
>> +}
>> +
> All this code here would require for someone who knows more about e820
> to chime in (or for me to learn something more about it, which I
> definitely want to, but will take a bit).
>
> Perhaps you can help me at leas a bit. Can you comment on/describe what
> is it that you do when you find an hole?
>
> It looks from the above that you just need to move the end of the
> node(s) memory range(s)... Is that the case? For some reason, I was
> assuming that dealing with holes would have required to add support for
> more than one memory block per node, but it''s apparently not the
case,
> is it?

The code looks right now a bit different as I found some problems there.

But here is the general idea.

When e820_host is not set, then pv guest has fixed e820 map and I dont
care about the holes much since its
always looks like this:

[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
[    0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
[    0.000000] Xen: [mem 0x0000000000100000-0x00000000f9ffffff] usable
[    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable

Means, first 4KB (0x0000 - 0x0fff) and 384K gap between 0xa0000 and
0xfffff will be reserved.
Since these pfns will never appear in the pages allocations and the
the beginning and end of memory blocks
I construct are exactly as in guest, I form memory blocks in this case
bases in sizes of vnodes.

In case e820_host is set to 1, I have to account for memory holes and
build correct memory ranges.

As for memory blocks.
There will be need in second memory block of certain type (memory, in
this case) if the node spans
non-contiguous memory ranges. For example, if node 0 has 0x000000 -
0x100000 and 0x400000 - 0x800000, then
it will look like this:

hypothethical addresses here :)

node0, memblock 0: 0x000000 - 0x0fffff
node1, memblock 1: 0x100000 - 0x3ffffff
node 0, memblock 1: 0x400000 - 0x7fffff

If the memory blocks can be merged and form contiguous block, that
will be done and will be one block per node.
So I dont see at this point as having two memory blocks.
gaps in e820 are being accounted in linux in memory ranges list in
certain memory block.

Elena

>
> Regards,
> Dario
>
> --
> <<This happens because I choose it to happen!>> (Raistlin
Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>


-- 
Elena

Xen devel - Nov 2013 - [PATCH v3 0/7] vNUMA introduction

[PATCH v3 0/7] vNUMA introduction

[PATCH v3 1/7] xen: vNUMA support for PV guests

[PATCH v3 2/7] libxc: Plumb Xen with vNUMA topology for domain.

[PATCH v3 3/7] libxc: vnodes allocation on NUMA nodes.

[PATCH v3 4/7] libxl: vNUMA supporting interface

[PATCH v3 5/7] libxl: vNUMA configuration parser

[PATCH v3 6/7] xen: adds vNUMA info debug-key u

[PATCH v3 7/7] xl: docs for xl config vnuma options

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

Re: [PATCH v3 2/7] libxc: Plumb Xen with vNUMA topology for domain.

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

Re: [PATCH v3 2/7] libxc: Plumb Xen with vNUMA topology for domain.

Re: [PATCH v3 3/7] libxc: vnodes allocation on NUMA nodes.

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

Re: [PATCH v3 5/7] libxl: vNUMA configuration parser

Re: [PATCH v3 7/7] xl: docs for xl config vnuma options

Re: [PATCH v3 7/7] xl: docs for xl config vnuma options

Re: [PATCH v3 4/7] libxl: vNUMA supporting interface

Re: [PATCH v3 2/7] libxc: Plumb Xen with vNUMA topology for domain.

Re: [PATCH v3 5/7] libxl: vNUMA configuration parser

Re: [PATCH v3 5/7] libxl: vNUMA configuration parser

Re: [PATCH v3 4/7] libxl: vNUMA supporting interface

Re: [PATCH v3 6/7] xen: adds vNUMA info debug-key u

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

Re: [PATCH v3 4/7] libxl: vNUMA supporting interface

Re: [PATCH v3 4/7] libxl: vNUMA supporting interface

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

Re: [PATCH v3 1/7] xen: vNUMA support for PV guests

Re: [PATCH v3 4/7] libxl: vNUMA supporting interface