This series of patches introduces vNUMA topology implementation and provides interfaces and data structures, exposing to PV guest virtual topology and enabling guest OS to use its own NUMA placement mechanisms. vNUMA topology support for Linux PV guest comes in a separate patch. Please review and send your comments. Introduction ------------- vNUMA topology is exposed to the PV guest to improve performance when running workloads on NUMA machines. XEN vNUMA implementation provides a way to create vNUMA-enabled guests on NUMA/UMA and map vNUMA topology to physical NUMA in a optimal way. XEN vNUMA support Current set of patches introduces subop hypercall that is available for enlightened PV guests with vNUMA patches applied. Domain structure was modified to reflect per-domain vNUMA topology for use in other vNUMA-aware subsystems (e.g. ballooning). libxc libxc provides interfaces to build PV guests with vNUMA support and in case of NUMA machines provides initial memory allocation on physical NUMA nodes. This implemented by trying to allocate all vnodes on one NUMA node. In case of insufficient memory, vnodes are allocated on other physical NODE with enough memory. If none of physical nodes has enough memory, the memory allocation is done using default mechanism and vnodes may have pages allocated on different nodes. libxl libxl provides a way to predefine in VM config vNUMA topology - number of vnodes, memory arrangement, vcpus to vnodes assignment, distance map. PV guest As of now, only PV guest can take advantage of vNUMA functionality. vNUMA Linux patches should be applied and NUMA support should be compiled in kernel. Patchset v. 0.1 --------------- [PATCH RFC 1/7] xen/vNUMA: hypercall and vnuma topology structures Defines XENMEM subop hypercall for PV vNUMA enabled guests and provides vNUMA topology information from per-domain vnuma topology build info. [PATCH RFC 2/7] xen/vnuma: domctl subop for vnuma setup Defines domctl subop hypercall for per-domain vNUMA topology construct [PATCH RFC 3/7] libxc/vnuma: domain per-domain vnuma structures. Makes use of domctl vnuma subop and initializes per-domain vnuma topology. [PATCH RFC 4/7] libxl/vnuma: vnuma domain config and pre-build Defines VM config options for vNUMA PV domain creation as follows: [PATCH RFC 5/7] libxl/vnuma: vnuma enabler Enables libxl vnuma ABI by LIBXL_HAVE_BUILDINFO_VNUMA [PATCH RFC 6/7] libxc/vnuma: vnuma per phys NUMA allocation. Allows for vNUMA enabled domains to allocate vnodes on physical NUMA nodes. Tries to allocate all vnodes on one NUMA node, or on next one if not all vnodes fit. If no physical numa node found, will let xen decide. [PATCH RFC 7/7] xen/vnuam: basic vnuma debug info Prints basic vnuma info per domain on ''debug-keys u''. TODO: --------------- - initial boot mem allocation alghoritm for vNUMA nodes on physical nodes; - linux vnuma memblocks and e820 holes needs testing; - move XENMEM subop hypercall in xen to sysctl subop; - some kind of statistics for vnuma enabled guests, xl info/list; - take into account cpu pinning if defined in VM config; - take into account automatic NUMA placement mechanism; - arch dependend pieces tests; - help files; Elena Ufimtseva (7): xen/vnuma: subop hypercall and vnuma topology structures. xen/vnuma: domctl subop for vnuma setup. libxc/vnuma: per-domain vnuma structures. libxl/vnuma: vnuma domain config libxl/vnuma: vnuma enabler. libxc/vnuma: vnuma per phys NUMA allocation. xen/vnuma: basic vnuma debug info tools/libxc/xc_dom.h | 10 +++ tools/libxc/xc_dom_x86.c | 79 +++++++++++++++-- tools/libxc/xc_domain.c | 63 ++++++++++++++ tools/libxc/xenctrl.h | 17 ++++ tools/libxc/xg_private.h | 4 + tools/libxl/libxl.c | 28 ++++++ tools/libxl/libxl.h | 23 +++++ tools/libxl/libxl_arch.h | 6 ++ tools/libxl/libxl_dom.c | 115 ++++++++++++++++++++++-- tools/libxl/libxl_internal.h | 3 + tools/libxl/libxl_types.idl | 6 +- tools/libxl/libxl_x86.c | 91 +++++++++++++++++++ tools/libxl/xl_cmdimpl.c | 197 +++++++++++++++++++++++++++++++++++++++++- xen/arch/x86/numa.c | 16 +++- xen/common/domain.c | 6 ++ xen/common/domctl.c | 72 ++++++++++++++- xen/common/memory.c | 90 ++++++++++++++++++- xen/include/public/domctl.h | 15 +++- xen/include/public/memory.h | 1 + xen/include/public/vnuma.h | 12 +++ xen/include/xen/domain.h | 9 ++ xen/include/xen/sched.h | 1 + xen/include/xen/vnuma.h | 27 ++++++ 23 files changed, 871 insertions(+), 20 deletions(-) create mode 100644 xen/include/public/vnuma.h create mode 100644 xen/include/xen/vnuma.h -- 1.7.10.4
Elena Ufimtseva
2013-Aug-27 07:54 UTC
[PATCH RFC 1/7] xen/vnuma: subop hypercall and vnuma topology structures.
Defines XENMEM subop hypercall for PV vNUMA enabled guests and provides vNUMA topology information from per-domain vnuma topology build info. TODO: subop XENMEM hypercall is subject to change to sysctl subop. Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> --- xen/common/memory.c | 90 ++++++++++++++++++++++++++++++++++++++++++- xen/include/public/memory.h | 1 + xen/include/public/vnuma.h | 12 ++++++ xen/include/xen/domain.h | 9 +++++ xen/include/xen/sched.h | 1 + xen/include/xen/vnuma.h | 27 +++++++++++++ 6 files changed, 139 insertions(+), 1 deletion(-) create mode 100644 xen/include/public/vnuma.h create mode 100644 xen/include/xen/vnuma.h diff --git a/xen/common/memory.c b/xen/common/memory.c index 50b740f..c7fbe11 100644 --- a/xen/common/memory.c +++ b/xen/common/memory.c @@ -28,6 +28,7 @@ #include <public/memory.h> #include <xsm/xsm.h> #include <xen/trace.h> +#include <xen/vnuma.h> struct memop_args { /* INPUT */ @@ -732,7 +733,94 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg) rcu_unlock_domain(d); break; - + case XENMEM_get_vnuma_info: + { + int i; + struct vnuma_topology_info mtopology; + struct vnuma_topology_info touser_topo; + struct domain *d; + unsigned int max_pages; + vnuma_memblk_t *vblks; + XEN_GUEST_HANDLE(int) vdistance; + XEN_GUEST_HANDLE_PARAM(int) vdist_param; + XEN_GUEST_HANDLE(vnuma_memblk_t) buf; + XEN_GUEST_HANDLE_PARAM(vnuma_memblk_t) buf_param; + XEN_GUEST_HANDLE(int) vcpu_to_vnode; + XEN_GUEST_HANDLE_PARAM(int) vmap_param; + + rc = -1; + if ( guest_handle_is_null(arg) ) + return rc; + if( copy_from_guest(&mtopology, arg, 1) ) + { + gdprintk(XENLOG_INFO, "Cannot get copy_from_guest..\n"); + return -EFAULT; + } + gdprintk(XENLOG_INFO, "Domain id is %d\n",mtopology.domid); + if ( (d = rcu_lock_domain_by_any_id(mtopology.domid)) == NULL ) + { + gdprintk(XENLOG_INFO, "Numa: Could not get domain id.\n"); + return -ESRCH; + } + rcu_unlock_domain(d); + touser_topo.nr_vnodes = d->vnuma.nr_vnodes; + rc = copy_to_guest(arg, &touser_topo, 1); + if ( rc ) + { + gdprintk(XENLOG_INFO, "Bad news, could not copy to guest NUMA info\n"); + return -EFAULT; + } + max_pages = d->max_pages; + if ( touser_topo.nr_vnodes == 0 || touser_topo.nr_vnodes > d->max_vcpus ) + { + gdprintk(XENLOG_INFO, "vNUMA: Error in block creation - vnodes %d, vcpus %d \n", touser_topo.nr_vnodes, d->max_vcpus); + return -EFAULT; + } + vblks = (vnuma_memblk_t *)xmalloc_array(struct vnuma_memblk, touser_topo.nr_vnodes); + if ( vblks == NULL ) + { + gdprintk(XENLOG_INFO, "vNUMA: Could not get memory for memblocks\n"); + return -1; + } + buf_param = guest_handle_cast(mtopology.vnuma_memblks, vnuma_memblk_t); + buf = guest_handle_from_param(buf_param, vnuma_memblk_t); + for ( i = 0; i < touser_topo.nr_vnodes; i++ ) + { + gdprintk(XENLOG_INFO, "vmemblk[%d] start %#lx end %#lx\n", i, d->vnuma.vnuma_memblks[i].start, d->vnuma.vnuma_memblks[i].end); + if ( copy_to_guest_offset(buf, i, &d->vnuma.vnuma_memblks[i], 1) ) + { + gdprintk(XENLOG_INFO, "Failed to copy to guest vmemblk[%d]\n", i); + goto out; + } + } + vdist_param = guest_handle_cast(mtopology.vdistance, int); + vdistance = guest_handle_from_param(vdist_param, int); + for ( i = 0; i < touser_topo.nr_vnodes * touser_topo.nr_vnodes; i++ ) + { + if ( copy_to_guest_offset(vdistance, i, &d->vnuma.vdistance[i], 1) ) + { + gdprintk(XENLOG_INFO, "Failed to copy to guest vdistance[%d]\n", i); + goto out; + } + } + vmap_param = guest_handle_cast(mtopology.vcpu_to_vnode, int); + vcpu_to_vnode = guest_handle_from_param(vmap_param, int); + for ( i = 0; i < d->max_vcpus ; i++ ) + { + if ( copy_to_guest_offset(vcpu_to_vnode, i, &d->vnuma.vcpu_to_vnode[i], 1) ) + { + gdprintk(XENLOG_INFO, "Failed to copy to guest vcputonode[%d]\n", i); + goto out; + } + else + gdprintk(XENLOG_INFO, "Copied map [%d] = %x\n", i, d->vnuma.vcpu_to_vnode[i]); + } + return rc; +out: + if ( vblks ) xfree(vblks); + return rc; + break; + } default: rc = arch_memory_op(op, arg); break; diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h index 7a26dee..30cb8af 100644 --- a/xen/include/public/memory.h +++ b/xen/include/public/memory.h @@ -453,6 +453,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t); * Caller must be privileged or the hypercall fails. */ #define XENMEM_claim_pages 24 +#define XENMEM_get_vnuma_info 25 /* * XENMEM_claim_pages flags - the are no flags at this time. diff --git a/xen/include/public/vnuma.h b/xen/include/public/vnuma.h new file mode 100644 index 0000000..a88dfe2 --- /dev/null +++ b/xen/include/public/vnuma.h @@ -0,0 +1,12 @@ +#ifndef __XEN_PUBLIC_VNUMA_H +#define __XEN_PUBLIC_VNUMA_H + +#include "xen.h" + +struct vnuma_memblk { + uint64_t start, end; +}; +typedef struct vnuma_memblk vnuma_memblk_t; +DEFINE_XEN_GUEST_HANDLE(vnuma_memblk_t); + +#endif diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h index a057069..3d39218 100644 --- a/xen/include/xen/domain.h +++ b/xen/include/xen/domain.h @@ -4,6 +4,7 @@ #include <public/xen.h> #include <asm/domain.h> +#include <public/vnuma.h> typedef union { struct vcpu_guest_context *nat; @@ -89,4 +90,12 @@ extern unsigned int xen_processor_pmbits; extern bool_t opt_dom0_vcpus_pin; +struct domain_vnuma_info { + uint16_t nr_vnodes; + int *vdistance; + vnuma_memblk_t *vnuma_memblks; + int *vcpu_to_vnode; + int *vnode_to_pnode; +}; + #endif /* __XEN_DOMAIN_H__ */ diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index ae6a3b8..cb023cf 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -377,6 +377,7 @@ struct domain nodemask_t node_affinity; unsigned int last_alloc_node; spinlock_t node_affinity_lock; + struct domain_vnuma_info vnuma; }; struct domain_setup_info diff --git a/xen/include/xen/vnuma.h b/xen/include/xen/vnuma.h new file mode 100644 index 0000000..f1ab531 --- /dev/null +++ b/xen/include/xen/vnuma.h @@ -0,0 +1,27 @@ +#ifndef _VNUMA_H +#define _VNUMA_H +#include <public/vnuma.h> + +/* DEFINE_XEN_GUEST_HANDLE(vnuma_memblk_t); */ + +struct vnuma_topology_info { + domid_t domid; + uint16_t nr_vnodes; + XEN_GUEST_HANDLE_64(vnuma_memblk_t) vnuma_memblks; + XEN_GUEST_HANDLE_64(int) vdistance; + XEN_GUEST_HANDLE_64(int) vcpu_to_vnode; + XEN_GUEST_HANDLE_64(int) vnode_to_pnode; +}; +typedef struct vnuma_topology_info vnuma_topology_info_t; +DEFINE_XEN_GUEST_HANDLE(vnuma_topology_info_t); + +#define __vnode_distance_offset(_dom, _i, _j) \ + ( ((_j)*((_dom)->vnuma.nr_vnodes)) + (_i) ) + +#define __vnode_distance(_dom, _i, _j) \ + ( (_dom)->vnuma.vdistance[__vnode_distance_offset((_dom), (_i), (_j))] ) + +#define __vnode_distance_set(_dom, _i, _j, _v) \ + do { __vnode_distance((_dom), (_i), (_j)) = (_v); } while(0) + +#endif -- 1.7.10.4
Elena Ufimtseva
2013-Aug-27 07:54 UTC
[PATCH RFC 2/7] xen/vnuma: domctl subop for vnuma setup.
Defines domctl subop hypercall for per-domain vNUMA topology construct. Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> --- xen/common/domain.c | 6 ++++ xen/common/domctl.c | 72 ++++++++++++++++++++++++++++++++++++++++++- xen/include/public/domctl.h | 15 ++++++++- 3 files changed, 91 insertions(+), 2 deletions(-) diff --git a/xen/common/domain.c b/xen/common/domain.c index 9390a22..f0c0a79 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -227,6 +227,11 @@ struct domain *domain_create( spin_lock_init(&d->node_affinity_lock); d->node_affinity = NODE_MASK_ALL; d->auto_node_affinity = 1; + d->vnuma.vnuma_memblks = NULL; + d->vnuma.vnode_to_pnode = NULL; + d->vnuma.vcpu_to_vnode = NULL; + d->vnuma.vdistance = NULL; + d->vnuma.nr_vnodes = 0; spin_lock_init(&d->shutdown_lock); d->shutdown_code = -1; @@ -532,6 +537,7 @@ int domain_kill(struct domain *d) tmem_destroy(d->tmem); domain_set_outstanding_pages(d, 0); d->tmem = NULL; + /* TODO: vnuma_destroy(d->vnuma); */ /* fallthrough */ case DOMDYING_dying: rc = domain_relinquish_resources(d); diff --git a/xen/common/domctl.c b/xen/common/domctl.c index 9760d50..b552e60 100644 --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -29,6 +29,7 @@ #include <asm/page.h> #include <public/domctl.h> #include <xsm/xsm.h> +#include <xen/vnuma.h> static DEFINE_SPINLOCK(domctl_lock); DEFINE_SPINLOCK(vcpu_alloc_lock); @@ -862,7 +863,76 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) ret = set_global_virq_handler(d, virq); } break; - + case XEN_DOMCTL_setvnumainfo: + { + int i, j; + int dist_size; + int dist, vmap, vntop; + vnuma_memblk_t vmemblk; + + ret = -EFAULT; + dist = i = j = 0; + if (op->u.vnuma.nr_vnodes <= 0 || op->u.vnuma.nr_vnodes > NR_CPUS) + break; + d->vnuma.nr_vnodes = op->u.vnuma.nr_vnodes; + dist_size = d->vnuma.nr_vnodes * d->vnuma.nr_vnodes; + if ( (d->vnuma.vdistance = xmalloc_bytes(sizeof(*d->vnuma.vdistance) * dist_size) ) == NULL) + break; + for ( i = 0; i < d->vnuma.nr_vnodes; i++ ) + for ( j = 0; j < d->vnuma.nr_vnodes; j++ ) + { + if ( unlikely(__copy_from_guest_offset(&dist, op->u.vnuma.vdistance, __vnode_distance_offset(d, i, j), 1)) ) + { + gdprintk(XENLOG_INFO, "vNUMA: Copy distance table error\n"); + goto err_dom; + } + __vnode_distance_set(d, i, j, dist); + } + if ( (d->vnuma.vnuma_memblks = xmalloc_bytes(sizeof(*d->vnuma.vnuma_memblks) * d->vnuma.nr_vnodes)) == NULL ) + goto err_dom; + for ( i = 0; i < d->vnuma.nr_vnodes; i++ ) + { + if ( unlikely(__copy_from_guest_offset(&vmemblk, op->u.vnuma.vnuma_memblks, i, 1)) ) + { + gdprintk(XENLOG_INFO, "vNUMA: memory size error\n"); + goto err_dom; + } + d->vnuma.vnuma_memblks[i].start = vmemblk.start; + d->vnuma.vnuma_memblks[i].end = vmemblk.end; + } + if ( (d->vnuma.vcpu_to_vnode = xmalloc_bytes(sizeof(*d->vnuma.vcpu_to_vnode) * d->max_vcpus)) == NULL ) + goto err_dom; + for ( i = 0; i < d->max_vcpus; i++ ) + { + if ( unlikely(__copy_from_guest_offset(&vmap, op->u.vnuma.vcpu_to_vnode, i, 1)) ) + { + gdprintk(XENLOG_INFO, "vNUMA: vcputovnode map error\n"); + goto err_dom; + } + d->vnuma.vcpu_to_vnode[i] = vmap; + } + if ( !guest_handle_is_null(op->u.vnuma.vnode_to_pnode) ) + { + if ( (d->vnuma.vnode_to_pnode = xmalloc_bytes(sizeof(*d->vnuma.vnode_to_pnode) * d->vnuma.nr_vnodes)) == NULL ) + goto err_dom; + for ( i = 0; i < d->vnuma.nr_vnodes; i++ ) + { + if ( unlikely(__copy_from_guest_offset(&vntop, op->u.vnuma.vnode_to_pnode, i, 1)) ) + { + gdprintk(XENLOG_INFO, "vNUMA: vnode_t_pnode map error\n"); + goto err_dom; + } + d->vnuma.vnode_to_pnode[i] = vntop; + } + } + else + d->vnuma.vnode_to_pnode = NULL; + ret = 0; + break; +err_dom: + ret = -EINVAL; + } + break; default: ret = arch_do_domctl(op, d, u_domctl); break; diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h index 4c5b2bb..a034688 100644 --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -35,6 +35,7 @@ #include "xen.h" #include "grant_table.h" #include "hvm/save.h" +#include "xen/vnuma.h" #define XEN_DOMCTL_INTERFACE_VERSION 0x00000009 @@ -852,6 +853,17 @@ struct xen_domctl_set_broken_page_p2m { typedef struct xen_domctl_set_broken_page_p2m xen_domctl_set_broken_page_p2m_t; DEFINE_XEN_GUEST_HANDLE(xen_domctl_set_broken_page_p2m_t); +struct xen_domctl_vnuma { + uint16_t nr_vnodes; + XEN_GUEST_HANDLE_64(int) vdistance; + XEN_GUEST_HANDLE_64(vnuma_memblk_t) vnuma_memblks; + XEN_GUEST_HANDLE_64(int) vcpu_to_vnode; + XEN_GUEST_HANDLE_64(int) vnode_to_pnode; +}; + +typedef struct xen_domctl_vnuma xen_domctl_vnuma_t; +DEFINE_XEN_GUEST_HANDLE(xen_domctl_vnuma_t); + struct xen_domctl { uint32_t cmd; #define XEN_DOMCTL_createdomain 1 @@ -920,6 +932,7 @@ struct xen_domctl { #define XEN_DOMCTL_set_broken_page_p2m 67 #define XEN_DOMCTL_setnodeaffinity 68 #define XEN_DOMCTL_getnodeaffinity 69 +#define XEN_DOMCTL_setvnumainfo 70 #define XEN_DOMCTL_gdbsx_guestmemio 1000 #define XEN_DOMCTL_gdbsx_pausevcpu 1001 #define XEN_DOMCTL_gdbsx_unpausevcpu 1002 @@ -979,6 +992,7 @@ struct xen_domctl { struct xen_domctl_set_broken_page_p2m set_broken_page_p2m; struct xen_domctl_gdbsx_pauseunp_vcpu gdbsx_pauseunp_vcpu; struct xen_domctl_gdbsx_domstatus gdbsx_domstatus; + struct xen_domctl_vnuma vnuma; uint8_t pad[128]; } u; }; @@ -986,7 +1000,6 @@ typedef struct xen_domctl xen_domctl_t; DEFINE_XEN_GUEST_HANDLE(xen_domctl_t); #endif /* __XEN_PUBLIC_DOMCTL_H__ */ - /* * Local variables: * mode: C -- 1.7.10.4
Elena Ufimtseva
2013-Aug-27 07:54 UTC
[PATCH RFC 3/7] libxc/vnuma: per-domain vnuma structures.
Makes use of domctl vnuma subop and initializes per-domain vnuma topology. Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> --- tools/libxc/xc_dom.h | 9 +++++++ tools/libxc/xc_domain.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++ tools/libxc/xenctrl.h | 17 +++++++++++++ 3 files changed, 89 insertions(+) diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h index 86e23ee..4375f25 100644 --- a/tools/libxc/xc_dom.h +++ b/tools/libxc/xc_dom.h @@ -114,6 +114,15 @@ struct xc_dom_image { struct xc_dom_phys *phys_pages; int realmodearea_log; + /* vNUMA topology and memory allocation structure + * Defines the way to allocate XEN + * memory from phys NUMA nodes by providing mask + * vnuma_to_pnuma */ + int nr_vnodes; + struct vnuma_memblk *vnumablocks; + uint64_t *vmemsizes; + int *vnode_to_pnode; + /* malloc memory pool */ struct xc_dom_mem *memblocks; diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c index 3257e2a..98445e3 100644 --- a/tools/libxc/xc_domain.c +++ b/tools/libxc/xc_domain.c @@ -24,6 +24,7 @@ #include "xg_save_restore.h" #include <xen/memory.h> #include <xen/hvm/hvm_op.h> +#include "xg_private.h" int xc_domain_create(xc_interface *xch, uint32_t ssidref, @@ -1629,6 +1630,68 @@ int xc_domain_set_virq_handler(xc_interface *xch, uint32_t domid, int virq) return do_domctl(xch, &domctl); } +/* Informs XEN that domain is vNUMA aware */ +int xc_domain_setvnodes(xc_interface *xch, + uint32_t domid, + uint16_t nr_vnodes, + uint16_t nr_vcpus, + vnuma_memblk_t *vmemblks, + int *vdistance, + int *vcpu_to_vnode, + int *vnode_to_pnode) +{ + int rc; + DECLARE_DOMCTL; + DECLARE_HYPERCALL_BUFFER(int, distbuf); + DECLARE_HYPERCALL_BUFFER(vnuma_memblk_t, membuf); + DECLARE_HYPERCALL_BUFFER(int, vcpumapbuf); + DECLARE_HYPERCALL_BUFFER(int, vntopbuf); + + rc = -EINVAL; + memset(&domctl, 0, sizeof(domctl)); + if ( vdistance == NULL || vcpu_to_vnode == NULL || vmemblks == NULL ) + /* vnode_to_pnode can be null on non-NUMA machines */ + { + PERROR("Parameters are wrong XEN_DOMCTL_setvnumainfo\n"); + return -EINVAL; + } + distbuf = xc_hypercall_buffer_alloc + (xch, distbuf, sizeof(*vdistance) * nr_vnodes * nr_vnodes); + membuf = xc_hypercall_buffer_alloc + (xch, membuf, sizeof(*membuf) * nr_vnodes); + vcpumapbuf = xc_hypercall_buffer_alloc + (xch, vcpumapbuf, sizeof(*vcpu_to_vnode) * nr_vcpus); + vntopbuf = xc_hypercall_buffer_alloc + (xch, vntopbuf, sizeof(*vnode_to_pnode) * nr_vnodes); + + if (distbuf == NULL || membuf == NULL || vcpumapbuf == NULL || vntopbuf == NULL ) + { + PERROR("Could not allocate memory for xc hypercall XEN_DOMCTL_setvnumainfo\n"); + goto fail; + } + memcpy(distbuf, vdistance, sizeof(*vdistance) * nr_vnodes * nr_vnodes); + memcpy(vntopbuf, vnode_to_pnode, sizeof(*vnode_to_pnode) * nr_vnodes); + memcpy(vcpumapbuf, vcpu_to_vnode, sizeof(*vcpu_to_vnode) * nr_vcpus); + memcpy(membuf, vmemblks, sizeof(*vmemblks) * nr_vnodes); + + set_xen_guest_handle(domctl.u.vnuma.vdistance, distbuf); + set_xen_guest_handle(domctl.u.vnuma.vnuma_memblks, membuf); + set_xen_guest_handle(domctl.u.vnuma.vcpu_to_vnode, vcpumapbuf); + set_xen_guest_handle(domctl.u.vnuma.vnode_to_pnode, vntopbuf); + + domctl.cmd = XEN_DOMCTL_setvnumainfo; + domctl.domain = (domid_t)domid; + domctl.u.vnuma.nr_vnodes = nr_vnodes; + rc = do_domctl(xch, &domctl); +fail: + xc_hypercall_buffer_free(xch, distbuf); + xc_hypercall_buffer_free(xch, membuf); + xc_hypercall_buffer_free(xch, vcpumapbuf); + xc_hypercall_buffer_free(xch, vntopbuf); + + return rc; +} + /* * Local variables: * mode: C diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h index f2cebaf..fb66cfa 100644 --- a/tools/libxc/xenctrl.h +++ b/tools/libxc/xenctrl.h @@ -1083,6 +1083,23 @@ int xc_domain_set_memmap_limit(xc_interface *xch, uint32_t domid, unsigned long map_limitkb); +/*unsigned long xc_get_memory_hole_size(unsigned long start, unsigned long end); + +int xc_domain_align_vnodes(xc_interface *xch, + uint32_t domid, + uint64_t *vmemareas, + vnuma_memblk_t *vnuma_memblks, + uint16_t nr_vnodes); +*/ +int xc_domain_setvnodes(xc_interface *xch, + uint32_t domid, + uint16_t nr_vnodes, + uint16_t nr_vcpus, + vnuma_memblk_t *vmemareas, + int *vdistance, + int *vcpu_to_vnode, + int *vnode_to_pnode); + #if defined(__i386__) || defined(__x86_64__) /* * PC BIOS standard E820 types and structure. -- 1.7.10.4
Defines VM config options for vNUMA PV domain creation as follows: vnodes - number of nodes and enables vnuma vnumamem - vnuma nodes memory sizes vnuma_distance - vnuma distance table (may be omitted) vcpu_to_vnode - vcpu to vnode mask (may be omitted) sum of all numamem should be equal to memory option. Number of vcpus should not be less that number of vnodes. VM config Examples: memory = 16384 vcpus = 8 name = "rc" vnodes = 8 vnumamem = "2g, 2g, 2g, 2g, 2g, 2g, 2g, 2g" vcpu_to_vnode ="5 6 7 4 3 2 1 0" memory = 2048 vcpus = 4 name = "rc9" vnodes = 2 vnumamem = "1g, 1g" vnuma_distance = "10 20, 10 20" vcpu_to_vnode ="1, 3, 2, 0" Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> --- tools/libxl/libxl.c | 28 ++++++ tools/libxl/libxl.h | 15 ++++ tools/libxl/libxl_arch.h | 6 ++ tools/libxl/libxl_dom.c | 115 ++++++++++++++++++++++-- tools/libxl/libxl_internal.h | 3 + tools/libxl/libxl_types.idl | 6 +- tools/libxl/libxl_x86.c | 91 +++++++++++++++++++ tools/libxl/xl_cmdimpl.c | 197 +++++++++++++++++++++++++++++++++++++++++- 8 files changed, 454 insertions(+), 7 deletions(-) diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c index 81785df..cd25474 100644 --- a/tools/libxl/libxl.c +++ b/tools/libxl/libxl.c @@ -4293,6 +4293,34 @@ static int libxl__set_vcpuonline_qmp(libxl__gc *gc, uint32_t domid, } return 0; } +#ifdef LIBXL_HAVE_BUILDINFO_VNUMA +int libxl_domain_setvnodes(libxl_ctx *ctx, + uint32_t domid, + uint16_t nr_vnodes, + uint16_t nr_vcpus, + vnuma_memblk_t *vnuma_memblks, + int *vdistance, + int *vcpu_to_vnode, + int *vnode_to_pnode) +{ + GC_INIT(ctx); + int ret; + ret = xc_domain_setvnodes(ctx->xch, domid, nr_vnodes, + nr_vcpus, vnuma_memblks, + vdistance, vcpu_to_vnode, + vnode_to_pnode); + GC_FREE; + return ret; +} + +int libxl_default_vcpu_to_vnuma(libxl_domain_build_info *info) +{ + int i; + for(i = 0; i < info->max_vcpus; i++) + info->vcpu_to_vnode[i] = i % info->nr_vnodes; + return 0; +} +#endif int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *cpumap) { diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h index be19bf5..a1a5e33 100644 --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -706,6 +706,21 @@ void libxl_vcpuinfo_list_free(libxl_vcpuinfo *, int nr_vcpus); void libxl_device_vtpm_list_free(libxl_device_vtpm*, int nr_vtpms); void libxl_vtpminfo_list_free(libxl_vtpminfo *, int nr_vtpms); +/* vNUMA topology */ + +#ifdef LIBXL_HAVE_BUILDINFO_VNUMA +#include <xen/vnuma.h> +int libxl_domain_setvnodes(libxl_ctx *ctx, + uint32_t domid, + uint16_t nr_vnodes, + uint16_t nr_vcpus, + vnuma_memblk_t *vnuma_memblks, + int *vdistance, + int *vcpu_to_vnode, + int *vnode_to_pnode); + +int libxl_default_vcpu_to_vnuma(libxl_domain_build_info *info); +#endif /* * Devices * ======diff --git a/tools/libxl/libxl_arch.h b/tools/libxl/libxl_arch.h index abe6685..76c1975 100644 --- a/tools/libxl/libxl_arch.h +++ b/tools/libxl/libxl_arch.h @@ -18,5 +18,11 @@ /* arch specific internal domain creation function */ int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config, uint32_t domid); +int libxl_vnuma_align_mem(libxl__gc *gc, + uint32_t domid, + struct libxl_domain_build_info *b_info, + vnuma_memblk_t *memblks); /* linux specific memory blocks: out */ + +unsigned long e820_memory_hole_size(unsigned long start, unsigned long end, struct e820entry e820[], int nr); #endif diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c index 6e2252a..8bbbd18 100644 --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -200,6 +200,63 @@ static int numa_place_domain(libxl__gc *gc, uint32_t domid, libxl_cpupoolinfo_dispose(&cpupool_info); return rc; } +#define set_all_vnodes(n) for(i=0; i< info->nr_vnodes; i++) \ + info->vnode_to_pnode[i] = n + +int libxl_init_vnodemap(libxl__gc *gc, uint32_t domid, + libxl_domain_build_info *info) +{ + int i, n, start, nr_nodes; + uint64_t *mems; + unsigned long long claim[16]; + libxl_numainfo *ninfo = NULL; + + if (info->vnode_to_pnode == NULL) + info->vnode_to_pnode = calloc(info->nr_vnodes, sizeof(*info->vnode_to_pnode)); + + set_all_vnodes(NUMA_NO_NODE); + mems = info->vnuma_memszs; + ninfo = libxl_get_numainfo(CTX, &nr_nodes); + if (ninfo == NULL) { + LOG(INFO, "No HW NUMA found\n"); + return -EINVAL; + } + /* lets check if all vnodes will fit in one node */ + for(n = 0; n < nr_nodes; n++) { + if(ninfo[n].free/1024 >= info->max_memkb) { + /* all fit on one node, fill the mask */ + set_all_vnodes(n); + LOG(INFO, "Setting all vnodes to node %d, free = %lu, need =%lu Kb\n", n, ninfo[n].free/1024, info->max_memkb); + return 0; + } + } + /* TODO: change algorithm. The current just fits the nodes + * Will be nice to have them also sorted by size */ + /* If no p-node found, will be set to NUMA_NO_NODE and allocation will fail */ + LOG(INFO, "Found %d physical NUMA nodes\n", nr_nodes); + memset(claim, 0, sizeof(*claim) * 16); + start = 0; + for ( n = 0; n < nr_nodes; n++ ) + { + for ( i = start; i < info->nr_vnodes; i++ ) + { + LOG(INFO, "Compare %Lx for vnode[%d] size %lx with free space on pnode[%d], free %lx\n", + claim[n] + mems[i], i, mems[i], n, ninfo[n].free); + if ( ((claim[n] + mems[i]) <= ninfo[n].free) && (info->vnode_to_pnode[i] == NUMA_NO_NODE) ) + { + info->vnode_to_pnode[i] = n; + LOG(INFO, "Set vnode[%d] to pnode [%d]\n", i, n); + claim[n] += mems[i]; + } + else { + /* Will have another chance at other pnode */ + start = i; + continue; + } + } + } + return 0; +} int libxl__build_pre(libxl__gc *gc, uint32_t domid, libxl_domain_config *d_config, libxl__domain_build_state *state) @@ -232,9 +289,36 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid, if (rc) return rc; } +#ifdef LIBXL_HAVE_BUILDINFO_VNUMA + if (info->nr_vnodes <= info->max_vcpus && info->nr_vnodes != 0) { + vnuma_memblk_t *memblks = libxl__calloc(gc, info->nr_vnodes, sizeof(*memblks)); + libxl_vnuma_align_mem(gc, domid, info, memblks); + if (libxl_init_vnodemap(gc, domid, info) != 0) { + LOG(INFO, "Failed to call init_vnodemap\n"); + rc = libxl_domain_setvnodes(ctx, domid, info->nr_vnodes, + info->max_vcpus, memblks, + info->vdistance, info->vcpu_to_vnode, + NULL); + } + else + rc = libxl_domain_setvnodes(ctx, domid, info->nr_vnodes, + info->max_vcpus, memblks, + info->vdistance, info->vcpu_to_vnode, + info->vnode_to_pnode); + if (rc < 0 ) LOG(INFO, "Failed to call xc_domain_setvnodes\n"); + for(int i=0; i<info->nr_vnodes; i++) + LOG(INFO, "Mapping vnode %d to pnode %d\n", i, info->vnode_to_pnode[i]); + libxl_bitmap_set_none(&info->nodemap); + libxl_bitmap_set(&info->nodemap, 0); + } + else { + LOG(INFO, "NOT Calling vNUMA construct with nr_nodes = %d\n", info->nr_vnodes); + info->nr_vnodes = 0; + } +#endif libxl_domain_set_nodeaffinity(ctx, domid, &info->nodemap); libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap); - + xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + LIBXL_MAXMEM_CONSTANT); xs_domid = xs_read(ctx->xsh, XBT_NULL, "/tool/xenstored/domid", NULL); state->store_domid = xs_domid ? atoi(xs_domid) : 0; @@ -368,7 +452,20 @@ int libxl__build_pv(libxl__gc *gc, uint32_t domid, } } } - +#ifdef LIBXL_HAVE_BUILDINFO_VNUMA + if (info->nr_vnodes != 0 && info->vnuma_memszs != NULL && info->vnode_to_pnode != NULL) { + dom->nr_vnodes = info->nr_vnodes; + dom->vnumablocks = malloc(info->nr_vnodes * sizeof(*dom->vnumablocks)); + dom->vnode_to_pnode = (int *)malloc(info->nr_vnodes * sizeof(*info->vnode_to_pnode)); + dom->vmemsizes = malloc(info->nr_vnodes * sizeof(*info->vnuma_memszs)); + if (dom->vmemsizes == NULL || dom->vnode_to_pnode == NULL) { + LOGE(ERROR, "%s:Failed to allocate memory for memory sizes.\n",__FUNCTION__); + goto out; + } + memcpy(dom->vmemsizes, info->vnuma_memszs, sizeof(*info->vnuma_memszs) * info->nr_vnodes); + memcpy(dom->vnode_to_pnode, info->vnode_to_pnode, sizeof(*info->vnode_to_pnode) * info->nr_vnodes); + } +#endif dom->flags = flags; dom->console_evtchn = state->console_port; dom->console_domid = state->console_domid; @@ -388,9 +485,17 @@ int libxl__build_pv(libxl__gc *gc, uint32_t domid, LOGE(ERROR, "xc_dom_mem_init failed"); goto out; } - if ( (ret = xc_dom_boot_mem_init(dom)) != 0 ) { - LOGE(ERROR, "xc_dom_boot_mem_init failed"); - goto out; + if (info->nr_vnodes != 0 && info->vnuma_memszs != NULL) { + if ( (ret = xc_dom_boot_mem_init(dom)) != 0 ) { + LOGE(ERROR, "xc_dom_boot_mem_init_node failed"); + goto out; + } + } + else { + if ( (ret = xc_dom_boot_mem_init(dom)) != 0 ) { + LOGE(ERROR, "xc_dom_boot_mem_init failed"); + goto out; + } } if ( (ret = xc_dom_build_image(dom)) != 0 ) { LOGE(ERROR, "xc_dom_build_image failed"); diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h index f051d91..4a501c4 100644 --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -2709,6 +2709,7 @@ static inline void libxl__ctx_unlock(libxl_ctx *ctx) { #define CTX_LOCK (libxl__ctx_lock(CTX)) #define CTX_UNLOCK (libxl__ctx_unlock(CTX)) +#define NUMA_NO_NODE 0xFF /* * Automatic NUMA placement * @@ -2832,6 +2833,8 @@ void libxl__numa_candidate_put_nodemap(libxl__gc *gc, libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap); } +int libxl_init_vnodemap(libxl__gc *gc, uint32_t domid, + libxl_domain_build_info *info); /* * Inserts "elm_new" into the sorted list "head". * diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl index 85341a0..c3a4d95 100644 --- a/tools/libxl/libxl_types.idl +++ b/tools/libxl/libxl_types.idl @@ -208,6 +208,7 @@ libxl_dominfo = Struct("dominfo",[ ("vcpu_max_id", uint32), ("vcpu_online", uint32), ("cpupool", uint32), + ("nr_vnodes", uint16), ], dir=DIR_OUT) libxl_cpupoolinfo = Struct("cpupoolinfo", [ @@ -279,7 +280,10 @@ libxl_domain_build_info = Struct("domain_build_info",[ ("disable_migrate", libxl_defbool), ("cpuid", libxl_cpuid_policy_list), ("blkdev_start", string), - + ("vnuma_memszs", Array(uint64, "nr_vnodes")), + ("vcpu_to_vnode", Array(integer, "nr_vnodemap")), + ("vdistance", Array(integer, "nr_vdist")), + ("vnode_to_pnode", Array(integer, "nr_vnode_to_pnode")), ("device_model_version", libxl_device_model_version), ("device_model_stubdomain", libxl_defbool), # if you set device_model you must set device_model_version too diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c index a78c91d..35da3a8 100644 --- a/tools/libxl/libxl_x86.c +++ b/tools/libxl/libxl_x86.c @@ -308,3 +308,94 @@ int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config, return ret; } + +unsigned long e820_memory_hole_size(unsigned long start, unsigned long end, struct e820entry e820[], int nr) +{ +#define clamp(val, min, max) ({ \ + typeof(val) __val = (val); \ + typeof(min) __min = (min); \ + typeof(max) __max = (max); \ + (void) (&__val == &__min); \ + (void) (&__val == &__max); \ + __val = __val < __min ? __min: __val; \ + __val > __max ? __max: __val; }) + int i; + unsigned long absent, start_pfn, end_pfn; + absent = start - end; + for(i = 0; i < nr; i++) { + if(e820[i].type == E820_RAM) { + start_pfn = clamp(e820[i].addr, start, end); + end_pfn = clamp(e820[i].addr + e820[i].size, start, end); + absent -= end_pfn - start_pfn; + } + } + return absent; +} + +/* Align memory blocks for linux NUMA build image */ +int libxl_vnuma_align_mem(libxl__gc *gc, + uint32_t domid, + libxl_domain_build_info *b_info, + vnuma_memblk_t *memblks) /* linux specific memory blocks: out */ +{ +#ifndef roundup +#define roundup(x, y) ((((x) + ((y) - 1)) / (y)) * (y)) +#endif + /* + This function transforms mem block sizes in bytes + into aligned PV Linux guest NUMA nodes. + XEN will provide this memory layout to PV Linux guest upon boot for + PV Linux guests. + */ + int i, rc; + unsigned long shift = 0, size, node_min_size = 1, limit; + unsigned long end_max; + uint32_t nr; + struct e820entry map[E820MAX]; + + libxl_ctx *ctx = libxl__gc_owner(gc); + rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX); + if (rc < 0) { + errno = rc; + return -EINVAL; + } + nr = rc; + rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb, + (b_info->max_memkb - b_info->target_memkb) + + b_info->u.pv.slack_memkb); + if (rc) + return ERROR_FAIL; + + end_max = map[nr-1].addr + map[nr-1].size; + + shift = 0; + for(i = 0; i < b_info->nr_vnodes; i++) { + printf("block [%d] start inside align = %#lx\n", i, b_info->vnuma_memszs[i]); + } + memset(memblks, 0, sizeof(*memblks)*b_info->nr_vnodes); + memblks[0].start = 0; + for(i = 0; i < b_info->nr_vnodes; i++) { + memblks[i].start += shift; + memblks[i].end += shift + b_info->vnuma_memszs[i]; + limit = size = memblks[i].end - memblks[i].start; + while (memblks[i].end - memblks[i].start - e820_memory_hole_size(memblks[i].start, memblks[i].end, map, nr) < size) { + memblks[i].end += node_min_size; + shift += node_min_size; + if (memblks[i].end - memblks[i].start >= limit) { + memblks[i].end = memblks[i].start + limit; + break; + } + if (memblks[i].end == end_max) { + memblks[i].end = end_max; + break; + } + } + shift = memblks[i].end; + memblks[i].start = roundup(memblks[i].start, 4*1024); + + printf("start = %#010lx, end = %#010lx\n", memblks[i].start, memblks[i].end); + } + if(memblks[i-1].end > end_max) + memblks[i-1].end = end_max; + return 0; +} diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index 884f050..36a8275 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -539,7 +539,121 @@ vcpp_out: return rc; } +#ifdef LIBXL_HAVE_BUILDINFO_VNUMA +static int vdistance_parse(char *vdistcfg, int *vdistance, int nr_vnodes) +{ + char *endptr, *toka, *tokb, *saveptra = NULL, *saveptrb = NULL; + int *vdist_tmp = NULL; + int rc = 0; + int i, j, dist, parsed = 0; + rc = -EINVAL; + if(vdistance == NULL) { + return rc; + } + vdist_tmp = (int *)malloc(nr_vnodes * nr_vnodes * sizeof(*vdistance)); + if (vdist_tmp == NULL) + return rc; + i =0; j = 0; + for (toka = strtok_r(vdistcfg, ",", &saveptra); toka; + toka = strtok_r(NULL, ",", &saveptra)) { + if ( i >= nr_vnodes ) + goto vdist_parse_err; + for (tokb = strtok_r(toka, " ", &saveptrb); tokb; + tokb = strtok_r(NULL, " ", &saveptrb)) { + if (j >= nr_vnodes) + goto vdist_parse_err; + dist = strtol(tokb, &endptr, 10); + if (tokb == endptr) + goto vdist_parse_err; + *(vdist_tmp + j*nr_vnodes + i) = dist; + parsed++; + j++; + } + i++; + j = 0; + } + rc = parsed; + memcpy(vdistance, vdist_tmp, nr_vnodes * nr_vnodes * sizeof(*vdistance)); +vdist_parse_err: + if (vdist_tmp !=NULL ) free(vdist_tmp); + return rc; +} +static int vcputovnode_parse(char *cfg, int *vmap, int nr_vnodes, int nr_vcpus) +{ + char *toka, *endptr, *saveptra = NULL; + int *vmap_tmp = NULL; + int rc = 0; + int i; + rc = -EINVAL; + i = 0; + if(vmap == NULL) { + return rc; + } + vmap_tmp = (int *)malloc(sizeof(*vmap) * nr_vcpus); + memset(vmap_tmp, 0, sizeof(*vmap) * nr_vcpus); + for (toka = strtok_r(cfg, " ", &saveptra); toka; + toka = strtok_r(NULL, " ", &saveptra)) { + if (i >= nr_vcpus) goto vmap_parse_out; + vmap_tmp[i] = strtoul(toka, &endptr, 10); + if( endptr == toka) + goto vmap_parse_out; + fprintf(stderr, "Parsed vcpu_to_vnode[%d] = %d.\n", i, vmap_tmp[i]); + i++; + } + memcpy(vmap, vmap_tmp, sizeof(*vmap) * nr_vcpus); + rc = i; +vmap_parse_out: + if (vmap_tmp != NULL) free(vmap_tmp); + return rc; +} + +static int vnumamem_parse(char *vmemsizes, uint64_t *vmemregions, int nr_vnodes) +{ + uint64_t memsize; + char *endptr, *toka, *saveptr = NULL; + int rc = 0; + int j; + rc = -EINVAL; + if(vmemregions == NULL) { + goto vmem_parse_out; + } + memsize = 0; + j = 0; + for (toka = strtok_r(vmemsizes, ",", &saveptr); toka; + toka = strtok_r(NULL, ",", &saveptr)) { + if ( j >= nr_vnodes ) + goto vmem_parse_out; + memsize = strtoul(toka, &endptr, 10); + if (endptr == toka) + goto vmem_parse_out; + switch (*endptr) { + case ''G'': + case ''g'': + memsize = memsize * 1024 * 1024 * 1024; + break; + case ''M'': + case ''m'': + memsize = memsize * 1024 * 1024; + break; + case ''K'': + case ''k'': + memsize = memsize * 1024 ; + break; + default: + continue; + break; + } + if (memsize > 0) { + vmemregions[j] = memsize; + j++; + } + } + rc = j; +vmem_parse_out: + return rc; +} +#endif static void parse_config_data(const char *config_source, const char *config_data, int config_len, @@ -871,7 +985,13 @@ static void parse_config_data(const char *config_source, { char *cmdline = NULL; const char *root = NULL, *extra = ""; - +#ifdef LIBXL_HAVE_BUILDINFO_VNUMA + const char *vnumamemcfg = NULL; + int nr_vnuma_regions; + long unsigned int vnuma_memparsed = 0; + const char *vmapcfg = NULL; + const char *vdistcfg = NULL; +#endif xlu_cfg_replace_string (config, "kernel", &b_info->u.pv.kernel, 0); xlu_cfg_get_string (config, "root", &root, 0); @@ -888,7 +1008,82 @@ static void parse_config_data(const char *config_source, fprintf(stderr, "Failed to allocate memory for cmdline\n"); exit(1); } +#ifdef LIBXL_HAVE_BUILDINFO_VNUMA + if (!xlu_cfg_get_long (config, "vnodes", &l, 0)) { + b_info->nr_vnodes = l; + if (b_info->nr_vnodes <= 0) + exit(1); + if(!xlu_cfg_get_string (config, "vnumamem", &vnumamemcfg, 0)) { + b_info->vnuma_memszs = calloc(b_info->nr_vnodes, + sizeof(*b_info->vnuma_memszs)); + if (b_info->vnuma_memszs == NULL) { + fprintf(stderr, "WARNING: Could not allocate vNUMA node memory sizes.\n"); + exit(1); + } + char *buf2 = strdup(vnumamemcfg); + nr_vnuma_regions = vnumamem_parse(buf2, b_info->vnuma_memszs, + b_info->nr_vnodes); + for(i = 0; i < b_info->nr_vnodes; i++) + vnuma_memparsed = vnuma_memparsed + (b_info->vnuma_memszs[i] >> 10); + + if(vnuma_memparsed != b_info->max_memkb || + nr_vnuma_regions != b_info->nr_vnodes ) + { + fprintf(stderr, "WARNING: Incorrect vNUMA config. Parsed memory = %lu, parsed nodes = %d, max = %lx\n", + vnuma_memparsed, nr_vnuma_regions, b_info->max_memkb); + if(buf2) free(buf2); + exit(1); + } + if (buf2) free(buf2); + } + else + b_info->nr_vnodes=0; + if(!xlu_cfg_get_string(config, "vnuma_distance", &vdistcfg, 0)) { + b_info->vdistance = (int *)calloc(b_info->nr_vnodes * b_info->nr_vnodes, + sizeof(*b_info->vdistance)); + if (b_info->vdistance == NULL) + exit(1); + char *buf2 = strdup(vdistcfg); + if(vdistance_parse(buf2, b_info->vdistance, b_info->nr_vnodes) != b_info->nr_vnodes * b_info->nr_vnodes) { + if (buf2) free(buf2); + free(b_info->vdistance); + exit(1); + } + if(buf2) free(buf2); + } + else + { + /* default distance */ + b_info->vdistance = (int *)calloc(b_info->nr_vnodes * b_info->nr_vnodes, sizeof(*b_info->vdistance)); + if (b_info->vdistance == NULL) + exit(1); + for(i = 0; i < b_info->nr_vnodes; i++) + for(int j = 0; j < b_info->nr_vnodes; j++) + *(b_info->vdistance + j*b_info->nr_vnodes + i) = (i == j ? 10 : 20); + } + if(!xlu_cfg_get_string(config, "vcpu_to_vnode", &vmapcfg, 0)) + { + b_info->vcpu_to_vnode = (int *)calloc(b_info->max_vcpus, sizeof(*b_info->vcpu_to_vnode)); + if (b_info->vcpu_to_vnode == NULL) + exit(-1); + char *buf2 = strdup(vmapcfg); + if (vcputovnode_parse(buf2, b_info->vcpu_to_vnode, b_info->nr_vnodes, b_info->max_vcpus) < 0) { + if (buf2) free(buf2); + fprintf(stderr, "Error parsing vcpu to vnode mask\n"); + exit(1); + } + if(buf2) free(buf2); + } + else + { + b_info->vcpu_to_vnode = (int *)calloc(b_info->max_vcpus, sizeof(*b_info->vcpu_to_vnode)); + if (b_info->vcpu_to_vnode != NULL) + libxl_default_vcpu_to_vnuma(b_info); + } + } +#endif + xlu_cfg_replace_string (config, "bootloader", &b_info->u.pv.bootloader, 0); switch (xlu_cfg_get_list_as_string_list(config, "bootloader_args", &b_info->u.pv.bootloader_args, 1)) -- 1.7.10.4
Enables libxl vnuma ABI by LIBXL_HAVE_BUILDINFO_VNUMA. Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> --- tools/libxl/libxl.h | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h index a1a5e33..ad0d0d8 100644 --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -90,6 +90,14 @@ #define LIBXL_HAVE_BUILDINFO_HVM_VENDOR_DEVICE 1 /* + * LIBXL_HAVE_BUILDINFO_VNUMA indicates that vnuma topology will be + * build for the guest upon request and with VM configuration. + * It will try to define best allocation for vNUMA + * nodes on real NUMA nodes. + */ +#define LIBXL_HAVE_BUILDINFO_VNUMA 1 + +/* * libxl ABI compatibility * * The only guarantee which libxl makes regarding ABI compatibility -- 1.7.10.4
Elena Ufimtseva
2013-Aug-27 07:54 UTC
[PATCH RFC 6/7] libxc/vnuma: vnuma per phys NUMA allocation.
Allows for vNUMA enabled domains to allocate vnodes on physical NUMA nodes. Tries to allocate all vnodes on one NUMA node, or on next one if not all vnodes fit. If no physical numa node found, will let xen decide. TODO: take into account cpu pinning if defined in VM config; take into account automatic NUMA placement mechanism; Allows for vNUMA enabled domains to allocate vnodes on physical NUMA nodes. Adds some arch bits. Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> --- tools/libxc/xc_dom.h | 1 + tools/libxc/xc_dom_x86.c | 79 ++++++++++++++++++++++++++++++++++++++++------ tools/libxc/xg_private.h | 4 +++ 3 files changed, 75 insertions(+), 9 deletions(-) diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h index 4375f25..7037614 100644 --- a/tools/libxc/xc_dom.h +++ b/tools/libxc/xc_dom.h @@ -371,6 +371,7 @@ static inline xen_pfn_t xc_dom_p2m_guest(struct xc_dom_image *dom, int arch_setup_meminit(struct xc_dom_image *dom); int arch_setup_bootearly(struct xc_dom_image *dom); int arch_setup_bootlate(struct xc_dom_image *dom); +int arch_boot_numa_alloc(struct xc_dom_image *dom); /* * Local variables: diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c index 126c0f8..99f7444 100644 --- a/tools/libxc/xc_dom_x86.c +++ b/tools/libxc/xc_dom_x86.c @@ -789,27 +789,47 @@ int arch_setup_meminit(struct xc_dom_image *dom) else { /* try to claim pages for early warning of insufficient memory avail */ + rc = 0; if ( dom->claim_enabled ) { rc = xc_domain_claim_pages(dom->xch, dom->guest_domid, dom->total_pages); if ( rc ) + { + xc_dom_panic(dom->xch, XC_INTERNAL_ERROR, + "%s: Failed to claim mem for dom\n", + __FUNCTION__); return rc; + } } /* setup initial p2m */ for ( pfn = 0; pfn < dom->total_pages; pfn++ ) dom->p2m_host[pfn] = pfn; /* allocate guest memory */ - for ( i = rc = allocsz = 0; - (i < dom->total_pages) && !rc; - i += allocsz ) + if (dom->nr_vnodes > 0) { - allocsz = dom->total_pages - i; - if ( allocsz > 1024*1024 ) - allocsz = 1024*1024; - rc = xc_domain_populate_physmap_exact( - dom->xch, dom->guest_domid, allocsz, - 0, 0, &dom->p2m_host[i]); + rc = arch_boot_numa_alloc(dom); + if ( rc ) + { + xc_dom_panic(dom->xch, XC_INTERNAL_ERROR, + "%s: Failed to allocate memory on NUMA nodes\n", + __FUNCTION__); + return rc; + } + } + else + { + for ( i = rc = allocsz = 0; + (i < dom->total_pages) && !rc; + i += allocsz ) + { + allocsz = dom->total_pages - i; + if ( allocsz > 1024*1024 ) + allocsz = 1024*1024; + rc = xc_domain_populate_physmap_exact( + dom->xch, dom->guest_domid, allocsz, + 0, 0, &dom->p2m_host[i]); + } } /* Ensure no unclaimed pages are left unused. @@ -817,7 +837,48 @@ int arch_setup_meminit(struct xc_dom_image *dom) (void)xc_domain_claim_pages(dom->xch, dom->guest_domid, 0 /* cancels the claim */); } + return rc; +} + +int arch_boot_numa_alloc(struct xc_dom_image *dom) +{ + int rc, n; + uint64_t guest_pages; + unsigned long allocsz, i, k; + unsigned long memflags; + rc = allocsz = k = 0; + for(n = 0; n < dom->nr_vnodes; n++) + { + memflags = 0; + if ( dom->vnode_to_pnode[n] != NUMA_NO_NODE ) + { + memflags |= XENMEMF_exact_node(dom->vnode_to_pnode[n]); + memflags |= XENMEMF_exact_node_request; + } + guest_pages = dom->vmemsizes[n] >> PAGE_SHIFT_X86; + for ( i = 0; + (i < guest_pages) && !rc; + i += allocsz ) + { + allocsz = guest_pages - i; + if ( allocsz > 1024*1024 ) + allocsz = 1024*1024; + rc = xc_domain_populate_physmap_exact( + dom->xch, dom->guest_domid, allocsz, + 0, memflags, &dom->p2m_host[i + k]); + k += allocsz; + } + if (rc == 0) printf("%s: allocated %lx pages for vnode %d on pnode %d out of %lx\n", + __FUNCTION__,i, n, dom->vnode_to_pnode[n], dom->total_pages); + else + { + xc_dom_panic(dom->xch, XC_INTERNAL_ERROR, + "%s: Failed allocation of %lx pages for vnode %d on pnode %d out of %lx\n", + __FUNCTION__,i, n, dom->vnode_to_pnode[n], dom->total_pages); + return -EINVAL; + } + } return rc; } diff --git a/tools/libxc/xg_private.h b/tools/libxc/xg_private.h index db02ccf..538d185 100644 --- a/tools/libxc/xg_private.h +++ b/tools/libxc/xg_private.h @@ -127,6 +127,10 @@ typedef uint64_t l4_pgentry_64_t; #define ROUNDUP(_x,_w) (((unsigned long)(_x)+(1UL<<(_w))-1) & ~((1UL<<(_w))-1)) #define NRPAGES(x) (ROUNDUP(x, PAGE_SHIFT) >> PAGE_SHIFT) +#define MAX_ORDER_X86 11 +#define NODE_MIN_SIZE_X86 1024*1024*4 +#define ZONE_ALIGN_X86 (1UL << (MAX_ORDER_X86 + PAGE_SHIFT_X86)) +#define NUMA_NO_NODE 0xFF /* XXX SMH: following skanky macros rely on variable p2m_size being set */ /* XXX TJD: also, "guest_width" should be the guest''s sizeof(unsigned long) */ -- 1.7.10.4
Prints basic vnuma info per domain on ''debug-keys u''. Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> --- xen/arch/x86/numa.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c index b141877..71bfd31 100644 --- a/xen/arch/x86/numa.c +++ b/xen/arch/x86/numa.c @@ -347,7 +347,7 @@ EXPORT_SYMBOL(node_data); static void dump_numa(unsigned char key) { s_time_t now = NOW(); - int i; + int i, j; struct domain *d; struct page_info *page; unsigned int page_num_node[MAX_NUMNODES]; @@ -389,6 +389,20 @@ static void dump_numa(unsigned char key) for_each_online_node(i) printk(" Node %u: %u\n", i, page_num_node[i]); + if(d->vnuma.nr_vnodes > 0) + { + printk(" Domain has %d vnodes\n", d->vnuma.nr_vnodes); + for(j = 0; j < d->vnuma.nr_vnodes; j++) { + printk(" vnode %d ranges %#010lx - %#010lx pnode %d", + j, d->vnuma.vnuma_memblks[j].start, + d->vnuma.vnuma_memblks[j].end, + d->vnuma.vnode_to_pnode[j]); + } + printk(" Domain vcpu to vnode: "); + for(j = 0; j < d->max_vcpus; j++) + printk("%d ", d->vnuma.vcpu_to_vnode[j]); + printk("\n"); + } } rcu_read_unlock(&domlist_read_lock); -- 1.7.10.4
Jan Beulich
2013-Aug-27 08:53 UTC
Re: [PATCH RFC 1/7] xen/vnuma: subop hypercall and vnuma topology structures.
>>> On 27.08.13 at 09:54, Elena Ufimtseva <ufimtseva@gmail.com> wrote: > Defines XENMEM subop hypercall for PV vNUMA enabled guests and provides > vNUMA topology information from per-domain vnuma topology build info. > TODO: > subop XENMEM hypercall is subject to change to sysctl subop.That would mean it''s intended to be used by the tool stack only. I thought that the balloon driver (and perhaps other code) are also intended to be consumers.> @@ -732,7 +733,94 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg) > rcu_unlock_domain(d); > > break; > - > + case XENMEM_get_vnuma_info: > + { > + int i; > + struct vnuma_topology_info mtopology; > + struct vnuma_topology_info touser_topo; > + struct domain *d; > + unsigned int max_pages; > + vnuma_memblk_t *vblks; > + XEN_GUEST_HANDLE(int) vdistance; > + XEN_GUEST_HANDLE_PARAM(int) vdist_param; > + XEN_GUEST_HANDLE(vnuma_memblk_t) buf; > + XEN_GUEST_HANDLE_PARAM(vnuma_memblk_t) buf_param; > + XEN_GUEST_HANDLE(int) vcpu_to_vnode; > + XEN_GUEST_HANDLE_PARAM(int) vmap_param; > + > + rc = -1;You absolutely need to use proper -E... values when returnin hypercall status.> + if ( guest_handle_is_null(arg) ) > + return rc; > + if( copy_from_guest(&mtopology, arg, 1) ) > + { > + gdprintk(XENLOG_INFO, "Cannot get copy_from_guest..\n"); > + return -EFAULT; > + } > + gdprintk(XENLOG_INFO, "Domain id is %d\n",mtopology.domid);I appreciate that you need such for debugging, but this should be removed before posting patches.> + if ( (d = rcu_lock_domain_by_any_id(mtopology.domid)) == NULL ) > + { > + gdprintk(XENLOG_INFO, "Numa: Could not get domain id.\n"); > + return -ESRCH; > + } > + rcu_unlock_domain(d); > + touser_topo.nr_vnodes = d->vnuma.nr_vnodes;Mis-ordered: First you want to use d, then rcu-unlock it.> + rc = copy_to_guest(arg, &touser_topo, 1); > + if ( rc ) > + { > + gdprintk(XENLOG_INFO, "Bad news, could not copy to guest NUMA info\n"); > + return -EFAULT; > + } > + max_pages = d->max_pages; > + if ( touser_topo.nr_vnodes == 0 || touser_topo.nr_vnodes > d->max_vcpus ) > + { > + gdprintk(XENLOG_INFO, "vNUMA: Error in block creation - vnodes %d, vcpus %d \n", touser_topo.nr_vnodes, d->max_vcpus); > + return -EFAULT; > + } > + vblks = (vnuma_memblk_t *)xmalloc_array(struct vnuma_memblk, touser_topo.nr_vnodes); > + if ( vblks == NULL ) > + { > + gdprintk(XENLOG_INFO, "vNUMA: Could not get memory for memblocks\n"); > + return -1; > + } > + buf_param = guest_handle_cast(mtopology.vnuma_memblks, vnuma_memblk_t);By giving the structure field a proper type you should be able to avoid the use of guest_handle_cast() here and below.> + buf = guest_handle_from_param(buf_param, vnuma_memblk_t); > + for ( i = 0; i < touser_topo.nr_vnodes; i++ ) > + { > + gdprintk(XENLOG_INFO, "vmemblk[%d] start %#lx end %#lx\n", i, d->vnuma.vnuma_memblks[i].start, d->vnuma.vnuma_memblks[i].end);Actually, I''m going to give up here (for this file) - the not cleaned up code is obfuscating the real meat of the code too much for reasonable reviewing.> --- a/xen/include/public/memory.h > +++ b/xen/include/public/memory.h > @@ -453,6 +453,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_mem_sharing_op_t); > * Caller must be privileged or the hypercall fails. > */ > #define XENMEM_claim_pages 24 > +#define XENMEM_get_vnuma_info 25 > > /* > * XENMEM_claim_pages flags - the are no flags at this time.Misplaced - don''t put this in the middle of another logical section.> --- /dev/null > +++ b/xen/include/public/vnuma.h > @@ -0,0 +1,12 @@ > +#ifndef __XEN_PUBLIC_VNUMA_H > +#define __XEN_PUBLIC_VNUMA_H > + > +#include "xen.h" > + > +struct vnuma_memblk { > + uint64_t start, end; > +}; > +typedef struct vnuma_memblk vnuma_memblk_t; > +DEFINE_XEN_GUEST_HANDLE(vnuma_memblk_t); > + > +#endifUnmotivated new file. Plus the type isn''t used elsewhere in the public interface.> @@ -89,4 +90,12 @@ extern unsigned int xen_processor_pmbits; > > extern bool_t opt_dom0_vcpus_pin; > > +struct domain_vnuma_info { > + uint16_t nr_vnodes; > + int *vdistance; > + vnuma_memblk_t *vnuma_memblks; > + int *vcpu_to_vnode; > + int *vnode_to_pnode;Can any of the "int" fields reasonably be negative? If not, they ought to be "unsigned int".> --- /dev/null > +++ b/xen/include/xen/vnuma.h > @@ -0,0 +1,27 @@ > +#ifndef _VNUMA_H > +#define _VNUMA_H > +#include <public/vnuma.h> > + > +/* DEFINE_XEN_GUEST_HANDLE(vnuma_memblk_t); */ > + > +struct vnuma_topology_info { > + domid_t domid; > + uint16_t nr_vnodes; > + XEN_GUEST_HANDLE_64(vnuma_memblk_t) vnuma_memblks; > + XEN_GUEST_HANDLE_64(int) vdistance; > + XEN_GUEST_HANDLE_64(int) vcpu_to_vnode; > + XEN_GUEST_HANDLE_64(int) vnode_to_pnode; > +}; > +typedef struct vnuma_topology_info vnuma_topology_info_t; > +DEFINE_XEN_GUEST_HANDLE(vnuma_topology_info_t);At least up to here it seems like this is part of the intended public interface, and hence ought to go into public/memory.h.> +#define __vnode_distance_offset(_dom, _i, _j) \ > + ( ((_j)*((_dom)->vnuma.nr_vnodes)) + (_i) )Missing blanks around *.> + > +#define __vnode_distance(_dom, _i, _j) \ > + ( (_dom)->vnuma.vdistance[__vnode_distance_offset((_dom), (_i), (_j))] ) > + > +#define __vnode_distance_set(_dom, _i, _j, _v) \ > + do { __vnode_distance((_dom), (_i), (_j)) = (_v); } while(0)Proper parenthesization is clearly necessary, but there are clearly some that are reducing legibility of the code without having any useful purpose. Jan
Jan Beulich
2013-Aug-27 08:59 UTC
Re: [PATCH RFC 2/7] xen/vnuma: domctl subop for vnuma setup.
>>> On 27.08.13 at 09:54, Elena Ufimtseva <ufimtseva@gmail.com> wrote: > --- a/xen/common/domain.c > +++ b/xen/common/domain.c > @@ -227,6 +227,11 @@ struct domain *domain_create( > spin_lock_init(&d->node_affinity_lock); > d->node_affinity = NODE_MASK_ALL; > d->auto_node_affinity = 1; > + d->vnuma.vnuma_memblks = NULL; > + d->vnuma.vnode_to_pnode = NULL; > + d->vnuma.vcpu_to_vnode = NULL; > + d->vnuma.vdistance = NULL; > + d->vnuma.nr_vnodes = 0;Pretty pointless considering that struct domain starts out from a zeroed page.> @@ -532,6 +537,7 @@ int domain_kill(struct domain *d) > tmem_destroy(d->tmem); > domain_set_outstanding_pages(d, 0); > d->tmem = NULL; > + /* TODO: vnuma_destroy(d->vnuma); */That''s intended to go away by the time the RFC tag gets dropped?> @@ -862,7 +863,76 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) > ret = set_global_virq_handler(d, virq); > } > break; > - > + case XEN_DOMCTL_setvnumainfo: > + { > + int i, j; > + int dist_size; > + int dist, vmap, vntop;unsigned, unsigned, unsigned.> + vnuma_memblk_t vmemblk; > + > + ret = -EFAULT; > + dist = i = j = 0; > + if (op->u.vnuma.nr_vnodes <= 0 || op->u.vnuma.nr_vnodes > NR_CPUS) > + break;-EFAULT seems inappropriate here.> + d->vnuma.nr_vnodes = op->u.vnuma.nr_vnodes; > + dist_size = d->vnuma.nr_vnodes * d->vnuma.nr_vnodes; > + if ( (d->vnuma.vdistance = xmalloc_bytes(sizeof(*d->vnuma.vdistance) * dist_size) ) == NULL) > + break; > + for ( i = 0; i < d->vnuma.nr_vnodes; i++ ) > + for ( j = 0; j < d->vnuma.nr_vnodes; j++ ) > + { > + if ( unlikely(__copy_from_guest_offset(&dist, op->u.vnuma.vdistance, __vnode_distance_offset(d, i, j), 1)) )Long line.> + { > + gdprintk(XENLOG_INFO, "vNUMA: Copy distance table error\n"); > + goto err_dom; > + } > + __vnode_distance_set(d, i, j, dist); > + } > + if ( (d->vnuma.vnuma_memblks = xmalloc_bytes(sizeof(*d->vnuma.vnuma_memblks) * d->vnuma.nr_vnodes)) == NULL )Again.> + goto err_dom; > + for ( i = 0; i < d->vnuma.nr_vnodes; i++ ) > + { > + if ( unlikely(__copy_from_guest_offset(&vmemblk, op->u.vnuma.vnuma_memblks, i, 1)) ) > + { > + gdprintk(XENLOG_INFO, "vNUMA: memory size error\n");Just like for the earlier patch - the many formal problems make it quite hard to review the actual code.> @@ -852,6 +853,17 @@ struct xen_domctl_set_broken_page_p2m { > typedef struct xen_domctl_set_broken_page_p2m xen_domctl_set_broken_page_p2m_t; > DEFINE_XEN_GUEST_HANDLE(xen_domctl_set_broken_page_p2m_t); > > +struct xen_domctl_vnuma { > + uint16_t nr_vnodes; > + XEN_GUEST_HANDLE_64(int) vdistance; > + XEN_GUEST_HANDLE_64(vnuma_memblk_t) vnuma_memblks; > + XEN_GUEST_HANDLE_64(int) vcpu_to_vnode; > + XEN_GUEST_HANDLE_64(int) vnode_to_pnode;uint, uint, uint. Jan
Ian Campbell
2013-Aug-27 09:10 UTC
Re: [PATCH RFC 3/7] libxc/vnuma: per-domain vnuma structures.
On Tue, 2013-08-27 at 03:54 -0400, Elena Ufimtseva wrote:> Makes use of domctl vnuma subop and initializes per-domain > vnuma topology. > > Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> > --- > tools/libxc/xc_dom.h | 9 +++++++ > tools/libxc/xc_domain.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++ > tools/libxc/xenctrl.h | 17 +++++++++++++ > 3 files changed, 89 insertions(+) > > diff --git a/tools/libxc/xc_dom.h b/tools/libxc/xc_dom.h > index 86e23ee..4375f25 100644 > --- a/tools/libxc/xc_dom.h > +++ b/tools/libxc/xc_dom.h > @@ -114,6 +114,15 @@ struct xc_dom_image { > struct xc_dom_phys *phys_pages; > int realmodearea_log; > > + /* vNUMA topology and memory allocation structure > + * Defines the way to allocate XEN > + * memory from phys NUMA nodes by providing mask > + * vnuma_to_pnuma */ > + int nr_vnodes; > + struct vnuma_memblk *vnumablocks; > + uint64_t *vmemsizes; > + int *vnode_to_pnode; > + > /* malloc memory pool */ > struct xc_dom_mem *memblocks; > > diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c > index 3257e2a..98445e3 100644 > --- a/tools/libxc/xc_domain.c > +++ b/tools/libxc/xc_domain.c > @@ -24,6 +24,7 @@ > #include "xg_save_restore.h" > #include <xen/memory.h> > #include <xen/hvm/hvm_op.h> > +#include "xg_private.h" > > int xc_domain_create(xc_interface *xch, > uint32_t ssidref, > @@ -1629,6 +1630,68 @@ int xc_domain_set_virq_handler(xc_interface *xch, uint32_t domid, int virq) > return do_domctl(xch, &domctl); > } > > +/* Informs XEN that domain is vNUMA aware */"Xen" ;-)> +int xc_domain_setvnodes(xc_interface *xch, > + uint32_t domid, > + uint16_t nr_vnodes, > + uint16_t nr_vcpus, > + vnuma_memblk_t *vmemblks, > + int *vdistance, > + int *vcpu_to_vnode, > + int *vnode_to_pnode)Can some of these be const?> +{ > + int rc; > + DECLARE_DOMCTL; > + DECLARE_HYPERCALL_BUFFER(int, distbuf); > + DECLARE_HYPERCALL_BUFFER(vnuma_memblk_t, membuf); > + DECLARE_HYPERCALL_BUFFER(int, vcpumapbuf); > + DECLARE_HYPERCALL_BUFFER(int, vntopbuf); > + > + rc = -EINVAL;After the comment below about ENOMEM I think the value set here is unused.> + memset(&domctl, 0, sizeof(domctl));DECLARE_DOMCTL will initialise domctl iff valgrind is enabled, which is all that is required I think.> + if ( vdistance == NULL || vcpu_to_vnode == NULL || vmemblks == NULL ) > + /* vnode_to_pnode can be null on non-NUMA machines */ > + { > + PERROR("Parameters are wrong XEN_DOMCTL_setvnumainfo\n"); > + return -EINVAL; > + } > + distbuf = xc_hypercall_buffer_alloc > + (xch, distbuf, sizeof(*vdistance) * nr_vnodes * nr_vnodes); > + membuf = xc_hypercall_buffer_alloc > + (xch, membuf, sizeof(*membuf) * nr_vnodes); > + vcpumapbuf = xc_hypercall_buffer_alloc > + (xch, vcpumapbuf, sizeof(*vcpu_to_vnode) * nr_vcpus); > + vntopbuf = xc_hypercall_buffer_alloc > + (xch, vntopbuf, sizeof(*vnode_to_pnode) * nr_vnodes); > + > + if (distbuf == NULL || membuf == NULL || vcpumapbuf == NULL || vntopbuf == NULL ) > + { > + PERROR("Could not allocate memory for xc hypercall XEN_DOMCTL_setvnumainfo\n");rc = -ENOMEM?> + goto fail; > + } > + memcpy(distbuf, vdistance, sizeof(*vdistance) * nr_vnodes * nr_vnodes); > + memcpy(vntopbuf, vnode_to_pnode, sizeof(*vnode_to_pnode) * nr_vnodes); > + memcpy(vcpumapbuf, vcpu_to_vnode, sizeof(*vcpu_to_vnode) * nr_vcpus); > + memcpy(membuf, vmemblks, sizeof(*vmemblks) * nr_vnodes);You can use DECLARE_HYPERCALL_BOUNCE and xc__hypercall_bounce_pre/post which takes care of the alloc and copying stuff internally.> + > + set_xen_guest_handle(domctl.u.vnuma.vdistance, distbuf); > + set_xen_guest_handle(domctl.u.vnuma.vnuma_memblks, membuf); > + set_xen_guest_handle(domctl.u.vnuma.vcpu_to_vnode, vcpumapbuf); > + set_xen_guest_handle(domctl.u.vnuma.vnode_to_pnode, vntopbuf); > + > + domctl.cmd = XEN_DOMCTL_setvnumainfo; > + domctl.domain = (domid_t)domid; > + domctl.u.vnuma.nr_vnodes = nr_vnodes; > + rc = do_domctl(xch, &domctl); > +fail: > + xc_hypercall_buffer_free(xch, distbuf); > + xc_hypercall_buffer_free(xch, membuf); > + xc_hypercall_buffer_free(xch, vcpumapbuf); > + xc_hypercall_buffer_free(xch, vntopbuf); > + > + return rc; > +} > + > /* > * Local variables: > * mode: C > diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h > index f2cebaf..fb66cfa 100644 > --- a/tools/libxc/xenctrl.h > +++ b/tools/libxc/xenctrl.h > @@ -1083,6 +1083,23 @@ int xc_domain_set_memmap_limit(xc_interface *xch, > uint32_t domid, > unsigned long map_limitkb); > > +/*unsigned long xc_get_memory_hole_size(unsigned long start, unsigned long end);What is this?> + > +int xc_domain_align_vnodes(xc_interface *xch, > + uint32_t domid, > + uint64_t *vmemareas, > + vnuma_memblk_t *vnuma_memblks, > + uint16_t nr_vnodes); > +*/ > +int xc_domain_setvnodes(xc_interface *xch, > + uint32_t domid, > + uint16_t nr_vnodes, > + uint16_t nr_vcpus, > + vnuma_memblk_t *vmemareas, > + int *vdistance, > + int *vcpu_to_vnode, > + int *vnode_to_pnode); > + > #if defined(__i386__) || defined(__x86_64__) > /* > * PC BIOS standard E820 types and structure.
On Tue, 2013-08-27 at 03:54 -0400, Elena Ufimtseva wrote:> Defines VM config options for vNUMA PV domain creation as follows: > vnodes - number of nodes and enables vnuma > vnumamem - vnuma nodes memory sizes > vnuma_distance - vnuma distance table (may be omitted) > vcpu_to_vnode - vcpu to vnode mask (may be omitted) > > sum of all numamem should be equal to memory option. > Number of vcpus should not be less that number of vnodes. > > VM config Examples:Please patch docs/ as necessary (e.g. the manpages) at the same time.> > memory = 16384 > vcpus = 8 > name = "rc" > vnodes = 8 > vnumamem = "2g, 2g, 2g, 2g, 2g, 2g, 2g, 2g" > vcpu_to_vnode ="5 6 7 4 3 2 1 0"xl cfg supports arrays, is there any reason not to use them? Hopefully (lib)xl will also implement some sort of sane default in the case where people don''t want to spell all this out? Is it actually useful to be able to arbitrarily map vcpus to nodes? I''d have thought dividing the vcpus among the nodes evenly would be sufficient for almost everyone. What happens if the total of vnumamem does not == memory? Would it be useful to be able to specify this as ratios? e.g. "1:1:1:1" etc? Or maybe we should simply extend the memory syntax to take a list and memory becomes the total? What happens if length(vnumamem) != vnodes? Likewise vcpu_to_vnode vs vcspus. How is maxmem handled/reconciled? Is there a vnumamaxmem? Likewise maxvcpus.> memory = 2048 > vcpus = 4 > name = "rc9" > vnodes = 2 > vnumamem = "1g, 1g" > vnuma_distance = "10 20, 10 20" > vcpu_to_vnode ="1, 3, 2, 0" > > Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> > --- > tools/libxl/libxl.c | 28 ++++++ > tools/libxl/libxl.h | 15 ++++ > tools/libxl/libxl_arch.h | 6 ++ > tools/libxl/libxl_dom.c | 115 ++++++++++++++++++++++-- > tools/libxl/libxl_internal.h | 3 + > tools/libxl/libxl_types.idl | 6 +- > tools/libxl/libxl_x86.c | 91 +++++++++++++++++++ > tools/libxl/xl_cmdimpl.c | 197 +++++++++++++++++++++++++++++++++++++++++- > 8 files changed, 454 insertions(+), 7 deletions(-) > > diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c > index 81785df..cd25474 100644 > --- a/tools/libxl/libxl.c > +++ b/tools/libxl/libxl.c > @@ -4293,6 +4293,34 @@ static int libxl__set_vcpuonline_qmp(libxl__gc *gc, uint32_t domid, > } > return 0; > } > +#ifdef LIBXL_HAVE_BUILDINFO_VNUMAlibxl itself doesn''t need to use the ifdef, just provide it for external callers.> +int libxl_domain_setvnodes(libxl_ctx *ctx, > + uint32_t domid, > + uint16_t nr_vnodes, > + uint16_t nr_vcpus, > + vnuma_memblk_t *vnuma_memblks, > + int *vdistance, > + int *vcpu_to_vnode, > + int *vnode_to_pnode) > +{ > + GC_INIT(ctx); > + int ret; > + ret = xc_domain_setvnodes(ctx->xch, domid, nr_vnodes, > + nr_vcpus, vnuma_memblks, > + vdistance, vcpu_to_vnode, > + vnode_to_pnode); > + GC_FREE; > + return ret; > +} > + > +int libxl_default_vcpu_to_vnuma(libxl_domain_build_info *info) > +{ > + int i; > + for(i = 0; i < info->max_vcpus; i++) > + info->vcpu_to_vnode[i] = i % info->nr_vnodes; > + return 0; > +} > +#endif > > int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *cpumap) > { > diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h > index be19bf5..a1a5e33 100644 > --- a/tools/libxl/libxl.h > +++ b/tools/libxl/libxl.h > @@ -706,6 +706,21 @@ void libxl_vcpuinfo_list_free(libxl_vcpuinfo *, int nr_vcpus); > void libxl_device_vtpm_list_free(libxl_device_vtpm*, int nr_vtpms); > void libxl_vtpminfo_list_free(libxl_vtpminfo *, int nr_vtpms); > > +/* vNUMA topology */ > + > +#ifdef LIBXL_HAVE_BUILDINFO_VNUMAUnneeded, but you do need to add the #define (which seems missing, how does this stuff get built?)> +#include <xen/vnuma.h>Includes should go at the top unless there is a good reason otherwise. However we try and avoid exposing Xen interfaces in the libxl interface. This means you need to define a libxl equivalent, which should be done via the libxl IDL.> +int libxl_domain_setvnodes(libxl_ctx *ctx, > + uint32_t domid, > + uint16_t nr_vnodes, > + uint16_t nr_vcpus, > + vnuma_memblk_t *vnuma_memblks, > + int *vdistance, > + int *vcpu_to_vnode, > + int *vnode_to_pnode); > + > +int libxl_default_vcpu_to_vnuma(libxl_domain_build_info *info); > +#endif > /* > * Devices > * ======> diff --git a/tools/libxl/libxl_arch.h b/tools/libxl/libxl_arch.h > index abe6685..76c1975 100644 > --- a/tools/libxl/libxl_arch.h > +++ b/tools/libxl/libxl_arch.h > @@ -18,5 +18,11 @@ > /* arch specific internal domain creation function */ > int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config, > uint32_t domid); > +int libxl_vnuma_align_mem(libxl__gc *gc,libxl__foo (double underscores) for internal function please.> + uint32_t domid, > + struct libxl_domain_build_info *b_info, > + vnuma_memblk_t *memblks); /* linux specific memory blocks: out */Why/how is this Linux specific? This is a hypercall parameter, isn''t it?> > + > +unsigned long e820_memory_hole_size(unsigned long start, unsigned long end, struct e820entry e820[], int nr); > #endif > diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c > index 6e2252a..8bbbd18 100644 > --- a/tools/libxl/libxl_dom.c > +++ b/tools/libxl/libxl_dom.c > @@ -200,6 +200,63 @@ static int numa_place_domain(libxl__gc *gc, uint32_t domid, > libxl_cpupoolinfo_dispose(&cpupool_info); > return rc; > } > +#define set_all_vnodes(n) for(i=0; i< info->nr_vnodes; i++) \ > + info->vnode_to_pnode[i] = n > + > +int libxl_init_vnodemap(libxl__gc *gc, uint32_t domid,Double underscore please.> + libxl_domain_build_info *info) > +{ > + int i, n, start, nr_nodes; > + uint64_t *mems; > + unsigned long long claim[16];Where does 16 come from?> + libxl_numainfo *ninfo = NULL; > + > + if (info->vnode_to_pnode == NULL) > + info->vnode_to_pnode = calloc(info->nr_vnodes, sizeof(*info->vnode_to_pnode)); > + > + set_all_vnodes(NUMA_NO_NODE); > + mems = info->vnuma_memszs; > + ninfo = libxl_get_numainfo(CTX, &nr_nodes); > + if (ninfo == NULL) { > + LOG(INFO, "No HW NUMA found\n"); > + return -EINVAL; > + } > + /* lets check if all vnodes will fit in one node */ > + for(n = 0; n < nr_nodes; n++) { > + if(ninfo[n].free/1024 >= info->max_memkb) { > + /* all fit on one node, fill the mask */ > + set_all_vnodes(n); > + LOG(INFO, "Setting all vnodes to node %d, free = %lu, need =%lu Kb\n", n, ninfo[n].free/1024, info->max_memkb); > + return 0; > + } > + } > + /* TODO: change algorithm. The current just fits the nodes > + * Will be nice to have them also sorted by size */ > + /* If no p-node found, will be set to NUMA_NO_NODE and allocation will fail */ > + LOG(INFO, "Found %d physical NUMA nodes\n", nr_nodes); > + memset(claim, 0, sizeof(*claim) * 16); > + start = 0; > + for ( n = 0; n < nr_nodes; n++ ) > + {If nr_nodes > 16 this will overflow claim[n].> + for ( i = start; i < info->nr_vnodes; i++ ) > + { > + LOG(INFO, "Compare %Lx for vnode[%d] size %lx with free space on pnode[%d], free %lx\n", > + claim[n] + mems[i], i, mems[i], n, ninfo[n].free);These should be at best LOG(DEBUG, ...). Perhaps a LOG_(INFO, ...) summary at the end would be suitable?> + if ( ((claim[n] + mems[i]) <= ninfo[n].free) && (info->vnode_to_pnode[i] == NUMA_NO_NODE) ) > + { > + info->vnode_to_pnode[i] = n; > + LOG(INFO, "Set vnode[%d] to pnode [%d]\n", i, n); > + claim[n] += mems[i]; > + } > + else { > + /* Will have another chance at other pnode */ > + start = i; > + continue; > + } > + } > + } > + return 0; > +} > > int libxl__build_pre(libxl__gc *gc, uint32_t domid, > libxl_domain_config *d_config, libxl__domain_build_state *state) > @@ -232,9 +289,36 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid, > if (rc) > return rc; > } > +#ifdef LIBXL_HAVE_BUILDINFO_VNUMANot needed.> + if (info->nr_vnodes <= info->max_vcpus && info->nr_vnodes != 0) { > + vnuma_memblk_t *memblks = libxl__calloc(gc, info->nr_vnodes, sizeof(*memblks)); > + libxl_vnuma_align_mem(gc, domid, info, memblks); > + if (libxl_init_vnodemap(gc, domid, info) != 0) { > + LOG(INFO, "Failed to call init_vnodemap\n"); > + rc = libxl_domain_setvnodes(ctx, domid, info->nr_vnodes, > + info->max_vcpus, memblks, > + info->vdistance, info->vcpu_to_vnode, > + NULL); > + } > + else > + rc = libxl_domain_setvnodes(ctx, domid, info->nr_vnodes, > + info->max_vcpus, memblks, > + info->vdistance, info->vcpu_to_vnode, > + info->vnode_to_pnode); > + if (rc < 0 ) LOG(INFO, "Failed to call xc_domain_setvnodes\n"); > + for(int i=0; i<info->nr_vnodes; i++) > + LOG(INFO, "Mapping vnode %d to pnode %d\n", i, info->vnode_to_pnode[i]); > + libxl_bitmap_set_none(&info->nodemap); > + libxl_bitmap_set(&info->nodemap, 0); > + } > + else { > + LOG(INFO, "NOT Calling vNUMA construct with nr_nodes = %d\n", info->nr_vnodes); > + info->nr_vnodes = 0; > + } > +#endif > libxl_domain_set_nodeaffinity(ctx, domid, &info->nodemap); > libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap); > - > + > xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + LIBXL_MAXMEM_CONSTANT); > xs_domid = xs_read(ctx->xsh, XBT_NULL, "/tool/xenstored/domid", NULL); > state->store_domid = xs_domid ? atoi(xs_domid) : 0; > @@ -368,7 +452,20 @@ int libxl__build_pv(libxl__gc *gc, uint32_t domid, > } > } > } > - > +#ifdef LIBXL_HAVE_BUILDINFO_VNUMAand again.> + if (info->nr_vnodes != 0 && info->vnuma_memszs != NULL && info->vnode_to_pnode != NULL) { > + dom->nr_vnodes = info->nr_vnodes; > + dom->vnumablocks = malloc(info->nr_vnodes * sizeof(*dom->vnumablocks)); > + dom->vnode_to_pnode = (int *)malloc(info->nr_vnodes * sizeof(*info->vnode_to_pnode)); > + dom->vmemsizes = malloc(info->nr_vnodes * sizeof(*info->vnuma_memszs)); > + if (dom->vmemsizes == NULL || dom->vnode_to_pnode == NULL) { > + LOGE(ERROR, "%s:Failed to allocate memory for memory sizes.\n",__FUNCTION__);I thought LOG* already included file/function stuff.> + goto out; > + } > + memcpy(dom->vmemsizes, info->vnuma_memszs, sizeof(*info->vnuma_memszs) * info->nr_vnodes); > + memcpy(dom->vnode_to_pnode, info->vnode_to_pnode, sizeof(*info->vnode_to_pnode) * info->nr_vnodes); > + } > +#endif > dom->flags = flags; > dom->console_evtchn = state->console_port; > dom->console_domid = state->console_domid; > @@ -388,9 +485,17 @@ int libxl__build_pv(libxl__gc *gc, uint32_t domid, > LOGE(ERROR, "xc_dom_mem_init failed"); > goto out; > } > - if ( (ret = xc_dom_boot_mem_init(dom)) != 0 ) { > - LOGE(ERROR, "xc_dom_boot_mem_init failed"); > - goto out; > + if (info->nr_vnodes != 0 && info->vnuma_memszs != NULL) { > + if ( (ret = xc_dom_boot_mem_init(dom)) != 0 ) { > + LOGE(ERROR, "xc_dom_boot_mem_init_node failed");No _node on the actual call here, I can''t see how it differes from the following call in fact.> + goto out; > + } > + } > + else { > + if ( (ret = xc_dom_boot_mem_init(dom)) != 0 ) { > + LOGE(ERROR, "xc_dom_boot_mem_init failed"); > + goto out; > + } > } > if ( (ret = xc_dom_build_image(dom)) != 0 ) { > LOGE(ERROR, "xc_dom_build_image failed"); > diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h > index f051d91..4a501c4 100644 > --- a/tools/libxl/libxl_internal.h > +++ b/tools/libxl/libxl_internal.h > @@ -2709,6 +2709,7 @@ static inline void libxl__ctx_unlock(libxl_ctx *ctx) { > #define CTX_LOCK (libxl__ctx_lock(CTX)) > #define CTX_UNLOCK (libxl__ctx_unlock(CTX)) > > +#define NUMA_NO_NODE 0xFF256 nodes isn''t completely implausible. Looks like nr_vnodes is a uint16_t so 0xffff or ~((uint16_t)0) would be better I think.> /* > * Automatic NUMA placement > * > @@ -2832,6 +2833,8 @@ void libxl__numa_candidate_put_nodemap(libxl__gc *gc, > libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap); > } > > +int libxl_init_vnodemap(libxl__gc *gc, uint32_t domid, > + libxl_domain_build_info *info); > /* > * Inserts "elm_new" into the sorted list "head". > * > diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl > index 85341a0..c3a4d95 100644 > --- a/tools/libxl/libxl_types.idl > +++ b/tools/libxl/libxl_types.idl > @@ -208,6 +208,7 @@ libxl_dominfo = Struct("dominfo",[ > ("vcpu_max_id", uint32), > ("vcpu_online", uint32), > ("cpupool", uint32), > + ("nr_vnodes", uint16), > ], dir=DIR_OUT) > > libxl_cpupoolinfo = Struct("cpupoolinfo", [ > @@ -279,7 +280,10 @@ libxl_domain_build_info = Struct("domain_build_info",[ > ("disable_migrate", libxl_defbool), > ("cpuid", libxl_cpuid_policy_list), > ("blkdev_start", string), > - > + ("vnuma_memszs", Array(uint64, "nr_vnodes")), > + ("vcpu_to_vnode", Array(integer, "nr_vnodemap")), > + ("vdistance", Array(integer, "nr_vdist")), > + ("vnode_to_pnode", Array(integer, "nr_vnode_to_pnode")), > ("device_model_version", libxl_device_model_version), > ("device_model_stubdomain", libxl_defbool), > # if you set device_model you must set device_model_version too > diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c > index a78c91d..35da3a8 100644 > --- a/tools/libxl/libxl_x86.c > +++ b/tools/libxl/libxl_x86.c > @@ -308,3 +308,94 @@ int libxl__arch_domain_create(libxl__gc *gc, libxl_domain_config *d_config, > > return ret; > } > + > +unsigned long e820_memory_hole_size(unsigned long start, unsigned long end, struct e820entry e820[], int nr) > +{ > +#define clamp(val, min, max) ({ \ > + typeof(val) __val = (val); \ > + typeof(min) __min = (min); \ > + typeof(max) __max = (max); \ > + (void) (&__val == &__min); \ > + (void) (&__val == &__max); \ > + __val = __val < __min ? __min: __val; \ > + __val > __max ? __max: __val; }) > + int i; > + unsigned long absent, start_pfn, end_pfn; > + absent = start - end; > + for(i = 0; i < nr; i++) { > + if(e820[i].type == E820_RAM) { > + start_pfn = clamp(e820[i].addr, start, end); > + end_pfn = clamp(e820[i].addr + e820[i].size, start, end); > + absent -= end_pfn - start_pfn; > + } > + } > + return absent; > +} > + > +/* Align memory blocks for linux NUMA build image */ > +int libxl_vnuma_align_mem(libxl__gc *gc, > + uint32_t domid, > + libxl_domain_build_info *b_info, > + vnuma_memblk_t *memblks) /* linux specific memory blocks: out */ > +{ > +#ifndef roundup > +#define roundup(x, y) ((((x) + ((y) - 1)) / (y)) * (y)) > +#endif > + /* > + This function transforms mem block sizes in bytes > + into aligned PV Linux guest NUMA nodes. > + XEN will provide this memory layout to PV Linux guest upon boot for > + PV Linux guests.You say PV Linux guest three times here but I don''t think any of this is specific to PV Linux as opposed to PV guests generally (whether or not Linux is the only current implementation of this interface doesn''t really matter)> + */ > + int i, rc; > + unsigned long shift = 0, size, node_min_size = 1, limit; > + unsigned long end_max; > + uint32_t nr; > + struct e820entry map[E820MAX]; > + > + libxl_ctx *ctx = libxl__gc_owner(gc); > + rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX); > + if (rc < 0) { > + errno = rc; > + return -EINVAL; > + } > + nr = rc; > + rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb, > + (b_info->max_memkb - b_info->target_memkb) + > + b_info->u.pv.slack_memkb); > + if (rc) > + return ERROR_FAIL; > + > + end_max = map[nr-1].addr + map[nr-1].size; > + > + shift = 0; > + for(i = 0; i < b_info->nr_vnodes; i++) { > + printf("block [%d] start inside align = %#lx\n", i, b_info->vnuma_memszs[i]);No printf in libxl please.> + } > + memset(memblks, 0, sizeof(*memblks)*b_info->nr_vnodes); > + memblks[0].start = 0; > + for(i = 0; i < b_info->nr_vnodes; i++) { > + memblks[i].start += shift; > + memblks[i].end += shift + b_info->vnuma_memszs[i]; > + limit = size = memblks[i].end - memblks[i].start; > + while (memblks[i].end - memblks[i].start - e820_memory_hole_size(memblks[i].start, memblks[i].end, map, nr) < size) {Please see if you can shorten this line.> + memblks[i].end += node_min_size; > + shift += node_min_size; > + if (memblks[i].end - memblks[i].start >= limit) { > + memblks[i].end = memblks[i].start + limit; > + break; > + } > + if (memblks[i].end == end_max) { > + memblks[i].end = end_max; > + break; > + } > + } > + shift = memblks[i].end; > + memblks[i].start = roundup(memblks[i].start, 4*1024); > + > + printf("start = %#010lx, end = %#010lx\n", memblks[i].start, memblks[i].end); > + } > + if(memblks[i-1].end > end_max) > + memblks[i-1].end = end_max; > + return 0; > +} > diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c > index 884f050..36a8275 100644 > --- a/tools/libxl/xl_cmdimpl.c > +++ b/tools/libxl/xl_cmdimpl.c > @@ -539,7 +539,121 @@ vcpp_out: > > return rc; > } > +#ifdef LIBXL_HAVE_BUILDINFO_VNUMAThis isn''t strictly needed in xl either. Although some people are keen to have xl build against newer and older libxl in order to test the compatibility guarentees made by the library.> +static int vdistance_parse(char *vdistcfg, int *vdistance, int nr_vnodes) > +{Please can you use some line breaks to separate logical paragraphs and make things more readable. e.g. after the local variable declaration and between related blocks of code.> + char *endptr, *toka, *tokb, *saveptra = NULL, *saveptrb = NULL; > + int *vdist_tmp = NULL; > + int rc = 0; > + int i, j, dist, parsed = 0; > + rc = -EINVAL;Here you have: int rc = 0; rc = -EINVAL One of them is redundant.> + if(vdistance == NULL) { > + return rc; > + } > + vdist_tmp = (int *)malloc(nr_vnodes * nr_vnodes * sizeof(*vdistance)); > + if (vdist_tmp == NULL) > + return rc; > + i =0; j = 0; > + for (toka = strtok_r(vdistcfg, ",", &saveptra); toka; > + toka = strtok_r(NULL, ",", &saveptra)) { > + if ( i >= nr_vnodes ) > + goto vdist_parse_err; > + for (tokb = strtok_r(toka, " ", &saveptrb); tokb; > + tokb = strtok_r(NULL, " ", &saveptrb)) { > + if (j >= nr_vnodes) > + goto vdist_parse_err; > + dist = strtol(tokb, &endptr, 10); > + if (tokb == endptr) > + goto vdist_parse_err; > + *(vdist_tmp + j*nr_vnodes + i) = dist; > + parsed++; > + j++; > + } > + i++; > + j = 0;This would all be easier if it was an xlcfg list.> + } > + rc = parsed; > + memcpy(vdistance, vdist_tmp, nr_vnodes * nr_vnodes * sizeof(*vdistance)); > +vdist_parse_err: > + if (vdist_tmp !=NULL ) free(vdist_tmp); > + return rc; > +} > > +static int vcputovnode_parse(char *cfg, int *vmap, int nr_vnodes, int nr_vcpus) > +{ > + char *toka, *endptr, *saveptra = NULL; > + int *vmap_tmp = NULL; > + int rc = 0; > + int i; > + rc = -EINVAL; > + i = 0; > + if(vmap == NULL) { > + return rc; > + } > + vmap_tmp = (int *)malloc(sizeof(*vmap) * nr_vcpus); > + memset(vmap_tmp, 0, sizeof(*vmap) * nr_vcpus); > + for (toka = strtok_r(cfg, " ", &saveptra); toka; > + toka = strtok_r(NULL, " ", &saveptra)) { > + if (i >= nr_vcpus) goto vmap_parse_out; > + vmap_tmp[i] = strtoul(toka, &endptr, 10); > + if( endptr == toka) > + goto vmap_parse_out; > + fprintf(stderr, "Parsed vcpu_to_vnode[%d] = %d.\n", i, vmap_tmp[i]); > + i++; > + } > + memcpy(vmap, vmap_tmp, sizeof(*vmap) * nr_vcpus); > + rc = i; > +vmap_parse_out: > + if (vmap_tmp != NULL) free(vmap_tmp); > + return rc; > +} > + > +static int vnumamem_parse(char *vmemsizes, uint64_t *vmemregions, int nr_vnodes) > +{ > + uint64_t memsize; > + char *endptr, *toka, *saveptr = NULL; > + int rc = 0; > + int j; > + rc = -EINVAL; > + if(vmemregions == NULL) { > + goto vmem_parse_out; > + } > + memsize = 0; > + j = 0; > + for (toka = strtok_r(vmemsizes, ",", &saveptr); toka; > + toka = strtok_r(NULL, ",", &saveptr)) { > + if ( j >= nr_vnodes ) > + goto vmem_parse_out; > + memsize = strtoul(toka, &endptr, 10); > + if (endptr == toka) > + goto vmem_parse_out; > + switch (*endptr) { > + case ''G'': > + case ''g'': > + memsize = memsize * 1024 * 1024 * 1024; > + break; > + case ''M'': > + case ''m'': > + memsize = memsize * 1024 * 1024; > + break; > + case ''K'': > + case ''k'': > + memsize = memsize * 1024 ; > + break; > + default: > + continue; > + break; > + } > + if (memsize > 0) { > + vmemregions[j] = memsize; > + j++; > + } > + } > + rc = j; > +vmem_parse_out: > + return rc; > +} > +#endif > static void parse_config_data(const char *config_source, > const char *config_data, > int config_len, > @@ -871,7 +985,13 @@ static void parse_config_data(const char *config_source, > { > char *cmdline = NULL; > const char *root = NULL, *extra = ""; > - > +#ifdef LIBXL_HAVE_BUILDINFO_VNUMA > + const char *vnumamemcfg = NULL; > + int nr_vnuma_regions; > + long unsigned int vnuma_memparsed = 0; > + const char *vmapcfg = NULL; > + const char *vdistcfg = NULL; > +#endif > xlu_cfg_replace_string (config, "kernel", &b_info->u.pv.kernel, 0); > > xlu_cfg_get_string (config, "root", &root, 0); > @@ -888,7 +1008,82 @@ static void parse_config_data(const char *config_source, > fprintf(stderr, "Failed to allocate memory for cmdline\n"); > exit(1); > } > +#ifdef LIBXL_HAVE_BUILDINFO_VNUMA > + if (!xlu_cfg_get_long (config, "vnodes", &l, 0)) { > + b_info->nr_vnodes = l; > + if (b_info->nr_vnodes <= 0) > + exit(1); > + if(!xlu_cfg_get_string (config, "vnumamem", &vnumamemcfg, 0)) { > + b_info->vnuma_memszs = calloc(b_info->nr_vnodes, > + sizeof(*b_info->vnuma_memszs)); > + if (b_info->vnuma_memszs == NULL) { > + fprintf(stderr, "WARNING: Could not allocate vNUMA node memory sizes.\n"); > + exit(1); > + } > + char *buf2 = strdup(vnumamemcfg); > + nr_vnuma_regions = vnumamem_parse(buf2, b_info->vnuma_memszs, > + b_info->nr_vnodes); > + for(i = 0; i < b_info->nr_vnodes; i++) > + vnuma_memparsed = vnuma_memparsed + (b_info->vnuma_memszs[i] >> 10); > + > + if(vnuma_memparsed != b_info->max_memkb || > + nr_vnuma_regions != b_info->nr_vnodes ) > + { > + fprintf(stderr, "WARNING: Incorrect vNUMA config. Parsed memory = %lu, parsed nodes = %d, max = %lx\n", > + vnuma_memparsed, nr_vnuma_regions, b_info->max_memkb); > + if(buf2) free(buf2); > + exit(1); > + } > + if (buf2) free(buf2); > + } > + else > + b_info->nr_vnodes=0; > + if(!xlu_cfg_get_string(config, "vnuma_distance", &vdistcfg, 0)) { > + b_info->vdistance = (int *)calloc(b_info->nr_vnodes * b_info->nr_vnodes, > + sizeof(*b_info->vdistance)); > + if (b_info->vdistance == NULL) > + exit(1); > + char *buf2 = strdup(vdistcfg); > + if(vdistance_parse(buf2, b_info->vdistance, b_info->nr_vnodes) != b_info->nr_vnodes * b_info->nr_vnodes) { > + if (buf2) free(buf2); > + free(b_info->vdistance); > + exit(1); > + } > + if(buf2) free(buf2); > + } > + else > + { > + /* default distance */ > + b_info->vdistance = (int *)calloc(b_info->nr_vnodes * b_info->nr_vnodes, sizeof(*b_info->vdistance)); > + if (b_info->vdistance == NULL) > + exit(1); > + for(i = 0; i < b_info->nr_vnodes; i++) > + for(int j = 0; j < b_info->nr_vnodes; j++) > + *(b_info->vdistance + j*b_info->nr_vnodes + i) = (i == j ? 10 : 20); > > + } > + if(!xlu_cfg_get_string(config, "vcpu_to_vnode", &vmapcfg, 0)) > + { > + b_info->vcpu_to_vnode = (int *)calloc(b_info->max_vcpus, sizeof(*b_info->vcpu_to_vnode)); > + if (b_info->vcpu_to_vnode == NULL) > + exit(-1); > + char *buf2 = strdup(vmapcfg); > + if (vcputovnode_parse(buf2, b_info->vcpu_to_vnode, b_info->nr_vnodes, b_info->max_vcpus) < 0) { > + if (buf2) free(buf2); > + fprintf(stderr, "Error parsing vcpu to vnode mask\n"); > + exit(1); > + } > + if(buf2) free(buf2); > + } > + else > + { > + b_info->vcpu_to_vnode = (int *)calloc(b_info->max_vcpus, sizeof(*b_info->vcpu_to_vnode)); > + if (b_info->vcpu_to_vnode != NULL) > + libxl_default_vcpu_to_vnuma(b_info); > + } > + } > +#endif > + > xlu_cfg_replace_string (config, "bootloader", &b_info->u.pv.bootloader, 0); > switch (xlu_cfg_get_list_as_string_list(config, "bootloader_args", > &b_info->u.pv.bootloader_args, 1))
On Tue, 2013-08-27 at 03:54 -0400, Elena Ufimtseva wrote:> Enables libxl vnuma ABI by LIBXL_HAVE_BUILDINFO_VNUMA.This should be in the patch which introduces the libxl interface and as I mentioned it doesn''t need to be ifdef''d in the library. A more natural way to structure this series would be to add the libxl stuff in one patch and the xl stuff in a second.> > Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> > --- > tools/libxl/libxl.h | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h > index a1a5e33..ad0d0d8 100644 > --- a/tools/libxl/libxl.h > +++ b/tools/libxl/libxl.h > @@ -90,6 +90,14 @@ > #define LIBXL_HAVE_BUILDINFO_HVM_VENDOR_DEVICE 1 > > /* > + * LIBXL_HAVE_BUILDINFO_VNUMA indicates that vnuma topology will be > + * build for the guest upon request and with VM configuration. > + * It will try to define best allocation for vNUMA > + * nodes on real NUMA nodes. > + */ > +#define LIBXL_HAVE_BUILDINFO_VNUMA 1 > + > +/* > * libxl ABI compatibility > * > * The only guarantee which libxl makes regarding ABI compatibility
On Tue, Aug 27, 2013 at 8:54 AM, Elena Ufimtseva <ufimtseva@gmail.com> wrote:> This series of patches introduces vNUMA topology implementation and > provides interfaces and data structures, exposing to PV guest virtual topology > and enabling guest OS to use its own NUMA placement mechanisms. > > vNUMA topology support for Linux PV guest comes in a separate patch. > > Please review and send your comments.Elena, do you have a public git tree to pull from, for the lazy? -George
George Dunlap
2013-Aug-27 14:06 UTC
Re: [PATCH RFC 1/7] xen/vnuma: subop hypercall and vnuma topology structures.
On Tue, Aug 27, 2013 at 9:53 AM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 27.08.13 at 09:54, Elena Ufimtseva <ufimtseva@gmail.com> wrote: >> Defines XENMEM subop hypercall for PV vNUMA enabled guests and provides >> vNUMA topology information from per-domain vnuma topology build info. >> TODO: >> subop XENMEM hypercall is subject to change to sysctl subop. > > That would mean it''s intended to be used by the tool stack only. I > thought that the balloon driver (and perhaps other code) are also > intended to be consumers.Can Elena take it from your detailed review that you''re OK with the general approach here? -George
Jan Beulich
2013-Aug-27 14:14 UTC
Re: [PATCH RFC 1/7] xen/vnuma: subop hypercall and vnuma topology structures.
>>> On 27.08.13 at 16:06, George Dunlap <George.Dunlap@eu.citrix.com> wrote: > On Tue, Aug 27, 2013 at 9:53 AM, Jan Beulich <JBeulich@suse.com> wrote: >>>>> On 27.08.13 at 09:54, Elena Ufimtseva <ufimtseva@gmail.com> wrote: >>> Defines XENMEM subop hypercall for PV vNUMA enabled guests and provides >>> vNUMA topology information from per-domain vnuma topology build info. >>> TODO: >>> subop XENMEM hypercall is subject to change to sysctl subop. >> >> That would mean it''s intended to be used by the tool stack only. I >> thought that the balloon driver (and perhaps other code) are also >> intended to be consumers. > > Can Elena take it from your detailed review that you''re OK with the > general approach here?Not yet - as said in the middle of both review replies, the enormous amount of formal issues in the patches makes it very hard to review them, and hence I gave up at those points. Thus I can only say that it looks okay at a first glance. Jan
Matt Wilson
2013-Aug-28 16:42 UTC
Re: [PATCH RFC 1/7] xen/vnuma: subop hypercall and vnuma topology structures.
On Tue, Aug 27, 2013 at 03:54:20AM -0400, Elena Ufimtseva wrote:> Defines XENMEM subop hypercall for PV vNUMA enabled guests and provides > vNUMA topology information from per-domain vnuma topology build info. > TODO: > subop XENMEM hypercall is subject to change to sysctl subop. > > Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>[...]> diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h > index a057069..3d39218 100644 > --- a/xen/include/xen/domain.h > +++ b/xen/include/xen/domain.h > @@ -4,6 +4,7 @@ > > #include <public/xen.h> > #include <asm/domain.h> > +#include <public/vnuma.h> > > typedef union { > struct vcpu_guest_context *nat; > @@ -89,4 +90,12 @@ extern unsigned int xen_processor_pmbits; > > extern bool_t opt_dom0_vcpus_pin; > > +struct domain_vnuma_info { > + uint16_t nr_vnodes; > + int *vdistance; > + vnuma_memblk_t *vnuma_memblks; > + int *vcpu_to_vnode; > + int *vnode_to_pnode;What''s the purpose of providing a vNode to pNode mapping to the guest? Or am I misunderstanding that this wouldn''t be provided to the guest?> +}; > + > #endif /* __XEN_DOMAIN_H__ */--msw
Elena Ufimtseva
2013-Aug-28 17:01 UTC
Re: [PATCH RFC 1/7] xen/vnuma: subop hypercall and vnuma topology structures.
Matt This is for xen only to set the vnuma topology. Guest will not have this, but will have some other interface to retreive this info (as we were discussing this in numa aware ballooning). Elena On Wed, Aug 28, 2013 at 12:42 PM, Matt Wilson <msw@amazon.com> wrote:> On Tue, Aug 27, 2013 at 03:54:20AM -0400, Elena Ufimtseva wrote: >> Defines XENMEM subop hypercall for PV vNUMA enabled guests and provides >> vNUMA topology information from per-domain vnuma topology build info. >> TODO: >> subop XENMEM hypercall is subject to change to sysctl subop. >> >> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com> > > [...] > >> diff --git a/xen/include/xen/domain.h b/xen/include/xen/domain.h >> index a057069..3d39218 100644 >> --- a/xen/include/xen/domain.h >> +++ b/xen/include/xen/domain.h >> @@ -4,6 +4,7 @@ >> >> #include <public/xen.h> >> #include <asm/domain.h> >> +#include <public/vnuma.h> >> >> typedef union { >> struct vcpu_guest_context *nat; >> @@ -89,4 +90,12 @@ extern unsigned int xen_processor_pmbits; >> >> extern bool_t opt_dom0_vcpus_pin; >> >> +struct domain_vnuma_info { >> + uint16_t nr_vnodes; >> + int *vdistance; >> + vnuma_memblk_t *vnuma_memblks; >> + int *vcpu_to_vnode; >> + int *vnode_to_pnode; > > What''s the purpose of providing a vNode to pNode mapping to the guest? > Or am I misunderstanding that this wouldn''t be provided to the guest? > >> +}; >> + >> #endif /* __XEN_DOMAIN_H__ */ > > --msw-- Elena
Hi George, all I will have it ready along with next version of patches. Elena On Tue, Aug 27, 2013 at 9:44 AM, George Dunlap <George.Dunlap@eu.citrix.com> wrote:> On Tue, Aug 27, 2013 at 8:54 AM, Elena Ufimtseva <ufimtseva@gmail.com> wrote: >> This series of patches introduces vNUMA topology implementation and >> provides interfaces and data structures, exposing to PV guest virtual topology >> and enabling guest OS to use its own NUMA placement mechanisms. >> >> vNUMA topology support for Linux PV guest comes in a separate patch. >> >> Please review and send your comments. > > Elena, do you have a public git tree to pull from, for the lazy? > > -George-- Elena