Hello Everyone, This is the second take for automaic NUMA placement for Xen 4.2, for the sake of feature parity with xm/xend. All the comments v1 received have been addressed. Actually, some bits of the patchset have been deeply restructured for achieving this. Details in single changelogs. Here it is what the series ships: 1/10 libxl: fix a typo in the GCREALLOC_ARRAY macro Is just a build fix (already posted separately). 2/10 libxl: add a new Array type to the IDL 3/10 libxl,libxc: introduce libxl_get_numainfo() 4/10 xl: add more NUMA information to `xl info -n'' 5/10 libxl: rename libxl_cpumap to libxl_bitmap 6/10 libxl: expand the libxl_bitmap API a bit 7/10 libxl: introduce some node map helpers Introduce the data structures, calls and infrastructure needed for retrieving all the information about the NUMA-ness of the system and deal with them at the toolstack level. 8/10 libxl: enable automatic placement of guests on NUMA nodes 9/10 libxl: have NUMA placement deal with cpupools Is the actual, well, ''food'' (no, I''m not calling it ''the meat'' as I''m vegetarian! :-D). 10/10 Some automatic NUMA placement documentation Puts some more details about the implementation and the usage of the new feature directly in the tree. Thanks a lot and Regards, Dario
Dario Faggioli
2012-Jun-15 17:04 UTC
[PATCH 01 of 10 v2] libxl: fix a typo in the GCREALLOC_ARRAY macro
Causing a build failure when trying to use it: xxx: error: expected '';'' before '')'' token xxx: error: expected statement before '')'' token Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -1972,7 +1972,7 @@ struct libxl__domain_create_state { #define GCREALLOC_ARRAY(var, nmemb) \ (assert(nmemb > 0), \ assert(ARRAY_SIZE_OK((var), (nmemb))), \ - (var) = libxl__realloc((gc), (var), (nmemb)*sizeof(*(var))))) + (var) = libxl__realloc((gc), (var), (nmemb)*sizeof(*(var)))) /*
Dario Faggioli
2012-Jun-15 17:04 UTC
[PATCH 02 of 10 v2] libxl: add a new Array type to the IDL
And make all the required infrastructure updates to enable this. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Tested-by: Dario Faggioli <dario.faggioli@citrix.com> diff --git a/tools/libxl/gentest.py b/tools/libxl/gentest.py --- a/tools/libxl/gentest.py +++ b/tools/libxl/gentest.py @@ -27,6 +27,18 @@ def gen_rand_init(ty, v, indent = " " s = "" if isinstance(ty, idl.Enumeration): s += "%s = %s;\n" % (ty.pass_arg(v, parent is None), randomize_enum(ty)) + elif isinstance(ty, idl.Array): + if parent is None: + raise Exception("Array type must have a parent") + s += "%s = rand()%%8;\n" % (parent + ty.lenvar.name) + s += "%s = calloc(%s, sizeof(*%s));\n" % \ + (v, parent + ty.lenvar.name, v) + s += "{\n" + s += " int i;\n" + s += " for (i=0; i<%s; i++)\n" % (parent + ty.lenvar.name) + s += gen_rand_init(ty.elem_type, v+"[i]", + indent + " ", parent) + s += "}\n" elif isinstance(ty, idl.KeyedUnion): if parent is None: raise Exception("KeyedUnion type must have a parent") diff --git a/tools/libxl/gentypes.py b/tools/libxl/gentypes.py --- a/tools/libxl/gentypes.py +++ b/tools/libxl/gentypes.py @@ -11,8 +11,12 @@ def libxl_C_instance_of(ty, instancename return libxl_C_type_define(ty) else: return libxl_C_type_define(ty) + " " + instancename - else: - return ty.typename + " " + instancename + + s = "" + if isinstance(ty, idl.Array): + s += libxl_C_instance_of(ty.lenvar.type, ty.lenvar.name) + ";\n" + + return s + ty.typename + " " + instancename def libxl_C_type_define(ty, indent = ""): s = "" @@ -66,6 +70,21 @@ def libxl_C_type_dispose(ty, v, indent s += libxl_C_type_dispose(f.type, fexpr, indent + " ", nparent) s += " break;\n" s += "}\n" + elif isinstance(ty, idl.Array): + if parent is None: + raise Exception("Array type must have a parent") + if ty.elem_type.dispose_fn is not None: + s += "{\n" + s += " int i;\n" + s += " for (i=0; i<%s; i++)\n" % (parent + ty.lenvar.name) + s += libxl_C_type_dispose(ty.elem_type, v+"[i]", + indent + " ", parent) + if ty.dispose_fn is not None: + if ty.elem_type.dispose_fn is not None: + s += " " + s += "%s(%s);\n" % (ty.dispose_fn, ty.pass_arg(v, parent is None)) + if ty.elem_type.dispose_fn is not None: + s += "}\n" elif isinstance(ty, idl.Struct) and (parent is None or ty.dispose_fn is None): for f in [f for f in ty.fields if not f.const]: (nparent,fexpr) = ty.member(v, f, parent is None) @@ -164,7 +183,24 @@ def libxl_C_type_gen_json(ty, v, indent s = "" if parent is None: s += "yajl_gen_status s;\n" - if isinstance(ty, idl.Enumeration): + + if isinstance(ty, idl.Array): + if parent is None: + raise Exception("Array type must have a parent") + s += "{\n" + s += " int i;\n" + s += " s = yajl_gen_array_open(hand);\n" + s += " if (s != yajl_gen_status_ok)\n" + s += " goto out;\n" + s += " for (i=0; i<%s; i++) {\n" % (parent + ty.lenvar.name) + s += libxl_C_type_gen_json(ty.elem_type, v+"[i]", + indent + " ", parent) + s += " }\n" + s += " s = yajl_gen_array_close(hand);\n" + s += " if (s != yajl_gen_status_ok)\n" + s += " goto out;\n" + s += "}\n" + elif isinstance(ty, idl.Enumeration): s += "s = libxl__yajl_gen_enum(hand, %s_to_string(%s));\n" % (ty.typename, ty.pass_arg(v, parent is None)) s += "if (s != yajl_gen_status_ok)\n" s += " goto out;\n" diff --git a/tools/libxl/idl.py b/tools/libxl/idl.py --- a/tools/libxl/idl.py +++ b/tools/libxl/idl.py @@ -266,6 +266,17 @@ string = Builtin("char *", namespace = N json_fn = "libxl__string_gen_json", autogenerate_json = False) +class Array(Type): + """An array of the same type""" + def __init__(self, elem_type, lenvar_name, **kwargs): + kwargs.setdefault(''dispose_fn'', ''free'') + Type.__init__(self, namespace=elem_type.namespace, typename=elem_type.rawname + " *", **kwargs) + + lv_kwargs = dict([(x.lstrip(''lenvar_''),y) for (x,y) in kwargs.items() if x.startswith(''lenvar_'')]) + + self.lenvar = Field(integer, lenvar_name, **lv_kwargs) + self.elem_type = elem_type + class OrderedDict(dict): """A dictionary which remembers insertion order. diff --git a/tools/libxl/idl.txt b/tools/libxl/idl.txt --- a/tools/libxl/idl.txt +++ b/tools/libxl/idl.txt @@ -145,12 +145,24 @@ idl.KeyedUnion A subclass of idl.Aggregate which represents the C union type where the currently valid member of the union can be determined based - upon another member in the containing type. + upon another member in the containing type. An idl.KeyedUnion must + always be a member of a containing idl.Aggregate type. - The KeyedUnion.keyvar contains an idl.type the member of the + The KeyedUnion.keyvar contains an idl.Type the member of the containing type which determines the valid member of the union. The must be an instance of the Enumeration type. +idl.Array + + A class representing an array or similar elements. An idl.Array must + always be an idl.Field of a containing idl.Aggregate. + + idl.Array.elem_type contains an idl.Type which is the type of all + elements of the array. + + idl.Array.len_var contains an idl.Field which is added to the parent + idl.Aggregate and will contain the length of the array. + Standard Types -------------- diff --git a/tools/ocaml/libs/xl/genwrap.py b/tools/ocaml/libs/xl/genwrap.py --- a/tools/ocaml/libs/xl/genwrap.py +++ b/tools/ocaml/libs/xl/genwrap.py @@ -55,7 +55,8 @@ def ocaml_type_of(ty): return "int%d" % ty.width else: return "int" - + elif isinstance(ty,idl.Array): + return "%s list" % ocaml_type_of(ty.elem_type) elif isinstance(ty,idl.Builtin): if not builtins.has_key(ty.typename): raise NotImplementedError("Unknown Builtin %s (%s)" % (ty.typename, type(ty))) diff --git a/tools/python/genwrap.py b/tools/python/genwrap.py --- a/tools/python/genwrap.py +++ b/tools/python/genwrap.py @@ -4,7 +4,7 @@ import sys,os import idl -(TYPE_DEFBOOL, TYPE_BOOL, TYPE_INT, TYPE_UINT, TYPE_STRING, TYPE_AGGREGATE) = range(6) +(TYPE_DEFBOOL, TYPE_BOOL, TYPE_INT, TYPE_UINT, TYPE_STRING, TYPE_ARRAY, TYPE_AGGREGATE) = range(7) def py_type(ty): if ty == idl.bool: @@ -18,6 +18,8 @@ def py_type(ty): return TYPE_INT else: return TYPE_UINT + if isinstance(ty, idl.Array): + return TYPE_ARRAY if isinstance(ty, idl.Aggregate): return TYPE_AGGREGATE if ty == idl.string: @@ -74,7 +76,7 @@ def py_attrib_get(ty, f): l.append('' return genwrap__ull_get(self->obj.%s);''%f.name) elif t == TYPE_STRING: l.append('' return genwrap__string_get(&self->obj.%s);''%f.name) - elif t == TYPE_AGGREGATE: + elif t == TYPE_AGGREGATE or t == TYPE_ARRAY: l.append('' PyErr_SetString(PyExc_NotImplementedError, "Getting %s");''%ty.typename) l.append('' return NULL;'') else: @@ -105,7 +107,7 @@ def py_attrib_set(ty, f): l.append('' return ret;'') elif t == TYPE_STRING: l.append('' return genwrap__string_set(v, &self->obj.%s);''%f.name) - elif t == TYPE_AGGREGATE: + elif t == TYPE_AGGREGATE or t == TYPE_ARRAY: l.append('' PyErr_SetString(PyExc_NotImplementedError, "Setting %s");''%ty.typename) l.append('' return -1;'') else:
Dario Faggioli
2012-Jun-15 17:04 UTC
[PATCH 03 of 10 v2] libxl, libxc: introduce libxl_get_numainfo()
Make some NUMA node information available to the toolstack. Achieve this by means of xc_numainfo(), which exposes memory size and amount of free memory of each node, as well as the relative distances of each node to all the others. For properly exposing distances we need the IDL to support arrays. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Changes from v1: * malloc converted to libxl__zalloc(NOGC, ...). * The patch also accommodates some bits of what was in "libxc, libxl: introduce xc_nodemap_t and libxl_nodemap" which was removed as well, as full support for node maps at libxc level is not needed (yet!). diff --git a/tools/libxc/xc_misc.c b/tools/libxc/xc_misc.c --- a/tools/libxc/xc_misc.c +++ b/tools/libxc/xc_misc.c @@ -35,6 +35,20 @@ int xc_get_max_cpus(xc_interface *xch) return max_cpus; } +int xc_get_max_nodes(xc_interface *xch) +{ + static int max_nodes = 0; + xc_physinfo_t physinfo; + + if ( max_nodes ) + return max_nodes; + + if ( !xc_physinfo(xch, &physinfo) ) + max_nodes = physinfo.max_node_id + 1; + + return max_nodes; +} + int xc_get_cpumap_size(xc_interface *xch) { return (xc_get_max_cpus(xch) + 7) / 8; diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h --- a/tools/libxc/xenctrl.h +++ b/tools/libxc/xenctrl.h @@ -329,6 +329,12 @@ int xc_get_cpumap_size(xc_interface *xch /* allocate a cpumap */ xc_cpumap_t xc_cpumap_alloc(xc_interface *xch); + /* + * NODEMAP handling + */ +/* return maximum number of NUMA nodes the hypervisor supports */ +int xc_get_max_nodes(xc_interface *xch); + /* * DOMAIN DEBUGGING FUNCTIONS */ diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c --- a/tools/libxl/libxl.c +++ b/tools/libxl/libxl.c @@ -3223,6 +3223,71 @@ fail: return ret; } +libxl_numainfo *libxl_get_numainfo(libxl_ctx *ctx, int *nr) +{ + xc_numainfo_t ninfo; + DECLARE_HYPERCALL_BUFFER(xc_node_to_memsize_t, memsize); + DECLARE_HYPERCALL_BUFFER(xc_node_to_memfree_t, memfree); + DECLARE_HYPERCALL_BUFFER(uint32_t, node_dists); + libxl_numainfo *ret = NULL; + int i, j, max_nodes; + + max_nodes = libxl_get_max_nodes(ctx); + if (max_nodes == 0) + { + LIBXL__LOG(ctx, XTL_ERROR, "Unable to determine number of NODES"); + return NULL; + } + + memsize = xc_hypercall_buffer_alloc + (ctx->xch, memsize, sizeof(*memsize) * max_nodes); + memfree = xc_hypercall_buffer_alloc + (ctx->xch, memfree, sizeof(*memfree) * max_nodes); + node_dists = xc_hypercall_buffer_alloc + (ctx->xch, node_dists, sizeof(*node_dists) * max_nodes * max_nodes); + if ((memsize == NULL) || (memfree == NULL) || (node_dists == NULL)) { + LIBXL__LOG_ERRNOVAL(ctx, XTL_ERROR, ENOMEM, + "Unable to allocate hypercall arguments"); + goto fail; + } + + set_xen_guest_handle(ninfo.node_to_memsize, memsize); + set_xen_guest_handle(ninfo.node_to_memfree, memfree); + set_xen_guest_handle(ninfo.node_to_node_distance, node_dists); + ninfo.max_node_index = max_nodes - 1; + if (xc_numainfo(ctx->xch, &ninfo) != 0) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting numainfo"); + goto fail; + } + + if (ninfo.max_node_index < max_nodes - 1) + max_nodes = ninfo.max_node_index + 1; + + ret = libxl__zalloc(NOGC, sizeof(libxl_numainfo) * max_nodes); + for (i = 0; i < max_nodes; i++) + ret[i].dists = libxl__zalloc(NULL, sizeof(*node_dists) * max_nodes); + + for (i = 0; i < max_nodes; i++) { +#define V(mem, i) (mem[i] == INVALID_NUMAINFO_ID) ? \ + LIBXL_NUMAINFO_INVALID_ENTRY : mem[i] + ret[i].size = V(memsize, i); + ret[i].free = V(memfree, i); + ret[i].num_dists = max_nodes; + for (j = 0; j < ret[i].num_dists; j++) + ret[i].dists[j] = V(node_dists, i * max_nodes + j); +#undef V + } + +fail: + xc_hypercall_buffer_free(ctx->xch, memsize); + xc_hypercall_buffer_free(ctx->xch, memfree); + xc_hypercall_buffer_free(ctx->xch, node_dists); + + if (ret) + *nr = max_nodes; + return ret; +} + const libxl_version_info* libxl_get_version_info(libxl_ctx *ctx) { union { diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -532,6 +532,9 @@ int libxl_domain_preserve(libxl_ctx *ctx /* get max. number of cpus supported by hypervisor */ int libxl_get_max_cpus(libxl_ctx *ctx); +/* get max. number of NUMA nodes supported by hypervisor */ +int libxl_get_max_nodes(libxl_ctx *ctx); + int libxl_domain_rename(libxl_ctx *ctx, uint32_t domid, const char *old_name, const char *new_name); @@ -769,6 +772,13 @@ int libxl_get_physinfo(libxl_ctx *ctx, l #define LIBXL_CPUTOPOLOGY_INVALID_ENTRY (~(uint32_t)0) libxl_cputopology *libxl_get_cpu_topology(libxl_ctx *ctx, int *nr); void libxl_cputopology_list_free(libxl_cputopology *, int nr); +#define LIBXL_NUMAINFO_INVALID_ENTRY (~(uint32_t)0) +libxl_numainfo *libxl_get_numainfo(libxl_ctx *ctx, int *nr); + /* On success, a list of nr libxl_numainfo elements is returned. + * That is from malloc, thus it is up to the caller to invoke + * libxl_cpupoolinfo_list_free() on it. + */ +void libxl_numainfo_list_free(libxl_numainfo *, int nr); libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t domid, int *nb_vcpu, int *nrcpus); void libxl_vcpuinfo_list_free(libxl_vcpuinfo *, int nr); diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl --- a/tools/libxl/libxl_types.idl +++ b/tools/libxl/libxl_types.idl @@ -423,6 +423,12 @@ libxl_physinfo = Struct("physinfo", [ ("cap_hvm_directio", bool), ], dir=DIR_OUT) +libxl_numainfo = Struct("numainfo", [ + ("size", uint64), + ("free", uint64), + ("dists", Array(uint32, "num_dists")), + ], dir=DIR_OUT) + libxl_cputopology = Struct("cputopology", [ ("core", uint32), ("socket", uint32), diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c --- a/tools/libxl/libxl_utils.c +++ b/tools/libxl/libxl_utils.c @@ -537,6 +537,11 @@ int libxl_get_max_cpus(libxl_ctx *ctx) return xc_get_max_cpus(ctx->xch); } +int libxl_get_max_nodes(libxl_ctx *ctx) +{ + return xc_get_max_nodes(ctx->xch); +} + int libxl__enum_from_string(const libxl_enum_string_table *t, const char *s, int *e) { @@ -551,6 +556,14 @@ int libxl__enum_from_string(const libxl_ return ERROR_FAIL; } +void libxl_numainfo_list_free(libxl_numainfo *list, int nr) +{ + int i; + for (i = 0; i < nr; i++) + libxl_numainfo_dispose(&list[i]); + free(list); +} + void libxl_cputopology_list_free(libxl_cputopology *list, int nr) { int i; diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h --- a/xen/include/public/sysctl.h +++ b/xen/include/public/sysctl.h @@ -484,6 +484,7 @@ typedef struct xen_sysctl_topologyinfo x DEFINE_XEN_GUEST_HANDLE(xen_sysctl_topologyinfo_t); /* XEN_SYSCTL_numainfo */ +#define INVALID_NUMAINFO_ID (~0U) struct xen_sysctl_numainfo { /* * IN: maximum addressable entry in the caller-provided arrays.
Dario Faggioli
2012-Jun-15 17:04 UTC
[PATCH 04 of 10 v2] xl: add more NUMA information to `xl info -n''
So that the user knows how much memory there is on each node and how far they are from each others. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Changes from v1: * integer division replaced by right shift. diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -4254,6 +4254,36 @@ static void output_physinfo(void) return; } +static void output_numainfo(void) +{ + libxl_numainfo *info; + int i, j, nr; + + info = libxl_get_numainfo(ctx, &nr); + if (info == NULL) { + fprintf(stderr, "libxl_get_numainfo failed.\n"); + return; + } + + printf("numa_info :\n"); + printf("node: memsize memfree distances\n"); + + for (i = 0; i < nr; i++) { + if (info[i].size != LIBXL_NUMAINFO_INVALID_ENTRY) { + printf("%4d: %6"PRIu64" %6"PRIu64" %d", i, + info[i].size >> 20, info[i].free >> 20, + info[i].dists[0]); + for (j = 1; j < info[i].num_dists; j++) + printf(",%d", info[i].dists[j]); + printf("\n"); + } + } + + libxl_numainfo_list_free(info, nr); + + return; +} + static void output_topologyinfo(void) { libxl_cputopology *info; @@ -4276,8 +4306,6 @@ static void output_topologyinfo(void) libxl_cputopology_list_free(info, nr); - printf("numa_info : none\n"); - return; } @@ -4287,8 +4315,10 @@ static void info(int numa) output_physinfo(); - if (numa) + if (numa) { output_topologyinfo(); + output_numainfo(); + } output_xeninfo();
Dario Faggioli
2012-Jun-15 17:04 UTC
[PATCH 05 of 10 v2] libxl: rename libxl_cpumap to libxl_bitmap
And leave to the caller the burden of knowing and remembering what kind of bitmap each instance of libxl_bitmap is. This is basically just some s/libxl_cpumap/libxl_bitmap/ (and some other related interface name substitution, e.g., libxl_for_each_cpu) in a bunch of files, with no real functional change involved. A specific allocation helper is introduced, besides libxl_bitmap_alloc(). It is called libxl_cpu_bitmap_alloc() and is meant at substituting the old libxl_cpumap_alloc(). It is just something easier to use in cases where one wants to allocate a libxl_bitmap that is going to serve as a cpu map. This is because we want to be able to deal with both cpu and NUMA node maps, but we don''t want to duplicate all the various helpers and wrappers. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.eu.com> Changes from v1: * this patch replaces "libxl: abstract libxl_cpumap to just libxl_map" as it directly change the name of the old type instead of adding one more abstraction layer. diff --git a/tools/libxl/gentest.py b/tools/libxl/gentest.py --- a/tools/libxl/gentest.py +++ b/tools/libxl/gentest.py @@ -20,7 +20,7 @@ def randomize_case(s): def randomize_enum(e): return random.choice([v.name for v in e.values]) -handcoded = ["libxl_cpumap", "libxl_key_value_list", +handcoded = ["libxl_bitmap", "libxl_key_value_list", "libxl_cpuid_policy_list", "libxl_string_list"] def gen_rand_init(ty, v, indent = " ", parent = None): @@ -117,16 +117,16 @@ static void rand_bytes(uint8_t *p, size_ p[i] = rand() % 256; } -static void libxl_cpumap_rand_init(libxl_cpumap *cpumap) +static void libxl_bitmap_rand_init(libxl_bitmap *bitmap) { int i; - cpumap->size = rand() % 16; - cpumap->map = calloc(cpumap->size, sizeof(*cpumap->map)); - libxl_for_each_cpu(i, *cpumap) { + bitmap->size = rand() % 16; + bitmap->map = calloc(bitmap->size, sizeof(*bitmap->map)); + libxl_for_each_bit(i, *bitmap) { if (rand() % 2) - libxl_cpumap_set(cpumap, i); + libxl_bitmap_set(bitmap, i); else - libxl_cpumap_reset(cpumap, i); + libxl_bitmap_reset(bitmap, i); } } diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c --- a/tools/libxl/libxl.c +++ b/tools/libxl/libxl.c @@ -570,7 +570,7 @@ static int cpupool_info(libxl__gc *gc, info->poolid = xcinfo->cpupool_id; info->sched = xcinfo->sched_id; info->n_dom = xcinfo->n_dom; - if (libxl_cpumap_alloc(CTX, &info->cpumap)) + if (libxl_cpu_bitmap_alloc(CTX, &info->cpumap)) goto out; memcpy(info->cpumap.map, xcinfo->cpumap, info->cpumap.size); @@ -3352,7 +3352,7 @@ libxl_vcpuinfo *libxl_list_vcpu(libxl_ct } for (*nb_vcpu = 0; *nb_vcpu <= domaininfo.max_vcpu_id; ++*nb_vcpu, ++ptr) { - if (libxl_cpumap_alloc(ctx, &ptr->cpumap)) { + if (libxl_cpu_bitmap_alloc(ctx, &ptr->cpumap)) { LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "allocating cpumap"); return NULL; } @@ -3375,7 +3375,7 @@ libxl_vcpuinfo *libxl_list_vcpu(libxl_ct } int libxl_set_vcpuaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid, - libxl_cpumap *cpumap) + libxl_bitmap *cpumap) { if (xc_vcpu_setaffinity(ctx->xch, domid, vcpuid, cpumap->map)) { LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "setting vcpu affinity"); @@ -3385,7 +3385,7 @@ int libxl_set_vcpuaffinity(libxl_ctx *ct } int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid, - unsigned int max_vcpus, libxl_cpumap *cpumap) + unsigned int max_vcpus, libxl_bitmap *cpumap) { int i, rc = 0; @@ -3399,7 +3399,7 @@ int libxl_set_vcpuaffinity_all(libxl_ctx return rc; } -int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_cpumap *cpumap) +int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *cpumap) { GC_INIT(ctx); libxl_dominfo info; @@ -3419,7 +3419,7 @@ retry_transaction: for (i = 0; i <= info.vcpu_max_id; i++) libxl__xs_write(gc, t, libxl__sprintf(gc, "%s/cpu/%u/availability", dompath, i), - "%s", libxl_cpumap_test(cpumap, i) ? "online" : "offline"); + "%s", libxl_bitmap_test(cpumap, i) ? "online" : "offline"); if (!xs_transaction_end(ctx->xsh, t, 0)) { if (errno == EAGAIN) goto retry_transaction; @@ -4015,7 +4015,7 @@ int libxl_tmem_freeable(libxl_ctx *ctx) return rc; } -int libxl_get_freecpus(libxl_ctx *ctx, libxl_cpumap *cpumap) +int libxl_get_freecpus(libxl_ctx *ctx, libxl_bitmap *cpumap) { int ncpus; @@ -4034,7 +4034,7 @@ int libxl_get_freecpus(libxl_ctx *ctx, l int libxl_cpupool_create(libxl_ctx *ctx, const char *name, libxl_scheduler sched, - libxl_cpumap cpumap, libxl_uuid *uuid, + libxl_bitmap cpumap, libxl_uuid *uuid, uint32_t *poolid) { GC_INIT(ctx); @@ -4057,8 +4057,8 @@ int libxl_cpupool_create(libxl_ctx *ctx, return ERROR_FAIL; } - libxl_for_each_cpu(i, cpumap) - if (libxl_cpumap_test(&cpumap, i)) { + libxl_for_each_bit(i, cpumap) + if (libxl_bitmap_test(&cpumap, i)) { rc = xc_cpupool_addcpu(ctx->xch, *poolid, i); if (rc) { LIBXL__LOG_ERRNOVAL(ctx, LIBXL__LOG_ERROR, rc, @@ -4093,7 +4093,7 @@ int libxl_cpupool_destroy(libxl_ctx *ctx int rc, i; xc_cpupoolinfo_t *info; xs_transaction_t t; - libxl_cpumap cpumap; + libxl_bitmap cpumap; info = xc_cpupool_getinfo(ctx->xch, poolid); if (info == NULL) { @@ -4106,12 +4106,12 @@ int libxl_cpupool_destroy(libxl_ctx *ctx goto out; rc = ERROR_NOMEM; - if (libxl_cpumap_alloc(ctx, &cpumap)) + if (libxl_cpu_bitmap_alloc(ctx, &cpumap)) goto out; memcpy(cpumap.map, info->cpumap, cpumap.size); - libxl_for_each_cpu(i, cpumap) - if (libxl_cpumap_test(&cpumap, i)) { + libxl_for_each_bit(i, cpumap) + if (libxl_bitmap_test(&cpumap, i)) { rc = xc_cpupool_removecpu(ctx->xch, poolid, i); if (rc) { LIBXL__LOG_ERRNOVAL(ctx, LIBXL__LOG_ERROR, rc, @@ -4140,7 +4140,7 @@ int libxl_cpupool_destroy(libxl_ctx *ctx rc = 0; out1: - libxl_cpumap_dispose(&cpumap); + libxl_bitmap_dispose(&cpumap); out: xc_cpupool_infofree(ctx->xch, info); GC_FREE; @@ -4208,7 +4208,7 @@ int libxl_cpupool_cpuadd_node(libxl_ctx { int rc = 0; int cpu, nr; - libxl_cpumap freemap; + libxl_bitmap freemap; libxl_cputopology *topology; if (libxl_get_freecpus(ctx, &freemap)) { @@ -4223,7 +4223,7 @@ int libxl_cpupool_cpuadd_node(libxl_ctx *cpus = 0; for (cpu = 0; cpu < nr; cpu++) { - if (libxl_cpumap_test(&freemap, cpu) && (topology[cpu].node == node) && + if (libxl_bitmap_test(&freemap, cpu) && (topology[cpu].node == node) && !libxl_cpupool_cpuadd(ctx, poolid, cpu)) { (*cpus)++; } @@ -4232,7 +4232,7 @@ int libxl_cpupool_cpuadd_node(libxl_ctx free(topology); out: - libxl_cpumap_dispose(&freemap); + libxl_bitmap_dispose(&freemap); return rc; } @@ -4274,7 +4274,7 @@ int libxl_cpupool_cpuremove_node(libxl_c if (poolinfo[p].poolid == poolid) { for (cpu = 0; cpu < nr_cpus; cpu++) { if ((topology[cpu].node == node) && - libxl_cpumap_test(&poolinfo[p].cpumap, cpu) && + libxl_bitmap_test(&poolinfo[p].cpumap, cpu) && !libxl_cpupool_cpuremove(ctx, poolid, cpu)) { (*cpus)++; } diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -285,8 +285,8 @@ typedef uint64_t libxl_ev_user; typedef struct { uint32_t size; /* number of bytes in map */ uint8_t *map; -} libxl_cpumap; -void libxl_cpumap_dispose(libxl_cpumap *map); +} libxl_bitmap; +void libxl_bitmap_dispose(libxl_bitmap *map); /* libxl_cpuid_policy_list is a dynamic array storing CPUID policies * for multiple leafs. It is terminated with an entry holding @@ -783,10 +783,10 @@ libxl_vcpuinfo *libxl_list_vcpu(libxl_ct int *nb_vcpu, int *nrcpus); void libxl_vcpuinfo_list_free(libxl_vcpuinfo *, int nr); int libxl_set_vcpuaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid, - libxl_cpumap *cpumap); + libxl_bitmap *cpumap); int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid, - unsigned int max_vcpus, libxl_cpumap *cpumap); -int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_cpumap *cpumap); + unsigned int max_vcpus, libxl_bitmap *cpumap); +int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *cpumap); libxl_scheduler libxl_get_scheduler(libxl_ctx *ctx); @@ -836,10 +836,10 @@ int libxl_tmem_shared_auth(libxl_ctx *ct int auth); int libxl_tmem_freeable(libxl_ctx *ctx); -int libxl_get_freecpus(libxl_ctx *ctx, libxl_cpumap *cpumap); +int libxl_get_freecpus(libxl_ctx *ctx, libxl_bitmap *cpumap); int libxl_cpupool_create(libxl_ctx *ctx, const char *name, libxl_scheduler sched, - libxl_cpumap cpumap, libxl_uuid *uuid, + libxl_bitmap cpumap, libxl_uuid *uuid, uint32_t *poolid); int libxl_cpupool_destroy(libxl_ctx *ctx, uint32_t poolid); int libxl_cpupool_rename(libxl_ctx *ctx, const char *name, uint32_t poolid); diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c --- a/tools/libxl/libxl_create.c +++ b/tools/libxl/libxl_create.c @@ -150,9 +150,9 @@ int libxl__domain_build_info_setdefault( b_info->cur_vcpus = 1; if (!b_info->cpumap.size) { - if (libxl_cpumap_alloc(CTX, &b_info->cpumap)) + if (libxl_cpu_bitmap_alloc(CTX, &b_info->cpumap)) return ERROR_NOMEM; - libxl_cpumap_set_any(&b_info->cpumap); + libxl_bitmap_set_any(&b_info->cpumap); } if (b_info->max_memkb == LIBXL_MEMKB_DEFAULT) diff --git a/tools/libxl/libxl_json.c b/tools/libxl/libxl_json.c --- a/tools/libxl/libxl_json.c +++ b/tools/libxl/libxl_json.c @@ -99,8 +99,8 @@ yajl_gen_status libxl_uuid_gen_json(yajl return yajl_gen_string(hand, (const unsigned char *)buf, LIBXL_UUID_FMTLEN); } -yajl_gen_status libxl_cpumap_gen_json(yajl_gen hand, - libxl_cpumap *cpumap) +yajl_gen_status libxl_bitmap_gen_json(yajl_gen hand, + libxl_bitmap *cpumap) { yajl_gen_status s; int i; @@ -108,8 +108,8 @@ yajl_gen_status libxl_cpumap_gen_json(ya s = yajl_gen_array_open(hand); if (s != yajl_gen_status_ok) goto out; - libxl_for_each_cpu(i, *cpumap) { - if (libxl_cpumap_test(cpumap, i)) { + libxl_for_each_bit(i, *cpumap) { + if (libxl_bitmap_test(cpumap, i)) { s = yajl_gen_integer(hand, i); if (s != yajl_gen_status_ok) goto out; } diff --git a/tools/libxl/libxl_json.h b/tools/libxl/libxl_json.h --- a/tools/libxl/libxl_json.h +++ b/tools/libxl/libxl_json.h @@ -26,7 +26,7 @@ yajl_gen_status libxl_defbool_gen_json(y yajl_gen_status libxl_domid_gen_json(yajl_gen hand, libxl_domid *p); yajl_gen_status libxl_uuid_gen_json(yajl_gen hand, libxl_uuid *p); yajl_gen_status libxl_mac_gen_json(yajl_gen hand, libxl_mac *p); -yajl_gen_status libxl_cpumap_gen_json(yajl_gen hand, libxl_cpumap *p); +yajl_gen_status libxl_bitmap_gen_json(yajl_gen hand, libxl_bitmap *p); yajl_gen_status libxl_cpuid_policy_list_gen_json(yajl_gen hand, libxl_cpuid_policy_list *p); yajl_gen_status libxl_string_list_gen_json(yajl_gen hand, libxl_string_list *p); diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl --- a/tools/libxl/libxl_types.idl +++ b/tools/libxl/libxl_types.idl @@ -10,7 +10,7 @@ libxl_defbool = Builtin("defbool", passb libxl_domid = Builtin("domid", json_fn = "yajl_gen_integer", autogenerate_json = False) libxl_uuid = Builtin("uuid", passby=PASS_BY_REFERENCE) libxl_mac = Builtin("mac", passby=PASS_BY_REFERENCE) -libxl_cpumap = Builtin("cpumap", dispose_fn="libxl_cpumap_dispose", passby=PASS_BY_REFERENCE) +libxl_bitmap = Builtin("bitmap", dispose_fn="libxl_bitmap_dispose", passby=PASS_BY_REFERENCE) libxl_cpuid_policy_list = Builtin("cpuid_policy_list", dispose_fn="libxl_cpuid_dispose", passby=PASS_BY_REFERENCE) libxl_string_list = Builtin("string_list", dispose_fn="libxl_string_list_dispose", passby=PASS_BY_REFERENCE) @@ -188,7 +188,7 @@ libxl_cpupoolinfo = Struct("cpupoolinfo" ("poolid", uint32), ("sched", libxl_scheduler), ("n_dom", uint32), - ("cpumap", libxl_cpumap) + ("cpumap", libxl_bitmap) ], dir=DIR_OUT) libxl_vminfo = Struct("vminfo", [ @@ -238,7 +238,7 @@ libxl_domain_sched_params = Struct("doma libxl_domain_build_info = Struct("domain_build_info",[ ("max_vcpus", integer), ("cur_vcpus", integer), - ("cpumap", libxl_cpumap), + ("cpumap", libxl_bitmap), ("tsc_mode", libxl_tsc_mode), ("max_memkb", MemKB), ("target_memkb", MemKB), @@ -399,7 +399,7 @@ libxl_vcpuinfo = Struct("vcpuinfo", [ ("blocked", bool), ("running", bool), ("vcpu_time", uint64), # total vcpu time ran (ns) - ("cpumap", libxl_cpumap), # current cpu''s affinities + ("cpumap", libxl_bitmap), # current cpu''s affinities ], dir=DIR_OUT) libxl_physinfo = Struct("physinfo", [ diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c --- a/tools/libxl/libxl_utils.c +++ b/tools/libxl/libxl_utils.c @@ -489,47 +489,42 @@ int libxl_mac_to_device_nic(libxl_ctx *c return rc; } -int libxl_cpumap_alloc(libxl_ctx *ctx, libxl_cpumap *cpumap) +int libxl_bitmap_alloc(libxl_ctx *ctx, libxl_bitmap *bitmap, int n_bits) { - int max_cpus; int sz; - max_cpus = libxl_get_max_cpus(ctx); - if (max_cpus == 0) - return ERROR_FAIL; - - sz = (max_cpus + 7) / 8; - cpumap->map = calloc(sz, sizeof(*cpumap->map)); - if (!cpumap->map) + sz = (n_bits + 7) / 8; + bitmap->map = calloc(sz, sizeof(*bitmap->map)); + if (!bitmap->map) return ERROR_NOMEM; - cpumap->size = sz; + bitmap->size = sz; return 0; } -void libxl_cpumap_dispose(libxl_cpumap *map) +void libxl_bitmap_dispose(libxl_bitmap *map) { free(map->map); } -int libxl_cpumap_test(libxl_cpumap *cpumap, int cpu) +int libxl_bitmap_test(const libxl_bitmap *bitmap, int bit) { - if (cpu >= cpumap->size * 8) + if (bit >= bitmap->size * 8) return 0; - return (cpumap->map[cpu / 8] & (1 << (cpu & 7))) ? 1 : 0; + return (bitmap->map[bit / 8] & (1 << (bit & 7))) ? 1 : 0; } -void libxl_cpumap_set(libxl_cpumap *cpumap, int cpu) +void libxl_bitmap_set(libxl_bitmap *bitmap, int bit) { - if (cpu >= cpumap->size * 8) + if (bit >= bitmap->size * 8) return; - cpumap->map[cpu / 8] |= 1 << (cpu & 7); + bitmap->map[bit / 8] |= 1 << (bit & 7); } -void libxl_cpumap_reset(libxl_cpumap *cpumap, int cpu) +void libxl_bitmap_reset(libxl_bitmap *bitmap, int bit) { - if (cpu >= cpumap->size * 8) + if (bit >= bitmap->size * 8) return; - cpumap->map[cpu / 8] &= ~(1 << (cpu & 7)); + bitmap->map[bit / 8] &= ~(1 << (bit & 7)); } int libxl_get_max_cpus(libxl_ctx *ctx) diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h --- a/tools/libxl/libxl_utils.h +++ b/tools/libxl/libxl_utils.h @@ -63,25 +63,38 @@ int libxl_devid_to_device_nic(libxl_ctx int libxl_vdev_to_device_disk(libxl_ctx *ctx, uint32_t domid, const char *vdev, libxl_device_disk *disk); -int libxl_cpumap_alloc(libxl_ctx *ctx, libxl_cpumap *cpumap); -int libxl_cpumap_test(libxl_cpumap *cpumap, int cpu); -void libxl_cpumap_set(libxl_cpumap *cpumap, int cpu); -void libxl_cpumap_reset(libxl_cpumap *cpumap, int cpu); -static inline void libxl_cpumap_set_any(libxl_cpumap *cpumap) +int libxl_bitmap_alloc(libxl_ctx *ctx, libxl_bitmap *bitmap, int n_bits); + /* Allocated bimap is from malloc, libxl_bitmap_dispose() to be + * called by the application when done. */ +int libxl_bitmap_test(const libxl_bitmap *bitmap, int bit); +void libxl_bitmap_set(libxl_bitmap *bitmap, int bit); +void libxl_bitmap_reset(libxl_bitmap *bitmap, int bit); +static inline void libxl_bitmap_set_any(libxl_bitmap *bitmap) { - memset(cpumap->map, -1, cpumap->size); + memset(bitmap->map, -1, bitmap->size); } -static inline void libxl_cpumap_set_none(libxl_cpumap *cpumap) +static inline void libxl_bitmap_set_none(libxl_bitmap *bitmap) { - memset(cpumap->map, 0, cpumap->size); + memset(bitmap->map, 0, bitmap->size); } -static inline int libxl_cpumap_cpu_valid(libxl_cpumap *cpumap, int cpu) +static inline int libxl_bitmap_cpu_valid(const libxl_bitmap *bitmap, int bit) { - return cpu >= 0 && cpu < (cpumap->size * 8); + return bit >= 0 && bit < (bitmap->size * 8); } -#define libxl_for_each_cpu(var, map) for (var = 0; var < (map).size * 8; var++) -#define libxl_for_each_set_cpu(v, m) for (v = 0; v < (m).size * 8; v++) \ - if (libxl_cpumap_test(&(m), v)) +#define libxl_for_each_bit(var, map) for (var = 0; var < (map).size * 8; var++) +#define libxl_for_each_set_bit(v, m) for (v = 0; v < (m).size * 8; v++) \ + if (libxl_bitmap_test(&(m), v)) + +static inline int libxl_cpu_bitmap_alloc(libxl_ctx *ctx, libxl_bitmap *cpumap) +{ + int max_cpus; + + max_cpus = libxl_get_max_cpus(ctx); + if (max_cpus == 0) + return ERROR_FAIL; + + return libxl_bitmap_alloc(ctx, cpumap, max_cpus); +} static inline uint32_t libxl__sizekb_to_mb(uint32_t s) { return (s + 1023) / 1024; diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -491,19 +491,19 @@ static void split_string_into_string_lis free(s); } -static int vcpupin_parse(char *cpu, libxl_cpumap *cpumap) -{ - libxl_cpumap exclude_cpumap; +static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap) +{ + libxl_bitmap exclude_cpumap; uint32_t cpuida, cpuidb; char *endptr, *toka, *tokb, *saveptr = NULL; int i, rc = 0, rmcpu; if (!strcmp(cpu, "all")) { - libxl_cpumap_set_any(cpumap); + libxl_bitmap_set_any(cpumap); return 0; } - if (libxl_cpumap_alloc(ctx, &exclude_cpumap)) { + if (libxl_cpu_bitmap_alloc(ctx, &exclude_cpumap)) { fprintf(stderr, "Error: Failed to allocate cpumap.\n"); return ENOMEM; } @@ -533,19 +533,19 @@ static int vcpupin_parse(char *cpu, libx } } while (cpuida <= cpuidb) { - rmcpu == 0 ? libxl_cpumap_set(cpumap, cpuida) : - libxl_cpumap_set(&exclude_cpumap, cpuida); + rmcpu == 0 ? libxl_bitmap_set(cpumap, cpuida) : + libxl_bitmap_set(&exclude_cpumap, cpuida); cpuida++; } } /* Clear all the cpus from the removal list */ - libxl_for_each_set_cpu(i, exclude_cpumap) { - libxl_cpumap_reset(cpumap, i); + libxl_for_each_set_bit(i, exclude_cpumap) { + libxl_bitmap_reset(cpumap, i); } vcpp_out: - libxl_cpumap_dispose(&exclude_cpumap); + libxl_bitmap_dispose(&exclude_cpumap); return rc; } @@ -685,7 +685,7 @@ static void parse_config_data(const char if (!xlu_cfg_get_list (config, "cpus", &cpus, 0, 1)) { int i, n_cpus = 0; - if (libxl_cpumap_alloc(ctx, &b_info->cpumap)) { + if (libxl_cpu_bitmap_alloc(ctx, &b_info->cpumap)) { fprintf(stderr, "Unable to allocate cpumap\n"); exit(1); } @@ -705,14 +705,14 @@ static void parse_config_data(const char * the cpumap derived from the list ensures memory is being * allocated on the proper nodes anyway. */ - libxl_cpumap_set_none(&b_info->cpumap); + libxl_bitmap_set_none(&b_info->cpumap); while ((buf = xlu_cfg_get_listitem(cpus, n_cpus)) != NULL) { i = atoi(buf); - if (!libxl_cpumap_cpu_valid(&b_info->cpumap, i)) { + if (!libxl_bitmap_cpu_valid(&b_info->cpumap, i)) { fprintf(stderr, "cpu %d illegal\n", i); exit(1); } - libxl_cpumap_set(&b_info->cpumap, i); + libxl_bitmap_set(&b_info->cpumap, i); if (n_cpus < b_info->max_vcpus) vcpu_to_pcpu[n_cpus] = i; n_cpus++; @@ -721,12 +721,12 @@ static void parse_config_data(const char else if (!xlu_cfg_get_string (config, "cpus", &buf, 0)) { char *buf2 = strdup(buf); - if (libxl_cpumap_alloc(ctx, &b_info->cpumap)) { + if (libxl_cpu_bitmap_alloc(ctx, &b_info->cpumap)) { fprintf(stderr, "Unable to allocate cpumap\n"); exit(1); } - libxl_cpumap_set_none(&b_info->cpumap); + libxl_bitmap_set_none(&b_info->cpumap); if (vcpupin_parse(buf2, &b_info->cpumap)) exit(1); free(buf2); @@ -1806,26 +1806,26 @@ start: /* If single vcpu to pcpu mapping was requested, honour it */ if (vcpu_to_pcpu) { - libxl_cpumap vcpu_cpumap; - - libxl_cpumap_alloc(ctx, &vcpu_cpumap); + libxl_bitmap vcpu_cpumap; + + libxl_cpu_bitmap_alloc(ctx, &vcpu_cpumap); for (i = 0; i < d_config.b_info.max_vcpus; i++) { if (vcpu_to_pcpu[i] != -1) { - libxl_cpumap_set_none(&vcpu_cpumap); - libxl_cpumap_set(&vcpu_cpumap, vcpu_to_pcpu[i]); + libxl_bitmap_set_none(&vcpu_cpumap); + libxl_bitmap_set(&vcpu_cpumap, vcpu_to_pcpu[i]); } else { - libxl_cpumap_set_any(&vcpu_cpumap); + libxl_bitmap_set_any(&vcpu_cpumap); } if (libxl_set_vcpuaffinity(ctx, domid, i, &vcpu_cpumap)) { fprintf(stderr, "setting affinity failed on vcpu `%d''.\n", i); - libxl_cpumap_dispose(&vcpu_cpumap); + libxl_bitmap_dispose(&vcpu_cpumap); free(vcpu_to_pcpu); ret = ERROR_FAIL; goto error_out; } } - libxl_cpumap_dispose(&vcpu_cpumap); + libxl_bitmap_dispose(&vcpu_cpumap); free(vcpu_to_pcpu); vcpu_to_pcpu = NULL; } @@ -4063,7 +4063,7 @@ int main_vcpulist(int argc, char **argv) static void vcpupin(const char *d, const char *vcpu, char *cpu) { libxl_vcpuinfo *vcpuinfo; - libxl_cpumap cpumap; + libxl_bitmap cpumap; uint32_t vcpuid; char *endptr; @@ -4080,7 +4080,7 @@ static void vcpupin(const char *d, const find_domain(d); - if (libxl_cpumap_alloc(ctx, &cpumap)) { + if (libxl_cpu_bitmap_alloc(ctx, &cpumap)) { goto vcpupin_out; } @@ -4107,7 +4107,7 @@ static void vcpupin(const char *d, const libxl_vcpuinfo_list_free(vcpuinfo, nb_vcpu); } vcpupin_out1: - libxl_cpumap_dispose(&cpumap); + libxl_bitmap_dispose(&cpumap); vcpupin_out: ; } @@ -4127,7 +4127,7 @@ static void vcpuset(const char *d, const { char *endptr; unsigned int max_vcpus, i; - libxl_cpumap cpumap; + libxl_bitmap cpumap; max_vcpus = strtoul(nr_vcpus, &endptr, 10); if (nr_vcpus == endptr) { @@ -4137,17 +4137,17 @@ static void vcpuset(const char *d, const find_domain(d); - if (libxl_cpumap_alloc(ctx, &cpumap)) { - fprintf(stderr, "libxl_cpumap_alloc failed\n"); + if (libxl_cpu_bitmap_alloc(ctx, &cpumap)) { + fprintf(stderr, "libxl_cpu_bitmap_alloc failed\n"); return; } for (i = 0; i < max_vcpus; i++) - libxl_cpumap_set(&cpumap, i); + libxl_bitmap_set(&cpumap, i); if (libxl_set_vcpuonline(ctx, domid, &cpumap) < 0) fprintf(stderr, "libxl_set_vcpuonline failed domid=%d max_vcpus=%d\n", domid, max_vcpus); - libxl_cpumap_dispose(&cpumap); + libxl_bitmap_dispose(&cpumap); } int main_vcpuset(int argc, char **argv) @@ -4211,7 +4211,7 @@ static void output_physinfo(void) libxl_physinfo info; const libxl_version_info *vinfo; unsigned int i; - libxl_cpumap cpumap; + libxl_bitmap cpumap; int n = 0; if (libxl_get_physinfo(ctx, &info) != 0) { @@ -4243,8 +4243,8 @@ static void output_physinfo(void) printf("sharing_used_memory : %"PRIu64"\n", info.sharing_used_frames / i); } if (!libxl_get_freecpus(ctx, &cpumap)) { - libxl_for_each_cpu(i, cpumap) - if (libxl_cpumap_test(&cpumap, i)) + libxl_for_each_bit(i, cpumap) + if (libxl_bitmap_test(&cpumap, i)) n++; printf("free_cpus : %d\n", n); free(cpumap.map); @@ -5866,8 +5866,8 @@ int main_cpupoolcreate(int argc, char ** XLU_ConfigList *cpus; XLU_ConfigList *nodes; int n_cpus, n_nodes, i, n; - libxl_cpumap freemap; - libxl_cpumap cpumap; + libxl_bitmap freemap; + libxl_bitmap cpumap; libxl_uuid uuid; libxl_cputopology *topology; int rc = -ERROR_FAIL; @@ -5980,7 +5980,7 @@ int main_cpupoolcreate(int argc, char ** fprintf(stderr, "libxl_get_freecpus failed\n"); goto out_cfg; } - if (libxl_cpumap_alloc(ctx, &cpumap)) { + if (libxl_cpu_bitmap_alloc(ctx, &cpumap)) { fprintf(stderr, "Failed to allocate cpumap\n"); goto out_cfg; } @@ -5997,8 +5997,8 @@ int main_cpupoolcreate(int argc, char ** n = atoi(buf); for (i = 0; i < nr; i++) { if ((topology[i].node == n) && - libxl_cpumap_test(&freemap, i)) { - libxl_cpumap_set(&cpumap, i); + libxl_bitmap_test(&freemap, i)) { + libxl_bitmap_set(&cpumap, i); n_cpus++; } } @@ -6016,11 +6016,11 @@ int main_cpupoolcreate(int argc, char ** while ((buf = xlu_cfg_get_listitem(cpus, n_cpus)) != NULL) { i = atoi(buf); if ((i < 0) || (i >= freemap.size * 8) || - !libxl_cpumap_test(&freemap, i)) { + !libxl_bitmap_test(&freemap, i)) { fprintf(stderr, "cpu %d illegal or not free\n", i); goto out_cfg; } - libxl_cpumap_set(&cpumap, i); + libxl_bitmap_set(&cpumap, i); n_cpus++; } } else @@ -6118,8 +6118,8 @@ int main_cpupoollist(int argc, char **ar printf("%-19s", name); free(name); n = 0; - libxl_for_each_cpu(c, poolinfo[p].cpumap) - if (libxl_cpumap_test(&poolinfo[p].cpumap, c)) { + libxl_for_each_bit(c, poolinfo[p].cpumap) + if (libxl_bitmap_test(&poolinfo[p].cpumap, c)) { if (n && opt_cpus) printf(","); if (opt_cpus) printf("%d", c); n++; @@ -6318,7 +6318,7 @@ int main_cpupoolnumasplit(int argc, char int n_cpus; char name[16]; libxl_uuid uuid; - libxl_cpumap cpumap; + libxl_bitmap cpumap; libxl_cpupoolinfo *poolinfo; libxl_cputopology *topology; libxl_dominfo info; @@ -6348,7 +6348,7 @@ int main_cpupoolnumasplit(int argc, char return -ERROR_FAIL; } - if (libxl_cpumap_alloc(ctx, &cpumap)) { + if (libxl_cpu_bitmap_alloc(ctx, &cpumap)) { fprintf(stderr, "Failed to allocate cpumap\n"); libxl_cputopology_list_free(topology, n_cpus); return -ERROR_FAIL; @@ -6374,7 +6374,7 @@ int main_cpupoolnumasplit(int argc, char for (c = 0; c < n_cpus; c++) { if (topology[c].node == node) { topology[c].node = LIBXL_CPUTOPOLOGY_INVALID_ENTRY; - libxl_cpumap_set(&cpumap, n); + libxl_bitmap_set(&cpumap, n); n++; } } @@ -6396,7 +6396,7 @@ int main_cpupoolnumasplit(int argc, char fprintf(stderr, "failed to offline vcpus\n"); goto out; } - libxl_cpumap_set_none(&cpumap); + libxl_bitmap_set_none(&cpumap); for (c = 0; c < n_cpus; c++) { if (topology[c].node == LIBXL_CPUTOPOLOGY_INVALID_ENTRY) { @@ -6434,7 +6434,7 @@ int main_cpupoolnumasplit(int argc, char out: libxl_cputopology_list_free(topology, n_cpus); - libxl_cpumap_dispose(&cpumap); + libxl_bitmap_dispose(&cpumap); return ret; } diff --git a/tools/python/xen/lowlevel/xl/xl.c b/tools/python/xen/lowlevel/xl/xl.c --- a/tools/python/xen/lowlevel/xl/xl.c +++ b/tools/python/xen/lowlevel/xl/xl.c @@ -231,14 +231,14 @@ int attrib__libxl_cpuid_policy_list_set( return -1; } -int attrib__libxl_cpumap_set(PyObject *v, libxl_cpumap *pptr) +int attrib__libxl_bitmap_set(PyObject *v, libxl_bitmap *pptr) { int i; long cpu; for (i = 0; i < PyList_Size(v); i++) { cpu = PyInt_AsLong(PyList_GetItem(v, i)); - libxl_cpumap_set(pptr, cpu); + libxl_bitmap_set(pptr, cpu); } return 0; } @@ -293,14 +293,14 @@ PyObject *attrib__libxl_cpuid_policy_lis return NULL; } -PyObject *attrib__libxl_cpumap_get(libxl_cpumap *pptr) +PyObject *attrib__libxl_bitmap_get(libxl_bitmap *pptr) { PyObject *cpulist = NULL; int i; cpulist = PyList_New(0); - libxl_for_each_cpu(i, *pptr) { - if ( libxl_cpumap_test(pptr, i) ) { + libxl_for_each_bit(i, *pptr) { + if ( libxl_bitmap_test(pptr, i) ) { PyObject* pyint = PyInt_FromLong(i); PyList_Append(cpulist, pyint);
Dario Faggioli
2012-Jun-15 17:04 UTC
[PATCH 06 of 10 v2] libxl: expand the libxl_bitmap API a bit
By adding copying and *_is_full/*_is_empty facilities. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Changes from v1: * now libxl_is_full/empty return 1 if true and 0 if false, as logic (and as requested during review). diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c --- a/tools/libxl/libxl_utils.c +++ b/tools/libxl/libxl_utils.c @@ -506,6 +506,35 @@ void libxl_bitmap_dispose(libxl_bitmap * free(map->map); } +void libxl_bitmap_copy(libxl_ctx *ctx, libxl_bitmap *dptr, + const libxl_bitmap *sptr) +{ + int sz; + + sz = dptr->size = sptr->size; + memcpy(dptr->map, sptr->map, sz * sizeof(*dptr->map)); +} + +int libxl_bitmap_is_full(const libxl_bitmap *bitmap) +{ + int i; + + for (i = 0; i < bitmap->size; i++) + if (bitmap->map[i] != (uint8_t)-1) + return 0; + return 1; +} + +int libxl_bitmap_is_empty(const libxl_bitmap *bitmap) +{ + int i; + + for (i = 0; i < bitmap->size; i++) + if (bitmap->map[i]) + return 0; + return 1; +} + int libxl_bitmap_test(const libxl_bitmap *bitmap, int bit) { if (bit >= bitmap->size * 8) diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h --- a/tools/libxl/libxl_utils.h +++ b/tools/libxl/libxl_utils.h @@ -66,6 +66,10 @@ int libxl_vdev_to_device_disk(libxl_ctx int libxl_bitmap_alloc(libxl_ctx *ctx, libxl_bitmap *bitmap, int n_bits); /* Allocated bimap is from malloc, libxl_bitmap_dispose() to be * called by the application when done. */ +void libxl_bitmap_copy(libxl_ctx *ctx, libxl_bitmap *dptr, + const libxl_bitmap *sptr); +int libxl_bitmap_is_full(const libxl_bitmap *bitmap); +int libxl_bitmap_is_empty(const libxl_bitmap *bitmap); int libxl_bitmap_test(const libxl_bitmap *bitmap, int bit); void libxl_bitmap_set(libxl_bitmap *bitmap, int bit); void libxl_bitmap_reset(libxl_bitmap *bitmap, int bit);
Dario Faggioli
2012-Jun-15 17:04 UTC
[PATCH 07 of 10 v2] libxl: introduce some node map helpers
To allow for allocating a node specific libxl_bitmap (as it is for cpu number and maps). Helper unctions to convert a node map it its coresponding cpu map and vice versa are also implemented. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Changes from v1: * This patch replaces "libxl: introduce libxl_nodemap", as the libxl_bitmap type introduced by the previous patch is now used for both cpu and node maps, as requested during v1 review. diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c --- a/tools/libxl/libxl_utils.c +++ b/tools/libxl/libxl_utils.c @@ -556,6 +556,50 @@ void libxl_bitmap_reset(libxl_bitmap *bi bitmap->map[bit / 8] &= ~(1 << (bit & 7)); } +int libxl_nodemap_to_cpumap(libxl_ctx *ctx, + const libxl_bitmap *nodemap, + libxl_bitmap *cpumap) +{ + libxl_cputopology *tinfo = NULL; + int nr_cpus, i, rc = 0; + + tinfo = libxl_get_cpu_topology(ctx, &nr_cpus); + if (tinfo == NULL) { + rc = ERROR_FAIL; + goto out; + } + + libxl_bitmap_set_none(cpumap); + for (i = 0; i < nr_cpus; i++) { + if (libxl_bitmap_test(nodemap, tinfo[i].node)) + libxl_bitmap_set(cpumap, i); + } + out: + libxl_cputopology_list_free(tinfo, nr_cpus); + return rc; +} + +int libxl_cpumap_to_nodemap(libxl_ctx *ctx, + const libxl_bitmap *cpumap, + libxl_bitmap *nodemap) +{ + libxl_cputopology *tinfo = NULL; + int nr_cpus, i, rc = 0; + + tinfo = libxl_get_cpu_topology(ctx, &nr_cpus); + if (tinfo == NULL) { + rc = ERROR_FAIL; + goto out; + } + + libxl_bitmap_set_none(nodemap); + libxl_for_each_set_bit(i, *cpumap) + libxl_bitmap_set(nodemap, tinfo[i].node); + out: + libxl_cputopology_list_free(tinfo, nr_cpus); + return rc; +} + int libxl_get_max_cpus(libxl_ctx *ctx) { return xc_get_max_cpus(ctx->xch); diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h --- a/tools/libxl/libxl_utils.h +++ b/tools/libxl/libxl_utils.h @@ -100,6 +100,27 @@ static inline int libxl_cpu_bitmap_alloc return libxl_bitmap_alloc(ctx, cpumap, max_cpus); } +static inline int libxl_node_bitmap_alloc(libxl_ctx *ctx, + libxl_bitmap *nodemap) +{ + int max_nodes; + + max_nodes = libxl_get_max_nodes(ctx); + if (max_nodes == 0) + return ERROR_FAIL; + + return libxl_bitmap_alloc(ctx, nodemap, max_nodes); +} + +int libxl_nodemap_to_cpumap(libxl_ctx *ctx, + const libxl_bitmap *nodemap, + libxl_bitmap *cpumap); + /* populate cpumap with the cpus spanned by the nodes in nodemap */ +int libxl_cpumap_to_nodemap(libxl_ctx *ctx, + const libxl_bitmap *cpuemap, + libxl_bitmap *nodemap); + /* populate nodemap with the nodes of the cpus in cpumap */ + static inline uint32_t libxl__sizekb_to_mb(uint32_t s) { return (s + 1023) / 1024; }
Dario Faggioli
2012-Jun-15 17:04 UTC
[PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
If a domain does not have a VCPU affinity, try to pin it automatically to some PCPUs. This is done taking into account the NUMA characteristics of the host. In fact, we look for a combination of host''s NUMA nodes with enough free memory and number of PCPUs for the new domain, and pin it to the VCPUs of those nodes. Once we know which ones, among all the possible combinations, represents valid placement candidates for a domain, use some heuistics for deciding which is the best. For instance, smaller candidates are considered to be better, both from the domain''s point of view (fewer memory spreading among nodes) and from the system as a whole point of view (fewer memoy fragmentation). In case of candidates of equal sizes (i.e., with the same number of nodes), the one with the greater amount of memory wins, as this is also good for keeping memory fragmentation under control. This all happens internally to libxl, and no API for driving the mechanism is provided for now. This matches what xend already does. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Changes from v1: * This patches incorporates the changes from both "libxl, xl: enable automatic placement of guests on NUMA nodes" and "libxl, xl: heuristics for reordering NUMA placement candidates" from v1. * The logic of the algorithm is basically the same as in v1, but the splitting of it in the various functions has been completely redesigned from scratch. * No public API for placement or candidate generation is now exposed, everything happens within libxl, as agreed during v1 review. * The relevant documentation have been moved near the actual functions and features. Also, the amount and (hopefully!) the quality of the documentation has been improved a lot, as requested. * All the comments about using the proper libxl facilities and helpers for allocations, etc., have been considered and applied. * This patch still bails out from NUMA optimizations if it find out cpupools are being utilized. It is next patch that makes the two things interact properly, as suggested during review. diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -111,8 +111,8 @@ created online and the remainder will be =item B<cpus="CPU-LIST"> -List of which cpus the guest is allowed to use. Default behavior is -`all cpus`. A C<CPU-LIST> may be specified as follows: +List of which cpus the guest is allowed to use. By default xl will (via +libxl) pick some cpus (see below). A C<CPU-LIST> may be specified as follows: =over 4 @@ -132,6 +132,12 @@ run on cpu #3 of the host. =back +If this option is not specified, libxl automatically tries to place the new +domain on the host''s NUMA nodes (provided the host has more than one NUMA +node) by pinning it to the cpus of those nodes. An heuristical approach is +utilized with the goals of maximizing performance for the domain and, at +the same time, achieving efficient utilization of the host''s CPUs and RAM. + =item B<cpu_weight=WEIGHT> A domain with a weight of 512 will get twice as much CPU as a domain diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -95,6 +95,459 @@ out: return sched; } +/* + * What follows are helpers for generating all the k-combinations + * without repetitions of a set S with n elements in it. Formally + * speaking, they are subsets of k distinct elements of S and, if + * S is n elements big, the number of k-combinations is equal to + * the binomial coefficient (n k), which means we get exactly + * n!/(k! * (n - k)!) subsets, all of them with k elements. + * + * The various subset are geneated one after the other by calling + * comb_init() first, and, after that, comb_next() + * (n k)-1 times. An iterator is used to store the curent status + * of the whole generation operation (i.e., basically, the last + * combination that has been generated). As soon as all + * combinations have been generated, comb_next() will + * start returning 0 instead of 1. It is of course impotant that + * the same instance of the iterator and the same values for + * n and k are used for each call. If that doesn''t happen, the + * result is unspecified. + * + * The algorithm is a well known one, and it produces the + * combinations in such a way that they (well, more precisely, + * their indexes it the array/map representing the set) come in + * lexicogaphic order. + * + * For example, with n = 5 and k = 3, calling comb_init() + * will generate { 0, 1, 2 }, while subsequent valid calls to + * comb_next() will produce the following: + * { { 0, 1, 2 }, { 0, 1, 3 }, { 0, 1, 4 }, + * { 0, 2, 3 }, { 0, 2, 4 }, { 0, 3, 4 }, + * { 1, 2, 3 }, { 1, 2, 4 }, { 1, 3, 4 }, + * { 2, 3, 4 } } + * + * This is used by the automatic NUMA placement logic below. + */ +typedef int* comb_iter_t; + +static int comb_init(libxl__gc *gc, comb_iter_t *it, int n, int k) +{ + comb_iter_t new_iter; + int i; + + if (n < k) + return 0; + + /* First set is always { 0, 1, 2, ..., k-1 } */ + GCNEW_ARRAY(new_iter, k); + for (i = 0; i < k; i++) + new_iter[i] = i; + + *it = new_iter; + return 1; +} + +static int comb_next(comb_iter_t it, int n, int k) +{ + int i; + + /* + * The idea here is to find the leftmost element from where + * we should start incrementing the indexes of the iterator. + * This means looking for the highest index that can be increased + * while still producing value smaller than n-1. In the example + * above, when dealing with { 0, 1, 4 }, such an element is the + * second one, as the third is already equal to 4 (which actually + * is n-1). + * Once we found from where to start, we increment that element + * and overide the right-hand rest of the iterator with its + * successors, thus achieving lexicographic ordering. + * + * Regarding the termination of the generation process, when we + * manage in bringing n-k at the very first position of the iterator, + * we know that is the last valid combination ( { 2, 3, 4 }, with + * n - k = 5 - 2 = 2, in the example above), and thus we start + * returning 0 as soon as we cross that border. + */ + for (i = k - 1; it[i] == n - k + i; i--) { + if (i <= 0) + return 0; + } + for (it[i]++, i++; i < k; i++) + it[i] = it[i - 1] + 1; + return 1; +} + +/* NUMA automatic placement (see libxl_internal.h for details) */ + +/* + * This function turns a k-combination iterator into a node map. + * This means the bits in the node map corresponding to the indexes + * of the given combination are the ones that will be set. + * For example, if the iterator represents the combination { 0, 2, 4}, + * the node map will have bits #0, #2 and #4 set to one and i thus + * will point to node 0, node 2 and and node 4. + */ +static void comb_get_nodemap(comb_iter_t it, libxl_bitmap *nodemap, int k) +{ + int i; + + libxl_bitmap_set_none(nodemap); + for (i = 0; i < k; i++) + libxl_bitmap_set(nodemap, it[i]); +} + +/* Retrieve the number of cpus that the nodes that are part of the nodemap + * span. */ +static int nodemap_to_nodes_cpus(libxl_cputopology *tinfo, int nr_cpus, + const libxl_bitmap *nodemap) +{ + int i, nodes_cpus = 0; + + for (i = 0; i < nr_cpus; i++) { + if (libxl_bitmap_test(nodemap, tinfo[i].node)) + nodes_cpus++; + } + return nodes_cpus; +} + +/* Retrieve the amount of free memory within the nodemap */ +static uint32_t nodemap_to_free_memkb(libxl_numainfo *ninfo, + libxl_bitmap *nodemap) +{ + uint32_t free_memkb = 0; + int i; + + libxl_for_each_set_bit(i, *nodemap) + free_memkb += ninfo[i].free / 1024; + + return free_memkb; +} + +/* Retrieve the number of domains that can potentially run on the cpus + * the nodes that are part of the nodemap. */ +static int nodemap_to_nr_domains(libxl__gc *gc, libxl_cputopology *tinfo, + const libxl_bitmap *nodemap) +{ + libxl_dominfo *dinfo = NULL; + libxl_bitmap dom_nodemap; + int nr_doms, nr_cpus; + int nr_domains = 0; + int i, j, k; + + dinfo = libxl_list_domain(CTX, &nr_doms); + if (dinfo == NULL) + return ERROR_FAIL; + + if (libxl_node_bitmap_alloc(CTX, &dom_nodemap) < 0) { + nr_domains = ERROR_FAIL; + goto out; + } + + for (i = 0; i < nr_doms; i++) { + libxl_vcpuinfo *vinfo; + int nr_vcpus; + + vinfo = libxl_list_vcpu(CTX, dinfo[i].domid, &nr_vcpus, &nr_cpus); + if (vinfo == NULL) + continue; + + libxl_bitmap_set_none(&dom_nodemap); + for (j = 0; j < nr_vcpus; j++) { + libxl_for_each_set_bit(k, vinfo[j].cpumap) + libxl_bitmap_set(&dom_nodemap, tinfo[k].node); + } + + libxl_for_each_set_bit(j, dom_nodemap) { + if (libxl_bitmap_test(nodemap, j)) { + nr_domains++; + goto found; + } + } + found: + libxl_vcpuinfo_list_free(vinfo, nr_vcpus); + } + + out: + libxl_bitmap_dispose(&dom_nodemap); + libxl_dominfo_list_free(dinfo, nr_doms); + return nr_domains; +} + +/* + * This function tries to figure out if the host has a consistent number + * of cpus along all its NUMA nodes. In fact, if that is the case, we can + * calculate the minimum number of nodes needed for a domain by just + * dividing its total number of vcpus by this value computed here. + * However, we are not allowed to assume that all the nodes have the + * same number of cpus. Therefore, in case discrepancies among different + * nodes are found, this function just returns 0 and the caller needs + * to sort out things in some other way. + */ +static int cpus_per_node_count(libxl_cputopology *tinfo, int nr_cpus, + libxl_numainfo *ninfo, int nr_nodes) +{ + int cpus_per_node = 0; + int j, i; + + /* This makes sense iff # of PCPUs is the same for all nodes */ + for (j = 0; j < nr_nodes; j++) { + int curr_cpus = 0; + + for (i = 0; i < nr_cpus; i++) { + if (tinfo[i].node == j) + curr_cpus++; + } + /* So, if the above does not hold, turn the whole thing off! */ + cpus_per_node = cpus_per_node == 0 ? curr_cpus : cpus_per_node; + if (cpus_per_node != curr_cpus) + return 0; + } + return cpus_per_node; +} + +/* Get all the placement candidates satisfying some specific conditions */ +int libxl__get_numa_candidates(libxl__gc *gc, + uint32_t min_free_memkb, int min_cpus, + int min_nodes, int max_nodes, + libxl__numa_candidate *cndts[], int *nr_cndts) +{ + libxl__numa_candidate *new_cndts = NULL; + libxl_cputopology *tinfo = NULL; + libxl_numainfo *ninfo = NULL; + libxl_bitmap nodemap; + int nr_nodes, nr_cpus; + int array_size, rc; + + /* Get platform info and prepare the map for testing the combinations */ + ninfo = libxl_get_numainfo(CTX, &nr_nodes); + if (ninfo == NULL) + return ERROR_FAIL; + /* If we don''t have at least 2 nodes, it is useless to proceed */ + if (nr_nodes < 2) { + rc = 0; + goto out; + } + + tinfo = libxl_get_cpu_topology(CTX, &nr_cpus); + if (tinfo == NULL) { + rc = ERROR_FAIL; + goto out; + } + + rc = libxl_node_bitmap_alloc(CTX, &nodemap); + if (rc) + goto out; + + /* + * Round up and down some of the constraints. For instance, the minimum + * number of cpus a candidate should have must at least be non-negative. + * Regarding the minimum number of NUMA nodes, if not explicitly specified + * (i.e., min_nodes <= 0), we try to figure out a sensible number of nodes + * from where to start generating candidates, if possible (or just start + * from 1 otherwise). The maximum number of nodes should not exceed the + * number of existent NUMA nodes on the host, or the candidate genaration + * won''t work properly. + */ + min_cpus = min_cpus <= 0 ? 0 : min_cpus; + if (min_nodes <= 0) { + int cpus_per_node; + + cpus_per_node = cpus_per_node_count(tinfo, nr_cpus, ninfo, nr_nodes); + if (cpus_per_node == 0) + min_nodes = 1; + else + min_nodes = (min_cpus + cpus_per_node - 1) / cpus_per_node; + } + min_nodes = min_nodes > nr_nodes ? nr_nodes : min_nodes; + if (max_nodes <= 0) + max_nodes = nr_nodes; + else + max_nodes = max_nodes > nr_nodes ? nr_nodes : max_nodes; + if (min_nodes > max_nodes) { + rc = ERROR_INVAL; + goto out; + } + + /* Initialize the local storage for the combinations */ + *nr_cndts = 0; + array_size = nr_nodes; + GCNEW_ARRAY(new_cndts, array_size); + + /* Generate all the combinations of any size from min_nodes to + * max_nodes (see comb_init() and comb_next()). */ + while (min_nodes <= max_nodes) { + comb_iter_t comb_iter; + int comb_ok; + + /* + * And here it is. Each step of this cycle generates a combination of + * nodes as big as min_nodes mandates. Each of these combinations is + * checked against the constraints provided by the caller (namely, + * amount of free memory and number of cpus) and it becomes an actual + * placement candidate iff it passes the check. + */ + for (comb_ok = comb_init(gc, &comb_iter, nr_nodes, min_nodes); comb_ok; + comb_ok = comb_next(comb_iter, nr_nodes, min_nodes)) { + uint32_t nodes_free_memkb; + int nodes_cpus; + + comb_get_nodemap(comb_iter, &nodemap, min_nodes); + + /* If there is not enough memoy in this combination, skip it + * and go generating the next one... */ + nodes_free_memkb = nodemap_to_free_memkb(ninfo, &nodemap); + if (min_free_memkb > 0 && nodes_free_memkb < min_free_memkb) + continue; + + /* And the same applies if this combination is short in cpus */ + nodes_cpus = nodemap_to_nodes_cpus(tinfo, nr_cpus, &nodemap); + if (min_cpus > 0 && nodes_cpus < min_cpus) + continue; + + /* + * Conditions are met, we can add this combination to the + * NUMA placement candidates list. We first make sure there + * is enough space in there, and then we initialize the new + * candidate element with the node map corresponding to the + * combination we are dealing with. Memory allocation for + * expanding the array that hosts the list happens in chunks + * equal to the number of NUMA nodes in the system (to + * avoid allocating memory each and every time we find a + * new candidate). + */ + if (*nr_cndts == array_size) + array_size += nr_nodes; + GCREALLOC_ARRAY(new_cndts, array_size); + + libxl__numa_candidate_alloc(gc, &new_cndts[*nr_cndts]); + libxl__numa_candidate_put_nodemap(gc, &new_cndts[*nr_cndts], + &nodemap); + new_cndts[*nr_cndts].nr_domains + nodemap_to_nr_domains(gc, tinfo, &nodemap); + new_cndts[*nr_cndts].free_memkb = nodes_free_memkb; + new_cndts[*nr_cndts].nr_nodes = min_nodes; + new_cndts[*nr_cndts].nr_cpus = nodes_cpus; + + LOG(DEBUG, "NUMA placement candidate #%d found: nr_nodes=%d, " + "nr_cpus=%d, free_memkb=%"PRIu32"", *nr_cndts, + min_nodes, new_cndts[*nr_cndts].nr_cpus, + new_cndts[*nr_cndts].free_memkb / 1024); + + (*nr_cndts)++; + } + min_nodes++; + } + + *cndts = new_cndts; + out: + libxl_bitmap_dispose(&nodemap); + libxl_cputopology_list_free(tinfo, nr_cpus); + libxl_numainfo_list_free(ninfo, nr_nodes); + return rc; +} + +void libxl__sort_numa_candidates(libxl__numa_candidate cndts[], int nr_cndts, + libxl__numa_candidate_cmpf cmpf) +{ + /* Reorder candidates (see the comparison function for + * the details on the heuristics) */ + qsort(cndts, nr_cndts, sizeof(cndts[0]), cmpf); +} + +/* + * The NUMA placement candidates are reordered according to the following + * heuristics: + * - candidates involving fewer nodes come first. In case two (or + * more) candidates span the same number of nodes, + * - candidates with greater amount of free memory come first. In + * case two (or more) candidates differ in their amount of free + * memory by less than 10%, + * - candidates with fewer domains insisting on them at the time of + * this call come first. + */ +static int numa_cmpf(const void *v1, const void *v2) +{ + const libxl__numa_candidate *c1 = (const libxl__numa_candidate*) v1; + const libxl__numa_candidate *c2 = (const libxl__numa_candidate*) v2; + double mem_diff = labs(c1->free_memkb - c2->free_memkb); + double mem_avg = (c1->free_memkb + c2->free_memkb) / 2.0; + + if (c1->nr_nodes != c2->nr_nodes) + return c1->nr_nodes - c2->nr_nodes; + + if ((mem_diff / mem_avg) * 100.0 < 10.0 && + c1->nr_domains != c2->nr_domains) + return c1->nr_domains - c2->nr_domains; + + return c2->free_memkb - c1->free_memkb; +} + +/* The actual automatic NUMA placement routine */ +static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info) +{ + int nr_candidates = 0; + libxl__numa_candidate *candidates = NULL; + libxl_bitmap candidate_nodemap; + libxl_cpupoolinfo *pinfo; + int nr_pools, rc = 0; + uint32_t memkb; + + /* First of all, if cpupools are in use, better not to mess with them */ + pinfo = libxl_list_cpupool(CTX, &nr_pools); + if (!pinfo) + return ERROR_FAIL; + if (nr_pools > 1) { + LOG(NOTICE, "skipping NUMA placement as cpupools are in use"); + goto out; + } + + rc = libxl_domain_need_memory(CTX, info, &memkb); + if (rc) + goto out; + if (libxl_node_bitmap_alloc(CTX, &candidate_nodemap)) { + rc = ERROR_FAIL; + goto out; + } + + /* Find all the candidates with enough free memory and at least + * as much pcpus as the domain has vcpus. */ + rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, 0, 0, + &candidates, &nr_candidates); + if (rc) + goto out; + + LOG(DETAIL, "%d NUMA placement candidates found", nr_candidates); + + /* No suitable placement candidates. We just return without touching the + * domain''s info->cpumap. It will have affinity with all nodes/cpus. */ + if (nr_candidates == 0) + goto out; + + /* Bring the best candidate in front of the list --> candidates[0] */ + if (nr_candidates > 1) + libxl__sort_numa_candidates(candidates, nr_candidates, numa_cmpf); + + /* + * At this point, the first candidate in the array is the one we want. + * Go for it by mapping its node map to the domain''s info->cpumap. + */ + libxl__numa_candidate_get_nodemap(gc, &candidates[0], &candidate_nodemap); + rc = libxl_nodemap_to_cpumap(CTX, &candidate_nodemap, &info->cpumap); + if (rc) + goto out; + + LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and " + "%"PRIu32" KB free selected", candidates[0].nr_nodes, + candidates[0].nr_cpus, candidates[0].free_memkb / 1024); + + out: + libxl_bitmap_dispose(&candidate_nodemap); + libxl_cpupoolinfo_list_free(pinfo, nr_pools); + return rc; +} + int libxl__build_pre(libxl__gc *gc, uint32_t domid, libxl_domain_build_info *info, libxl__domain_build_state *state) { @@ -104,7 +557,22 @@ int libxl__build_pre(libxl__gc *gc, uint uint32_t rtc_timeoffset; xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus); + + /* + * Check if the domain has any CPU affinity. If not, try to build up one. + * In case numa_place_domain() find at least a suitable candidate, it will + * affect info->cpumap accordingly; if it does not, it just leaves it + * as it is. This means (unless some weird error manifests) the subsequent + * call to libxl_set_vcpuaffinity_all() will do the actual placement, + * whatever that turns out to be. + */ + if (libxl_bitmap_is_full(&info->cpumap)) { + int rc = numa_place_domain(gc, info); + if (rc) + return rc; + } libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap); + xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + LIBXL_MAXMEM_CONSTANT); if (info->type == LIBXL_DOMAIN_TYPE_PV) xc_domain_set_memmap_limit(ctx->xch, domid, diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -2021,6 +2021,134 @@ static inline void libxl__ctx_unlock(lib #define CTX_LOCK (libxl__ctx_lock(CTX)) #define CTX_UNLOCK (libxl__ctx_unlock(CTX)) +/* + * Automatic NUMA placement + * + * These functions and data structures deal with the initial placement of a + * domain onto the host NUMA nodes. + * + * The key concept here is the one of "numa placement candidate" (or jusr "numa + * candidate", "placement candidate" or even "candidate"), which basically is a + * set of nodes whose characteristics have been successfully checked against + * some specific requirements. More pecisely, a candidate is the nodemap + * associated with one of the possible subset of the host NUMA nodes providing + * a certain amount of free memory, or a given number of cpus, or even both + * (depending in what the caller wants). For convenience of use, some of this + * information are stored within the candidate itselfs, instead of always being + * dynamically computed. A candidate can consist of just one, up to all the + * available nodes on the host (the latter being, of course, not ideal). For + * instance, looking for a numa candidates with 2GB of free memoy means we want + * all the possible subsets of the host NUMA nodes with, cumulatively, at least + * 2GB of free memory. That could be possible by just using one particular + * node, or may require more nodes, depending on the characteristics of the + * host, on how many domains have been created already, on how big they are, + * etc. + * + * The intended usage is as follows: + * 1. by, fist of all, calling libxl__get_numa_candidates(), and specifying + * the proper constraints to it (e.g., the amount of memory a domain need + * as the minimum amount of free memory fo the candidates) one can build + * up a whole set of suitable placing alternatives for a domain; + * 2. after that, one specific candidate should be chosen. That can happen + * by looking at their various characteristics; + * 3. the chosen candidate''s nodemap should be utilized for computing the + * actual affinity of the domain which, given the curent NUMA support + * in the hypervisor, is what determines the placement of the domain''s + * vcpus and memory. + * + * To make phase 2 even easier, a sorting helper function for the list of + * candidates is povided in the form of libxl__sort_numa_candidates(). The only + * that is needed is defining a comparison function, containing the chriteria + * for deciding, given two candidates, which one is ''better''. Depending on how + * the comparison function is defined, the best candidate (where, of course, + * best is defined with respect to the heuristics implemented in the comparison + * function itself, libxl__numa_candidate_cmpf()) could become the first o the + * last element of the list. + * + * Summarizing, achieving automatic NUMA placement is just a matter of + * obtaining the list of suitable placement candidates, perhaps asking for each + * of them to provide at least the amount of memory the domain needs. After + * that just implement a comparison function by means of the various helpers + * retrieving the relevant information about the candidates themselves. + * Finally, call the sorting helper function and use the candidate that became + * (typically) the first element of the list fo determining the domain''s + * affinity. + */ + +typedef struct { + int nr_cpus, nr_nodes; + int nr_domains; + uint32_t free_memkb; + libxl_bitmap nodemap; +} libxl__numa_candidate; + +/* + * This generates the list of NUMA placement candidates satisfying some + * specific conditions. If min_nodes and/or max_nodes are not 0, their value is + * used to determine the minimum and maximum number of nodes that are allow to + * be present in each candidate. If min_nodes and/or max_nodes are 0, the + * minimum and maximum number of nodes to be used are automatically selected by + * the implementation (and that will likely be just 1 node for the minimum and + * the total number of existent nodes for the maximum). Re min_free_memkb and + * min_cpu, if not 0, they specify the caller only wants candidates with at + * least that amount of free memory and that number of cpus, respectively. If + * min_free_memkb and/or min_cpus are 0, the candidates'' free memory and number + * of cpus won''t be checked at all, which means a candidate will always be + * considered suitable wrt the specific constraint. cndts is where the list of + * exactly nr_cndts candidates is returned. Note that, in case no candidates + * are found at all, the function returns successfully, but with nr_cndts equal + * to zero. + */ +_hidden int libxl__get_numa_candidates(libxl__gc *gc, + uint32_t min_free_memkb, int min_cpus, + int min_nodes, int max_nodes, + libxl__numa_candidate *cndts[], int *nr_cndts); + +/* allocation and deallocation for placement candidates */ +static inline int libxl__numa_candidate_alloc(libxl__gc *gc, + libxl__numa_candidate *cndt) +{ + return libxl_node_bitmap_alloc(CTX, &cndt->nodemap); +} +static inline void libxl__numa_candidate_dispose(libxl__numa_candidate *cndt) +{ + libxl_bitmap_dispose(&cndt->nodemap); +} +static inline void libxl__numacandidate_list_free(libxl__numa_candidate *cndts, + int nr_cndts) +{ + int i; + + for (i = 0; i < nr_cndts; i++) + libxl__numa_candidate_dispose(&cndts[i]); + free(cndts); +} + +/* retrieve (in nodemap) the node map associated to placement candidate cndt */ +static inline +void libxl__numa_candidate_get_nodemap(libxl__gc *gc, + const libxl__numa_candidate *cndt, + libxl_bitmap *nodemap) +{ + libxl_bitmap_copy(CTX, nodemap, &cndt->nodemap); +} +/* set the node map of placement candidate cndt to match nodemap */ +static inline +void libxl__numa_candidate_put_nodemap(libxl__gc *gc, + libxl__numa_candidate *cndt, + const libxl_bitmap *nodemap) +{ + libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap); +} + +/* signature for the comparison function between two candidates c1 and c2 + * (the thid parameter is provided to enable thread safety). */ +typedef int (*libxl__numa_candidate_cmpf)(const void *v1, const void *v2); +/* sort the list of candidates in cndts (an array with nr_cndts elements in + * it) using cmpf for comparing two candidates. Uses libc''s qsort(). */ +_hidden void libxl__sort_numa_candidates(libxl__numa_candidate cndts[], + int nr_cndts, + libxl__numa_candidate_cmpf cmpf); /* * Inserts "elm_new" into the sorted list "head".
Dario Faggioli
2012-Jun-15 17:04 UTC
[PATCH 09 of 10 v2] libxl: have NUMA placement deal with cpupools
In such a way that only the cpus belonging to the cpupool of the domain being placed are considered for the placement itself. This happens by filtering out all the nodes in which the cpupool has not any cpu from the placement candidates. After that -- as a cpu pooling not necessarily happens at NUMA nodes boundaries -- we also make sure only the actual cpus that are part of the pool are considered when counting how much processors a placement candidate is able to provide. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -198,15 +198,27 @@ static void comb_get_nodemap(comb_iter_t libxl_bitmap_set(nodemap, it[i]); } +/* Retrieve how many nodes a nodemap spans. */ +static int nodemap_to_nr_nodes(const libxl_bitmap *nodemap) +{ + int i, nr_nodes = 0; + + libxl_for_each_set_bit(i, *nodemap) + nr_nodes++; + return nr_nodes; +} + /* Retrieve the number of cpus that the nodes that are part of the nodemap - * span. */ + * span and that are also set in suitable_cpumap. */ static int nodemap_to_nodes_cpus(libxl_cputopology *tinfo, int nr_cpus, + const libxl_bitmap *suitable_cpumap, const libxl_bitmap *nodemap) { int i, nodes_cpus = 0; for (i = 0; i < nr_cpus; i++) { - if (libxl_bitmap_test(nodemap, tinfo[i].node)) + if (libxl_bitmap_test(suitable_cpumap, i) && + libxl_bitmap_test(nodemap, tinfo[i].node)) nodes_cpus++; } return nodes_cpus; @@ -311,12 +323,13 @@ static int cpus_per_node_count(libxl_cpu int libxl__get_numa_candidates(libxl__gc *gc, uint32_t min_free_memkb, int min_cpus, int min_nodes, int max_nodes, + const libxl_bitmap *suitable_cpumap, libxl__numa_candidate *cndts[], int *nr_cndts) { libxl__numa_candidate *new_cndts = NULL; libxl_cputopology *tinfo = NULL; libxl_numainfo *ninfo = NULL; - libxl_bitmap nodemap; + libxl_bitmap suitable_nodemap, nodemap; int nr_nodes, nr_cpus; int array_size, rc; @@ -340,6 +353,15 @@ int libxl__get_numa_candidates(libxl__gc if (rc) goto out; + /* Allocate and prepare the map of the node that can be utilized for + * placement, basing on the map of suitable cpus. */ + rc = libxl_node_bitmap_alloc(CTX, &suitable_nodemap); + if (rc) + goto out; + rc = libxl_cpumap_to_nodemap(CTX, suitable_cpumap, &suitable_nodemap); + if (rc) + goto out; + /* * Round up and down some of the constraints. For instance, the minimum * number of cpus a candidate should have must at least be non-negative. @@ -391,9 +413,14 @@ int libxl__get_numa_candidates(libxl__gc for (comb_ok = comb_init(gc, &comb_iter, nr_nodes, min_nodes); comb_ok; comb_ok = comb_next(comb_iter, nr_nodes, min_nodes)) { uint32_t nodes_free_memkb; - int nodes_cpus; + int i, nodes_cpus; + /* Get the nodemap for the combination and filter unwnted nodes */ comb_get_nodemap(comb_iter, &nodemap, min_nodes); + libxl_for_each_set_bit(i, nodemap) { + if (!libxl_bitmap_test(&suitable_nodemap, i)) + libxl_bitmap_reset(&nodemap, i); + } /* If there is not enough memoy in this combination, skip it * and go generating the next one... */ @@ -402,7 +429,8 @@ int libxl__get_numa_candidates(libxl__gc continue; /* And the same applies if this combination is short in cpus */ - nodes_cpus = nodemap_to_nodes_cpus(tinfo, nr_cpus, &nodemap); + nodes_cpus = nodemap_to_nodes_cpus(tinfo, nr_cpus, suitable_cpumap, + &nodemap); if (min_cpus > 0 && nodes_cpus < min_cpus) continue; @@ -427,12 +455,13 @@ int libxl__get_numa_candidates(libxl__gc new_cndts[*nr_cndts].nr_domains nodemap_to_nr_domains(gc, tinfo, &nodemap); new_cndts[*nr_cndts].free_memkb = nodes_free_memkb; - new_cndts[*nr_cndts].nr_nodes = min_nodes; + new_cndts[*nr_cndts].nr_nodes = nodemap_to_nr_nodes(&nodemap); new_cndts[*nr_cndts].nr_cpus = nodes_cpus; LOG(DEBUG, "NUMA placement candidate #%d found: nr_nodes=%d, " "nr_cpus=%d, free_memkb=%"PRIu32"", *nr_cndts, - min_nodes, new_cndts[*nr_cndts].nr_cpus, + new_cndts[*nr_cndts].nr_nodes, + new_cndts[*nr_cndts].nr_cpus, new_cndts[*nr_cndts].free_memkb / 1024); (*nr_cndts)++; @@ -442,6 +471,7 @@ int libxl__get_numa_candidates(libxl__gc *cndts = new_cndts; out: + libxl_bitmap_dispose(&suitable_nodemap); libxl_bitmap_dispose(&nodemap); libxl_cputopology_list_free(tinfo, nr_cpus); libxl_numainfo_list_free(ninfo, nr_nodes); @@ -485,23 +515,27 @@ static int numa_cmpf(const void *v1, con } /* The actual automatic NUMA placement routine */ -static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info) +static int numa_place_domain(libxl__gc *gc, uint32_t domid, + libxl_domain_build_info *info) { int nr_candidates = 0; libxl__numa_candidate *candidates = NULL; libxl_bitmap candidate_nodemap; - libxl_cpupoolinfo *pinfo; - int nr_pools, rc = 0; + libxl_cpupoolinfo cpupool_info; + int i, cpupool, rc = 0; uint32_t memkb; - /* First of all, if cpupools are in use, better not to mess with them */ - pinfo = libxl_list_cpupool(CTX, &nr_pools); - if (!pinfo) - return ERROR_FAIL; - if (nr_pools > 1) { - LOG(NOTICE, "skipping NUMA placement as cpupools are in use"); - goto out; - } + /* + * Extract the cpumap from the cpupool the domain belong to. In fact, + * it only makes sense to consider the cpus/nodes that are in there + * for placement. + */ + rc = cpupool = libxl__domain_cpupool(gc, domid); + if (rc < 0) + return rc; + rc = libxl_cpupool_info(CTX, &cpupool_info, cpupool); + if (rc) + return rc; rc = libxl_domain_need_memory(CTX, info, &memkb); if (rc) @@ -513,7 +547,8 @@ static int numa_place_domain(libxl__gc * /* Find all the candidates with enough free memory and at least * as much pcpus as the domain has vcpus. */ - rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, 0, 0, + rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, + 0, 0, &cpupool_info.cpumap, &candidates, &nr_candidates); if (rc) goto out; @@ -538,13 +573,20 @@ static int numa_place_domain(libxl__gc * if (rc) goto out; + /* Avoid trying to set the affinity to cpus that might be in the + * nodemap but not in our cpupool. */ + libxl_for_each_set_bit(i, info->cpumap) { + if (!libxl_bitmap_test(&cpupool_info.cpumap, i)) + libxl_bitmap_reset(&info->cpumap, i); + } + LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and " "%"PRIu32" KB free selected", candidates[0].nr_nodes, candidates[0].nr_cpus, candidates[0].free_memkb / 1024); out: libxl_bitmap_dispose(&candidate_nodemap); - libxl_cpupoolinfo_list_free(pinfo, nr_pools); + libxl_cpupoolinfo_dispose(&cpupool_info); return rc; } @@ -567,7 +609,7 @@ int libxl__build_pre(libxl__gc *gc, uint * whatever that turns out to be. */ if (libxl_bitmap_is_full(&info->cpumap)) { - int rc = numa_place_domain(gc, info); + int rc = numa_place_domain(gc, domid, info); if (rc) return rc; } diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -2094,14 +2094,17 @@ typedef struct { * least that amount of free memory and that number of cpus, respectively. If * min_free_memkb and/or min_cpus are 0, the candidates'' free memory and number * of cpus won''t be checked at all, which means a candidate will always be - * considered suitable wrt the specific constraint. cndts is where the list of - * exactly nr_cndts candidates is returned. Note that, in case no candidates - * are found at all, the function returns successfully, but with nr_cndts equal - * to zero. + * considered suitable wrt the specific constraint. suitable_cpumap is useful + * for specifyin we want only the cpus in that mask to be considered while + * generating placement candidates (for example because of cpupools). cndts is + * where the list of exactly nr_cndts candidates is returned. Note that, in + * case no candidates are found at all, the function returns successfully, but + * with nr_cndts equal to zero. */ _hidden int libxl__get_numa_candidates(libxl__gc *gc, uint32_t min_free_memkb, int min_cpus, int min_nodes, int max_nodes, + const libxl_bitmap *suitable_cpumap, libxl__numa_candidate *cndts[], int *nr_cndts); /* allocation and deallocation for placement candidates */
Dario Faggioli
2012-Jun-15 17:04 UTC
[PATCH 10 of 10 v2] Some automatic NUMA placement documentation
About rationale, usage and (some small bits of) API. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Changes from v1: * API documentation moved close to the actual functions. diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown new file mode 100644 --- /dev/null +++ b/docs/misc/xl-numa-placement.markdown @@ -0,0 +1,74 @@ +# Guest Automatic NUMA Placement in libxl and xl # + +## Rationale ## + +The Xen hypervisor deals with Non-Uniform Memory Access (NUMA]) +machines by assigning to its domain a "node affinity", i.e., a set of NUMA +nodes of the host from which it gets its memory allocated. + +NUMA awareness becomes very important as soon as many domains start running +memory-intensive workloads on a shared host. In fact, the cost of accessing +non node-local memory locations is very high, and the performance degradation +is likely to be noticeable. + +## Guest Placement in xl ## + +If using xl for creating and managing guests, it is very easy to ask +for both manual or automatic placement of them across the host''s NUMA +nodes. + +Note that xm/xend does the very same thing, the only differences residing +in the details of the heuristics adopted for the placement (see below). + +### Manual Guest Placement with xl ### + +Thanks to the "cpus=" option, it is possible to specify where a domain +should be created and scheduled on, directly in its config file. This +affects NUMA placement and memory accesses as the hypervisor constructs +the node affinity of a VM basing right on its CPU affinity when it is +created. + +This is very simple and effective, but requires the user/system +administrator to explicitly specify affinities for each and every domain, +or Xen won''t be able to enable guarantee the locality for their memory +accesses. + +### Automatic Guest Placement with xl ### + +In case no "cpus=" option is specified in the config file, xl tries +to figure out on its own on which node(s) the domain could fit best. + +First of all, it needs to find a node (or a set of nodes) that have +enough free memory for accommodating the domain. After that, the actual +decision on where to put the new guest happens by generating all the +possible combinations of nodes that satisfies the above and chose among +them according to the following heuristics: + + * candidates involving fewer nodes come first. In case two (or more) + candidates span the same number of nodes, + * candidates with greater amount of free memory come first. In case + two (or more) candidates differ in their amount of free memory by + less than 10%, + * candidates with fewer domains already placed on them come first. + +Giving preference to small candidates ensures better performance for +the guest, as it avoid spreading its memory among different nodes. +Using the nodes that have the biggest amounts of free memory helps +keeping the memory fragmentation small, from a system wide perspective. +Finally, in case more candidates fulfil these criteria by the same +extent, choosing the candidate that is hosting fewer domain helps +balancing the load on the various nodes. + +The last step is figuring out whether the selected candidate contains +at least as much CPUs as the number of VCPUs of the VM. The current +solution for the case when this is not verified is just to add some +more nodes, until the condition turns into being true. When doing +this, the nodes with the least possible distance from the ones +already in the nodemap are considered. + +## Guest Placement within libxl ## + +xl achieves automatic NUMA placement by means of the following API +calls, provided by libxl. + +
Dario Faggioli
2012-Jun-18 15:54 UTC
Re: [PATCH 10 of 10 v2] Some automatic NUMA placement documentation
On Fri, 2012-06-15 at 19:04 +0200, Dario Faggioli wrote:> +### Automatic Guest Placement with xl ### > + > +In case no "cpus=" option is specified in the config file, xl tries > +to figure out on its own on which node(s) the domain could fit best. > + > +First of all, it needs to find a node (or a set of nodes) that have > +enough free memory for accommodating the domain. After that, the actual > +decision on where to put the new guest happens by generating all the > +possible combinations of nodes that satisfies the above and chose among > +them according to the following heuristics: > + > + * candidates involving fewer nodes come first. In case two (or more) > + candidates span the same number of nodes, > + * candidates with greater amount of free memory come first. In case > + two (or more) candidates differ in their amount of free memory by > + less than 10%, > + * candidates with fewer domains already placed on them come first. > + > +Giving preference to small candidates ensures better performance for > +the guest, as it avoid spreading its memory among different nodes. > +Using the nodes that have the biggest amounts of free memory helps > +keeping the memory fragmentation small, from a system wide perspective. > +Finally, in case more candidates fulfil these criteria by the same > +extent, choosing the candidate that is hosting fewer domain helps > +balancing the load on the various nodes. > +This part below is basically a leftover from previous version. Things does not work like that any longer. Basically, what we do is looking for all the nodes, or a sets of nodes, that have enough free memory and at least as much pCPUs as the domain has vCPUs and then apply the heuristics outlined above.> +The last step is figuring out whether the selected candidate contains > +at least as much CPUs as the number of VCPUs of the VM. The current > +solution for the case when this is not verified is just to add some > +more nodes, until the condition turns into being true. When doing > +this, the nodes with the least possible distance from the ones > +already in the nodemap are considered. > +Sorry, (already fixed in my local patchqueue) Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Ian Campbell
2012-Jun-21 08:53 UTC
Re: [PATCH 01 of 10 v2] libxl: fix a typo in the GCREALLOC_ARRAY macro
On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:> Causing a build failure when trying to use it: > > xxx: error: expected '';'' before '')'' token > xxx: error: expected statement before '')'' token > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>Acked-by: Ian Campbell <ian.campbell@citrix.com>> > diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h > --- a/tools/libxl/libxl_internal.h > +++ b/tools/libxl/libxl_internal.h > @@ -1972,7 +1972,7 @@ struct libxl__domain_create_state { > #define GCREALLOC_ARRAY(var, nmemb) \ > (assert(nmemb > 0), \ > assert(ARRAY_SIZE_OK((var), (nmemb))), \ > - (var) = libxl__realloc((gc), (var), (nmemb)*sizeof(*(var))))) > + (var) = libxl__realloc((gc), (var), (nmemb)*sizeof(*(var)))) > > > /*
Ian Campbell
2012-Jun-21 09:02 UTC
Re: [PATCH 03 of 10 v2] libxl, libxc: introduce libxl_get_numainfo()
On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:> Make some NUMA node information available to the toolstack. Achieve > this by means of xc_numainfo(), which exposes memory size and amount > of free memory of each node, as well as the relative distances of > each node to all the others. > > For properly exposing distances we need the IDL to support arrays. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > Changes from v1: > * malloc converted to libxl__zalloc(NOGC, ...). > * The patch also accommodates some bits of what was in "libxc, > libxl: introduce xc_nodemap_t and libxl_nodemap" which was > removed as well, as full support for node maps at libxc > level is not needed (yet!). > > diff --git a/tools/libxc/xc_misc.c b/tools/libxc/xc_misc.c > --- a/tools/libxc/xc_misc.c > +++ b/tools/libxc/xc_misc.c > @@ -35,6 +35,20 @@ int xc_get_max_cpus(xc_interface *xch) > return max_cpus; > } > > +int xc_get_max_nodes(xc_interface *xch) > +{ > + static int max_nodes = 0; > + xc_physinfo_t physinfo; > + > + if ( max_nodes ) > + return max_nodes; > + > + if ( !xc_physinfo(xch, &physinfo) ) > + max_nodes = physinfo.max_node_id + 1; > + > + return max_nodes; > +} > + > int xc_get_cpumap_size(xc_interface *xch) > { > return (xc_get_max_cpus(xch) + 7) / 8; > diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h > --- a/tools/libxc/xenctrl.h > +++ b/tools/libxc/xenctrl.h > @@ -329,6 +329,12 @@ int xc_get_cpumap_size(xc_interface *xch > /* allocate a cpumap */ > xc_cpumap_t xc_cpumap_alloc(xc_interface *xch); > > + /* > + * NODEMAP handling > + */ > +/* return maximum number of NUMA nodes the hypervisor supports */ > +int xc_get_max_nodes(xc_interface *xch); > + > /* > * DOMAIN DEBUGGING FUNCTIONS > */ > diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c > --- a/tools/libxl/libxl.c > +++ b/tools/libxl/libxl.c > @@ -3223,6 +3223,71 @@ fail: > return ret; > } > > +libxl_numainfo *libxl_get_numainfo(libxl_ctx *ctx, int *nr) > +{[...] The hypercall buffer stuff all looks good.> + if (ret) > + *nr = max_nodes;You could put this before the fail: label. Not that it matters.> + return ret; > +} > + > const libxl_version_info* libxl_get_version_info(libxl_ctx *ctx) > { > union { > diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h > --- a/tools/libxl/libxl.h > +++ b/tools/libxl/libxl.h > @@ -532,6 +532,9 @@ int libxl_domain_preserve(libxl_ctx *ctx > /* get max. number of cpus supported by hypervisor */ > int libxl_get_max_cpus(libxl_ctx *ctx); > > +/* get max. number of NUMA nodes supported by hypervisor */ > +int libxl_get_max_nodes(libxl_ctx *ctx); > + > int libxl_domain_rename(libxl_ctx *ctx, uint32_t domid, > const char *old_name, const char *new_name); > > @@ -769,6 +772,13 @@ int libxl_get_physinfo(libxl_ctx *ctx, l > #define LIBXL_CPUTOPOLOGY_INVALID_ENTRY (~(uint32_t)0) > libxl_cputopology *libxl_get_cpu_topology(libxl_ctx *ctx, int *nr); > void libxl_cputopology_list_free(libxl_cputopology *, int nr); > +#define LIBXL_NUMAINFO_INVALID_ENTRY (~(uint32_t)0) > +libxl_numainfo *libxl_get_numainfo(libxl_ctx *ctx, int *nr); > + /* On success, a list of nr libxl_numainfo elements is returned. > + * That is from malloc, thus it is up to the caller to invoke > + * libxl_cpupoolinfo_list_free() on it.Don''t you mean libxl_numinfo_list_free() ? Also normally we put the comment before the prototype.> + */ > +void libxl_numainfo_list_free(libxl_numainfo *, int nr); > libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t domid, > int *nb_vcpu, int *nrcpus); > void libxl_vcpuinfo_list_free(libxl_vcpuinfo *, int nr); > diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl > --- a/tools/libxl/libxl_types.idl > +++ b/tools/libxl/libxl_types.idl > @@ -423,6 +423,12 @@ libxl_physinfo = Struct("physinfo", [ > ("cap_hvm_directio", bool), > ], dir=DIR_OUT) > > +libxl_numainfo = Struct("numainfo", [ > + ("size", uint64), > + ("free", uint64), > + ("dists", Array(uint32, "num_dists")),This is an array of distances from this node to each other node? That could be worth a comment.> + ], dir=DIR_OUT) > + > libxl_cputopology = Struct("cputopology", [ > ("core", uint32), > ("socket", uint32), > diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c > --- a/tools/libxl/libxl_utils.c > +++ b/tools/libxl/libxl_utils.c > @@ -537,6 +537,11 @@ int libxl_get_max_cpus(libxl_ctx *ctx) > return xc_get_max_cpus(ctx->xch); > } > > +int libxl_get_max_nodes(libxl_ctx *ctx) > +{ > + return xc_get_max_nodes(ctx->xch); > +}Is this needed externally to libxl or do we expect all callers to use libxl_get_numainfo? I suppose there is no harm in exporting this either way.> + > int libxl__enum_from_string(const libxl_enum_string_table *t, > const char *s, int *e) > { > @@ -551,6 +556,14 @@ int libxl__enum_from_string(const libxl_ > return ERROR_FAIL; > } > > +void libxl_numainfo_list_free(libxl_numainfo *list, int nr) > +{ > + int i; > + for (i = 0; i < nr; i++) > + libxl_numainfo_dispose(&list[i]); > + free(list); > +} > + > void libxl_cputopology_list_free(libxl_cputopology *list, int nr) > { > int i; > diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h > --- a/xen/include/public/sysctl.h > +++ b/xen/include/public/sysctl.h > @@ -484,6 +484,7 @@ typedef struct xen_sysctl_topologyinfo x > DEFINE_XEN_GUEST_HANDLE(xen_sysctl_topologyinfo_t); > > /* XEN_SYSCTL_numainfo */ > +#define INVALID_NUMAINFO_ID (~0U)It feels like there ought to be hunks in the hypervisor which either use this symbol instead of a hardcoded ~0U or which remove the internal definition in favour of this one?
Ian Campbell
2012-Jun-21 09:04 UTC
Re: [PATCH 04 of 10 v2] xl: add more NUMA information to `xl info -n''
On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:> So that the user knows how much memory there is on each node and > how far they are from each others. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>Acked-by: Ian Campbell <ian.campbell@citrix.com>> > Changes from v1: > * integer division replaced by right shift. > > diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c > --- a/tools/libxl/xl_cmdimpl.c > +++ b/tools/libxl/xl_cmdimpl.c > @@ -4254,6 +4254,36 @@ static void output_physinfo(void) > return; > } > > +static void output_numainfo(void) > +{ > + libxl_numainfo *info; > + int i, j, nr; > + > + info = libxl_get_numainfo(ctx, &nr); > + if (info == NULL) { > + fprintf(stderr, "libxl_get_numainfo failed.\n"); > + return; > + } > + > + printf("numa_info :\n"); > + printf("node: memsize memfree distances\n"); > + > + for (i = 0; i < nr; i++) { > + if (info[i].size != LIBXL_NUMAINFO_INVALID_ENTRY) { > + printf("%4d: %6"PRIu64" %6"PRIu64" %d", i, > + info[i].size >> 20, info[i].free >> 20, > + info[i].dists[0]); > + for (j = 1; j < info[i].num_dists; j++) > + printf(",%d", info[i].dists[j]); > + printf("\n"); > + } > + } > + > + libxl_numainfo_list_free(info, nr); > + > + return; > +} > + > static void output_topologyinfo(void) > { > libxl_cputopology *info; > @@ -4276,8 +4306,6 @@ static void output_topologyinfo(void) > > libxl_cputopology_list_free(info, nr); > > - printf("numa_info : none\n"); > - > return; > } > > @@ -4287,8 +4315,10 @@ static void info(int numa) > > output_physinfo(); > > - if (numa) > + if (numa) { > output_topologyinfo(); > + output_numainfo(); > + } > > output_xeninfo(); >
Ian Campbell
2012-Jun-21 09:12 UTC
Re: [PATCH 05 of 10 v2] libxl: rename libxl_cpumap to libxl_bitmap
On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:> And leave to the caller the burden of knowing and remembering what kind > of bitmap each instance of libxl_bitmap is. > > This is basically just some s/libxl_cpumap/libxl_bitmap/ (and some other > related interface name substitution, e.g., libxl_for_each_cpu) in a bunch > of files, with no real functional change involved. > > A specific allocation helper is introduced, besides libxl_bitmap_alloc(). > It is called libxl_cpu_bitmap_alloc() and is meant at substituting the old > libxl_cpumap_alloc(). It is just something easier to use in cases where one > wants to allocate a libxl_bitmap that is going to serve as a cpu map. > > This is because we want to be able to deal with both cpu and NUMA node > maps, but we don''t want to duplicate all the various helpers and wrappers.FWIW I''d have been perfectly happy with a bunch of #define for_each_cpu(mumble) for_each_bit(mumble) type stuff, but I think Ian J and yourself didn''t like those which I''m also fine with.> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.eu.com> > diff --git a/tools/libxl/libxl_json.c b/tools/libxl/libxl_json.c > --- a/tools/libxl/libxl_json.c > +++ b/tools/libxl/libxl_json.c > @@ -99,8 +99,8 @@ yajl_gen_status libxl_uuid_gen_json(yajl > return yajl_gen_string(hand, (const unsigned char *)buf, LIBXL_UUID_FMTLEN); > } > > -yajl_gen_status libxl_cpumap_gen_json(yajl_gen hand, > - libxl_cpumap *cpumap) > +yajl_gen_status libxl_bitmap_gen_json(yajl_gen hand, > + libxl_bitmap *cpumap)Minor nit: You likely meant to rename cpumap in the argument list too.> { > yajl_gen_status s; > int i; > @@ -108,8 +108,8 @@ yajl_gen_status libxl_cpumap_gen_json(ya > s = yajl_gen_array_open(hand); > if (s != yajl_gen_status_ok) goto out; > > - libxl_for_each_cpu(i, *cpumap) { > - if (libxl_cpumap_test(cpumap, i)) { > + libxl_for_each_bit(i, *cpumap) { > + if (libxl_bitmap_test(cpumap, i)) { > s = yajl_gen_integer(hand, i); > if (s != yajl_gen_status_ok) goto out; > }Even with that minor nit: Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Campbell
2012-Jun-21 09:30 UTC
Re: [PATCH 06 of 10 v2] libxl: expand the libxl_bitmap API a bit
On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:> By adding copying and *_is_full/*_is_empty facilities. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > Changes from v1: > * now libxl_is_full/empty return 1 if true and 0 if false, > as logic (and as requested during review). > > diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c > --- a/tools/libxl/libxl_utils.c > +++ b/tools/libxl/libxl_utils.c > @@ -506,6 +506,35 @@ void libxl_bitmap_dispose(libxl_bitmap * > free(map->map); > } > > +void libxl_bitmap_copy(libxl_ctx *ctx, libxl_bitmap *dptr, > + const libxl_bitmap *sptr) > +{ > + int sz; > + > + sz = dptr->size = sptr->size; > + memcpy(dptr->map, sptr->map, sz * sizeof(*dptr->map));I think you need either some checks/asserts on the two relative sizes or a reallocation of the destination map. Otherwise you risk an overflow. Perhaps a simple assert (i.e. make it the callers problem) would be ok?> +} > + > +int libxl_bitmap_is_full(const libxl_bitmap *bitmap) > +{ > + int i; > + > + for (i = 0; i < bitmap->size; i++) > + if (bitmap->map[i] != (uint8_t)-1) > + return 0; > + return 1; > +} > + > +int libxl_bitmap_is_empty(const libxl_bitmap *bitmap) > +{ > + int i; > + > + for (i = 0; i < bitmap->size; i++) > + if (bitmap->map[i]) > + return 0; > + return 1; > +} > + > int libxl_bitmap_test(const libxl_bitmap *bitmap, int bit) > { > if (bit >= bitmap->size * 8) > diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h > --- a/tools/libxl/libxl_utils.h > +++ b/tools/libxl/libxl_utils.h > @@ -66,6 +66,10 @@ int libxl_vdev_to_device_disk(libxl_ctx > int libxl_bitmap_alloc(libxl_ctx *ctx, libxl_bitmap *bitmap, int n_bits); > /* Allocated bimap is from malloc, libxl_bitmap_dispose() to be > * called by the application when done. */ > +void libxl_bitmap_copy(libxl_ctx *ctx, libxl_bitmap *dptr, > + const libxl_bitmap *sptr); > +int libxl_bitmap_is_full(const libxl_bitmap *bitmap); > +int libxl_bitmap_is_empty(const libxl_bitmap *bitmap); > int libxl_bitmap_test(const libxl_bitmap *bitmap, int bit); > void libxl_bitmap_set(libxl_bitmap *bitmap, int bit); > void libxl_bitmap_reset(libxl_bitmap *bitmap, int bit);
Ian Campbell
2012-Jun-21 09:35 UTC
Re: [PATCH 07 of 10 v2] libxl: introduce some node map helpers
On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:> +int libxl_nodemap_to_cpumap(libxl_ctx *ctx, > + const libxl_bitmap *nodemap, > + libxl_bitmap *cpumap); > + /* populate cpumap with the cpus spanned by the nodes in nodemap */Comments before proto please (I know libxl is a bit of a mish-mash, but we''ll try to be consistent going forward...) Otherwise: Acked-by: Ian Campbell <ian.campbell@citrix.com>> +int libxl_cpumap_to_nodemap(libxl_ctx *ctx, > + const libxl_bitmap *cpuemap, > + libxl_bitmap *nodemap); > + /* populate nodemap with the nodes of the cpus in cpumap */ > + > static inline uint32_t libxl__sizekb_to_mb(uint32_t s) { > return (s + 1023) / 1024; > }
Dario Faggioli
2012-Jun-21 09:44 UTC
Re: [PATCH 07 of 10 v2] libxl: introduce some node map helpers
On Thu, 2012-06-21 at 10:35 +0100, Ian Campbell wrote:> On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote: > > > +int libxl_nodemap_to_cpumap(libxl_ctx *ctx, > > + const libxl_bitmap *nodemap, > > + libxl_bitmap *cpumap); > > + /* populate cpumap with the cpus spanned by the nodes in nodemap */ > > Comments before proto please >Ok.> (I know libxl is a bit of a mish-mash, but > we''ll try to be consistent going forward...) >Yeah, it''s just hard to figure out which one of the various suits you can find in the file(s) for similar things that one should follow! :-P Thanks for looking at this, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2012-Jun-21 09:46 UTC
Re: [PATCH 06 of 10 v2] libxl: expand the libxl_bitmap API a bit
On Thu, 2012-06-21 at 10:30 +0100, Ian Campbell wrote:> > diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c > > --- a/tools/libxl/libxl_utils.c > > +++ b/tools/libxl/libxl_utils.c > > @@ -506,6 +506,35 @@ void libxl_bitmap_dispose(libxl_bitmap * > > free(map->map); > > } > > > > +void libxl_bitmap_copy(libxl_ctx *ctx, libxl_bitmap *dptr, > > + const libxl_bitmap *sptr) > > +{ > > + int sz; > > + > > + sz = dptr->size = sptr->size; > > + memcpy(dptr->map, sptr->map, sz * sizeof(*dptr->map)); > > I think you need either some checks/asserts on the two relative sizes or > a reallocation of the destination map. Otherwise you risk an overflow. > > Perhaps a simple assert (i.e. make it the callers problem) would be ok? >Sounds reasonable, will do that. Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2012-Jun-21 09:49 UTC
Re: [PATCH 05 of 10 v2] libxl: rename libxl_cpumap to libxl_bitmap
On Thu, 2012-06-21 at 10:12 +0100, Ian Campbell wrote:> On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote: > > And leave to the caller the burden of knowing and remembering what kind > > of bitmap each instance of libxl_bitmap is. > > > > This is basically just some s/libxl_cpumap/libxl_bitmap/ (and some other > > related interface name substitution, e.g., libxl_for_each_cpu) in a bunch > > of files, with no real functional change involved. > > > > A specific allocation helper is introduced, besides libxl_bitmap_alloc(). > > It is called libxl_cpu_bitmap_alloc() and is meant at substituting the old > > libxl_cpumap_alloc(). It is just something easier to use in cases where one > > wants to allocate a libxl_bitmap that is going to serve as a cpu map. > > > > This is because we want to be able to deal with both cpu and NUMA node > > maps, but we don''t want to duplicate all the various helpers and wrappers. > > FWIW I''d have been perfectly happy with a bunch of > #define for_each_cpu(mumble) for_each_bit(mumble) > type stuff, but I think Ian J and yourself didn''t like those which I''m > also fine with. >Well, I''m fine with both approaches, actually. However, I''ve been asked by Ian during last round to kill the duplication as much as I can, and here it is what I came out with. :-) That being said, if it''s just a matter of adding back a couple of libxl_for_each_{set_}cpu as a define to libxl_for_each_{set_} bit, I could well do that, as it looks a "reasonable" amount of wrapper duplication to me...> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.eu.com> > > diff --git a/tools/libxl/libxl_json.c b/tools/libxl/libxl_json.c > > --- a/tools/libxl/libxl_json.c > > +++ b/tools/libxl/libxl_json.c > > @@ -99,8 +99,8 @@ yajl_gen_status libxl_uuid_gen_json(yajl > > return yajl_gen_string(hand, (const unsigned char *)buf, LIBXL_UUID_FMTLEN); > > } > > > > -yajl_gen_status libxl_cpumap_gen_json(yajl_gen hand, > > - libxl_cpumap *cpumap) > > +yajl_gen_status libxl_bitmap_gen_json(yajl_gen hand, > > + libxl_bitmap *cpumap) > > Minor nit: You likely meant to rename cpumap in the argument list too. >In this case, I sure did, thanks for pointing it out.> Even with that minor nit: > > Acked-by: Ian Campbell <ian.campbell@citrix.com> >Good! I''ll fix the nit, as I have to resubmit anyway. :-P Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2012-Jun-21 10:00 UTC
Re: [PATCH 03 of 10 v2] libxl, libxc: introduce libxl_get_numainfo()
On Thu, 2012-06-21 at 10:02 +0100, Ian Campbell wrote:> > diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h > > --- a/tools/libxc/xenctrl.h > > +++ b/tools/libxc/xenctrl.h > ... > > +libxl_numainfo *libxl_get_numainfo(libxl_ctx *ctx, int *nr) > > +{ > [...] > > The hypercall buffer stuff all looks good. > > > + if (ret) > > + *nr = max_nodes; > > You could put this before the fail: label. Not that it matters. >Ok.> > diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h > > --- a/tools/libxl/libxl.h > > +++ b/tools/libxl/libxl.h > ... > > #define LIBXL_CPUTOPOLOGY_INVALID_ENTRY (~(uint32_t)0) > > libxl_cputopology *libxl_get_cpu_topology(libxl_ctx *ctx, int *nr); > > void libxl_cputopology_list_free(libxl_cputopology *, int nr); > > +#define LIBXL_NUMAINFO_INVALID_ENTRY (~(uint32_t)0) > > +libxl_numainfo *libxl_get_numainfo(libxl_ctx *ctx, int *nr); > > + /* On success, a list of nr libxl_numainfo elements is returned. > > + * That is from malloc, thus it is up to the caller to invoke > > + * libxl_cpupoolinfo_list_free() on it. > > Don''t you mean libxl_numinfo_list_free() ? > > Also normally we put the comment before the prototype. >Yes, I did, and will fix it. For the comment, again, I''ll move that up... It''s just you can find so much different "examples" in those files... :-O> > diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c > > --- a/tools/libxl/libxl_utils.c > > +++ b/tools/libxl/libxl_utils.c > > @@ -537,6 +537,11 @@ int libxl_get_max_cpus(libxl_ctx *ctx) > > return xc_get_max_cpus(ctx->xch); > > } > > > > +int libxl_get_max_nodes(libxl_ctx *ctx) > > +{ > > + return xc_get_max_nodes(ctx->xch); > > +} > > Is this needed externally to libxl or do we expect all callers to use > libxl_get_numainfo? I suppose there is no harm in exporting this either > way. >I''m not sure. What I did is to replicate what happens for libxl_get_max_cpus(), but I really don''t know whether or not they both make any sense outside libxl. It does not look that bad to me that we offer our users a chance to figure out how many cpus and/or nodes they have, without needing to call the proper libxl_get_*info(), which is quite a bit more of a burden. FWIW, I''d leave both of them public.> > diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h > > --- a/xen/include/public/sysctl.h > > +++ b/xen/include/public/sysctl.h > > @@ -484,6 +484,7 @@ typedef struct xen_sysctl_topologyinfo x > > DEFINE_XEN_GUEST_HANDLE(xen_sysctl_topologyinfo_t); > > > > /* XEN_SYSCTL_numainfo */ > > +#define INVALID_NUMAINFO_ID (~0U) > > It feels like there ought to be hunks in the hypervisor which either use > this symbol instead of a hardcoded ~0U or which remove the internal > definition in favour of this one? >Again, -topologyinfo machinery does exactly this, so I really think we either fix/change or leave as they are both of them (which of course I can do, just tell me if that is what you want). Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Campbell
2012-Jun-21 10:21 UTC
Re: [PATCH 03 of 10 v2] libxl, libxc: introduce libxl_get_numainfo()
> > > diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c > > > --- a/tools/libxl/libxl_utils.c > > > +++ b/tools/libxl/libxl_utils.c > > > @@ -537,6 +537,11 @@ int libxl_get_max_cpus(libxl_ctx *ctx) > > > return xc_get_max_cpus(ctx->xch); > > > } > > > > > > +int libxl_get_max_nodes(libxl_ctx *ctx) > > > +{ > > > + return xc_get_max_nodes(ctx->xch); > > > +} > > > > Is this needed externally to libxl or do we expect all callers to use > > libxl_get_numainfo? I suppose there is no harm in exporting this either > > way. > > > I''m not sure. What I did is to replicate what happens for > libxl_get_max_cpus(), but I really don''t know whether or not they both > make any sense outside libxl. It does not look that bad to me that we > offer our users a chance to figure out how many cpus and/or nodes they > have, without needing to call the proper libxl_get_*info(), which is > quite a bit more of a burden. FWIW, I''d leave both of them public.OK.> > > diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h > > > --- a/xen/include/public/sysctl.h > > > +++ b/xen/include/public/sysctl.h > > > @@ -484,6 +484,7 @@ typedef struct xen_sysctl_topologyinfo x > > > DEFINE_XEN_GUEST_HANDLE(xen_sysctl_topologyinfo_t); > > > > > > /* XEN_SYSCTL_numainfo */ > > > +#define INVALID_NUMAINFO_ID (~0U) > > > > It feels like there ought to be hunks in the hypervisor which either use > > this symbol instead of a hardcoded ~0U or which remove the internal > > definition in favour of this one? > > > Again, -topologyinfo machinery does exactly this, so I really think we > either fix/change or leave as they are both of them (which of course I > can do, just tell me if that is what you want).Lets leave it as is then. Ian.
Ian Campbell
2012-Jun-21 10:22 UTC
Re: [PATCH 05 of 10 v2] libxl: rename libxl_cpumap to libxl_bitmap
On Thu, 2012-06-21 at 10:49 +0100, Dario Faggioli wrote:> On Thu, 2012-06-21 at 10:12 +0100, Ian Campbell wrote: > > On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote: > > > And leave to the caller the burden of knowing and remembering what kind > > > of bitmap each instance of libxl_bitmap is. > > > > > > This is basically just some s/libxl_cpumap/libxl_bitmap/ (and some other > > > related interface name substitution, e.g., libxl_for_each_cpu) in a bunch > > > of files, with no real functional change involved. > > > > > > A specific allocation helper is introduced, besides libxl_bitmap_alloc(). > > > It is called libxl_cpu_bitmap_alloc() and is meant at substituting the old > > > libxl_cpumap_alloc(). It is just something easier to use in cases where one > > > wants to allocate a libxl_bitmap that is going to serve as a cpu map. > > > > > > This is because we want to be able to deal with both cpu and NUMA node > > > maps, but we don''t want to duplicate all the various helpers and wrappers. > > > > FWIW I''d have been perfectly happy with a bunch of > > #define for_each_cpu(mumble) for_each_bit(mumble) > > type stuff, but I think Ian J and yourself didn''t like those which I''m > > also fine with. > > > Well, I''m fine with both approaches, actually. However, I''ve been asked > by Ian during last round to kill the duplication as much as I can, and > here it is what I came out with. :-) > > That being said, if it''s just a matter of adding back a couple of > libxl_for_each_{set_}cpu as a define to libxl_for_each_{set_} bit, I > could well do that, as it looks a "reasonable" amount of wrapper > duplication to me...Lets not bother now -- it''s easier to add a wrapper in the future than to remove a bad one done in haste in the last stages of the release.
Ian Campbell
2012-Jun-21 11:40 UTC
Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:> If a domain does not have a VCPU affinity, try to pin it automatically to some > PCPUs. This is done taking into account the NUMA characteristics of the host. > In fact, we look for a combination of host''s NUMA nodes with enough free memory > and number of PCPUs for the new domain, and pin it to the VCPUs of those nodes. > > Once we know which ones, among all the possible combinations, represents valid > placement candidates for a domain, use some heuistics for deciding which is theheuristics> best. For instance, smaller candidates are considered to be better, both from > the domain''s point of view (fewer memory spreading among nodes) and from the > system as a whole point of view (fewer memoy fragmentation). In case ofmemory> candidates of equal sizes (i.e., with the same number of nodes), the one with > the greater amount of memory wins, as this is also good for keeping memory > fragmentation under control. > > This all happens internally to libxl, and no API for driving the mechanism is > provided for now. This matches what xend already does. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > Changes from v1: > * This patches incorporates the changes from both "libxl, xl: enable automatic > placement of guests on NUMA nodes" and "libxl, xl: heuristics for reordering > NUMA placement candidates" from v1. > * The logic of the algorithm is basically the same as in v1, but the splitting > of it in the various functions has been completely redesigned from scratch. > * No public API for placement or candidate generation is now exposed, > everything happens within libxl, as agreed during v1 review. > * The relevant documentation have been moved near the actual functions and > features. Also, the amount and (hopefully!) the quality of the documentation > has been improved a lot, as requested. > * All the comments about using the proper libxl facilities and helpers for > allocations, etc., have been considered and applied. > * This patch still bails out from NUMA optimizations if it find out cpupools > are being utilized. It is next patch that makes the two things interact > properly, as suggested during review. > > diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 > --- a/docs/man/xl.cfg.pod.5 > +++ b/docs/man/xl.cfg.pod.5 > @@ -111,8 +111,8 @@ created online and the remainder will be > > =item B<cpus="CPU-LIST"> > > -List of which cpus the guest is allowed to use. Default behavior isThere is (according to grep) a slight preference for British spelling, but I imagine we are totally inconsistent all over the place so lets not worry too much ;-)> -`all cpus`. A C<CPU-LIST> may be specified as follows: > +List of which cpus the guest is allowed to use. By default xl will (via > +libxl) pick some cpus (see below). A C<CPU-LIST> may be specified as follows: > > =over 4 > > @@ -132,6 +132,12 @@ run on cpu #3 of the host. > > =back > > +If this option is not specified, libxl automatically tries to place the new > +domain on the host''s NUMA nodes (provided the host has more than one NUMA > +node) by pinning it to the cpus of those nodes. An heuristical approach isI don''t think heuristical is a word, I think "A heuristic approach..." is fine (not sure about "A heuristic" vs "An heuristic" but A sounds more natural to me, generally the rule is An before "vowel sounds" so I guess it depends on how you pronounce the h in heuristic, which varies even depending on which bit of England you are from ;-))> +utilized with the goals of maximizing performance for the domain and, at > +the same time, achieving efficient utilization of the host''s CPUs and RAM. > + > =item B<cpu_weight=WEIGHT> > > A domain with a weight of 512 will get twice as much CPU as a domain > diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c > --- a/tools/libxl/libxl_dom.c > +++ b/tools/libxl/libxl_dom.cI think it would be quite reasonable for you to create a new libxl_numa(_placement).c if you wanted too. likewise a libxl_numa.h header. Up to you.> @@ -95,6 +95,459 @@ out: > return sched; > } > > +/* > + * What follows are helpers for generating all the k-combinations > + * without repetitions of a set S with n elements in it. Formally > + * speaking, they are subsets of k distinct elements of S and, if > + * S is n elements big, the number of k-combinations is equal to > + * the binomial coefficient (n k), which means we get exactly > + * n!/(k! * (n - k)!) subsets, all of them with k elements. > + * > + * The various subset are geneated one after the other by callinggenerated> + * comb_init() first, and, after that, comb_next() > + * (n k)-1 times. An iterator is used to store the curent statuscurrent> + * of the whole generation operation (i.e., basically, the last > + * combination that has been generated). As soon as all > + * combinations have been generated, comb_next() will > + * start returning 0 instead of 1. It is of course impotant thatimportant> + * the same instance of the iterator and the same values for > + * n and k are used for each call. If that doesn''t happen, the > + * result is unspecified. > + * > + * The algorithm is a well known one,Worth a link/reference?> and it produces the > + * combinations in such a way that they (well, more precisely, > + * their indexes it the array/map representing the set) come in > + * lexicogaphic order.lexicographic> + * > + * For example, with n = 5 and k = 3, calling comb_init() > + * will generate { 0, 1, 2 }, while subsequent valid calls to > + * comb_next() will produce the following: > + * { { 0, 1, 2 }, { 0, 1, 3 }, { 0, 1, 4 },^ strictly speaking this 0,1,2 isn''t produced by a call to comb_next since it came from the initial comb_init, right? IOW the first comb_next() call won''t duplicate it... I''m not going to try very hard to review the algorithm here, I trust you and I''ve got a head cold ;-)> + * { 0, 2, 3 }, { 0, 2, 4 }, { 0, 3, 4 }, > + * { 1, 2, 3 }, { 1, 2, 4 }, { 1, 3, 4 }, > + * { 2, 3, 4 } } > + * > + * This is used by the automatic NUMA placement logic below. > + */ > +typedef int* comb_iter_t; > + > +static int comb_init(libxl__gc *gc, comb_iter_t *it, int n, int k) > +{ > + comb_iter_t new_iter; > + int i; > + > + if (n < k) > + return 0; > + > + /* First set is always { 0, 1, 2, ..., k-1 } */ > + GCNEW_ARRAY(new_iter, k); > + for (i = 0; i < k; i++) > + new_iter[i] = i; > + > + *it = new_iter; > + return 1; > +} > + > +static int comb_next(comb_iter_t it, int n, int k) > +{ > + int i; > + > + /* > + * The idea here is to find the leftmost element from where > + * we should start incrementing the indexes of the iterator. > + * This means looking for the highest index that can be increased > + * while still producing value smaller than n-1. In the example > + * above, when dealing with { 0, 1, 4 }, such an element is the > + * second one, as the third is already equal to 4 (which actually > + * is n-1). > + * Once we found from where to start, we increment that element > + * and overide the right-hand rest of the iterator with itsoverride> + * successors, thus achieving lexicographic ordering. > + * > + * Regarding the termination of the generation process, when we > + * manage in bringing n-k at the very first position of the iterator, > + * we know that is the last valid combination ( { 2, 3, 4 }, with > + * n - k = 5 - 2 = 2, in the example above), and thus we start > + * returning 0 as soon as we cross that border. > + */ > + for (i = k - 1; it[i] == n - k + i; i--) { > + if (i <= 0) > + return 0; > + } > + for (it[i]++, i++; i < k; i++) > + it[i] = it[i - 1] + 1; > + return 1; > +} > + > +/* NUMA automatic placement (see libxl_internal.h for details) */ > + > +/* > + * This function turns a k-combination iterator into a node map. > + * This means the bits in the node map corresponding to the indexes > + * of the given combination are the ones that will be set. > + * For example, if the iterator represents the combination { 0, 2, 4}, > + * the node map will have bits #0, #2 and #4 set to one and i thus > + * will point to node 0, node 2 and and node 4.What is "i" in this last bit ("i thus will point to")? I don''t think you are referring to the loop iterator variable. If you meant "it" then I don''t think it needs saying that a node map with bits 0, 2 and 4 set refers to nodes 0, 2 and 4.> + */ > +static void comb_get_nodemap(comb_iter_t it, libxl_bitmap *nodemap, int k) > +{ > + int i; > + > + libxl_bitmap_set_none(nodemap); > + for (i = 0; i < k; i++) > + libxl_bitmap_set(nodemap, it[i]); > +} > + > +/* Retrieve the number of cpus that the nodes that are part of the nodemap > + * span. */ > +static int nodemap_to_nodes_cpus(libxl_cputopology *tinfo, int nr_cpus, > + const libxl_bitmap *nodemap)nodes here == node''s ? But can''t have an apostrophe in a function which makes it ready weird to me. Perhaps "nodemap_count_cpus"?> +{ > + int i, nodes_cpus = 0; > + > + for (i = 0; i < nr_cpus; i++) { > + if (libxl_bitmap_test(nodemap, tinfo[i].node)) > + nodes_cpus++; > + } > + return nodes_cpus; > +} > +[...]> +/* Retrieve the number of domains that can potentially run on the cpus > + * the nodes that are part of the nodemap. */ > +static int nodemap_to_nr_domains(libxl__gc *gc, libxl_cputopology *tinfo, > + const libxl_bitmap *nodemap) > +{ > + libxl_dominfo *dinfo = NULL; > + libxl_bitmap dom_nodemap; > + int nr_doms, nr_cpus; > + int nr_domains = 0; > + int i, j, k; > + > + dinfo = libxl_list_domain(CTX, &nr_doms); > + if (dinfo == NULL) > + return ERROR_FAIL; > + > + if (libxl_node_bitmap_alloc(CTX, &dom_nodemap) < 0) { > + nr_domains = ERROR_FAIL; > + goto out; > + } > + > + for (i = 0; i < nr_doms; i++) { > + libxl_vcpuinfo *vinfo; > + int nr_vcpus; > + > + vinfo = libxl_list_vcpu(CTX, dinfo[i].domid, &nr_vcpus, &nr_cpus); > + if (vinfo == NULL) > + continue; > + > + libxl_bitmap_set_none(&dom_nodemap); > + for (j = 0; j < nr_vcpus; j++) { > + libxl_for_each_set_bit(k, vinfo[j].cpumap) > + libxl_bitmap_set(&dom_nodemap, tinfo[k].node); > + } > + > + libxl_for_each_set_bit(j, dom_nodemap) { > + if (libxl_bitmap_test(nodemap, j)) { > + nr_domains++; > + goto found;AKA break?> + } > + } > + found: > + libxl_vcpuinfo_list_free(vinfo, nr_vcpus); > + } > + > + out: > + libxl_bitmap_dispose(&dom_nodemap);You can come to out if libxl_nodebitmap_alloc fails -- in which case dom_nodemap is (potentially) uninitialised. You could just move out: down a line. Otherwise we''d need libxl_bitmap_init which zeroes everything such that calling dispose is safe even if you don''t call alloc. I don''t mind which.> + libxl_dominfo_list_free(dinfo, nr_doms); > + return nr_domains; > +} > + > +/* > + * This function tries to figure out if the host has a consistent number > + * of cpus along all its NUMA nodes. In fact, if that is the case, we can > + * calculate the minimum number of nodes needed for a domain by just > + * dividing its total number of vcpus by this value computed here. > + * However, we are not allowed to assume that all the nodes have the > + * same number of cpus. Therefore, in case discrepancies among different > + * nodes are found, this function just returns 0 and the caller needs > + * to sort out things in some other way.If the caller has to deal with this anyway why bother with this check and/or the other algorithm?> + */ > +static int cpus_per_node_count(libxl_cputopology *tinfo, int nr_cpus, > + libxl_numainfo *ninfo, int nr_nodes) > +{ > + int cpus_per_node = 0; > + int j, i; > + > + /* This makes sense iff # of PCPUs is the same for all nodes */ > + for (j = 0; j < nr_nodes; j++) { > + int curr_cpus = 0; > + > + for (i = 0; i < nr_cpus; i++) { > + if (tinfo[i].node == j) > + curr_cpus++; > + } > + /* So, if the above does not hold, turn the whole thing off! */ > + cpus_per_node = cpus_per_node == 0 ? curr_cpus : cpus_per_node; > + if (cpus_per_node != curr_cpus) > + return 0; > + } > + return cpus_per_node; > +} > + > +/* Get all the placement candidates satisfying some specific conditions */ > +int libxl__get_numa_candidates(libxl__gc *gc, > + uint32_t min_free_memkb, int min_cpus, > + int min_nodes, int max_nodes, > + libxl__numa_candidate *cndts[], int *nr_cndts) > +{ > + libxl__numa_candidate *new_cndts = NULL; > + libxl_cputopology *tinfo = NULL; > + libxl_numainfo *ninfo = NULL; > + libxl_bitmap nodemap; > + int nr_nodes, nr_cpus; > + int array_size, rc; > + > + /* Get platform info and prepare the map for testing the combinations */ > + ninfo = libxl_get_numainfo(CTX, &nr_nodes); > + if (ninfo == NULL) > + return ERROR_FAIL; > + /* If we don''t have at least 2 nodes, it is useless to proceed */You don''t want to return whichever of those 2 nodes meets the constraints?> + if (nr_nodes < 2) { > + rc = 0; > + goto out; > + } > + > + tinfo = libxl_get_cpu_topology(CTX, &nr_cpus); > + if (tinfo == NULL) { > + rc = ERROR_FAIL; > + goto out; > + } > + > + rc = libxl_node_bitmap_alloc(CTX, &nodemap); > + if (rc) > + goto out; > + > + /* > + * Round up and down some of the constraints. For instance, the minimum > + * number of cpus a candidate should have must at least be non-negative. > + * Regarding the minimum number of NUMA nodes, if not explicitly specified > + * (i.e., min_nodes <= 0), we try to figure out a sensible number of nodes > + * from where to start generating candidates, if possible (or just start > + * from 1 otherwise). The maximum number of nodes should not exceed the > + * number of existent NUMA nodes on the host, or the candidate genarationgeneration> + * won''t work properly. > + */ > + min_cpus = min_cpus <= 0 ? 0 : min_cpus; > + if (min_nodes <= 0) { > + int cpus_per_node; > + > + cpus_per_node = cpus_per_node_count(tinfo, nr_cpus, ninfo, nr_nodes); > + if (cpus_per_node == 0) > + min_nodes = 1; > + else > + min_nodes = (min_cpus + cpus_per_node - 1) / cpus_per_node; > + } > + min_nodes = min_nodes > nr_nodes ? nr_nodes : min_nodes; > + if (max_nodes <= 0) > + max_nodes = nr_nodes; > + else > + max_nodes = max_nodes > nr_nodes ? nr_nodes : max_nodes; > + if (min_nodes > max_nodes) { > + rc = ERROR_INVAL; > + goto out; > + } > + > + /* Initialize the local storage for the combinations */ > + *nr_cndts = 0; > + array_size = nr_nodes; > + GCNEW_ARRAY(new_cndts, array_size); > + > + /* Generate all the combinations of any size from min_nodes to > + * max_nodes (see comb_init() and comb_next()). */ > + while (min_nodes <= max_nodes) { > + comb_iter_t comb_iter; > + int comb_ok; > + > + /* > + * And here it is. Each step of this cycle generates a combination of > + * nodes as big as min_nodes mandates. Each of these combinations is > + * checked against the constraints provided by the caller (namely, > + * amount of free memory and number of cpus) and it becomes an actual > + * placement candidate iff it passes the check. > + */ > + for (comb_ok = comb_init(gc, &comb_iter, nr_nodes, min_nodes); comb_ok; > + comb_ok = comb_next(comb_iter, nr_nodes, min_nodes)) { > + uint32_t nodes_free_memkb; > + int nodes_cpus; > + > + comb_get_nodemap(comb_iter, &nodemap, min_nodes); > + > + /* If there is not enough memoy in this combination, skip itmemory> + * and go generating the next one... */ > + nodes_free_memkb = nodemap_to_free_memkb(ninfo, &nodemap); > + if (min_free_memkb > 0 && nodes_free_memkb < min_free_memkb) > + continue; > + > + /* And the same applies if this combination is short in cpus */ > + nodes_cpus = nodemap_to_nodes_cpus(tinfo, nr_cpus, &nodemap); > + if (min_cpus > 0 && nodes_cpus < min_cpus) > + continue; > + > + /* > + * Conditions are met, we can add this combination to the > + * NUMA placement candidates list. We first make sure there > + * is enough space in there, and then we initialize the new > + * candidate element with the node map corresponding to the > + * combination we are dealing with. Memory allocation for > + * expanding the array that hosts the list happens in chunks > + * equal to the number of NUMA nodes in the system (to > + * avoid allocating memory each and every time we find a > + * new candidate). > + */ > + if (*nr_cndts == array_size) > + array_size += nr_nodes; > + GCREALLOC_ARRAY(new_cndts, array_size); > + > + libxl__numa_candidate_alloc(gc, &new_cndts[*nr_cndts]); > + libxl__numa_candidate_put_nodemap(gc, &new_cndts[*nr_cndts], > + &nodemap); > + new_cndts[*nr_cndts].nr_domains > + nodemap_to_nr_domains(gc, tinfo, &nodemap); > + new_cndts[*nr_cndts].free_memkb = nodes_free_memkb; > + new_cndts[*nr_cndts].nr_nodes = min_nodes; > + new_cndts[*nr_cndts].nr_cpus = nodes_cpus; > + > + LOG(DEBUG, "NUMA placement candidate #%d found: nr_nodes=%d, " > + "nr_cpus=%d, free_memkb=%"PRIu32"", *nr_cndts, > + min_nodes, new_cndts[*nr_cndts].nr_cpus, > + new_cndts[*nr_cndts].free_memkb / 1024); > + > + (*nr_cndts)++; > + } > + min_nodes++; > + } > + > + *cndts = new_cndts; > + out: > + libxl_bitmap_dispose(&nodemap);nodemap might not have been initialised?> + libxl_cputopology_list_free(tinfo, nr_cpus); > + libxl_numainfo_list_free(ninfo, nr_nodes);It looks like the nr_* may also not have been initialised, but maybe list_free is safe in that case since the associated array must necessarily be NULL?> + return rc; > +} > + > +void libxl__sort_numa_candidates(libxl__numa_candidate cndts[], int nr_cndts, > + libxl__numa_candidate_cmpf cmpf) > +{ > + /* Reorder candidates (see the comparison function for > + * the details on the heuristics) */ > + qsort(cndts, nr_cndts, sizeof(cndts[0]), cmpf); > +} > + > +/* > + * The NUMA placement candidates are reordered according to the following > + * heuristics: > + * - candidates involving fewer nodes come first. In case two (or > + * more) candidates span the same number of nodes, > + * - candidates with greater amount of free memory come first. In > + * case two (or more) candidates differ in their amount of free > + * memory by less than 10%, > + * - candidates with fewer domains insisting on them at the time of > + * this call come first. > + */ > +static int numa_cmpf(const void *v1, const void *v2) > +{ > + const libxl__numa_candidate *c1 = (const libxl__numa_candidate*) v1; > + const libxl__numa_candidate *c2 = (const libxl__numa_candidate*) v2;Are these casts necessary?> + double mem_diff = labs(c1->free_memkb - c2->free_memkb); > + double mem_avg = (c1->free_memkb + c2->free_memkb) / 2.0; > + > + if (c1->nr_nodes != c2->nr_nodes) > + return c1->nr_nodes - c2->nr_nodes; > + > + if ((mem_diff / mem_avg) * 100.0 < 10.0 &&Is all this FP math necessary? Using integers might have some rounding but is it significant or do we care if it gets it slightly wrong?> + c1->nr_domains != c2->nr_domains) > + return c1->nr_domains - c2->nr_domains; > + > + return c2->free_memkb - c1->free_memkb; > +} > + > +/* The actual automatic NUMA placement routine */ > +static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info) > +{ > + int nr_candidates = 0; > + libxl__numa_candidate *candidates = NULL; > + libxl_bitmap candidate_nodemap; > + libxl_cpupoolinfo *pinfo; > + int nr_pools, rc = 0; > + uint32_t memkb; > + > + /* First of all, if cpupools are in use, better not to mess with them */ > + pinfo = libxl_list_cpupool(CTX, &nr_pools); > + if (!pinfo) > + return ERROR_FAIL; > + if (nr_pools > 1) { > + LOG(NOTICE, "skipping NUMA placement as cpupools are in use"); > + goto out; > + } > + > + rc = libxl_domain_need_memory(CTX, info, &memkb); > + if (rc) > + goto out; > + if (libxl_node_bitmap_alloc(CTX, &candidate_nodemap)) { > + rc = ERROR_FAIL; > + goto out; > + } > + > + /* Find all the candidates with enough free memory and at least > + * as much pcpus as the domain has vcpus. */ > + rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, 0, 0, > + &candidates, &nr_candidates); > + if (rc) > + goto out; > + > + LOG(DETAIL, "%d NUMA placement candidates found", nr_candidates); > + > + /* No suitable placement candidates. We just return without touching the > + * domain''s info->cpumap. It will have affinity with all nodes/cpus. */ > + if (nr_candidates == 0) > + goto out; > + > + /* Bring the best candidate in front of the list --> candidates[0] */ > + if (nr_candidates > 1) > + libxl__sort_numa_candidates(candidates, nr_candidates, numa_cmpf);Is the start up cost of libxl__sort_numa_candidates significant enough to make that if worthwhile?> + > + /* > + * At this point, the first candidate in the array is the one we want. > + * Go for it by mapping its node map to the domain''s info->cpumap. > + */ > + libxl__numa_candidate_get_nodemap(gc, &candidates[0], &candidate_nodemap); > + rc = libxl_nodemap_to_cpumap(CTX, &candidate_nodemap, &info->cpumap); > + if (rc) > + goto out; > + > + LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and " > + "%"PRIu32" KB free selected", candidates[0].nr_nodes, > + candidates[0].nr_cpus, candidates[0].free_memkb / 1024); > + > + out: > + libxl_bitmap_dispose(&candidate_nodemap); > + libxl_cpupoolinfo_list_free(pinfo, nr_pools); > + return rc; > +} > + > int libxl__build_pre(libxl__gc *gc, uint32_t domid, > libxl_domain_build_info *info, libxl__domain_build_state *state) > { > @@ -104,7 +557,22 @@ int libxl__build_pre(libxl__gc *gc, uint > uint32_t rtc_timeoffset; > > xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus); > + > + /* > + * Check if the domain has any CPU affinity. If not, try to build up one. > + * In case numa_place_domain() find at least a suitable candidate, it will > + * affect info->cpumap accordingly; if it does not, it just leaves it > + * as it is. This means (unless some weird error manifests) the subsequent > + * call to libxl_set_vcpuaffinity_all() will do the actual placement, > + * whatever that turns out to be. > + */ > + if (libxl_bitmap_is_full(&info->cpumap)) { > + int rc = numa_place_domain(gc, info); > + if (rc) > + return rc;I''m not sure about this, if numa_place_domain fails for any reason would we be not better off logging and continuing without placement? We already do that explicitly in a couple of cases.> + } > libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap); > + > xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + LIBXL_MAXMEM_CONSTANT); > if (info->type == LIBXL_DOMAIN_TYPE_PV) > xc_domain_set_memmap_limit(ctx->xch, domid, > diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h > --- a/tools/libxl/libxl_internal.h > +++ b/tools/libxl/libxl_internal.h > @@ -2021,6 +2021,134 @@ static inline void libxl__ctx_unlock(lib > #define CTX_LOCK (libxl__ctx_lock(CTX)) > #define CTX_UNLOCK (libxl__ctx_unlock(CTX)) > > +/* > + * Automatic NUMA placement > + * > + * These functions and data structures deal with the initial placement of a > + * domain onto the host NUMA nodes. > + * > + * The key concept here is the one of "numa placement candidate" (or jusr "numajust> + * candidate", "placement candidate" or even "candidate"), which basically is aDo we really need that level of detail about what we might call it? There''s no minimum word limit you know ;-)> + * set of nodes whose characteristics have been successfully checked against > + * some specific requirements. More pecisely, a candidate is the nodemapprecisely> + * associated with one of the possible subset of the host NUMA nodes providing > + * a certain amount of free memory, or a given number of cpus, or even both > + * (depending in what the caller wants). For convenience of use, some of this > + * information are stored within the candidate itselfs, instead of always beingitself> + * dynamically computed. A candidate can consist of just one, up to all the > + * available nodes on the host (the latter being, of course, not ideal).I can''t parse this sentence.> For > + * instance, looking for a numa candidates with 2GB of free memoy means we wantmemory> + * all the possible subsets of the host NUMA nodes with, cumulatively, at least > + * 2GB of free memory. That could be possible by just using one particular > + * node, or may require more nodes, depending on the characteristics of the > + * host, on how many domains have been created already, on how big they are, > + * etc. > + * > + * The intended usage is as follows: > + * 1. by, fist of all, calling libxl__get_numa_candidates(), and specifying > + * the proper constraints to it (e.g., the amount of memory a domain need > + * as the minimum amount of free memory fo the candidates) one can buildfor> + * up a whole set of suitable placing alternatives for a domain; > + * 2. after that, one specific candidate should be chosen. That can happen > + * by looking at their various characteristics; > + * 3. the chosen candidate''s nodemap should be utilized for computing the > + * actual affinity of the domain which, given the curent NUMA supportcurrent> + * in the hypervisor, is what determines the placement of the domain''s > + * vcpus and memory. > + * > + * To make phase 2 even easier, a sorting helper function for the list of > + * candidates is povided in the form of libxl__sort_numa_candidates(). The onlyprovided> + * that is needed is defining a comparison function, containing the chriteriacriteria> + * for deciding, given two candidates, which one is ''better''. Depending on how > + * the comparison function is defined, the best candidate (where, of course, > + * best is defined with respect to the heuristics implemented in the comparison > + * function itself, libxl__numa_candidate_cmpf()) could become the first o theor> + * last element of the list. > + * > + * Summarizing, achieving automatic NUMA placement is just a matter of > + * obtaining the list of suitable placement candidates, perhaps asking for each > + * of them to provide at least the amount of memory the domain needs. After > + * that just implement a comparison function by means of the various helpers > + * retrieving the relevant information about the candidates themselves. > + * Finally, call the sorting helper function and use the candidate that became > + * (typically) the first element of the list fo determining the domain''sfor> + * affinity. > + */ > + > +typedef struct { > + int nr_cpus, nr_nodes; > + int nr_domains; > + uint32_t free_memkb; > + libxl_bitmap nodemap; > +} libxl__numa_candidate; > + > +/* > + * This generates the list of NUMA placement candidates satisfying some > + * specific conditions. If min_nodes and/or max_nodes are not 0, their value is > + * used to determine the minimum and maximum number of nodes that are allow to > + * be present in each candidate. If min_nodes and/or max_nodes are 0, the > + * minimum and maximum number of nodes to be used are automatically selected by > + * the implementation (and that will likely be just 1 node for the minimum and > + * the total number of existent nodes for the maximum). Re min_free_memkb and > + * min_cpu, if not 0, they specify the caller only wants candidates with at > + * least that amount of free memory and that number of cpus, respectively. If > + * min_free_memkb and/or min_cpus are 0, the candidates'' free memory and number > + * of cpus won''t be checked at all, which means a candidate will always be > + * considered suitable wrt the specific constraint. cndts is where the list of > + * exactly nr_cndts candidates is returned. Note that, in case no candidates > + * are found at all, the function returns successfully, but with nr_cndts equal > + * to zero. > + */ > +_hidden int libxl__get_numa_candidates(libxl__gc *gc, > + uint32_t min_free_memkb, int min_cpus, > + int min_nodes, int max_nodes, > + libxl__numa_candidate *cndts[], int *nr_cndts); > + > +/* allocation and deallocation for placement candidates */ > +static inline int libxl__numa_candidate_alloc(libxl__gc *gc, > + libxl__numa_candidate *cndt) > +{ > + return libxl_node_bitmap_alloc(CTX, &cndt->nodemap); > +} > +static inline void libxl__numa_candidate_dispose(libxl__numa_candidate *cndt) > +{ > + libxl_bitmap_dispose(&cndt->nodemap); > +} > +static inline void libxl__numacandidate_list_free(libxl__numa_candidate *cndts, > + int nr_cndts) > +{ > + int i; > + > + for (i = 0; i < nr_cndts; i++) > + libxl__numa_candidate_dispose(&cndts[i]); > + free(cndts); > +} > + > +/* retrieve (in nodemap) the node map associated to placement candidate cndt */ > +static inline > +void libxl__numa_candidate_get_nodemap(libxl__gc *gc, > + const libxl__numa_candidate *cndt, > + libxl_bitmap *nodemap) > +{ > + libxl_bitmap_copy(CTX, nodemap, &cndt->nodemap); > +} > +/* set the node map of placement candidate cndt to match nodemap */ > +static inline > +void libxl__numa_candidate_put_nodemap(libxl__gc *gc, > + libxl__numa_candidate *cndt, > + const libxl_bitmap *nodemap) > +{ > + libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap); > +} > + > +/* signature for the comparison function between two candidates c1 and c2 > + * (the thid parameter is provided to enable thread safety). */third But there isn''t actually a third parameter so it''s a bit moot ;-)> +typedef int (*libxl__numa_candidate_cmpf)(const void *v1, const void *v2); > +/* sort the list of candidates in cndts (an array with nr_cndts elements in > + * it) using cmpf for comparing two candidates. Uses libc''s qsort(). */ > +_hidden void libxl__sort_numa_candidates(libxl__numa_candidate cndts[], > + int nr_cndts, > + libxl__numa_candidate_cmpf cmpf); > > /* > * Inserts "elm_new" into the sorted list "head".
Ian Campbell
2012-Jun-21 13:31 UTC
Re: [PATCH 09 of 10 v2] libxl: have NUMA placement deal with cpupools
On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:> In such a way that only the cpus belonging to the cpupool of the > domain being placed are considered for the placement itself. > > This happens by filtering out all the nodes in which the cpupool has > not any cpu from the placement candidates. After that -- as a cpu pooling > not necessarily happens at NUMA nodes boundaries -- we also make sure > only the actual cpus that are part of the pool are considered when > counting how much processors a placement candidate is able to provide. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c > --- a/tools/libxl/libxl_dom.c > +++ b/tools/libxl/libxl_dom.c > @@ -198,15 +198,27 @@ static void comb_get_nodemap(comb_iter_t > libxl_bitmap_set(nodemap, it[i]); > } > > +/* Retrieve how many nodes a nodemap spans. */ > +static int nodemap_to_nr_nodes(const libxl_bitmap *nodemap) > +{ > + int i, nr_nodes = 0; > + > + libxl_for_each_set_bit(i, *nodemap) > + nr_nodes++; > + return nr_nodes; > +} > + > /* Retrieve the number of cpus that the nodes that are part of the nodemap > - * span. */ > + * span and that are also set in suitable_cpumap. */ > static int nodemap_to_nodes_cpus(libxl_cputopology *tinfo, int nr_cpus, > + const libxl_bitmap *suitable_cpumap, > const libxl_bitmap *nodemap) > { > int i, nodes_cpus = 0; > > for (i = 0; i < nr_cpus; i++) { > - if (libxl_bitmap_test(nodemap, tinfo[i].node)) > + if (libxl_bitmap_test(suitable_cpumap, i) && > + libxl_bitmap_test(nodemap, tinfo[i].node)) > nodes_cpus++; > } > return nodes_cpus; > @@ -311,12 +323,13 @@ static int cpus_per_node_count(libxl_cpu > int libxl__get_numa_candidates(libxl__gc *gc, > uint32_t min_free_memkb, int min_cpus, > int min_nodes, int max_nodes, > + const libxl_bitmap *suitable_cpumap, > libxl__numa_candidate *cndts[], int *nr_cndts) > { > libxl__numa_candidate *new_cndts = NULL; > libxl_cputopology *tinfo = NULL; > libxl_numainfo *ninfo = NULL; > - libxl_bitmap nodemap; > + libxl_bitmap suitable_nodemap, nodemap; > int nr_nodes, nr_cpus; > int array_size, rc; > > @@ -340,6 +353,15 @@ int libxl__get_numa_candidates(libxl__gc > if (rc) > goto out; > > + /* Allocate and prepare the map of the node that can be utilized for > + * placement, basing on the map of suitable cpus. */ > + rc = libxl_node_bitmap_alloc(CTX, &suitable_nodemap); > + if (rc) > + goto out; > + rc = libxl_cpumap_to_nodemap(CTX, suitable_cpumap, &suitable_nodemap); > + if (rc) > + goto out; > + > /* > * Round up and down some of the constraints. For instance, the minimum > * number of cpus a candidate should have must at least be non-negative. > @@ -391,9 +413,14 @@ int libxl__get_numa_candidates(libxl__gc > for (comb_ok = comb_init(gc, &comb_iter, nr_nodes, min_nodes); comb_ok; > comb_ok = comb_next(comb_iter, nr_nodes, min_nodes)) { > uint32_t nodes_free_memkb; > - int nodes_cpus; > + int i, nodes_cpus; > > + /* Get the nodemap for the combination and filter unwnted nodes */unwanted> comb_get_nodemap(comb_iter, &nodemap, min_nodes); > + libxl_for_each_set_bit(i, nodemap) { > + if (!libxl_bitmap_test(&suitable_nodemap, i)) > + libxl_bitmap_reset(&nodemap, i); > + } > > /* If there is not enough memoy in this combination, skip it > * and go generating the next one... */ > @@ -402,7 +429,8 @@ int libxl__get_numa_candidates(libxl__gc > continue; > > /* And the same applies if this combination is short in cpus */ > - nodes_cpus = nodemap_to_nodes_cpus(tinfo, nr_cpus, &nodemap); > + nodes_cpus = nodemap_to_nodes_cpus(tinfo, nr_cpus, suitable_cpumap, > + &nodemap); > if (min_cpus > 0 && nodes_cpus < min_cpus) > continue; > > @@ -427,12 +455,13 @@ int libxl__get_numa_candidates(libxl__gc > new_cndts[*nr_cndts].nr_domains > nodemap_to_nr_domains(gc, tinfo, &nodemap); > new_cndts[*nr_cndts].free_memkb = nodes_free_memkb; > - new_cndts[*nr_cndts].nr_nodes = min_nodes; > + new_cndts[*nr_cndts].nr_nodes = nodemap_to_nr_nodes(&nodemap); > new_cndts[*nr_cndts].nr_cpus = nodes_cpus; > > LOG(DEBUG, "NUMA placement candidate #%d found: nr_nodes=%d, " > "nr_cpus=%d, free_memkb=%"PRIu32"", *nr_cndts, > - min_nodes, new_cndts[*nr_cndts].nr_cpus, > + new_cndts[*nr_cndts].nr_nodes, > + new_cndts[*nr_cndts].nr_cpus, > new_cndts[*nr_cndts].free_memkb / 1024); > > (*nr_cndts)++; > @@ -442,6 +471,7 @@ int libxl__get_numa_candidates(libxl__gc > > *cndts = new_cndts; > out: > + libxl_bitmap_dispose(&suitable_nodemap); > libxl_bitmap_dispose(&nodemap); > libxl_cputopology_list_free(tinfo, nr_cpus); > libxl_numainfo_list_free(ninfo, nr_nodes); > @@ -485,23 +515,27 @@ static int numa_cmpf(const void *v1, con > } > > /* The actual automatic NUMA placement routine */ > -static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info) > +static int numa_place_domain(libxl__gc *gc, uint32_t domid, > + libxl_domain_build_info *info) > { > int nr_candidates = 0; > libxl__numa_candidate *candidates = NULL; > libxl_bitmap candidate_nodemap; > - libxl_cpupoolinfo *pinfo; > - int nr_pools, rc = 0; > + libxl_cpupoolinfo cpupool_info; > + int i, cpupool, rc = 0; > uint32_t memkb; > > - /* First of all, if cpupools are in use, better not to mess with them */ > - pinfo = libxl_list_cpupool(CTX, &nr_pools); > - if (!pinfo) > - return ERROR_FAIL; > - if (nr_pools > 1) { > - LOG(NOTICE, "skipping NUMA placement as cpupools are in use"); > - goto out; > - } > + /* > + * Extract the cpumap from the cpupool the domain belong to. In fact, > + * it only makes sense to consider the cpus/nodes that are in there > + * for placement. > + */ > + rc = cpupool = libxl__domain_cpupool(gc, domid); > + if (rc < 0) > + return rc; > + rc = libxl_cpupool_info(CTX, &cpupool_info, cpupool); > + if (rc) > + return rc; > > rc = libxl_domain_need_memory(CTX, info, &memkb); > if (rc) > @@ -513,7 +547,8 @@ static int numa_place_domain(libxl__gc * > > /* Find all the candidates with enough free memory and at least > * as much pcpus as the domain has vcpus. */ > - rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, 0, 0, > + rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, > + 0, 0, &cpupool_info.cpumap, > &candidates, &nr_candidates); > if (rc) > goto out; > @@ -538,13 +573,20 @@ static int numa_place_domain(libxl__gc * > if (rc) > goto out; > > + /* Avoid trying to set the affinity to cpus that might be in the > + * nodemap but not in our cpupool. */ > + libxl_for_each_set_bit(i, info->cpumap) { > + if (!libxl_bitmap_test(&cpupool_info.cpumap, i)) > + libxl_bitmap_reset(&info->cpumap, i); > + } > + > LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and " > "%"PRIu32" KB free selected", candidates[0].nr_nodes, > candidates[0].nr_cpus, candidates[0].free_memkb / 1024); > > out: > libxl_bitmap_dispose(&candidate_nodemap); > - libxl_cpupoolinfo_list_free(pinfo, nr_pools); > + libxl_cpupoolinfo_dispose(&cpupool_info); > return rc; > } > > @@ -567,7 +609,7 @@ int libxl__build_pre(libxl__gc *gc, uint > * whatever that turns out to be. > */ > if (libxl_bitmap_is_full(&info->cpumap)) { > - int rc = numa_place_domain(gc, info); > + int rc = numa_place_domain(gc, domid, info); > if (rc) > return rc; > } > diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h > --- a/tools/libxl/libxl_internal.h > +++ b/tools/libxl/libxl_internal.h > @@ -2094,14 +2094,17 @@ typedef struct { > * least that amount of free memory and that number of cpus, respectively. If > * min_free_memkb and/or min_cpus are 0, the candidates'' free memory and number > * of cpus won''t be checked at all, which means a candidate will always be > - * considered suitable wrt the specific constraint. cndts is where the list of > - * exactly nr_cndts candidates is returned. Note that, in case no candidates > - * are found at all, the function returns successfully, but with nr_cndts equal > - * to zero. > + * considered suitable wrt the specific constraint. suitable_cpumap is useful > + * for specifyin we want only the cpus in that mask to be considered whilespecifying Apart from those two spelling errors: Acked-by: Ian Campbell <ian.campbell@citrix.com>> + * generating placement candidates (for example because of cpupools). cndts is > + * where the list of exactly nr_cndts candidates is returned. Note that, in > + * case no candidates are found at all, the function returns successfully, but > + * with nr_cndts equal to zero. > */ > _hidden int libxl__get_numa_candidates(libxl__gc *gc, > uint32_t min_free_memkb, int min_cpus, > int min_nodes, int max_nodes, > + const libxl_bitmap *suitable_cpumap, > libxl__numa_candidate *cndts[], int *nr_cndts); > > /* allocation and deallocation for placement candidates */
Ian Campbell
2012-Jun-21 13:38 UTC
Re: [PATCH 10 of 10 v2] Some automatic NUMA placement documentation
On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:> About rationale, usage and (some small bits of) API. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > Changes from v1: > * API documentation moved close to the actual functions. > > diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown > new file mode 100644 > --- /dev/null > +++ b/docs/misc/xl-numa-placement.markdown > @@ -0,0 +1,74 @@ > +# Guest Automatic NUMA Placement in libxl and xl # > + > +## Rationale ## > + > +The Xen hypervisor deals with Non-Uniform Memory Access (NUMA]) > +machines by assigning to its domain a "node affinity", i.e., a set of NUMA > +nodes of the host from which it gets its memory allocated. > + > +NUMA awareness becomes very important as soon as many domains start running > +memory-intensive workloads on a shared host. In fact, the cost of accessing > +non node-local memory locations is very high, and the performance degradation > +is likely to be noticeable. > + > +## Guest Placement in xl ## > + > +If using xl for creating and managing guests, it is very easy to ask > +for both manual or automatic placement of them across the host''s NUMA > +nodes. > + > +Note that xm/xend does the very same thing, the only differences residing > +in the details of the heuristics adopted for the placement (see below). > + > +### Manual Guest Placement with xl ### > + > +Thanks to the "cpus=" option, it is possible to specify where a domain > +should be created and scheduled on, directly in its config file. This > +affects NUMA placement and memory accesses as the hypervisor constructs > +the node affinity of a VM basing right on its CPU affinity when it is > +created. > + > +This is very simple and effective, but requires the user/system > +administrator to explicitly specify affinities for each and every domain, > +or Xen won''t be able to enable guarantee the locality for their memory > +accesses. > + > +### Automatic Guest Placement with xl ### > + > +In case no "cpus=" option is specified in the config file, xl tries > +to figure out on its own on which node(s) the domain could fit best. > + > +First of all, it needs to find a node (or a set of nodes) that have > +enough free memory for accommodating the domain. After that, the actual > +decision on where to put the new guest happens by generating all the > +possible combinations of nodes that satisfies the above and chose among > +them according to the following heuristics: > + > + * candidates involving fewer nodes come first. In case two (or more) > + candidates span the same number of nodes, > + * candidates with greater amount of free memory come first. In case > + two (or more) candidates differ in their amount of free memory by > + less than 10%, > + * candidates with fewer domains already placed on them come first. > + > +Giving preference to small candidates ensures better performance for > +the guest, as it avoid spreading its memory among different nodes. > +Using the nodes that have the biggest amounts of free memory helps > +keeping the memory fragmentation small, from a system wide perspective. > +Finally, in case more candidates fulfil these criteria by the same > +extent, choosing the candidate that is hosting fewer domain helps > +balancing the load on the various nodes.[...] snipped the bit you said was wrong. It was just be remove not replaced, right?> + > +## Guest Placement within libxl ## > + > +xl achieves automatic NUMA placement by means of the following API > +calls, provided by libxl.This last seems out of date. Assuming the bit above was to be removed not replaced: Acked-by: Ian Campbell <ian.campbell@citrix.com>
Dario Faggioli
2012-Jun-21 13:54 UTC
Re: [PATCH 09 of 10 v2] libxl: have NUMA placement deal with cpupools
On Thu, 2012-06-21 at 14:31 +0100, Ian Campbell wrote:> > + * considered suitable wrt the specific constraint. suitable_cpumap is useful > > + * for specifyin we want only the cpus in that mask to be considered while > > specifying > > Apart from those two spelling errors: >Yeah, sorry for that. I''m very bad in spelling correctly when typing, and this bloody keyboard is not helping! :-P Isn''t there any tool for spell checking code comments? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2012-Jun-21 13:57 UTC
Re: [PATCH 10 of 10 v2] Some automatic NUMA placement documentation
On Thu, 2012-06-21 at 14:38 +0100, Ian Campbell wrote:> snipped the bit you said was wrong. It was just be remove not replaced, right? >Yes, it will go away, as the algorithm changed and does not do that any longer.> > + > > +## Guest Placement within libxl ## > > + > > +xl achieves automatic NUMA placement by means of the following API > > +calls, provided by libxl. > > This last seems out of date. >Indeed, thanks. Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Campbell
2012-Jun-21 13:58 UTC
Re: [PATCH 09 of 10 v2] libxl: have NUMA placement deal with cpupools
On Thu, 2012-06-21 at 14:54 +0100, Dario Faggioli wrote:> On Thu, 2012-06-21 at 14:31 +0100, Ian Campbell wrote: > > > + * considered suitable wrt the specific constraint. suitable_cpumap is useful > > > + * for specifyin we want only the cpus in that mask to be considered while > > > > specifying > > > > Apart from those two spelling errors: > > > Yeah, sorry for that. I''m very bad in spelling correctly when typing, > and this bloody keyboard is not helping! :-P > > Isn''t there any tool for spell checking code comments?I''ve often thought such a thing would be useful, but I don''t know of one. TBH I only really spot the spelling errors because my mail client happens to draw a squiggly red line under them in the quotes. Ian.
George Dunlap
2012-Jun-21 16:16 UTC
Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
On 15/06/12 18:04, Dario Faggioli wrote:> If a domain does not have a VCPU affinity, try to pin it automatically to some > PCPUs. This is done taking into account the NUMA characteristics of the host. > In fact, we look for a combination of host''s NUMA nodes with enough free memory > and number of PCPUs for the new domain, and pin it to the VCPUs of those nodes. > > Once we know which ones, among all the possible combinations, represents valid > placement candidates for a domain, use some heuistics for deciding which is the > best. For instance, smaller candidates are considered to be better, both from > the domain''s point of view (fewer memory spreading among nodes) and from the > system as a whole point of view (fewer memoy fragmentation). In case of > candidates of equal sizes (i.e., with the same number of nodes), the one with > the greater amount of memory wins, as this is also good for keeping memory > fragmentation under control. > > This all happens internally to libxl, and no API for driving the mechanism is > provided for now. This matches what xend already does. > > Signed-off-by: Dario Faggioli<dario.faggioli@citrix.com>Overall I think this approach is much better. I haven''t done an extremely detailed review, but I do have some comments below.> + /* > + * Round up and down some of the constraints. For instance, the minimum > + * number of cpus a candidate should have must at least be non-negative. > + * Regarding the minimum number of NUMA nodes, if not explicitly specified > + * (i.e., min_nodes<= 0), we try to figure out a sensible number of nodes > + * from where to start generating candidates, if possible (or just start > + * from 1 otherwise). The maximum number of nodes should not exceed the > + * number of existent NUMA nodes on the host, or the candidate genaration > + * won''t work properly. > + */ > + min_cpus = min_cpus<= 0 ? 0 : min_cpus;Wouldn''t it just make more sense to specify that "min_cpus" (and other parameters) had to be >=0? In any case, this is a very complicated way of saying "if (min_cpus<0) min_cpus = 0;".> + if (min_nodes<= 0) { > + int cpus_per_node; > + > + cpus_per_node = cpus_per_node_count(tinfo, nr_cpus, ninfo, nr_nodes); > + if (cpus_per_node == 0) > + min_nodes = 1; > + else > + min_nodes = (min_cpus + cpus_per_node - 1) / cpus_per_node; > + }Same here; you could just write "if(!min_nodes) {..."> + min_nodes = min_nodes> nr_nodes ? nr_nodes : min_nodes; > + if (max_nodes<= 0) > + max_nodes = nr_nodes; > + else > + max_nodes = max_nodes> nr_nodes ? nr_nodes : max_nodes;if (max_nodes == 0 || max_nodes > nr_nodes) max_nodes = nr_nodes;> + > +/* > + * The NUMA placement candidates are reordered according to the following > + * heuristics: > + * - candidates involving fewer nodes come first. In case two (or > + * more) candidates span the same number of nodes, > + * - candidates with greater amount of free memory come first. In > + * case two (or more) candidates differ in their amount of free > + * memory by less than 10%,Interesting idea -- sounds pretty reasonable.> + * - candidates with fewer domains insisting on them at the time of > + * this call come first.Do you mean "existing"? I think "assigned to" is probably better.> + */ > +static int numa_cmpf(const void *v1, const void *v2) > +{ > + const libxl__numa_candidate *c1 = (const libxl__numa_candidate*) v1; > + const libxl__numa_candidate *c2 = (const libxl__numa_candidate*) v2; > + double mem_diff = labs(c1->free_memkb - c2->free_memkb); > + double mem_avg = (c1->free_memkb + c2->free_memkb) / 2.0; > + > + if (c1->nr_nodes != c2->nr_nodes) > + return c1->nr_nodes - c2->nr_nodes; > + > + if ((mem_diff / mem_avg) * 100.0< 10.0&& > + c1->nr_domains != c2->nr_domains) > + return c1->nr_domains - c2->nr_domains;I realize this isn''t a hot path, but it seems like moving into FP is really unnecessary. You can just do this: if ( ((mem_diff * 100) / mem_avg) < 10 ... One minor note: I personally think it''s a lot more readable to put the ''&&'' and ''||'' on the same line as the next item, rather than the previous item; i.e.: if ( expression_a && expression_b ) One more thing: Is there a reason why you put get_numa_candidates() in libxl_internal.h, but not sort_numa_candidates()? It seems like both or neither should go. :-) That''s all I have for now. I''m OK with the general approach, so here''s a "weak ack", so if a maintainer is happy with the code, he can check it in: Acked-by: George Dunlap <george.dunlap@eu.citrix.com> I''ll try to come back and give a more detailed code review if I get a chance. -George
Dario Faggioli
2012-Jun-21 16:34 UTC
Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
On Thu, 2012-06-21 at 12:40 +0100, Ian Campbell wrote:> > diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 > > --- a/docs/man/xl.cfg.pod.5 > > +++ b/docs/man/xl.cfg.pod.5 > > @@ -111,8 +111,8 @@ created online and the remainder will be > > > > =item B<cpus="CPU-LIST"> > > > > -List of which cpus the guest is allowed to use. Default behavior is > > There is (according to grep) a slight preference for British spelling, > but I imagine we are totally inconsistent all over the place so lets not > worry too much ;-) >Especially considering I''m removing it! :-P (However, I''m sure it was me that put that line in some time ago, so I''ll make a note about this! :-D).> > @@ -132,6 +132,12 @@ run on cpu #3 of the host. > > > > =back > > > > +If this option is not specified, libxl automatically tries to place the new > > +domain on the host''s NUMA nodes (provided the host has more than one NUMA > > +node) by pinning it to the cpus of those nodes. An heuristical approach is > > I don''t think heuristical is a word, I think "A heuristic approach..." > is fine >Fixed (with "A heuristic approach"). Thanks.> > + * the same instance of the iterator and the same values for > > + * n and k are used for each call. If that doesn''t happen, the > > + * result is unspecified. > > + * > > + * The algorithm is a well known one, > > Worth a link/reference? >As you like... The general idea is described, for example, in this Wikipedia page http://en.wikipedia.org/wiki/Combination , the actual algorithm appeared, for instance, in D. Knuth''s "The Art of Computer Programming - Volume 4, Fascicle 3", but in many other places as well. I guess I can cite TAOCP, although I''m not sure it is actually the first, and thus the one who should be credited for this... But I guess that doesn''t matter too much, does it?> > + * > > + * For example, with n = 5 and k = 3, calling comb_init() > > + * will generate { 0, 1, 2 }, while subsequent valid calls to > > + * comb_next() will produce the following: > > + * { { 0, 1, 2 }, { 0, 1, 3 }, { 0, 1, 4 }, > > ^ strictly speaking this 0,1,2 isn''t produced by a call > to comb_next since it came from the initial comb_init, > right?Yes, that''s right.> IOW the first comb_next() call won''t duplicate > it... >No, it won''t. I''ll fix this in the comment.> I''m not going to try very hard to review the algorithm here, I trust you > and I''ve got a head cold ;-) >Fine. :-)> > +/* NUMA automatic placement (see libxl_internal.h for details) */ > > + > > +/* > > + * This function turns a k-combination iterator into a node map. > > + * This means the bits in the node map corresponding to the indexes > > + * of the given combination are the ones that will be set. > > + * For example, if the iterator represents the combination { 0, 2, 4}, > > + * the node map will have bits #0, #2 and #4 set to one and i thus > > + * will point to node 0, node 2 and and node 4. > > What is "i" in this last bit ("i thus will point to")? I don''t think you > are referring to the loop iterator variable. > > If you meant "it" then I don''t think it needs saying that a node map > with bits 0, 2 and 4 set refers to nodes 0, 2 and 4. >It was actually "it", and I''m happy to kill that part of the sentence.> > +/* Retrieve the number of cpus that the nodes that are part of the nodemap > > + * span. */ > > +static int nodemap_to_nodes_cpus(libxl_cputopology *tinfo, int nr_cpus, > > + const libxl_bitmap *nodemap) > > nodes here == node''s ? But can''t have an apostrophe in a function which > makes it ready weird to me. Perhaps "nodemap_count_cpus"? >Right. I wanted to stick to the nodemap_to_xxx() for those group of function, so I changed this one into "nodemap_to_nr_cpus()". Hope it is fine...> > +/* Retrieve the number of domains that can potentially run on the cpus > > + * the nodes that are part of the nodemap. */ > > +static int nodemap_to_nr_domains(libxl__gc *gc, libxl_cputopology *tinfo, > > + const libxl_bitmap *nodemap) > > +{ > > + libxl_dominfo *dinfo = NULL; > > + libxl_bitmap dom_nodemap; > > + int nr_doms, nr_cpus; > > + int nr_domains = 0; > > + int i, j, k; > > + > > + dinfo = libxl_list_domain(CTX, &nr_doms); > > + if (dinfo == NULL) > > + return ERROR_FAIL; > > + > > + if (libxl_node_bitmap_alloc(CTX, &dom_nodemap) < 0) { > > + nr_domains = ERROR_FAIL; > > + goto out; > > + } > > + > > + for (i = 0; i < nr_doms; i++) { > > + libxl_vcpuinfo *vinfo; > > + int nr_vcpus; > > + > > + vinfo = libxl_list_vcpu(CTX, dinfo[i].domid, &nr_vcpus, &nr_cpus); > > + if (vinfo == NULL) > > + continue; > > + > > + libxl_bitmap_set_none(&dom_nodemap); > > + for (j = 0; j < nr_vcpus; j++) { > > + libxl_for_each_set_bit(k, vinfo[j].cpumap) > > + libxl_bitmap_set(&dom_nodemap, tinfo[k].node); > > + } > > + > > + libxl_for_each_set_bit(j, dom_nodemap) { > > + if (libxl_bitmap_test(nodemap, j)) { > > + nr_domains++; > > + goto found; > > AKA break? >Yes. Leftover from v1, which I forgot to adapt.> > + } > > + } > > + found: > > + libxl_vcpuinfo_list_free(vinfo, nr_vcpus); > > + } > > + > > + out: > > + libxl_bitmap_dispose(&dom_nodemap); > > You can come to out if libxl_nodebitmap_alloc fails -- in which case > dom_nodemap is (potentially) uninitialised. > > You could just move out: down a line. >Right. As I did not really like that "nr_domains = ERROR_FAIL" I changed the error handling style of this function as a whole so that now I do not need the out: label any more.> > +/* > > + * This function tries to figure out if the host has a consistent number > > + * of cpus along all its NUMA nodes. In fact, if that is the case, we can > > + * calculate the minimum number of nodes needed for a domain by just > > + * dividing its total number of vcpus by this value computed here. > > + * However, we are not allowed to assume that all the nodes have the > > + * same number of cpus. Therefore, in case discrepancies among different > > + * nodes are found, this function just returns 0 and the caller needs > > + * to sort out things in some other way. > > If the caller has to deal with this anyway why bother with this check > and/or the other algorithm? >Mmm... I''m not sure I get what you mean here. Perhaps my commenting about the whole thing was not so clear in the first place. The (George''s :-)) idea is, if we know each node has 8 CPUs, and our domain wants more than that, it does not make sense to start looking for placement solutions with just one node. Actually, if the domain wants fewer than 16 CPUs, starting looking from solutions with 2 nodes in them should be the right thing. That being said, I think it''s important we do not assume we won''t ever find some architecture with different number of CPUs in different nodes, and that is what a zero return from that function means. What was that you thought not to be worthwhile?> > +/* Get all the placement candidates satisfying some specific conditions */ > > +int libxl__get_numa_candidates(libxl__gc *gc, > > + uint32_t min_free_memkb, int min_cpus, > > + int min_nodes, int max_nodes, > > + libxl__numa_candidate *cndts[], int *nr_cndts) > > +{ > > + libxl__numa_candidate *new_cndts = NULL; > > + libxl_cputopology *tinfo = NULL; > > + libxl_numainfo *ninfo = NULL; > > + libxl_bitmap nodemap; > > + int nr_nodes, nr_cpus; > > + int array_size, rc; > > + > > + /* Get platform info and prepare the map for testing the combinations */ > > + ninfo = libxl_get_numainfo(CTX, &nr_nodes); > > + if (ninfo == NULL) > > + return ERROR_FAIL; > > + /* If we don''t have at least 2 nodes, it is useless to proceed */ > > You don''t want to return whichever of those 2 nodes meets the > constraints? >I actually was meaning something like "hey, there''s only _1_ node, go for it!" :-P> ... > > + *cndts = new_cndts; > > + out: > > + libxl_bitmap_dispose(&nodemap); > > nodemap might not have been initialised? >Mmm... Right. So I really think I need an init function. :-(> > + libxl_cputopology_list_free(tinfo, nr_cpus); > > + libxl_numainfo_list_free(ninfo, nr_nodes); > > It looks like the nr_* may also not have been initialised, but maybe > list_free is safe in that case since the associated array must > necessarily be NULL? >Well, probably, but as they both do a "for (i=0;i<nr_*;i++)" I think it''s safer if I "=0" those twos, is that ok?> > +/* > > + * The NUMA placement candidates are reordered according to the following > > + * heuristics: > > + * - candidates involving fewer nodes come first. In case two (or > > + * more) candidates span the same number of nodes, > > + * - candidates with greater amount of free memory come first. In > > + * case two (or more) candidates differ in their amount of free > > + * memory by less than 10%, > > + * - candidates with fewer domains insisting on them at the time of > > + * this call come first. > > + */ > > +static int numa_cmpf(const void *v1, const void *v2) > > +{ > > + const libxl__numa_candidate *c1 = (const libxl__numa_candidate*) v1; > > + const libxl__numa_candidate *c2 = (const libxl__numa_candidate*) v2; > > Are these casts necessary? >Will check and remove if not. Thanks.> > + double mem_diff = labs(c1->free_memkb - c2->free_memkb); > > + double mem_avg = (c1->free_memkb + c2->free_memkb) / 2.0; > > + > > + if (c1->nr_nodes != c2->nr_nodes) > > + return c1->nr_nodes - c2->nr_nodes; > > + > > + if ((mem_diff / mem_avg) * 100.0 < 10.0 && > > Is all this FP math necessary? Using integers might have some rounding > but is it significant or do we care if it gets it slightly wrong? >I think integer is fine... It''s heuristics, after all. I''ll go for it.> > +/* The actual automatic NUMA placement routine */ > > +static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info) > > +{ > > + int nr_candidates = 0; > > + libxl__numa_candidate *candidates = NULL; > > + libxl_bitmap candidate_nodemap; > > + libxl_cpupoolinfo *pinfo; > > + int nr_pools, rc = 0; > > + uint32_t memkb; > > + > > + /* First of all, if cpupools are in use, better not to mess with them */ > > + pinfo = libxl_list_cpupool(CTX, &nr_pools); > > + if (!pinfo) > > + return ERROR_FAIL; > > + if (nr_pools > 1) { > > + LOG(NOTICE, "skipping NUMA placement as cpupools are in use"); > > + goto out; > > + } > > + > > + rc = libxl_domain_need_memory(CTX, info, &memkb); > > + if (rc) > > + goto out; > > + if (libxl_node_bitmap_alloc(CTX, &candidate_nodemap)) { > > + rc = ERROR_FAIL; > > + goto out; > > + } > > + > > + /* Find all the candidates with enough free memory and at least > > + * as much pcpus as the domain has vcpus. */ > > + rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, 0, 0, > > + &candidates, &nr_candidates); > > + if (rc) > > + goto out; > > + > > + LOG(DETAIL, "%d NUMA placement candidates found", nr_candidates); > > + > > + /* No suitable placement candidates. We just return without touching the > > + * domain''s info->cpumap. It will have affinity with all nodes/cpus. */ > > + if (nr_candidates == 0) > > + goto out; > > + > > + /* Bring the best candidate in front of the list --> candidates[0] */ > > + if (nr_candidates > 1) > > + libxl__sort_numa_candidates(candidates, nr_candidates, numa_cmpf); > > Is the start up cost of libxl__sort_numa_candidates significant enough > to make that if worthwhile? >I''m not sure what you mean here, I''m afraid ... :-(> > + > > + /* > > + * Check if the domain has any CPU affinity. If not, try to build up one. > > + * In case numa_place_domain() find at least a suitable candidate, it will > > + * affect info->cpumap accordingly; if it does not, it just leaves it > > + * as it is. This means (unless some weird error manifests) the subsequent > > + * call to libxl_set_vcpuaffinity_all() will do the actual placement, > > + * whatever that turns out to be. > > + */ > > + if (libxl_bitmap_is_full(&info->cpumap)) { > > + int rc = numa_place_domain(gc, info); > > + if (rc) > > + return rc; > > I''m not sure about this, if numa_place_domain fails for any reason would > we be not better off logging and continuing without placement? We > already do that explicitly in a couple of cases. >Well, actually, if it "fails" in the sense it finds no candidates or does not manage in placing the domain, it returns 0, so we do right what you''re suggesting. I''m sure this is stated in some comment but I guess I can repeat it here. If !rc, it means we stepped into some ERROR_FAIL or something, which I think has to be handled like this, hasn''t it? Perhaps I can also add some logging about "successfully failing", i.e., not getting into any error but not being able to place the domain. If yes, that will happen _inside_ numa_place_domain() rather than here.> > + * dynamically computed. A candidate can consist of just one, up to all the > > + * available nodes on the host (the latter being, of course, not ideal). > > I can''t parse this sentence. >Sorry. I just meant a placement candidate is a set of nodes, so it can be one node, four nodes, or even comprise all the available nodes.> > +/* signature for the comparison function between two candidates c1 and c2 > > + * (the thid parameter is provided to enable thread safety). */ > > third > > But there isn''t actually a third parameter so it''s a bit moot ;-) > > > +typedef int (*libxl__numa_candidate_cmpf)(const void *v1, const void *v2); >You''re right. That is a leftover from where I was trying to use qsort_r() for sorting, but it turned out to not be necessary in this case. Killed. Thanks again and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2012-Jun-21 16:43 UTC
Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
On Thu, 2012-06-21 at 17:16 +0100, George Dunlap wrote:> > Signed-off-by: Dario Faggioli<dario.faggioli@citrix.com> > Overall I think this approach is much better. >Cool, Thanks.> > + /* > > + * Round up and down some of the constraints. For instance, the minimum > > + * number of cpus a candidate should have must at least be non-negative. > > + * Regarding the minimum number of NUMA nodes, if not explicitly specified > > + * (i.e., min_nodes<= 0), we try to figure out a sensible number of nodes > > + * from where to start generating candidates, if possible (or just start > > + * from 1 otherwise). The maximum number of nodes should not exceed the > > + * number of existent NUMA nodes on the host, or the candidate genaration > > + * won''t work properly. > > + */ > > + min_cpus = min_cpus<= 0 ? 0 : min_cpus; > Wouldn''t it just make more sense to specify that "min_cpus" (and other > parameters) had to be >=0? >Yes, I think that make sense, will do.> > +/* > > + * The NUMA placement candidates are reordered according to the following > > + * heuristics: > > + * - candidates involving fewer nodes come first. In case two (or > > + * more) candidates span the same number of nodes, > > + * - candidates with greater amount of free memory come first. In > > + * case two (or more) candidates differ in their amount of free > > + * memory by less than 10%, > Interesting idea -- sounds pretty reasonable. >Time will tell... :-O> > +static int numa_cmpf(const void *v1, const void *v2) > > +{ > > + const libxl__numa_candidate *c1 = (const libxl__numa_candidate*) v1; > > + const libxl__numa_candidate *c2 = (const libxl__numa_candidate*) v2; > > + double mem_diff = labs(c1->free_memkb - c2->free_memkb); > > + double mem_avg = (c1->free_memkb + c2->free_memkb) / 2.0; > > + > > + if (c1->nr_nodes != c2->nr_nodes) > > + return c1->nr_nodes - c2->nr_nodes; > > + > > + if ((mem_diff / mem_avg) * 100.0< 10.0&& > > + c1->nr_domains != c2->nr_domains) > > + return c1->nr_domains - c2->nr_domains; > I realize this isn''t a hot path, but it seems like moving into FP is > really unnecessary. You can just do this: >Yeah, IanC pointed out that too. I''ll convert everything toward integer arith.> One more thing: Is there a reason why you put get_numa_candidates() in > libxl_internal.h, but not sort_numa_candidates()? It seems like both or > neither should go. :-) >diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -2021,6 +2021,134 @@ static inline void libxl__ctx_unlock(lib ... +_hidden int libxl__get_numa_candidates(libxl__gc *gc, + uint32_t min_free_memkb, int min_cpus, + int min_nodes, int max_nodes, + libxl__numa_candidate *cndts[], int *nr_cndts); + ... +/* signature for the comparison function between two candidates c1 and c2 + * (the thid parameter is provided to enable thread safety). */ +typedef int (*libxl__numa_candidate_cmpf)(const void *v1, const void *v2); +/* sort the list of candidates in cndts (an array with nr_cndts elements in + * it) using cmpf for comparing two candidates. Uses libc''s qsort(). */ +_hidden void libxl__sort_numa_candidates(libxl__numa_candidate cndts[], + int nr_cndts, + libxl__numa_candidate_cmpf cmpf); But I''m not entirely sure I understood what you meant...> That''s all I have for now. I''m OK with the general approach, so here''s > a "weak ack", so if a maintainer is happy with the code, he can check it in: > > Acked-by: George Dunlap <george.dunlap@eu.citrix.com> >Ok, that''s very nice. I''ll have to respin the series, so I''ll definitely address your comments and add your ack. :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
George Dunlap
2012-Jun-22 10:05 UTC
Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
On Thu, Jun 21, 2012 at 5:43 PM, Dario Faggioli <raistlin@linux.it> wrote:>> > + * - candidates with greater amount of free memory come first. In >> > + * case two (or more) candidates differ in their amount of free >> > + * memory by less than 10%, >> Interesting idea -- sounds pretty reasonable. >> > Time will tell... :-OThe only potential concern is that it seems likely that your comparison function above will not actually generate an ordering, because the relationship it generates isn''t transitive. That is, you could have a situation where A > B, B > C, but C > A. For example: A: freemem 1000, domains 1 B: freemem 1090, domains 2 C: freemem 1110, domains 3 In this case, A>B, because memory is within 10% but has fewer domains. B > C, because B is within 10%, and has fewer domains. But C > A, because C has more than 10% more memory than A (so its domains are not counted against it). This may give you strange results from your quicksort. But it''s likely to give something *reasonable* anyway. (And if it turns out to cause problems, a bubble sort shouldn''t be too expensive and will likely to be more robust against this kind of quirkiness.) -George
Ian Campbell
2012-Jun-22 10:14 UTC
Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
On Thu, 2012-06-21 at 17:34 +0100, Dario Faggioli wrote:> On Thu, 2012-06-21 at 12:40 +0100, Ian Campbell wrote:> > > +/* > > > + * This function tries to figure out if the host has a consistent number > > > + * of cpus along all its NUMA nodes. In fact, if that is the case, we can > > > + * calculate the minimum number of nodes needed for a domain by just > > > + * dividing its total number of vcpus by this value computed here. > > > + * However, we are not allowed to assume that all the nodes have the > > > + * same number of cpus. Therefore, in case discrepancies among different > > > + * nodes are found, this function just returns 0 and the caller needs > > > + * to sort out things in some other way. > > > > If the caller has to deal with this anyway why bother with this check > > and/or the other algorithm? > > > Mmm... I''m not sure I get what you mean here. Perhaps my commenting > about the whole thing was not so clear in the first place. > > The (George''s :-)) idea is, if we know each node has 8 CPUs, and our > domain wants more than that, it does not make sense to start looking for > placement solutions with just one node. Actually, if the domain wants > fewer than 16 CPUs, starting looking from solutions with 2 nodes in them > should be the right thing. > > That being said, I think it''s important we do not assume we won''t ever > find some architecture with different number of CPUs in different nodes, > and that is what a zero return from that function means. > > What was that you thought not to be worthwhile?The comment said "this function just returns 0 and the caller needs to sort out things in some other way" so I was wondering -- well if the caller has some algorithm which can cope with this why not always use it. I think I later observed that if it returns zero the caller in this case does something basic instead, so I see how this makes sense now.> > > > +/* Get all the placement candidates satisfying some specific conditions */ > > > +int libxl__get_numa_candidates(libxl__gc *gc, > > > + uint32_t min_free_memkb, int min_cpus, > > > + int min_nodes, int max_nodes, > > > + libxl__numa_candidate *cndts[], int *nr_cndts) > > > +{ > > > + libxl__numa_candidate *new_cndts = NULL; > > > + libxl_cputopology *tinfo = NULL; > > > + libxl_numainfo *ninfo = NULL; > > > + libxl_bitmap nodemap; > > > + int nr_nodes, nr_cpus; > > > + int array_size, rc; > > > + > > > + /* Get platform info and prepare the map for testing the combinations */ > > > + ninfo = libxl_get_numainfo(CTX, &nr_nodes); > > > + if (ninfo == NULL) > > > + return ERROR_FAIL; > > > + /* If we don''t have at least 2 nodes, it is useless to proceed */ > > > > You don''t want to return whichever of those 2 nodes meets the > > constraints? > > > I actually was meaning something like "hey, there''s only _1_ node, go > for it!" :-PI misread sorry!> > > ... > > > + *cndts = new_cndts; > > > + out: > > > + libxl_bitmap_dispose(&nodemap); > > > > nodemap might not have been initialised? > > > Mmm... Right. So I really think I need an init function. :-(In general all libxl datastructures are supposed to have one, most of them are just memset(.., 0)> > > > + libxl_cputopology_list_free(tinfo, nr_cpus); > > > + libxl_numainfo_list_free(ninfo, nr_nodes); > > > > It looks like the nr_* may also not have been initialised, but maybe > > list_free is safe in that case since the associated array must > > necessarily be NULL? > > > Well, probably, but as they both do a "for (i=0;i<nr_*;i++)" I think > it''s safer if I "=0" those twos, is that ok?I think it would be a good idea. The alternative is that anything which returns a list has to set nr =0.> > > + /* Bring the best candidate in front of the list --> candidates[0] */ > > > + if (nr_candidates > 1) > > > + libxl__sort_numa_candidates(candidates, nr_candidates, numa_cmpf); > > > > Is the start up cost of libxl__sort_numa_candidates significant enough > > to make that if worthwhile? > > > I''m not sure what you mean here, I''m afraid ... :-(Is the cost of sorting the 1 element list so high it is worth avoiding. Since the sort itself would be trivial the startup costs are what would dominate.> > > > + > > > + /* > > > + * Check if the domain has any CPU affinity. If not, try to build up one. > > > + * In case numa_place_domain() find at least a suitable candidate, it will > > > + * affect info->cpumap accordingly; if it does not, it just leaves it > > > + * as it is. This means (unless some weird error manifests) the subsequent > > > + * call to libxl_set_vcpuaffinity_all() will do the actual placement, > > > + * whatever that turns out to be. > > > + */ > > > + if (libxl_bitmap_is_full(&info->cpumap)) { > > > + int rc = numa_place_domain(gc, info); > > > + if (rc) > > > + return rc; > > > > I''m not sure about this, if numa_place_domain fails for any reason would > > we be not better off logging and continuing without placement? We > > already do that explicitly in a couple of cases. > > > Well, actually, if it "fails" in the sense it finds no candidates or > does not manage in placing the domain, it returns 0, so we do right what > you''re suggesting. I''m sure this is stated in some comment but I guess I > can repeat it here. If !rc, it means we stepped into some ERROR_FAIL or > something, which I think has to be handled like this, hasn''t it?That makes sense.> Perhaps I can also add some logging about "successfully failing", i.e., > not getting into any error but not being able to place the domain. If > yes, that will happen _inside_ numa_place_domain() rather than here.Logging in that case seems wise in any case since it will have performance implications I think. Ian.
Ian Jackson
2012-Jun-26 11:03 UTC
Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
George Dunlap writes ("Re: [Xen-devel] [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes"):> The only potential concern is that it seems likely that your > comparison function above will not actually generate an ordering, > because the relationship it generates isn''t transitive. That is, you > could have a situation where A > B, B > C, but C > A. For example: > A: freemem 1000, domains 1 > B: freemem 1090, domains 2 > C: freemem 1110, domains 3 > > In this case, A>B, because memory is within 10% but has fewer domains. > B > C, because B is within 10%, and has fewer domains. But C > A, > because C has more than 10% more memory than A (so its domains are not > counted against it).The conventional approach to this kind of thing is to invent a score for sorting etc. Something like score = tuning_parameter * log(freemem) - number_of_domains which does sort of roughly the same as your 10% rule if you squint. Or maybe you want to take the log of the number_of_domains too, which would be equivalent for sorting to score = freemem / (number_of_domains + other_tuning_parameter)> This may give you strange results from your quicksort. But it''s > likely to give something *reasonable* anyway. (And if it turns out to > cause problems, a bubble sort shouldn''t be too expensive and will > likely to be more robust against this kind of quirkiness.)The behaviour of sorting functions when giving broken comparison functions may be entirely arbitrary and is not guaranteed to be anything near the plausible intended orderings. If you''re using qsort you''re not guaranteed that any particular algorithm will be used, so you can''t make any assumptions about its behaviour in this case. Ian.
Dario Faggioli
2012-Jun-26 15:20 UTC
Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
On Tue, 2012-06-26 at 12:03 +0100, Ian Jackson wrote:> George Dunlap writes ("Re: [Xen-devel] [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes"): > > The only potential concern is that it seems likely that your > > comparison function above will not actually generate an ordering, > > because the relationship it generates isn''t transitive. That is, you > > could have a situation where A > B, B > C, but C > A. For example: > > A: freemem 1000, domains 1 > > B: freemem 1090, domains 2 > > C: freemem 1110, domains 3 > > > > In this case, A>B, because memory is within 10% but has fewer domains. > > B > C, because B is within 10%, and has fewer domains. But C > A, > > because C has more than 10% more memory than A (so its domains are not > > counted against it). > > The conventional approach to this kind of thing is to invent a score > for sorting etc. Something like > score = tuning_parameter * log(freemem) - number_of_domains > which does sort of roughly the same as your 10% rule if you squint. >Right, I thought about this, but then got a bit scared about properly mixing apple, oranges and melons (as I have number of nodes, amount of free memory and number of domains), giving the proper "weight" to each of them. Anyway, I see the potential issue George is reporting, so I''ll try to come up with some formula... Thanks and regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Campbell
2012-Jun-26 16:00 UTC
Re: [PATCH 01 of 10 v2] libxl: fix a typo in the GCREALLOC_ARRAY macro
On Thu, 2012-06-21 at 09:53 +0100, Ian Campbell wrote:> On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote: > > Causing a build failure when trying to use it: > > > > xxx: error: expected '';'' before '')'' token > > xxx: error: expected statement before '')'' token > > > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > Acked-by: Ian Campbell <ian.campbell@citrix.com>This bugfix was mostly independent from the rest of the series, so I''ve committed it.> > > > > diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h > > --- a/tools/libxl/libxl_internal.h > > +++ b/tools/libxl/libxl_internal.h > > @@ -1972,7 +1972,7 @@ struct libxl__domain_create_state { > > #define GCREALLOC_ARRAY(var, nmemb) \ > > (assert(nmemb > 0), \ > > assert(ARRAY_SIZE_OK((var), (nmemb))), \ > > - (var) = libxl__realloc((gc), (var), (nmemb)*sizeof(*(var))))) > > + (var) = libxl__realloc((gc), (var), (nmemb)*sizeof(*(var)))) > > > > > > /* > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Dario Faggioli
2012-Jun-26 16:25 UTC
Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
On Fri, 2012-06-22 at 11:14 +0100, Ian Campbell wrote:> > > > + /* Bring the best candidate in front of the list --> candidates[0] */ > > > > + if (nr_candidates > 1) > > > > + libxl__sort_numa_candidates(candidates, nr_candidates, numa_cmpf); > > > > > > Is the start up cost of libxl__sort_numa_candidates significant enough > > > to make that if worthwhile? > > > > > I''m not sure what you mean here, I''m afraid ... :-( > > Is the cost of sorting the 1 element list so high it is worth avoiding. > Since the sort itself would be trivial the startup costs are what would > dominate. >So, the answer is that I really don''t know, as I''m using a library function for doing the actual work. I really do not expect for them to be that high, and I hence can remove the if() if you think it looks ugly. :-)> > > I''m not sure about this, if numa_place_domain fails for any reason would > > > we be not better off logging and continuing without placement? We > > > already do that explicitly in a couple of cases. > > > > > Well, actually, if it "fails" in the sense it finds no candidates or > > does not manage in placing the domain, it returns 0, so we do right what > > you''re suggesting. I''m sure this is stated in some comment but I guess I > > can repeat it here. If !rc, it means we stepped into some ERROR_FAIL or > > something, which I think has to be handled like this, hasn''t it? > > That makes sense. > > > Perhaps I can also add some logging about "successfully failing", i.e., > > not getting into any error but not being able to place the domain. If > > yes, that will happen _inside_ numa_place_domain() rather than here. > > Logging in that case seems wise in any case since it will have > performance implications I think. >Ok, I''ll log something out. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Campbell
2012-Jun-26 16:26 UTC
Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
On Tue, 2012-06-26 at 17:25 +0100, Dario Faggioli wrote:> On Fri, 2012-06-22 at 11:14 +0100, Ian Campbell wrote: > > > > > + /* Bring the best candidate in front of the list --> candidates[0] */ > > > > > + if (nr_candidates > 1) > > > > > + libxl__sort_numa_candidates(candidates, nr_candidates, numa_cmpf); > > > > > > > > Is the start up cost of libxl__sort_numa_candidates significant enough > > > > to make that if worthwhile? > > > > > > > I''m not sure what you mean here, I''m afraid ... :-( > > > > Is the cost of sorting the 1 element list so high it is worth avoiding. > > Since the sort itself would be trivial the startup costs are what would > > dominate. > > > So, the answer is that I really don''t know, as I''m using a library > function for doing the actual work. I really do not expect for them to > be that high, and I hence can remove the if() if you think it looks > ugly. :-)Not really ugly, it just seemed like an odd optimisation to make.> > > > > I''m not sure about this, if numa_place_domain fails for any reason would > > > > we be not better off logging and continuing without placement? We > > > > already do that explicitly in a couple of cases. > > > > > > > Well, actually, if it "fails" in the sense it finds no candidates or > > > does not manage in placing the domain, it returns 0, so we do right what > > > you''re suggesting. I''m sure this is stated in some comment but I guess I > > > can repeat it here. If !rc, it means we stepped into some ERROR_FAIL or > > > something, which I think has to be handled like this, hasn''t it? > > > > That makes sense. > > > > > Perhaps I can also add some logging about "successfully failing", i.e., > > > not getting into any error but not being able to place the domain. If > > > yes, that will happen _inside_ numa_place_domain() rather than here. > > > > Logging in that case seems wise in any case since it will have > > performance implications I think. > > > Ok, I''ll log something out. > > Thanks and Regards, > Dario >
Dario Faggioli
2012-Jun-26 16:26 UTC
Re: [PATCH 01 of 10 v2] libxl: fix a typo in the GCREALLOC_ARRAY macro
On Tue, 2012-06-26 at 17:00 +0100, Ian Campbell wrote:> On Thu, 2012-06-21 at 09:53 +0100, Ian Campbell wrote: > > On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote: > > > Causing a build failure when trying to use it: > > > > > > xxx: error: expected '';'' before '')'' token > > > xxx: error: expected statement before '')'' token > > > > > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > > > Acked-by: Ian Campbell <ian.campbell@citrix.com> > > This bugfix was mostly independent from the rest of the series,Indeed. I submitted it as a separate patch the week before, but it maybe got lost. :-)> so I''ve committed it.Cool. Thanks, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Jackson
2012-Jun-26 17:23 UTC
Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
Ian Campbell writes ("Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes"):> On Tue, 2012-06-26 at 17:25 +0100, Dario Faggioli wrote: > > So, the answer is that I really don''t know, as I''m using a library > > function for doing the actual work. I really do not expect for them to > > be that high, and I hence can remove the if() if you think it looks > > ugly. :-) > > Not really ugly, it just seemed like an odd optimisation to make.Yes, please remove the if(). It makes the code more complicated and this is not a hot path. Thanks, Ian
Dario Faggioli
2012-Jun-27 08:15 UTC
Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
On Tue, 2012-06-26 at 12:03 +0100, Ian Jackson wrote:> The conventional approach to this kind of thing is to invent a score > for sorting etc. Something like > score = tuning_parameter * log(freemem) - number_of_domains > which does sort of roughly the same as your 10% rule if you squint. >However, if I go for this approach, I think I should stick to floating point arith, as I''m not sure I can properly to things like normalization and take log()-s with integers... :-/ Would that be fine? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Zhang, Yang Z
2012-Jun-28 07:25 UTC
Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
Dario Faggioli wrote on 2012-06-16:> If a domain does not have a VCPU affinity, try to pin it automatically > to some PCPUs. This is done taking into account the NUMA characteristics > of the host. In fact, we look for a combination of host''s NUMA nodes > with enough free memory and number of PCPUs for the new domain, and pin > it to the VCPUs of those nodes. > > Once we know which ones, among all the possible combinations, represents > valid placement candidates for a domain, use some heuistics for deciding > which is the best. For instance, smaller candidates are considered to be > better, both from the domain''s point of view (fewer memory spreading > among nodes) and from the system as a whole point of view (fewer memoy > fragmentation). In case of candidates of equal sizes (i.e., with the > same number of nodes), the one with the greater amount of memory wins, > as this is also good for keeping memory fragmentation under control. > > This all happens internally to libxl, and no API for driving the mechanism is > provided for now. This matches what xend already does.we should consider the basic IONUMA. I mean when a guest with a device assigned, from the point of view of performance, it''s better to allocated the memory from the node which the device belongs to. Furthermore, we need consider the guest NUMA(I am working on it now) and live migration. Best regards, Yang> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > Changes from v1: > * This patches incorporates the changes from both "libxl, xl: enable automatic > placement of guests on NUMA nodes" and "libxl, xl: heuristics for > reordering NUMA placement candidates" from v1. * The logic of the > algorithm is basically the same as in v1, but the splitting of it in > the various functions has been completely redesigned from scratch. * > No public API for placement or candidate generation is now exposed, > everything happens within libxl, as agreed during v1 review. * The > relevant documentation have been moved near the actual functions and > features. Also, the amount and (hopefully!) the quality of the > documentation has been improved a lot, as requested. * All the > comments about using the proper libxl facilities and helpers for > allocations, etc., have been considered and applied. * This patch > still bails out from NUMA optimizations if it find out cpupools are > being utilized. It is next patch that makes the two things interact > properly, as suggested during review. > diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 > --- a/docs/man/xl.cfg.pod.5 > +++ b/docs/man/xl.cfg.pod.5 > @@ -111,8 +111,8 @@ created online and the remainder will be > > =item B<cpus="CPU-LIST"> > -List of which cpus the guest is allowed to use. Default behavior is > -`all cpus`. A C<CPU-LIST> may be specified as follows: > +List of which cpus the guest is allowed to use. By default xl will (via > +libxl) pick some cpus (see below). A C<CPU-LIST> may be specified as follows: > > =over 4 > @@ -132,6 +132,12 @@ run on cpu #3 of the host. > > =back > +If this option is not specified, libxl automatically tries to place the > new +domain on the host''s NUMA nodes (provided the host has more than > one NUMA +node) by pinning it to the cpus of those nodes. An heuristical > approach is +utilized with the goals of maximizing performance for the > domain and, at +the same time, achieving efficient utilization of the > host''s CPUs and RAM. + > =item B<cpu_weight=WEIGHT> > > A domain with a weight of 512 will get twice as much CPU as a domain > diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c > --- a/tools/libxl/libxl_dom.c > +++ b/tools/libxl/libxl_dom.c > @@ -95,6 +95,459 @@ out: > return sched; > } > +/* + * What follows are helpers for generating all the k-combinations + > * without repetitions of a set S with n elements in it. Formally + * > speaking, they are subsets of k distinct elements of S and, if + * S is > n elements big, the number of k-combinations is equal to + * the > binomial coefficient (n k), which means we get exactly + * n!/(k! * (n - > k)!) subsets, all of them with k elements. + * + * The various subset > are geneated one after the other by calling + * comb_init() first, and, > after that, comb_next() + * (n k)-1 times. An iterator is used to store > the curent status + * of the whole generation operation (i.e., > basically, the last + * combination that has been generated). As soon as > all + * combinations have been generated, comb_next() will + * start > returning 0 instead of 1. It is of course impotant that + * the same > instance of the iterator and the same values for + * n and k are used > for each call. If that doesn''t happen, the + * result is unspecified. + > * + * The algorithm is a well known one, and it produces the + * > combinations in such a way that they (well, more precisely, + * their > indexes it the array/map representing the set) come in + * lexicogaphic > order. + * + * For example, with n = 5 and k = 3, calling comb_init() + > * will generate { 0, 1, 2 }, while subsequent valid calls to + * > comb_next() will produce the following: + * { { 0, 1, 2 }, { 0, 1, 3 }, > { 0, 1, 4 }, + * { 0, 2, 3 }, { 0, 2, 4 }, { 0, 3, 4 }, + * { 1, 2, > 3 }, { 1, 2, 4 }, { 1, 3, 4 }, + * { 2, 3, 4 } } + * + * This is used > by the automatic NUMA placement logic below. + */ +typedef int* > comb_iter_t; + +static int comb_init(libxl__gc *gc, comb_iter_t *it, int > n, int k) +{ + comb_iter_t new_iter; + int i; + + if (n < k) + > return 0; + + /* First set is always { 0, 1, 2, ..., k-1 } */ + > GCNEW_ARRAY(new_iter, k); + for (i = 0; i < k; i++) + > new_iter[i] = i; + + *it = new_iter; + return 1; +} + +static int > comb_next(comb_iter_t it, int n, int k) +{ + int i; + + /* + * > The idea here is to find the leftmost element from where + * we > should start incrementing the indexes of the iterator. + * This > means looking for the highest index that can be increased + * while > still producing value smaller than n-1. In the example + * above, > when dealing with { 0, 1, 4 }, such an element is the + * second > one, as the third is already equal to 4 (which actually + * is n-1). > + * Once we found from where to start, we increment that element + > * and overide the right-hand rest of the iterator with its + * > successors, thus achieving lexicographic ordering. + * + * > Regarding the termination of the generation process, when we + * > manage in bringing n-k at the very first position of the iterator, + > * we know that is the last valid combination ( { 2, 3, 4 }, with + * > n - k = 5 - 2 = 2, in the example above), and thus we start + * > returning 0 as soon as we cross that border. + */ + for (i = k - > 1; it[i] == n - k + i; i--) { + if (i <= 0) + return > 0; + } + for (it[i]++, i++; i < k; i++) + it[i] = it[i - 1] > + 1; + return 1; +} + +/* NUMA automatic placement (see > libxl_internal.h for details) */ + +/* + * This function turns a > k-combination iterator into a node map. + * This means the bits in the > node map corresponding to the indexes + * of the given combination are > the ones that will be set. + * For example, if the iterator represents > the combination { 0, 2, 4}, + * the node map will have bits #0, #2 and > #4 set to one and i thus + * will point to node 0, node 2 and and node > 4. + */ +static void comb_get_nodemap(comb_iter_t it, libxl_bitmap > *nodemap, int k) +{ + int i; + + libxl_bitmap_set_none(nodemap); + > for (i = 0; i < k; i++) + libxl_bitmap_set(nodemap, it[i]); +} > + +/* Retrieve the number of cpus that the nodes that are part of the > nodemap + * span. */ +static int nodemap_to_nodes_cpus(libxl_cputopology > *tinfo, int nr_cpus, + const > libxl_bitmap *nodemap) +{ + int i, nodes_cpus = 0; + + for (i = 0; > i < nr_cpus; i++) { + if (libxl_bitmap_test(nodemap, > tinfo[i].node)) + nodes_cpus++; + } + return > nodes_cpus; +} + +/* Retrieve the amount of free memory within the > nodemap */ +static uint32_t nodemap_to_free_memkb(libxl_numainfo *ninfo, > + libxl_bitmap *nodemap) +{ + > uint32_t free_memkb = 0; + int i; + + libxl_for_each_set_bit(i, > *nodemap) + free_memkb += ninfo[i].free / 1024; + + return > free_memkb; +} + +/* Retrieve the number of domains that can potentially > run on the cpus + * the nodes that are part of the nodemap. */ +static > int nodemap_to_nr_domains(libxl__gc *gc, libxl_cputopology *tinfo, + > const libxl_bitmap *nodemap) +{ + > libxl_dominfo *dinfo = NULL; + libxl_bitmap dom_nodemap; + int > nr_doms, nr_cpus; + int nr_domains = 0; + int i, j, k; + + > dinfo = libxl_list_domain(CTX, &nr_doms); + if (dinfo == NULL) + > return ERROR_FAIL; + + if (libxl_node_bitmap_alloc(CTX, > &dom_nodemap) < 0) { + nr_domains = ERROR_FAIL; + goto > out; + } + + for (i = 0; i < nr_doms; i++) { + > libxl_vcpuinfo *vinfo; + int nr_vcpus; + + vinfo > libxl_list_vcpu(CTX, dinfo[i].domid, &nr_vcpus, &nr_cpus); + if > (vinfo == NULL) + continue; + + > libxl_bitmap_set_none(&dom_nodemap); + for (j = 0; j < nr_vcpus; > j++) { + libxl_for_each_set_bit(k, vinfo[j].cpumap) + > libxl_bitmap_set(&dom_nodemap, tinfo[k].node); + } + + > libxl_for_each_set_bit(j, dom_nodemap) { + if > (libxl_bitmap_test(nodemap, j)) { + nr_domains++; + > goto found; + } + } + found: + > libxl_vcpuinfo_list_free(vinfo, nr_vcpus); + } + + out: + > libxl_bitmap_dispose(&dom_nodemap); + libxl_dominfo_list_free(dinfo, > nr_doms); + return nr_domains; +} + +/* + * This function tries to > figure out if the host has a consistent number + * of cpus along all its > NUMA nodes. In fact, if that is the case, we can + * calculate the > minimum number of nodes needed for a domain by just + * dividing its > total number of vcpus by this value computed here. + * However, we are > not allowed to assume that all the nodes have the + * same number of > cpus. Therefore, in case discrepancies among different + * nodes are > found, this function just returns 0 and the caller needs + * to sort out > things in some other way. + */ +static int > cpus_per_node_count(libxl_cputopology *tinfo, int nr_cpus, + > libxl_numainfo *ninfo, int nr_nodes) +{ + int > cpus_per_node = 0; + int j, i; + + /* This makes sense iff # of > PCPUs is the same for all nodes */ + for (j = 0; j < nr_nodes; j++) { > + int curr_cpus = 0; + + for (i = 0; i < nr_cpus; i++) { + > if (tinfo[i].node == j) + curr_cpus++; + > } + /* So, if the above does not hold, turn the whole thing off! > */ + cpus_per_node = cpus_per_node == 0 ? curr_cpus : > cpus_per_node; + if (cpus_per_node != curr_cpus) + > return 0; + } + return cpus_per_node; +} + +/* Get all the > placement candidates satisfying some specific conditions */ +int > libxl__get_numa_candidates(libxl__gc *gc, + > uint32_t min_free_memkb, int min_cpus, + > int min_nodes, int max_nodes, + > libxl__numa_candidate *cndts[], int *nr_cndts) +{ + > libxl__numa_candidate *new_cndts = NULL; + libxl_cputopology *tinfo > NULL; + libxl_numainfo *ninfo = NULL; + libxl_bitmap nodemap; + > int nr_nodes, nr_cpus; + int array_size, rc; + + /* Get platform > info and prepare the map for testing the combinations */ + ninfo > libxl_get_numainfo(CTX, &nr_nodes); + if (ninfo == NULL) + > return ERROR_FAIL; + /* If we don''t have at least 2 nodes, it is > useless to proceed */ + if (nr_nodes < 2) { + rc = 0; + > goto out; + } + + tinfo = libxl_get_cpu_topology(CTX, &nr_cpus); + > if (tinfo == NULL) { + rc = ERROR_FAIL; + goto out; + > } + + rc = libxl_node_bitmap_alloc(CTX, &nodemap); + if (rc) + > goto out; + + /* + * Round up and down some of the > constraints. For instance, the minimum + * number of cpus a > candidate should have must at least be non-negative. + * Regarding > the minimum number of NUMA nodes, if not explicitly specified + * > (i.e., min_nodes <= 0), we try to figure out a sensible number of nodes > + * from where to start generating candidates, if possible (or just > start + * from 1 otherwise). The maximum number of nodes should not > exceed the + * number of existent NUMA nodes on the host, or the > candidate genaration + * won''t work properly. + */ + min_cpus > = min_cpus <= 0 ? 0 : min_cpus; + if (min_nodes <= 0) { + int > cpus_per_node; + + cpus_per_node = cpus_per_node_count(tinfo, > nr_cpus, ninfo, nr_nodes); + if (cpus_per_node == 0) + > min_nodes = 1; + else + min_nodes = (min_cpus + > cpus_per_node - 1) / cpus_per_node; + } + min_nodes = min_nodes > > nr_nodes ? nr_nodes : min_nodes; + if (max_nodes <= 0) + > max_nodes = nr_nodes; + else + max_nodes = max_nodes > > nr_nodes ? nr_nodes : max_nodes; + if (min_nodes > max_nodes) { + > rc = ERROR_INVAL; + goto out; + } + + /* Initialize the > local storage for the combinations */ + *nr_cndts = 0; + > array_size = nr_nodes; + GCNEW_ARRAY(new_cndts, array_size); + + > /* Generate all the combinations of any size from min_nodes to + * > max_nodes (see comb_init() and comb_next()). */ + while (min_nodes <> max_nodes) { + comb_iter_t comb_iter; + int comb_ok; + + > /* + * And here it is. Each step of this cycle generates a > combination of + * nodes as big as min_nodes mandates. Each of these > combinations is + * checked against the constraints provided by the > caller (namely, + * amount of free memory and number of cpus) and it > becomes an actual + * placement candidate iff it passes the check. + > */ + for (comb_ok = comb_init(gc, &comb_iter, nr_nodes, > min_nodes); comb_ok; + comb_ok = comb_next(comb_iter, > nr_nodes, min_nodes)) { + uint32_t nodes_free_memkb; + > int nodes_cpus; + + comb_get_nodemap(comb_iter, &nodemap, > min_nodes); + + /* If there is not enough memoy in this > combination, skip it + * and go generating the next one... > */ + nodes_free_memkb = nodemap_to_free_memkb(ninfo, > &nodemap); + if (min_free_memkb > 0 && nodes_free_memkb < > min_free_memkb) + continue; + + /* And the > same applies if this combination is short in cpus */ + > nodes_cpus = nodemap_to_nodes_cpus(tinfo, nr_cpus, &nodemap); + > if (min_cpus > 0 && nodes_cpus < min_cpus) + continue; > + + /* + * Conditions are met, we can add this > combination to the + * NUMA placement candidates list. We > first make sure there + * is enough space in there, and then > we initialize the new + * candidate element with the node > map corresponding to the + * combination we are dealing > with. Memory allocation for + * expanding the array that > hosts the list happens in chunks + * equal to the number of > NUMA nodes in the system (to + * avoid allocating memory > each and every time we find a + * new candidate). + > */ + if (*nr_cndts == array_size) + > array_size += nr_nodes; + GCREALLOC_ARRAY(new_cndts, > array_size); + + libxl__numa_candidate_alloc(gc, > &new_cndts[*nr_cndts]); + > libxl__numa_candidate_put_nodemap(gc, &new_cndts[*nr_cndts], + > &nodemap); + > new_cndts[*nr_cndts].nr_domains = + > nodemap_to_nr_domains(gc, tinfo, &nodemap); + > new_cndts[*nr_cndts].free_memkb = nodes_free_memkb; + > new_cndts[*nr_cndts].nr_nodes = min_nodes; + > new_cndts[*nr_cndts].nr_cpus = nodes_cpus; + + LOG(DEBUG, > "NUMA placement candidate #%d found: nr_nodes=%d, " + > "nr_cpus=%d, free_memkb=%"PRIu32"", *nr_cndts, + > min_nodes, new_cndts[*nr_cndts].nr_cpus, + > new_cndts[*nr_cndts].free_memkb / 1024); + + (*nr_cndts)++; + > } + min_nodes++; + } + + *cndts = new_cndts; + out: > + libxl_bitmap_dispose(&nodemap); + > libxl_cputopology_list_free(tinfo, nr_cpus); + > libxl_numainfo_list_free(ninfo, nr_nodes); + return rc; +} + +void > libxl__sort_numa_candidates(libxl__numa_candidate cndts[], int nr_cndts, > + libxl__numa_candidate_cmpf cmpf) +{ + > /* Reorder candidates (see the comparison function for + * the > details on the heuristics) */ + qsort(cndts, nr_cndts, > sizeof(cndts[0]), cmpf); +} + +/* + * The NUMA placement candidates are > reordered according to the following + * heuristics: + * - candidates > involving fewer nodes come first. In case two (or + * more) > candidates span the same number of nodes, + * - candidates with greater > amount of free memory come first. In + * case two (or more) > candidates differ in their amount of free + * memory by less than > 10%, + * - candidates with fewer domains insisting on them at the time > of + * this call come first. + */ +static int numa_cmpf(const void > *v1, const void *v2) +{ + const libxl__numa_candidate *c1 = (const > libxl__numa_candidate*) v1; + const libxl__numa_candidate *c2 > (const libxl__numa_candidate*) v2; + double mem_diff > labs(c1->free_memkb - c2->free_memkb); + double mem_avg > (c1->free_memkb + c2->free_memkb) / 2.0; + + if (c1->nr_nodes !> c2->nr_nodes) + return c1->nr_nodes - c2->nr_nodes; + + if > ((mem_diff / mem_avg) * 100.0 < 10.0 && + c1->nr_domains !> c2->nr_domains) + return c1->nr_domains - c2->nr_domains; + + > return c2->free_memkb - c1->free_memkb; +} + +/* The actual automatic > NUMA placement routine */ +static int numa_place_domain(libxl__gc *gc, > libxl_domain_build_info *info) +{ + int nr_candidates = 0; + > libxl__numa_candidate *candidates = NULL; + libxl_bitmap > candidate_nodemap; + libxl_cpupoolinfo *pinfo; + int nr_pools, rc > = 0; + uint32_t memkb; + + /* First of all, if cpupools are in > use, better not to mess with them */ + pinfo > libxl_list_cpupool(CTX, &nr_pools); + if (!pinfo) + return > ERROR_FAIL; + if (nr_pools > 1) { + LOG(NOTICE, "skipping NUMA > placement as cpupools are in use"); + goto out; + } + + rc > = libxl_domain_need_memory(CTX, info, &memkb); + if (rc) + > goto out; + if (libxl_node_bitmap_alloc(CTX, &candidate_nodemap)) { + > rc = ERROR_FAIL; + goto out; + } + + /* Find all the > candidates with enough free memory and at least + * as much pcpus as > the domain has vcpus. */ + rc = libxl__get_numa_candidates(gc, > memkb, info->max_vcpus, 0, 0, + > &candidates, &nr_candidates); + if (rc) + goto out; + + > LOG(DETAIL, "%d NUMA placement candidates found", nr_candidates); + + > /* No suitable placement candidates. We just return without touching the > + * domain''s info->cpumap. It will have affinity with all > nodes/cpus. */ + if (nr_candidates == 0) + goto out; + + /* > Bring the best candidate in front of the list --> candidates[0] */ + > if (nr_candidates > 1) + libxl__sort_numa_candidates(candidates, > nr_candidates, numa_cmpf); + + /* + * At this point, the first > candidate in the array is the one we want. + * Go for it by mapping > its node map to the domain''s info->cpumap. + */ + > libxl__numa_candidate_get_nodemap(gc, &candidates[0], > &candidate_nodemap); + rc = libxl_nodemap_to_cpumap(CTX, > &candidate_nodemap, &info->cpumap); + if (rc) + goto out; + + > LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and " + > "%"PRIu32" KB free selected", candidates[0].nr_nodes, + > candidates[0].nr_cpus, candidates[0].free_memkb / 1024); + + > out: + libxl_bitmap_dispose(&candidate_nodemap); + > libxl_cpupoolinfo_list_free(pinfo, nr_pools); + return rc; +} + > int libxl__build_pre(libxl__gc *gc, uint32_t domid, > libxl_domain_build_info *info, libxl__domain_build_state > *state) > { > @@ -104,7 +557,22 @@ int libxl__build_pre(libxl__gc *gc, uint > uint32_t rtc_timeoffset; > > xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus); > + > + /* > + * Check if the domain has any CPU affinity. If not, try to build up one. > + * In case numa_place_domain() find at least a suitable candidate, it will > + * affect info->cpumap accordingly; if it does not, it just leaves it > + * as it is. This means (unless some weird error manifests) the subsequent > + * call to libxl_set_vcpuaffinity_all() will do the actual placement, > + * whatever that turns out to be. > + */ > + if (libxl_bitmap_is_full(&info->cpumap)) { > + int rc = numa_place_domain(gc, info); > + if (rc) > + return rc; > + } > libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, > &info->cpumap); + xc_domain_setmaxmem(ctx->xch, domid, > info->target_memkb + LIBXL_MAXMEM_CONSTANT); if (info->type => LIBXL_DOMAIN_TYPE_PV) > xc_domain_set_memmap_limit(ctx->xch, domid, > diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h > --- a/tools/libxl/libxl_internal.h > +++ b/tools/libxl/libxl_internal.h > @@ -2021,6 +2021,134 @@ static inline void libxl__ctx_unlock(lib > #define CTX_LOCK (libxl__ctx_lock(CTX)) > #define CTX_UNLOCK (libxl__ctx_unlock(CTX)) > +/* + * Automatic NUMA placement + * + * These functions and data > structures deal with the initial placement of a + * domain onto the host > NUMA nodes. + * + * The key concept here is the one of "numa placement > candidate" (or jusr "numa + * candidate", "placement candidate" or even > "candidate"), which basically is a + * set of nodes whose > characteristics have been successfully checked against + * some specific > requirements. More pecisely, a candidate is the nodemap + * associated > with one of the possible subset of the host NUMA nodes providing + * a > certain amount of free memory, or a given number of cpus, or even both + > * (depending in what the caller wants). For convenience of use, some of > this + * information are stored within the candidate itselfs, instead of > always being + * dynamically computed. A candidate can consist of just > one, up to all the + * available nodes on the host (the latter being, of > course, not ideal). For + * instance, looking for a numa candidates > with 2GB of free memoy means we want + * all the possible subsets of the > host NUMA nodes with, cumulatively, at least + * 2GB of free memory. > That could be possible by just using one particular + * node, or may > require more nodes, depending on the characteristics of the + * host, on > how many domains have been created already, on how big they are, + * > etc. + * + * The intended usage is as follows: + * 1. by, fist of all, > calling libxl__get_numa_candidates(), and specifying + * the proper > constraints to it (e.g., the amount of memory a domain need + * as > the minimum amount of free memory fo the candidates) one can build + * > up a whole set of suitable placing alternatives for a domain; + * 2. > after that, one specific candidate should be chosen. That can happen + * > by looking at their various characteristics; + * 3. the chosen > candidate''s nodemap should be utilized for computing the + * actual > affinity of the domain which, given the curent NUMA support + * in > the hypervisor, is what determines the placement of the domain''s + * > vcpus and memory. + * + * To make phase 2 even easier, a sorting helper > function for the list of + * candidates is povided in the form of > libxl__sort_numa_candidates(). The only + * that is needed is defining a > comparison function, containing the chriteria + * for deciding, given > two candidates, which one is ''better''. Depending on how + * the > comparison function is defined, the best candidate (where, of course, + > * best is defined with respect to the heuristics implemented in the > comparison + * function itself, libxl__numa_candidate_cmpf()) could > become the first o the + * last element of the list. + * + * > Summarizing, achieving automatic NUMA placement is just a matter of + * > obtaining the list of suitable placement candidates, perhaps asking for > each + * of them to provide at least the amount of memory the domain > needs. After + * that just implement a comparison function by means of > the various helpers + * retrieving the relevant information about the > candidates themselves. + * Finally, call the sorting helper function and > use the candidate that became + * (typically) the first element of the > list fo determining the domain''s + * affinity. + */ + +typedef struct { > + int nr_cpus, nr_nodes; + int nr_domains; + uint32_t > free_memkb; + libxl_bitmap nodemap; +} libxl__numa_candidate; + +/* + > * This generates the list of NUMA placement candidates satisfying some + > * specific conditions. If min_nodes and/or max_nodes are not 0, their > value is + * used to determine the minimum and maximum number of nodes > that are allow to + * be present in each candidate. If min_nodes and/or > max_nodes are 0, the + * minimum and maximum number of nodes to be used > are automatically selected by + * the implementation (and that will > likely be just 1 node for the minimum and + * the total number of > existent nodes for the maximum). Re min_free_memkb and + * min_cpu, if > not 0, they specify the caller only wants candidates with at + * least > that amount of free memory and that number of cpus, respectively. If + * > min_free_memkb and/or min_cpus are 0, the candidates'' free memory and > number + * of cpus won''t be checked at all, which means a candidate will > always be + * considered suitable wrt the specific constraint. cndts is > where the list of + * exactly nr_cndts candidates is returned. Note > that, in case no candidates + * are found at all, the function returns > successfully, but with nr_cndts equal + * to zero. + */ +_hidden int > libxl__get_numa_candidates(libxl__gc *gc, + > uint32_t min_free_memkb, int min_cpus, + > int min_nodes, int max_nodes, + > libxl__numa_candidate *cndts[], int *nr_cndts); + +/* allocation and > deallocation for placement candidates */ +static inline int > libxl__numa_candidate_alloc(libxl__gc *gc, + > libxl__numa_candidate *cndt) +{ + return > libxl_node_bitmap_alloc(CTX, &cndt->nodemap); +} +static inline void > libxl__numa_candidate_dispose(libxl__numa_candidate *cndt) +{ + > libxl_bitmap_dispose(&cndt->nodemap); +} +static inline void > libxl__numacandidate_list_free(libxl__numa_candidate *cndts, + > int nr_cndts) +{ + int i; + + > for (i = 0; i < nr_cndts; i++) + > libxl__numa_candidate_dispose(&cndts[i]); + free(cndts); +} + +/* > retrieve (in nodemap) the node map associated to placement candidate > cndt */ +static inline +void libxl__numa_candidate_get_nodemap(libxl__gc > *gc, + const libxl__numa_candidate > *cndt, + libxl_bitmap *nodemap) +{ > + libxl_bitmap_copy(CTX, nodemap, &cndt->nodemap); +} +/* set the > node map of placement candidate cndt to match nodemap */ +static inline > +void libxl__numa_candidate_put_nodemap(libxl__gc *gc, + > libxl__numa_candidate *cndt, + > const libxl_bitmap *nodemap) +{ + > libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap); +} + +/* signature for > the comparison function between two candidates c1 and c2 + * (the thid > parameter is provided to enable thread safety). */ +typedef int > (*libxl__numa_candidate_cmpf)(const void *v1, const void *v2); +/* sort > the list of candidates in cndts (an array with nr_cndts elements in + * > it) using cmpf for comparing two candidates. Uses libc''s qsort(). */ > +_hidden void libxl__sort_numa_candidates(libxl__numa_candidate cndts[], > + int nr_cndts, + > libxl__numa_candidate_cmpf cmpf); > > /* > * Inserts "elm_new" into the sorted list "head". > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
George Dunlap
2012-Jun-28 08:36 UTC
Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
On Thu, Jun 28, 2012 at 8:25 AM, Zhang, Yang Z <yang.z.zhang@intel.com> wrote:>> This all happens internally to libxl, and no API for driving the mechanism is >> provided for now. This matches what xend already does. > we should consider the basic IONUMA. I mean when a guest with a device assigned, > from the point of view of performance, it''s better to allocated the memory from the node > which the device belongs to. > Furthermore, we need consider the guest NUMA(I am working on it now) and live migration.Just to help set the context for these patch series: We have lots of big plans for NUMA in the 4.3 release (and are always glad for input and help!), but the current goal is just to have *something* for the upcoming 4.2 release. -George
Dario Faggioli
2012-Jun-28 10:12 UTC
Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
On Thu, 2012-06-28 at 07:25 +0000, Zhang, Yang Z wrote:> > This all happens internally to libxl, and no API for driving the mechanism is > > provided for now. This matches what xend already does. > we should consider the basic IONUMA. I mean when a guest with a device assigned, > from the point of view of performance, it''s better to allocated the memory from the node > which the device belongs to. >That sure makes sense, and it would be a very useful extension!> Furthermore, we need consider the guest NUMA(I am working on it now) and live migration. >Great. Can you share, either here on or a separate thread some more details about what you mean by both "guest NUMA" and "live migration" in this context? As George said, that would be 4.3 material, but I''m still very interested. :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Pasi Kärkkäinen
2012-Jun-28 12:41 UTC
Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
On Thu, Jun 28, 2012 at 12:12:09PM +0200, Dario Faggioli wrote:> On Thu, 2012-06-28 at 07:25 +0000, Zhang, Yang Z wrote: > > > This all happens internally to libxl, and no API for driving the mechanism is > > > provided for now. This matches what xend already does. > > we should consider the basic IONUMA. I mean when a guest with a device assigned, > > from the point of view of performance, it''s better to allocated the memory from the node > > which the device belongs to. > > > That sure makes sense, and it would be a very useful extension! > > > Furthermore, we need consider the guest NUMA(I am working on it now) and live migration. > > > Great. Can you share, either here on or a separate thread some more > details about what you mean by both "guest NUMA" and "live migration" in > this context? > > As George said, that would be 4.3 material, but I''m still very > interested. :-) >For the mailinglist archives.. here are the Xen guest VM NUMA slides from XenSummit 2010: http://www.slideshare.net/xen_com_mgr/nakajima-numafinal http://www.slideshare.net/xen_com_mgr/dulloor-xensummit -- Pasi
Dario Faggioli
2012-Jun-28 17:03 UTC
Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
On Thu, 2012-06-28 at 15:41 +0300, Pasi Kärkkäinen wrote:> > > Furthermore, we need consider the guest NUMA(I am working on it now) and live migration. > > > > > Great. Can you share, either here on or a separate thread some more > > details about what you mean by both "guest NUMA" and "live migration" in > > this context? > > > > As George said, that would be 4.3 material, but I''m still very > > interested. :-) > > > > For the mailinglist archives.. >:-)> here are the Xen guest VM NUMA slides from XenSummit 2010: > http://www.slideshare.net/xen_com_mgr/nakajima-numafinal > http://www.slideshare.net/xen_com_mgr/dulloor-xensummit >Yes, I know this slides, and I''ve also looked at most of the code they''re referring to. That''s all very interesting and it''s well worthwhile to keep that in mind for future steps in the field of NUMA support. In particular, virtual topology awareness of the guest is something really important to have, and I''ll be happy to discuss/work/contribute/help on it in the near future. However, I''m still not sure I see how that ("guest NUMA") and live migration should affect (either now or during 4.3 dev cycle) the initial placement of a domain... That''s what I was asking. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Zhang, Yang Z
2012-Jun-29 05:29 UTC
Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
Dario Faggioli wrote on 2012-06-29:> On Thu, 2012-06-28 at 15:41 +0300, Pasi Kärkkäinen wrote: >>>> Furthermore, we need consider the guest NUMA(I am working on it now) >>>> and live migration. >>>> >>> Great. Can you share, either here on or a separate thread some more >>> details about what you mean by both "guest NUMA" and "live migration" in >>> this context? >>> >>> As George said, that would be 4.3 material, but I'm still very >>> interested. :-) >>> >> >> For the mailinglist archives.. >> > :-) > >> here are the Xen guest VM NUMA slides from XenSummit 2010: >> http://www.slideshare.net/xen_com_mgr/nakajima-numafinal >> http://www.slideshare.net/xen_com_mgr/dulloor-xensummit >> > Yes, I know this slides, and I've also looked at most of the code > they're referring to. That's all very interesting and it's well > worthwhile to keep that in mind for future steps in the field of NUMA > support. In particular, virtual topology awareness of the guest is > something really important to have, and I'll be happy to > discuss/work/contribute/help on it in the near future. > > However, I'm still not sure I see how that ("guest NUMA") and live > migration should affect (either now or during 4.3 dev cycle) the initial > placement of a domain... That's what I was asking.When doing migration, if the target machine is able to provide the same memory framework for the guest, we need to consider it firstly. Best regards, Yang _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Zhang, Yang Z
2012-Jun-29 05:38 UTC
Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
George Dunlap wrote on 2012-06-28:> On Thu, Jun 28, 2012 at 8:25 AM, Zhang, Yang Z <yang.z.zhang@intel.com> wrote: >>> This all happens internally to libxl, and no API for driving the mechanism is >>> provided for now. This matches what xend already does. >> we should consider the basic IONUMA. I mean when a guest with a device >> assigned, from the point of view of performance, it''s better to >> allocated the memory from the node which the device belongs to. >> Furthermore, we need consider the guest NUMA(I am working on it now) and > live migration. > > Just to help set the context for these patch series: We have lots of > big plans for NUMA in the 4.3 release (and are always glad for input > and help!), but the current goal is just to have *something* for the > upcoming 4.2 release.Nice to hear you guys have the big plans fo NUMA things. :) Best regards, Yang
Dario Faggioli
2012-Jun-29 09:38 UTC
Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
On Fri, 2012-06-29 at 05:29 +0000, Zhang, Yang Z wrote:> > However, I''m still not sure I see how that ("guest NUMA") and live > > migration should affect (either now or during 4.3 dev cycle) the initial > > placement of a domain... That''s what I was asking. > > When doing migration, if the target machine is able to provide the same memory framework for the guest, we need to consider it firstly. >Ok, I see, and I think it make sense. Thanks for replying. :-) Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2012-Jun-29 09:46 UTC
Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes
On Fri, 2012-06-29 at 05:38 +0000, Zhang, Yang Z wrote:> > Just to help set the context for these patch series: We have lots of > > big plans for NUMA in the 4.3 release (and are always glad for input > > and help!), but the current goal is just to have *something* for the > > upcoming 4.2 release. > > Nice to hear you guys have the big plans fo NUMA things. :) >Yes, we actually do! :-) So, as you mentioned you''re working on related stuff, do you think it would be useful to try to coordinate a bit? You know, just to avoid duplicating the efforts too much, and achieving the best result with minimum strain... I think even just opening a new thread on this mailing list would do, but I''m open to any other solution anyone has in mind. For the records, I plan to talk about NUMA at and during XenSummit too (although I still don''t know whether my talk has been accepted or not :-D). Just let me (again, you and/or anyone else) know your thoughts and preferences are. :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel