thr3ads.net - Xen devel - [PATCH 00 of 10 v2] Automatic NUMA placement for xl [Jun 2012]

If this information is useful, please help other people find it:
Share via:

Dario Faggioli

2012-Jun-15 17:04 UTC

[PATCH 00 of 10 v2] Automatic NUMA placement for xl

Hello Everyone,

This is the second take for automaic NUMA placement for Xen 4.2, for the sake
of feature parity with xm/xend.

All the comments v1 received have been addressed. Actually, some bits of the
patchset have been deeply restructured for achieving this. Details in single
changelogs.

Here it is what the series ships:

 1/10 libxl: fix a typo in the GCREALLOC_ARRAY macro

Is just a build fix (already posted separately).

 2/10 libxl: add a new Array type to the IDL
 3/10 libxl,libxc: introduce libxl_get_numainfo()
 4/10 xl: add more NUMA information to `xl info -n''
 5/10 libxl: rename libxl_cpumap to libxl_bitmap
 6/10 libxl: expand the libxl_bitmap API a bit
 7/10 libxl: introduce some node map helpers
 
Introduce the data structures, calls and infrastructure needed for retrieving
all the information about the NUMA-ness of the system and deal with them at the
toolstack level.

 8/10 libxl: enable automatic placement of guests on NUMA nodes
 9/10 libxl: have NUMA placement deal with cpupools

Is the actual, well, ''food'' (no, I''m not calling it
''the meat'' as I''m
vegetarian!  :-D).

 10/10 Some automatic NUMA placement documentation

Puts some more details about the implementation and the usage of the new
feature directly in the tree.

Thanks a lot and Regards,
Dario

Dario Faggioli

2012-Jun-15 17:04 UTC

head link

[PATCH 01 of 10 v2] libxl: fix a typo in the GCREALLOC_ARRAY macro

Causing a build failure when trying to use it:

xxx: error: expected '';'' before '')'' token
xxx: error: expected statement before '')'' token

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -1972,7 +1972,7 @@ struct libxl__domain_create_state {
 #define GCREALLOC_ARRAY(var, nmemb)                                     \
     (assert(nmemb > 0),                                                 \
      assert(ARRAY_SIZE_OK((var), (nmemb))),                             \
-     (var) = libxl__realloc((gc), (var), (nmemb)*sizeof(*(var)))))
+     (var) = libxl__realloc((gc), (var), (nmemb)*sizeof(*(var))))
 
 
 /*

Dario Faggioli

2012-Jun-15 17:04 UTC

head link

[PATCH 02 of 10 v2] libxl: add a new Array type to the IDL

And make all the required infrastructure updates to enable this.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Tested-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/tools/libxl/gentest.py b/tools/libxl/gentest.py
--- a/tools/libxl/gentest.py
+++ b/tools/libxl/gentest.py
@@ -27,6 +27,18 @@ def gen_rand_init(ty, v, indent = "    "
     s = ""
     if isinstance(ty, idl.Enumeration):
         s += "%s = %s;\n" % (ty.pass_arg(v, parent is None),
randomize_enum(ty))
+    elif isinstance(ty, idl.Array):
+        if parent is None:
+            raise Exception("Array type must have a parent")
+        s += "%s = rand()%%8;\n" % (parent + ty.lenvar.name)
+        s += "%s = calloc(%s, sizeof(*%s));\n" % \
+            (v, parent + ty.lenvar.name, v)
+        s += "{\n"
+        s += "    int i;\n"
+        s += "    for (i=0; i<%s; i++)\n" % (parent +
ty.lenvar.name)
+        s += gen_rand_init(ty.elem_type, v+"[i]",
+                           indent + "        ", parent)
+        s += "}\n"
     elif isinstance(ty, idl.KeyedUnion):
         if parent is None:
             raise Exception("KeyedUnion type must have a parent")
diff --git a/tools/libxl/gentypes.py b/tools/libxl/gentypes.py
--- a/tools/libxl/gentypes.py
+++ b/tools/libxl/gentypes.py
@@ -11,8 +11,12 @@ def libxl_C_instance_of(ty, instancename
             return libxl_C_type_define(ty)
         else:
             return libxl_C_type_define(ty) + " " + instancename
-    else:
-        return ty.typename + " " + instancename
+
+    s = ""
+    if isinstance(ty, idl.Array):
+        s += libxl_C_instance_of(ty.lenvar.type, ty.lenvar.name) +
";\n"
+
+    return s + ty.typename + " " + instancename
 
 def libxl_C_type_define(ty, indent = ""):
     s = ""
@@ -66,6 +70,21 @@ def libxl_C_type_dispose(ty, v, indent              s +=
libxl_C_type_dispose(f.type, fexpr, indent + "    ", nparent)
             s += "    break;\n"
         s += "}\n"
+    elif isinstance(ty, idl.Array):
+        if parent is None:
+            raise Exception("Array type must have a parent")
+        if ty.elem_type.dispose_fn is not None:
+            s += "{\n"
+            s += "    int i;\n"
+            s += "    for (i=0; i<%s; i++)\n" % (parent +
ty.lenvar.name)
+            s += libxl_C_type_dispose(ty.elem_type, v+"[i]",
+                                      indent + "        ", parent)
+        if ty.dispose_fn is not None:
+            if ty.elem_type.dispose_fn is not None:
+                s += "    "
+            s += "%s(%s);\n" % (ty.dispose_fn, ty.pass_arg(v, parent
is None))
+        if ty.elem_type.dispose_fn is not None:
+            s += "}\n"
     elif isinstance(ty, idl.Struct) and (parent is None or ty.dispose_fn is
None):
         for f in [f for f in ty.fields if not f.const]:
             (nparent,fexpr) = ty.member(v, f, parent is None)
@@ -164,7 +183,24 @@ def libxl_C_type_gen_json(ty, v, indent 
     s = ""
     if parent is None:
         s += "yajl_gen_status s;\n"
-    if isinstance(ty, idl.Enumeration):
+
+    if isinstance(ty, idl.Array):
+        if parent is None:
+            raise Exception("Array type must have a parent")
+        s += "{\n"
+        s += "    int i;\n"
+        s += "    s = yajl_gen_array_open(hand);\n"
+        s += "    if (s != yajl_gen_status_ok)\n"
+        s += "        goto out;\n"
+        s += "    for (i=0; i<%s; i++) {\n" % (parent +
ty.lenvar.name)
+        s += libxl_C_type_gen_json(ty.elem_type, v+"[i]",
+                                   indent + "        ", parent)
+        s += "    }\n"
+        s += "    s = yajl_gen_array_close(hand);\n"
+        s += "    if (s != yajl_gen_status_ok)\n"
+        s += "        goto out;\n"
+        s += "}\n"
+    elif isinstance(ty, idl.Enumeration):
         s += "s = libxl__yajl_gen_enum(hand, %s_to_string(%s));\n" %
(ty.typename, ty.pass_arg(v, parent is None))
         s += "if (s != yajl_gen_status_ok)\n"
         s += "    goto out;\n"
diff --git a/tools/libxl/idl.py b/tools/libxl/idl.py
--- a/tools/libxl/idl.py
+++ b/tools/libxl/idl.py
@@ -266,6 +266,17 @@ string = Builtin("char *", namespace = N
                  json_fn = "libxl__string_gen_json",
                  autogenerate_json = False)
 
+class Array(Type):
+    """An array of the same type"""
+    def __init__(self, elem_type, lenvar_name, **kwargs):
+        kwargs.setdefault(''dispose_fn'',
''free'')
+        Type.__init__(self, namespace=elem_type.namespace,
typename=elem_type.rawname + " *", **kwargs)
+
+        lv_kwargs = dict([(x.lstrip(''lenvar_''),y) for (x,y)
in kwargs.items() if x.startswith(''lenvar_'')])
+
+        self.lenvar = Field(integer, lenvar_name, **lv_kwargs)
+        self.elem_type = elem_type
+
 class OrderedDict(dict):
     """A dictionary which remembers insertion order.
 
diff --git a/tools/libxl/idl.txt b/tools/libxl/idl.txt
--- a/tools/libxl/idl.txt
+++ b/tools/libxl/idl.txt
@@ -145,12 +145,24 @@ idl.KeyedUnion
 
  A subclass of idl.Aggregate which represents the C union type
  where the currently valid member of the union can be determined based
- upon another member in the containing type.
+ upon another member in the containing type. An idl.KeyedUnion must
+ always be a member of a containing idl.Aggregate type.
 
- The KeyedUnion.keyvar contains an idl.type the member of the
+ The KeyedUnion.keyvar contains an idl.Type the member of the
  containing type which determines the valid member of the union. The
  must be an instance of the Enumeration type.
 
+idl.Array
+
+  A class representing an array or similar elements. An idl.Array must
+  always be an idl.Field of a containing idl.Aggregate.
+
+  idl.Array.elem_type contains an idl.Type which is the type of all
+  elements of the array.
+
+  idl.Array.len_var contains an idl.Field which is added to the parent
+  idl.Aggregate and will contain the length of the array.
+
 Standard Types
 --------------
 
diff --git a/tools/ocaml/libs/xl/genwrap.py b/tools/ocaml/libs/xl/genwrap.py
--- a/tools/ocaml/libs/xl/genwrap.py
+++ b/tools/ocaml/libs/xl/genwrap.py
@@ -55,7 +55,8 @@ def ocaml_type_of(ty):
             return "int%d" % ty.width
         else:
             return "int"
-
+    elif isinstance(ty,idl.Array):
+        return "%s list" % ocaml_type_of(ty.elem_type)
     elif isinstance(ty,idl.Builtin):
         if not builtins.has_key(ty.typename):
             raise NotImplementedError("Unknown Builtin %s (%s)" %
(ty.typename, type(ty)))
diff --git a/tools/python/genwrap.py b/tools/python/genwrap.py
--- a/tools/python/genwrap.py
+++ b/tools/python/genwrap.py
@@ -4,7 +4,7 @@ import sys,os
 
 import idl
 
-(TYPE_DEFBOOL, TYPE_BOOL, TYPE_INT, TYPE_UINT, TYPE_STRING, TYPE_AGGREGATE) =
range(6)
+(TYPE_DEFBOOL, TYPE_BOOL, TYPE_INT, TYPE_UINT, TYPE_STRING, TYPE_ARRAY,
TYPE_AGGREGATE) = range(7)
 
 def py_type(ty):
     if ty == idl.bool:
@@ -18,6 +18,8 @@ def py_type(ty):
             return TYPE_INT
         else:
             return TYPE_UINT
+    if isinstance(ty, idl.Array):
+        return TYPE_ARRAY
     if isinstance(ty, idl.Aggregate):
         return TYPE_AGGREGATE
     if ty == idl.string:
@@ -74,7 +76,7 @@ def py_attrib_get(ty, f):
         l.append(''    return
genwrap__ull_get(self->obj.%s);''%f.name)
     elif t == TYPE_STRING:
         l.append(''    return
genwrap__string_get(&self->obj.%s);''%f.name)
-    elif t == TYPE_AGGREGATE:
+    elif t == TYPE_AGGREGATE or t == TYPE_ARRAY:
         l.append(''    PyErr_SetString(PyExc_NotImplementedError,
"Getting %s");''%ty.typename)
         l.append(''    return NULL;'')
     else:
@@ -105,7 +107,7 @@ def py_attrib_set(ty, f):
         l.append(''    return ret;'')
     elif t == TYPE_STRING:
         l.append(''    return genwrap__string_set(v,
&self->obj.%s);''%f.name)
-    elif t == TYPE_AGGREGATE:
+    elif t == TYPE_AGGREGATE or t == TYPE_ARRAY:
         l.append(''    PyErr_SetString(PyExc_NotImplementedError,
"Setting %s");''%ty.typename)
         l.append(''    return -1;'')
     else:

Dario Faggioli

2012-Jun-15 17:04 UTC

head link

[PATCH 03 of 10 v2] libxl, libxc: introduce libxl_get_numainfo()

Make some NUMA node information available to the toolstack. Achieve
this by means of xc_numainfo(), which exposes memory size and amount
of free memory of each node, as well as the relative distances of
each node to all the others.

For properly exposing distances we need the IDL to support arrays.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Changes from v1:
 * malloc converted to libxl__zalloc(NOGC, ...).
 * The patch also accommodates some bits of what was in "libxc,
   libxl: introduce xc_nodemap_t and libxl_nodemap" which was
   removed as well, as full support for node maps at libxc
   level is not needed (yet!).

diff --git a/tools/libxc/xc_misc.c b/tools/libxc/xc_misc.c
--- a/tools/libxc/xc_misc.c
+++ b/tools/libxc/xc_misc.c
@@ -35,6 +35,20 @@ int xc_get_max_cpus(xc_interface *xch)
     return max_cpus;
 }
 
+int xc_get_max_nodes(xc_interface *xch)
+{
+    static int max_nodes = 0;
+    xc_physinfo_t physinfo;
+
+    if ( max_nodes )
+        return max_nodes;
+
+    if ( !xc_physinfo(xch, &physinfo) )
+        max_nodes = physinfo.max_node_id + 1;
+
+    return max_nodes;
+}
+
 int xc_get_cpumap_size(xc_interface *xch)
 {
     return (xc_get_max_cpus(xch) + 7) / 8;
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -329,6 +329,12 @@ int xc_get_cpumap_size(xc_interface *xch
 /* allocate a cpumap */
 xc_cpumap_t xc_cpumap_alloc(xc_interface *xch);
 
+ /*
+ * NODEMAP handling
+ */
+/* return maximum number of NUMA nodes the hypervisor supports */
+int xc_get_max_nodes(xc_interface *xch);
+
 /*
  * DOMAIN DEBUGGING FUNCTIONS
  */
diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -3223,6 +3223,71 @@ fail:
     return ret;
 }
 
+libxl_numainfo *libxl_get_numainfo(libxl_ctx *ctx, int *nr)
+{
+    xc_numainfo_t ninfo;
+    DECLARE_HYPERCALL_BUFFER(xc_node_to_memsize_t, memsize);
+    DECLARE_HYPERCALL_BUFFER(xc_node_to_memfree_t, memfree);
+    DECLARE_HYPERCALL_BUFFER(uint32_t, node_dists);
+    libxl_numainfo *ret = NULL;
+    int i, j, max_nodes;
+
+    max_nodes = libxl_get_max_nodes(ctx);
+    if (max_nodes == 0)
+    {
+        LIBXL__LOG(ctx, XTL_ERROR, "Unable to determine number of
NODES");
+        return NULL;
+    }
+
+    memsize = xc_hypercall_buffer_alloc
+        (ctx->xch, memsize, sizeof(*memsize) * max_nodes);
+    memfree = xc_hypercall_buffer_alloc
+        (ctx->xch, memfree, sizeof(*memfree) * max_nodes);
+    node_dists = xc_hypercall_buffer_alloc
+        (ctx->xch, node_dists, sizeof(*node_dists) * max_nodes * max_nodes);
+    if ((memsize == NULL) || (memfree == NULL) || (node_dists == NULL)) {
+        LIBXL__LOG_ERRNOVAL(ctx, XTL_ERROR, ENOMEM,
+                            "Unable to allocate hypercall
arguments");
+        goto fail;
+    }
+
+    set_xen_guest_handle(ninfo.node_to_memsize, memsize);
+    set_xen_guest_handle(ninfo.node_to_memfree, memfree);
+    set_xen_guest_handle(ninfo.node_to_node_distance, node_dists);
+    ninfo.max_node_index = max_nodes - 1;
+    if (xc_numainfo(ctx->xch, &ninfo) != 0) {
+        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting numainfo");
+        goto fail;
+    }
+
+    if (ninfo.max_node_index < max_nodes - 1)
+        max_nodes = ninfo.max_node_index + 1;
+
+    ret = libxl__zalloc(NOGC, sizeof(libxl_numainfo) * max_nodes);
+    for (i = 0; i < max_nodes; i++)
+        ret[i].dists = libxl__zalloc(NULL, sizeof(*node_dists) * max_nodes);
+
+    for (i = 0; i < max_nodes; i++) {
+#define V(mem, i) (mem[i] == INVALID_NUMAINFO_ID) ? \
+    LIBXL_NUMAINFO_INVALID_ENTRY : mem[i]
+        ret[i].size = V(memsize, i);
+        ret[i].free = V(memfree, i);
+        ret[i].num_dists = max_nodes;
+        for (j = 0; j < ret[i].num_dists; j++)
+            ret[i].dists[j] = V(node_dists, i * max_nodes + j);
+#undef V
+    }
+
+fail:
+    xc_hypercall_buffer_free(ctx->xch, memsize);
+    xc_hypercall_buffer_free(ctx->xch, memfree);
+    xc_hypercall_buffer_free(ctx->xch, node_dists);
+
+    if (ret)
+        *nr = max_nodes;
+    return ret;
+}
+
 const libxl_version_info* libxl_get_version_info(libxl_ctx *ctx)
 {
     union {
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -532,6 +532,9 @@ int libxl_domain_preserve(libxl_ctx *ctx
 /* get max. number of cpus supported by hypervisor */
 int libxl_get_max_cpus(libxl_ctx *ctx);
 
+/* get max. number of NUMA nodes supported by hypervisor */
+int libxl_get_max_nodes(libxl_ctx *ctx);
+
 int libxl_domain_rename(libxl_ctx *ctx, uint32_t domid,
                         const char *old_name, const char *new_name);
 
@@ -769,6 +772,13 @@ int libxl_get_physinfo(libxl_ctx *ctx, l
 #define LIBXL_CPUTOPOLOGY_INVALID_ENTRY (~(uint32_t)0)
 libxl_cputopology *libxl_get_cpu_topology(libxl_ctx *ctx, int *nr);
 void libxl_cputopology_list_free(libxl_cputopology *, int nr);
+#define LIBXL_NUMAINFO_INVALID_ENTRY (~(uint32_t)0)
+libxl_numainfo *libxl_get_numainfo(libxl_ctx *ctx, int *nr);
+  /* On success, a list of nr libxl_numainfo elements is returned.
+   * That is from malloc, thus it is up to the caller to invoke
+   * libxl_cpupoolinfo_list_free() on it.
+   */
+void libxl_numainfo_list_free(libxl_numainfo *, int nr);
 libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t domid,
                                        int *nb_vcpu, int *nrcpus);
 void libxl_vcpuinfo_list_free(libxl_vcpuinfo *, int nr);
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -423,6 +423,12 @@ libxl_physinfo = Struct("physinfo", [
     ("cap_hvm_directio", bool),
     ], dir=DIR_OUT)
 
+libxl_numainfo = Struct("numainfo", [
+    ("size", uint64),
+    ("free", uint64),
+    ("dists", Array(uint32, "num_dists")),
+    ], dir=DIR_OUT)
+
 libxl_cputopology = Struct("cputopology", [
     ("core", uint32),
     ("socket", uint32),
diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c
--- a/tools/libxl/libxl_utils.c
+++ b/tools/libxl/libxl_utils.c
@@ -537,6 +537,11 @@ int libxl_get_max_cpus(libxl_ctx *ctx)
     return xc_get_max_cpus(ctx->xch);
 }
 
+int libxl_get_max_nodes(libxl_ctx *ctx)
+{
+    return xc_get_max_nodes(ctx->xch);
+}
+
 int libxl__enum_from_string(const libxl_enum_string_table *t,
                             const char *s, int *e)
 {
@@ -551,6 +556,14 @@ int libxl__enum_from_string(const libxl_
     return ERROR_FAIL;
 }
 
+void libxl_numainfo_list_free(libxl_numainfo *list, int nr)
+{
+    int i;
+    for (i = 0; i < nr; i++)
+        libxl_numainfo_dispose(&list[i]);
+    free(list);
+}
+
 void libxl_cputopology_list_free(libxl_cputopology *list, int nr)
 {
     int i;
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -484,6 +484,7 @@ typedef struct xen_sysctl_topologyinfo x
 DEFINE_XEN_GUEST_HANDLE(xen_sysctl_topologyinfo_t);
 
 /* XEN_SYSCTL_numainfo */
+#define INVALID_NUMAINFO_ID (~0U)
 struct xen_sysctl_numainfo {
     /*
      * IN: maximum addressable entry in the caller-provided arrays.

Dario Faggioli

2012-Jun-15 17:04 UTC

head link

[PATCH 04 of 10 v2] xl: add more NUMA information to `xl info -n''

So that the user knows how much memory there is on each node and
how far they are from each others.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Changes from v1:
 * integer division replaced by right shift.

diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -4254,6 +4254,36 @@ static void output_physinfo(void)
     return;
 }
 
+static void output_numainfo(void)
+{
+    libxl_numainfo *info;
+    int i, j, nr;
+
+    info = libxl_get_numainfo(ctx, &nr);
+    if (info == NULL) {
+        fprintf(stderr, "libxl_get_numainfo failed.\n");
+        return;
+    }
+
+    printf("numa_info              :\n");
+    printf("node:    memsize    memfree    distances\n");
+
+    for (i = 0; i < nr; i++) {
+        if (info[i].size != LIBXL_NUMAINFO_INVALID_ENTRY) {
+            printf("%4d:    %6"PRIu64"     %6"PRIu64" 
%d", i,
+                   info[i].size >> 20, info[i].free >> 20,
+                   info[i].dists[0]);
+            for (j = 1; j < info[i].num_dists; j++)
+                printf(",%d", info[i].dists[j]);
+            printf("\n");
+        }
+    }
+
+    libxl_numainfo_list_free(info, nr);
+
+    return;
+}
+
 static void output_topologyinfo(void)
 {
     libxl_cputopology *info;
@@ -4276,8 +4306,6 @@ static void output_topologyinfo(void)
 
     libxl_cputopology_list_free(info, nr);
 
-    printf("numa_info              : none\n");
-
     return;
 }
 
@@ -4287,8 +4315,10 @@ static void info(int numa)
 
     output_physinfo();
 
-    if (numa)
+    if (numa) {
         output_topologyinfo();
+        output_numainfo();
+    }
 
     output_xeninfo();

Dario Faggioli

2012-Jun-15 17:04 UTC

head link

[PATCH 05 of 10 v2] libxl: rename libxl_cpumap to libxl_bitmap

And leave to the caller the burden of knowing and remembering what kind
of bitmap each instance of libxl_bitmap is.

This is basically just some s/libxl_cpumap/libxl_bitmap/ (and some other
related interface name substitution, e.g., libxl_for_each_cpu) in a bunch
of files, with no real functional change involved.

A specific allocation helper is introduced, besides libxl_bitmap_alloc().
It is called libxl_cpu_bitmap_alloc() and is meant at substituting the old
libxl_cpumap_alloc(). It is just something easier to use in cases where one
wants to allocate a libxl_bitmap that is going to serve as a cpu map.

This is because we want to be able to deal with both cpu and NUMA node
maps, but we don''t want to duplicate all the various helpers and
wrappers.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.eu.com>

Changes from v1:
 * this patch replaces "libxl: abstract libxl_cpumap to just
libxl_map"
   as it directly change the name of the old type instead of adding one
   more abstraction layer.

diff --git a/tools/libxl/gentest.py b/tools/libxl/gentest.py
--- a/tools/libxl/gentest.py
+++ b/tools/libxl/gentest.py
@@ -20,7 +20,7 @@ def randomize_case(s):
 def randomize_enum(e):
     return random.choice([v.name for v in e.values])
 
-handcoded = ["libxl_cpumap", "libxl_key_value_list",
+handcoded = ["libxl_bitmap", "libxl_key_value_list",
              "libxl_cpuid_policy_list",
"libxl_string_list"]
 
 def gen_rand_init(ty, v, indent = "    ", parent = None):
@@ -117,16 +117,16 @@ static void rand_bytes(uint8_t *p, size_
         p[i] = rand() % 256;
 }
 
-static void libxl_cpumap_rand_init(libxl_cpumap *cpumap)
+static void libxl_bitmap_rand_init(libxl_bitmap *bitmap)
 {
     int i;
-    cpumap->size = rand() % 16;
-    cpumap->map = calloc(cpumap->size, sizeof(*cpumap->map));
-    libxl_for_each_cpu(i, *cpumap) {
+    bitmap->size = rand() % 16;
+    bitmap->map = calloc(bitmap->size, sizeof(*bitmap->map));
+    libxl_for_each_bit(i, *bitmap) {
         if (rand() % 2)
-            libxl_cpumap_set(cpumap, i);
+            libxl_bitmap_set(bitmap, i);
         else
-            libxl_cpumap_reset(cpumap, i);
+            libxl_bitmap_reset(bitmap, i);
     }
 }
 
diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -570,7 +570,7 @@ static int cpupool_info(libxl__gc *gc,
     info->poolid = xcinfo->cpupool_id;
     info->sched = xcinfo->sched_id;
     info->n_dom = xcinfo->n_dom;
-    if (libxl_cpumap_alloc(CTX, &info->cpumap))
+    if (libxl_cpu_bitmap_alloc(CTX, &info->cpumap))
         goto out;
     memcpy(info->cpumap.map, xcinfo->cpumap, info->cpumap.size);
 
@@ -3352,7 +3352,7 @@ libxl_vcpuinfo *libxl_list_vcpu(libxl_ct
     }
 
     for (*nb_vcpu = 0; *nb_vcpu <= domaininfo.max_vcpu_id; ++*nb_vcpu,
++ptr) {
-        if (libxl_cpumap_alloc(ctx, &ptr->cpumap)) {
+        if (libxl_cpu_bitmap_alloc(ctx, &ptr->cpumap)) {
             LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "allocating
cpumap");
             return NULL;
         }
@@ -3375,7 +3375,7 @@ libxl_vcpuinfo *libxl_list_vcpu(libxl_ct
 }
 
 int libxl_set_vcpuaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid,
-                           libxl_cpumap *cpumap)
+                           libxl_bitmap *cpumap)
 {
     if (xc_vcpu_setaffinity(ctx->xch, domid, vcpuid, cpumap->map)) {
         LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "setting vcpu
affinity");
@@ -3385,7 +3385,7 @@ int libxl_set_vcpuaffinity(libxl_ctx *ct
 }
 
 int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid,
-                               unsigned int max_vcpus, libxl_cpumap *cpumap)
+                               unsigned int max_vcpus, libxl_bitmap *cpumap)
 {
     int i, rc = 0;
 
@@ -3399,7 +3399,7 @@ int libxl_set_vcpuaffinity_all(libxl_ctx
     return rc;
 }
 
-int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_cpumap *cpumap)
+int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *cpumap)
 {
     GC_INIT(ctx);
     libxl_dominfo info;
@@ -3419,7 +3419,7 @@ retry_transaction:
     for (i = 0; i <= info.vcpu_max_id; i++)
         libxl__xs_write(gc, t,
                        libxl__sprintf(gc, "%s/cpu/%u/availability",
dompath, i),
-                       "%s", libxl_cpumap_test(cpumap, i) ?
"online" : "offline");
+                       "%s", libxl_bitmap_test(cpumap, i) ?
"online" : "offline");
     if (!xs_transaction_end(ctx->xsh, t, 0)) {
         if (errno == EAGAIN)
             goto retry_transaction;
@@ -4015,7 +4015,7 @@ int libxl_tmem_freeable(libxl_ctx *ctx)
     return rc;
 }
 
-int libxl_get_freecpus(libxl_ctx *ctx, libxl_cpumap *cpumap)
+int libxl_get_freecpus(libxl_ctx *ctx, libxl_bitmap *cpumap)
 {
     int ncpus;
 
@@ -4034,7 +4034,7 @@ int libxl_get_freecpus(libxl_ctx *ctx, l
 
 int libxl_cpupool_create(libxl_ctx *ctx, const char *name,
                          libxl_scheduler sched,
-                         libxl_cpumap cpumap, libxl_uuid *uuid,
+                         libxl_bitmap cpumap, libxl_uuid *uuid,
                          uint32_t *poolid)
 {
     GC_INIT(ctx);
@@ -4057,8 +4057,8 @@ int libxl_cpupool_create(libxl_ctx *ctx,
         return ERROR_FAIL;
     }
 
-    libxl_for_each_cpu(i, cpumap)
-        if (libxl_cpumap_test(&cpumap, i)) {
+    libxl_for_each_bit(i, cpumap)
+        if (libxl_bitmap_test(&cpumap, i)) {
             rc = xc_cpupool_addcpu(ctx->xch, *poolid, i);
             if (rc) {
                 LIBXL__LOG_ERRNOVAL(ctx, LIBXL__LOG_ERROR, rc,
@@ -4093,7 +4093,7 @@ int libxl_cpupool_destroy(libxl_ctx *ctx
     int rc, i;
     xc_cpupoolinfo_t *info;
     xs_transaction_t t;
-    libxl_cpumap cpumap;
+    libxl_bitmap cpumap;
 
     info = xc_cpupool_getinfo(ctx->xch, poolid);
     if (info == NULL) {
@@ -4106,12 +4106,12 @@ int libxl_cpupool_destroy(libxl_ctx *ctx
         goto out;
 
     rc = ERROR_NOMEM;
-    if (libxl_cpumap_alloc(ctx, &cpumap))
+    if (libxl_cpu_bitmap_alloc(ctx, &cpumap))
         goto out;
 
     memcpy(cpumap.map, info->cpumap, cpumap.size);
-    libxl_for_each_cpu(i, cpumap)
-        if (libxl_cpumap_test(&cpumap, i)) {
+    libxl_for_each_bit(i, cpumap)
+        if (libxl_bitmap_test(&cpumap, i)) {
             rc = xc_cpupool_removecpu(ctx->xch, poolid, i);
             if (rc) {
                 LIBXL__LOG_ERRNOVAL(ctx, LIBXL__LOG_ERROR, rc,
@@ -4140,7 +4140,7 @@ int libxl_cpupool_destroy(libxl_ctx *ctx
     rc = 0;
 
 out1:
-    libxl_cpumap_dispose(&cpumap);
+    libxl_bitmap_dispose(&cpumap);
 out:
     xc_cpupool_infofree(ctx->xch, info);
     GC_FREE;
@@ -4208,7 +4208,7 @@ int libxl_cpupool_cpuadd_node(libxl_ctx 
 {
     int rc = 0;
     int cpu, nr;
-    libxl_cpumap freemap;
+    libxl_bitmap freemap;
     libxl_cputopology *topology;
 
     if (libxl_get_freecpus(ctx, &freemap)) {
@@ -4223,7 +4223,7 @@ int libxl_cpupool_cpuadd_node(libxl_ctx 
 
     *cpus = 0;
     for (cpu = 0; cpu < nr; cpu++) {
-        if (libxl_cpumap_test(&freemap, cpu) && (topology[cpu].node
== node) &&
+        if (libxl_bitmap_test(&freemap, cpu) && (topology[cpu].node
== node) &&
             !libxl_cpupool_cpuadd(ctx, poolid, cpu)) {
                 (*cpus)++;
         }
@@ -4232,7 +4232,7 @@ int libxl_cpupool_cpuadd_node(libxl_ctx 
 
     free(topology);
 out:
-    libxl_cpumap_dispose(&freemap);
+    libxl_bitmap_dispose(&freemap);
     return rc;
 }
 
@@ -4274,7 +4274,7 @@ int libxl_cpupool_cpuremove_node(libxl_c
         if (poolinfo[p].poolid == poolid) {
             for (cpu = 0; cpu < nr_cpus; cpu++) {
                 if ((topology[cpu].node == node) &&
-                    libxl_cpumap_test(&poolinfo[p].cpumap, cpu) &&
+                    libxl_bitmap_test(&poolinfo[p].cpumap, cpu) &&
                     !libxl_cpupool_cpuremove(ctx, poolid, cpu)) {
                         (*cpus)++;
                 }
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -285,8 +285,8 @@ typedef uint64_t libxl_ev_user;
 typedef struct {
     uint32_t size;          /* number of bytes in map */
     uint8_t *map;
-} libxl_cpumap;
-void libxl_cpumap_dispose(libxl_cpumap *map);
+} libxl_bitmap;
+void libxl_bitmap_dispose(libxl_bitmap *map);
 
 /* libxl_cpuid_policy_list is a dynamic array storing CPUID policies
  * for multiple leafs. It is terminated with an entry holding
@@ -783,10 +783,10 @@ libxl_vcpuinfo *libxl_list_vcpu(libxl_ct
                                        int *nb_vcpu, int *nrcpus);
 void libxl_vcpuinfo_list_free(libxl_vcpuinfo *, int nr);
 int libxl_set_vcpuaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid,
-                           libxl_cpumap *cpumap);
+                           libxl_bitmap *cpumap);
 int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid,
-                               unsigned int max_vcpus, libxl_cpumap *cpumap);
-int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_cpumap *cpumap);
+                               unsigned int max_vcpus, libxl_bitmap *cpumap);
+int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *cpumap);
 
 libxl_scheduler libxl_get_scheduler(libxl_ctx *ctx);
 
@@ -836,10 +836,10 @@ int libxl_tmem_shared_auth(libxl_ctx *ct
                            int auth);
 int libxl_tmem_freeable(libxl_ctx *ctx);
 
-int libxl_get_freecpus(libxl_ctx *ctx, libxl_cpumap *cpumap);
+int libxl_get_freecpus(libxl_ctx *ctx, libxl_bitmap *cpumap);
 int libxl_cpupool_create(libxl_ctx *ctx, const char *name,
                          libxl_scheduler sched,
-                         libxl_cpumap cpumap, libxl_uuid *uuid,
+                         libxl_bitmap cpumap, libxl_uuid *uuid,
                          uint32_t *poolid);
 int libxl_cpupool_destroy(libxl_ctx *ctx, uint32_t poolid);
 int libxl_cpupool_rename(libxl_ctx *ctx, const char *name, uint32_t poolid);
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -150,9 +150,9 @@ int libxl__domain_build_info_setdefault(
         b_info->cur_vcpus = 1;
 
     if (!b_info->cpumap.size) {
-        if (libxl_cpumap_alloc(CTX, &b_info->cpumap))
+        if (libxl_cpu_bitmap_alloc(CTX, &b_info->cpumap))
             return ERROR_NOMEM;
-        libxl_cpumap_set_any(&b_info->cpumap);
+        libxl_bitmap_set_any(&b_info->cpumap);
     }
 
     if (b_info->max_memkb == LIBXL_MEMKB_DEFAULT)
diff --git a/tools/libxl/libxl_json.c b/tools/libxl/libxl_json.c
--- a/tools/libxl/libxl_json.c
+++ b/tools/libxl/libxl_json.c
@@ -99,8 +99,8 @@ yajl_gen_status libxl_uuid_gen_json(yajl
     return yajl_gen_string(hand, (const unsigned char *)buf,
LIBXL_UUID_FMTLEN);
 }
 
-yajl_gen_status libxl_cpumap_gen_json(yajl_gen hand,
-                                      libxl_cpumap *cpumap)
+yajl_gen_status libxl_bitmap_gen_json(yajl_gen hand,
+                                      libxl_bitmap *cpumap)
 {
     yajl_gen_status s;
     int i;
@@ -108,8 +108,8 @@ yajl_gen_status libxl_cpumap_gen_json(ya
     s = yajl_gen_array_open(hand);
     if (s != yajl_gen_status_ok) goto out;
 
-    libxl_for_each_cpu(i, *cpumap) {
-        if (libxl_cpumap_test(cpumap, i)) {
+    libxl_for_each_bit(i, *cpumap) {
+        if (libxl_bitmap_test(cpumap, i)) {
             s = yajl_gen_integer(hand, i);
             if (s != yajl_gen_status_ok) goto out;
         }
diff --git a/tools/libxl/libxl_json.h b/tools/libxl/libxl_json.h
--- a/tools/libxl/libxl_json.h
+++ b/tools/libxl/libxl_json.h
@@ -26,7 +26,7 @@ yajl_gen_status libxl_defbool_gen_json(y
 yajl_gen_status libxl_domid_gen_json(yajl_gen hand, libxl_domid *p);
 yajl_gen_status libxl_uuid_gen_json(yajl_gen hand, libxl_uuid *p);
 yajl_gen_status libxl_mac_gen_json(yajl_gen hand, libxl_mac *p);
-yajl_gen_status libxl_cpumap_gen_json(yajl_gen hand, libxl_cpumap *p);
+yajl_gen_status libxl_bitmap_gen_json(yajl_gen hand, libxl_bitmap *p);
 yajl_gen_status libxl_cpuid_policy_list_gen_json(yajl_gen hand,
                                                  libxl_cpuid_policy_list *p);
 yajl_gen_status libxl_string_list_gen_json(yajl_gen hand, libxl_string_list
*p);
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -10,7 +10,7 @@ libxl_defbool = Builtin("defbool", passb
 libxl_domid = Builtin("domid", json_fn =
"yajl_gen_integer", autogenerate_json = False)
 libxl_uuid = Builtin("uuid", passby=PASS_BY_REFERENCE)
 libxl_mac = Builtin("mac", passby=PASS_BY_REFERENCE)
-libxl_cpumap = Builtin("cpumap",
dispose_fn="libxl_cpumap_dispose", passby=PASS_BY_REFERENCE)
+libxl_bitmap = Builtin("bitmap",
dispose_fn="libxl_bitmap_dispose", passby=PASS_BY_REFERENCE)
 libxl_cpuid_policy_list = Builtin("cpuid_policy_list",
dispose_fn="libxl_cpuid_dispose", passby=PASS_BY_REFERENCE)
 
 libxl_string_list = Builtin("string_list",
dispose_fn="libxl_string_list_dispose", passby=PASS_BY_REFERENCE)
@@ -188,7 +188,7 @@ libxl_cpupoolinfo = Struct("cpupoolinfo"
     ("poolid",      uint32),
     ("sched",       libxl_scheduler),
     ("n_dom",       uint32),
-    ("cpumap",      libxl_cpumap)
+    ("cpumap",      libxl_bitmap)
     ], dir=DIR_OUT)
 
 libxl_vminfo = Struct("vminfo", [
@@ -238,7 +238,7 @@ libxl_domain_sched_params = Struct("doma
 libxl_domain_build_info = Struct("domain_build_info",[
     ("max_vcpus",       integer),
     ("cur_vcpus",       integer),
-    ("cpumap",          libxl_cpumap),
+    ("cpumap",          libxl_bitmap),
     ("tsc_mode",        libxl_tsc_mode),
     ("max_memkb",       MemKB),
     ("target_memkb",    MemKB),
@@ -399,7 +399,7 @@ libxl_vcpuinfo = Struct("vcpuinfo", [
     ("blocked", bool),
     ("running", bool),
     ("vcpu_time", uint64), # total vcpu time ran (ns)
-    ("cpumap", libxl_cpumap), # current cpu''s affinities
+    ("cpumap", libxl_bitmap), # current cpu''s affinities
     ], dir=DIR_OUT)
 
 libxl_physinfo = Struct("physinfo", [
diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c
--- a/tools/libxl/libxl_utils.c
+++ b/tools/libxl/libxl_utils.c
@@ -489,47 +489,42 @@ int libxl_mac_to_device_nic(libxl_ctx *c
     return rc;
 }
 
-int libxl_cpumap_alloc(libxl_ctx *ctx, libxl_cpumap *cpumap)
+int libxl_bitmap_alloc(libxl_ctx *ctx, libxl_bitmap *bitmap, int n_bits)
 {
-    int max_cpus;
     int sz;
 
-    max_cpus = libxl_get_max_cpus(ctx);
-    if (max_cpus == 0)
-        return ERROR_FAIL;
-
-    sz = (max_cpus + 7) / 8;
-    cpumap->map = calloc(sz, sizeof(*cpumap->map));
-    if (!cpumap->map)
+    sz = (n_bits + 7) / 8;
+    bitmap->map = calloc(sz, sizeof(*bitmap->map));
+    if (!bitmap->map)
         return ERROR_NOMEM;
-    cpumap->size = sz;
+    bitmap->size = sz;
     return 0;
 }
 
-void libxl_cpumap_dispose(libxl_cpumap *map)
+void libxl_bitmap_dispose(libxl_bitmap *map)
 {
     free(map->map);
 }
 
-int libxl_cpumap_test(libxl_cpumap *cpumap, int cpu)
+int libxl_bitmap_test(const libxl_bitmap *bitmap, int bit)
 {
-    if (cpu >= cpumap->size * 8)
+    if (bit >= bitmap->size * 8)
         return 0;
-    return (cpumap->map[cpu / 8] & (1 << (cpu & 7))) ? 1 : 0;
+    return (bitmap->map[bit / 8] & (1 << (bit & 7))) ? 1 : 0;
 }
 
-void libxl_cpumap_set(libxl_cpumap *cpumap, int cpu)
+void libxl_bitmap_set(libxl_bitmap *bitmap, int bit)
 {
-    if (cpu >= cpumap->size * 8)
+    if (bit >= bitmap->size * 8)
         return;
-    cpumap->map[cpu / 8] |= 1 << (cpu & 7);
+    bitmap->map[bit / 8] |= 1 << (bit & 7);
 }
 
-void libxl_cpumap_reset(libxl_cpumap *cpumap, int cpu)
+void libxl_bitmap_reset(libxl_bitmap *bitmap, int bit)
 {
-    if (cpu >= cpumap->size * 8)
+    if (bit >= bitmap->size * 8)
         return;
-    cpumap->map[cpu / 8] &= ~(1 << (cpu & 7));
+    bitmap->map[bit / 8] &= ~(1 << (bit & 7));
 }
 
 int libxl_get_max_cpus(libxl_ctx *ctx)
diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h
--- a/tools/libxl/libxl_utils.h
+++ b/tools/libxl/libxl_utils.h
@@ -63,25 +63,38 @@ int libxl_devid_to_device_nic(libxl_ctx 
 int libxl_vdev_to_device_disk(libxl_ctx *ctx, uint32_t domid, const char *vdev,
                                libxl_device_disk *disk);
 
-int libxl_cpumap_alloc(libxl_ctx *ctx, libxl_cpumap *cpumap);
-int libxl_cpumap_test(libxl_cpumap *cpumap, int cpu);
-void libxl_cpumap_set(libxl_cpumap *cpumap, int cpu);
-void libxl_cpumap_reset(libxl_cpumap *cpumap, int cpu);
-static inline void libxl_cpumap_set_any(libxl_cpumap *cpumap)
+int libxl_bitmap_alloc(libxl_ctx *ctx, libxl_bitmap *bitmap, int n_bits);
+    /* Allocated bimap is from malloc, libxl_bitmap_dispose() to be
+     * called by the application when done. */
+int libxl_bitmap_test(const libxl_bitmap *bitmap, int bit);
+void libxl_bitmap_set(libxl_bitmap *bitmap, int bit);
+void libxl_bitmap_reset(libxl_bitmap *bitmap, int bit);
+static inline void libxl_bitmap_set_any(libxl_bitmap *bitmap)
 {
-    memset(cpumap->map, -1, cpumap->size);
+    memset(bitmap->map, -1, bitmap->size);
 }
-static inline void libxl_cpumap_set_none(libxl_cpumap *cpumap)
+static inline void libxl_bitmap_set_none(libxl_bitmap *bitmap)
 {
-    memset(cpumap->map, 0, cpumap->size);
+    memset(bitmap->map, 0, bitmap->size);
 }
-static inline int libxl_cpumap_cpu_valid(libxl_cpumap *cpumap, int cpu)
+static inline int libxl_bitmap_cpu_valid(const libxl_bitmap *bitmap, int bit)
 {
-    return cpu >= 0 && cpu < (cpumap->size * 8);
+    return bit >= 0 && bit < (bitmap->size * 8);
 }
-#define libxl_for_each_cpu(var, map) for (var = 0; var < (map).size * 8;
var++)
-#define libxl_for_each_set_cpu(v, m) for (v = 0; v < (m).size * 8; v++) \
-                                             if (libxl_cpumap_test(&(m),
v))
+#define libxl_for_each_bit(var, map) for (var = 0; var < (map).size * 8;
var++)
+#define libxl_for_each_set_bit(v, m) for (v = 0; v < (m).size * 8; v++) \
+                                             if (libxl_bitmap_test(&(m),
v))
+
+static inline int libxl_cpu_bitmap_alloc(libxl_ctx *ctx, libxl_bitmap *cpumap)
+{
+    int max_cpus;
+
+    max_cpus = libxl_get_max_cpus(ctx);
+    if (max_cpus == 0)
+        return ERROR_FAIL;
+
+    return libxl_bitmap_alloc(ctx, cpumap, max_cpus);
+}
 
 static inline uint32_t libxl__sizekb_to_mb(uint32_t s) {
     return (s + 1023) / 1024;
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -491,19 +491,19 @@ static void split_string_into_string_lis
     free(s);
 }
 
-static int vcpupin_parse(char *cpu, libxl_cpumap *cpumap)
-{
-    libxl_cpumap exclude_cpumap;
+static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap)
+{
+    libxl_bitmap exclude_cpumap;
     uint32_t cpuida, cpuidb;
     char *endptr, *toka, *tokb, *saveptr = NULL;
     int i, rc = 0, rmcpu;
 
     if (!strcmp(cpu, "all")) {
-        libxl_cpumap_set_any(cpumap);
+        libxl_bitmap_set_any(cpumap);
         return 0;
     }
 
-    if (libxl_cpumap_alloc(ctx, &exclude_cpumap)) {
+    if (libxl_cpu_bitmap_alloc(ctx, &exclude_cpumap)) {
         fprintf(stderr, "Error: Failed to allocate cpumap.\n");
         return ENOMEM;
     }
@@ -533,19 +533,19 @@ static int vcpupin_parse(char *cpu, libx
             }
         }
         while (cpuida <= cpuidb) {
-            rmcpu == 0 ? libxl_cpumap_set(cpumap, cpuida) :
-                         libxl_cpumap_set(&exclude_cpumap, cpuida);
+            rmcpu == 0 ? libxl_bitmap_set(cpumap, cpuida) :
+                         libxl_bitmap_set(&exclude_cpumap, cpuida);
             cpuida++;
         }
     }
 
     /* Clear all the cpus from the removal list */
-    libxl_for_each_set_cpu(i, exclude_cpumap) {
-        libxl_cpumap_reset(cpumap, i);
+    libxl_for_each_set_bit(i, exclude_cpumap) {
+        libxl_bitmap_reset(cpumap, i);
     }
 
 vcpp_out:
-    libxl_cpumap_dispose(&exclude_cpumap);
+    libxl_bitmap_dispose(&exclude_cpumap);
 
     return rc;
 }
@@ -685,7 +685,7 @@ static void parse_config_data(const char
     if (!xlu_cfg_get_list (config, "cpus", &cpus, 0, 1)) {
         int i, n_cpus = 0;
 
-        if (libxl_cpumap_alloc(ctx, &b_info->cpumap)) {
+        if (libxl_cpu_bitmap_alloc(ctx, &b_info->cpumap)) {
             fprintf(stderr, "Unable to allocate cpumap\n");
             exit(1);
         }
@@ -705,14 +705,14 @@ static void parse_config_data(const char
          * the cpumap derived from the list ensures memory is being
          * allocated on the proper nodes anyway.
          */
-        libxl_cpumap_set_none(&b_info->cpumap);
+        libxl_bitmap_set_none(&b_info->cpumap);
         while ((buf = xlu_cfg_get_listitem(cpus, n_cpus)) != NULL) {
             i = atoi(buf);
-            if (!libxl_cpumap_cpu_valid(&b_info->cpumap, i)) {
+            if (!libxl_bitmap_cpu_valid(&b_info->cpumap, i)) {
                 fprintf(stderr, "cpu %d illegal\n", i);
                 exit(1);
             }
-            libxl_cpumap_set(&b_info->cpumap, i);
+            libxl_bitmap_set(&b_info->cpumap, i);
             if (n_cpus < b_info->max_vcpus)
                 vcpu_to_pcpu[n_cpus] = i;
             n_cpus++;
@@ -721,12 +721,12 @@ static void parse_config_data(const char
     else if (!xlu_cfg_get_string (config, "cpus", &buf, 0)) {
         char *buf2 = strdup(buf);
 
-        if (libxl_cpumap_alloc(ctx, &b_info->cpumap)) {
+        if (libxl_cpu_bitmap_alloc(ctx, &b_info->cpumap)) {
             fprintf(stderr, "Unable to allocate cpumap\n");
             exit(1);
         }
 
-        libxl_cpumap_set_none(&b_info->cpumap);
+        libxl_bitmap_set_none(&b_info->cpumap);
         if (vcpupin_parse(buf2, &b_info->cpumap))
             exit(1);
         free(buf2);
@@ -1806,26 +1806,26 @@ start:
 
     /* If single vcpu to pcpu mapping was requested, honour it */
     if (vcpu_to_pcpu) {
-        libxl_cpumap vcpu_cpumap;
-
-        libxl_cpumap_alloc(ctx, &vcpu_cpumap);
+        libxl_bitmap vcpu_cpumap;
+
+        libxl_cpu_bitmap_alloc(ctx, &vcpu_cpumap);
         for (i = 0; i < d_config.b_info.max_vcpus; i++) {
 
             if (vcpu_to_pcpu[i] != -1) {
-                libxl_cpumap_set_none(&vcpu_cpumap);
-                libxl_cpumap_set(&vcpu_cpumap, vcpu_to_pcpu[i]);
+                libxl_bitmap_set_none(&vcpu_cpumap);
+                libxl_bitmap_set(&vcpu_cpumap, vcpu_to_pcpu[i]);
             } else {
-                libxl_cpumap_set_any(&vcpu_cpumap);
+                libxl_bitmap_set_any(&vcpu_cpumap);
             }
             if (libxl_set_vcpuaffinity(ctx, domid, i, &vcpu_cpumap)) {
                 fprintf(stderr, "setting affinity failed on vcpu
`%d''.\n", i);
-                libxl_cpumap_dispose(&vcpu_cpumap);
+                libxl_bitmap_dispose(&vcpu_cpumap);
                 free(vcpu_to_pcpu);
                 ret = ERROR_FAIL;
                 goto error_out;
             }
         }
-        libxl_cpumap_dispose(&vcpu_cpumap);
+        libxl_bitmap_dispose(&vcpu_cpumap);
         free(vcpu_to_pcpu); vcpu_to_pcpu = NULL;
     }
 
@@ -4063,7 +4063,7 @@ int main_vcpulist(int argc, char **argv)
 static void vcpupin(const char *d, const char *vcpu, char *cpu)
 {
     libxl_vcpuinfo *vcpuinfo;
-    libxl_cpumap cpumap;
+    libxl_bitmap cpumap;
 
     uint32_t vcpuid;
     char *endptr;
@@ -4080,7 +4080,7 @@ static void vcpupin(const char *d, const
 
     find_domain(d);
 
-    if (libxl_cpumap_alloc(ctx, &cpumap)) {
+    if (libxl_cpu_bitmap_alloc(ctx, &cpumap)) {
         goto vcpupin_out;
     }
 
@@ -4107,7 +4107,7 @@ static void vcpupin(const char *d, const
         libxl_vcpuinfo_list_free(vcpuinfo, nb_vcpu);
     }
   vcpupin_out1:
-    libxl_cpumap_dispose(&cpumap);
+    libxl_bitmap_dispose(&cpumap);
   vcpupin_out:
     ;
 }
@@ -4127,7 +4127,7 @@ static void vcpuset(const char *d, const
 {
     char *endptr;
     unsigned int max_vcpus, i;
-    libxl_cpumap cpumap;
+    libxl_bitmap cpumap;
 
     max_vcpus = strtoul(nr_vcpus, &endptr, 10);
     if (nr_vcpus == endptr) {
@@ -4137,17 +4137,17 @@ static void vcpuset(const char *d, const
 
     find_domain(d);
 
-    if (libxl_cpumap_alloc(ctx, &cpumap)) {
-        fprintf(stderr, "libxl_cpumap_alloc failed\n");
+    if (libxl_cpu_bitmap_alloc(ctx, &cpumap)) {
+        fprintf(stderr, "libxl_cpu_bitmap_alloc failed\n");
         return;
     }
     for (i = 0; i < max_vcpus; i++)
-        libxl_cpumap_set(&cpumap, i);
+        libxl_bitmap_set(&cpumap, i);
 
     if (libxl_set_vcpuonline(ctx, domid, &cpumap) < 0)
         fprintf(stderr, "libxl_set_vcpuonline failed domid=%d
max_vcpus=%d\n", domid, max_vcpus);
 
-    libxl_cpumap_dispose(&cpumap);
+    libxl_bitmap_dispose(&cpumap);
 }
 
 int main_vcpuset(int argc, char **argv)
@@ -4211,7 +4211,7 @@ static void output_physinfo(void)
     libxl_physinfo info;
     const libxl_version_info *vinfo;
     unsigned int i;
-    libxl_cpumap cpumap;
+    libxl_bitmap cpumap;
     int n = 0;
 
     if (libxl_get_physinfo(ctx, &info) != 0) {
@@ -4243,8 +4243,8 @@ static void output_physinfo(void)
         printf("sharing_used_memory    : %"PRIu64"\n",
info.sharing_used_frames / i);
     }
     if (!libxl_get_freecpus(ctx, &cpumap)) {
-        libxl_for_each_cpu(i, cpumap)
-            if (libxl_cpumap_test(&cpumap, i))
+        libxl_for_each_bit(i, cpumap)
+            if (libxl_bitmap_test(&cpumap, i))
                 n++;
         printf("free_cpus              : %d\n", n);
         free(cpumap.map);
@@ -5866,8 +5866,8 @@ int main_cpupoolcreate(int argc, char **
     XLU_ConfigList *cpus;
     XLU_ConfigList *nodes;
     int n_cpus, n_nodes, i, n;
-    libxl_cpumap freemap;
-    libxl_cpumap cpumap;
+    libxl_bitmap freemap;
+    libxl_bitmap cpumap;
     libxl_uuid uuid;
     libxl_cputopology *topology;
     int rc = -ERROR_FAIL; 
@@ -5980,7 +5980,7 @@ int main_cpupoolcreate(int argc, char **
         fprintf(stderr, "libxl_get_freecpus failed\n");
         goto out_cfg;
     }
-    if (libxl_cpumap_alloc(ctx, &cpumap)) {
+    if (libxl_cpu_bitmap_alloc(ctx, &cpumap)) {
         fprintf(stderr, "Failed to allocate cpumap\n");
         goto out_cfg;
     }
@@ -5997,8 +5997,8 @@ int main_cpupoolcreate(int argc, char **
             n = atoi(buf);
             for (i = 0; i < nr; i++) {
                 if ((topology[i].node == n) &&
-                    libxl_cpumap_test(&freemap, i)) {
-                    libxl_cpumap_set(&cpumap, i);
+                    libxl_bitmap_test(&freemap, i)) {
+                    libxl_bitmap_set(&cpumap, i);
                     n_cpus++;
                 }
             }
@@ -6016,11 +6016,11 @@ int main_cpupoolcreate(int argc, char **
         while ((buf = xlu_cfg_get_listitem(cpus, n_cpus)) != NULL) {
             i = atoi(buf);
             if ((i < 0) || (i >= freemap.size * 8) ||
-                !libxl_cpumap_test(&freemap, i)) {
+                !libxl_bitmap_test(&freemap, i)) {
                 fprintf(stderr, "cpu %d illegal or not free\n", i);
                 goto out_cfg;
             }
-            libxl_cpumap_set(&cpumap, i);
+            libxl_bitmap_set(&cpumap, i);
             n_cpus++;
         }
     } else
@@ -6118,8 +6118,8 @@ int main_cpupoollist(int argc, char **ar
                 printf("%-19s", name);
                 free(name);
                 n = 0;
-                libxl_for_each_cpu(c, poolinfo[p].cpumap)
-                    if (libxl_cpumap_test(&poolinfo[p].cpumap, c)) {
+                libxl_for_each_bit(c, poolinfo[p].cpumap)
+                    if (libxl_bitmap_test(&poolinfo[p].cpumap, c)) {
                         if (n && opt_cpus) printf(",");
                         if (opt_cpus) printf("%d", c);
                         n++;
@@ -6318,7 +6318,7 @@ int main_cpupoolnumasplit(int argc, char
     int n_cpus;
     char name[16];
     libxl_uuid uuid;
-    libxl_cpumap cpumap;
+    libxl_bitmap cpumap;
     libxl_cpupoolinfo *poolinfo;
     libxl_cputopology *topology;
     libxl_dominfo info;
@@ -6348,7 +6348,7 @@ int main_cpupoolnumasplit(int argc, char
         return -ERROR_FAIL;
     }
 
-    if (libxl_cpumap_alloc(ctx, &cpumap)) {
+    if (libxl_cpu_bitmap_alloc(ctx, &cpumap)) {
         fprintf(stderr, "Failed to allocate cpumap\n");
         libxl_cputopology_list_free(topology, n_cpus);
         return -ERROR_FAIL;
@@ -6374,7 +6374,7 @@ int main_cpupoolnumasplit(int argc, char
     for (c = 0; c < n_cpus; c++) {
         if (topology[c].node == node) {
             topology[c].node = LIBXL_CPUTOPOLOGY_INVALID_ENTRY;
-            libxl_cpumap_set(&cpumap, n);
+            libxl_bitmap_set(&cpumap, n);
             n++;
         }
     }
@@ -6396,7 +6396,7 @@ int main_cpupoolnumasplit(int argc, char
         fprintf(stderr, "failed to offline vcpus\n");
         goto out;
     }
-    libxl_cpumap_set_none(&cpumap);
+    libxl_bitmap_set_none(&cpumap);
 
     for (c = 0; c < n_cpus; c++) {
         if (topology[c].node == LIBXL_CPUTOPOLOGY_INVALID_ENTRY) {
@@ -6434,7 +6434,7 @@ int main_cpupoolnumasplit(int argc, char
 
 out:
     libxl_cputopology_list_free(topology, n_cpus);
-    libxl_cpumap_dispose(&cpumap);
+    libxl_bitmap_dispose(&cpumap);
 
     return ret;
 }
diff --git a/tools/python/xen/lowlevel/xl/xl.c
b/tools/python/xen/lowlevel/xl/xl.c
--- a/tools/python/xen/lowlevel/xl/xl.c
+++ b/tools/python/xen/lowlevel/xl/xl.c
@@ -231,14 +231,14 @@ int attrib__libxl_cpuid_policy_list_set(
     return -1;
 }
 
-int attrib__libxl_cpumap_set(PyObject *v, libxl_cpumap *pptr)
+int attrib__libxl_bitmap_set(PyObject *v, libxl_bitmap *pptr)
 {
     int i;
     long cpu;
 
     for (i = 0; i < PyList_Size(v); i++) {
         cpu = PyInt_AsLong(PyList_GetItem(v, i));
-        libxl_cpumap_set(pptr, cpu);
+        libxl_bitmap_set(pptr, cpu);
     }
     return 0;
 }
@@ -293,14 +293,14 @@ PyObject *attrib__libxl_cpuid_policy_lis
     return NULL;
 }
 
-PyObject *attrib__libxl_cpumap_get(libxl_cpumap *pptr)
+PyObject *attrib__libxl_bitmap_get(libxl_bitmap *pptr)
 {
     PyObject *cpulist = NULL;
     int i;
 
     cpulist = PyList_New(0);
-    libxl_for_each_cpu(i, *pptr) {
-        if ( libxl_cpumap_test(pptr, i) ) {
+    libxl_for_each_bit(i, *pptr) {
+        if ( libxl_bitmap_test(pptr, i) ) {
             PyObject* pyint = PyInt_FromLong(i);
 
             PyList_Append(cpulist, pyint);

Dario Faggioli

2012-Jun-15 17:04 UTC

head link

[PATCH 06 of 10 v2] libxl: expand the libxl_bitmap API a bit

By adding copying and *_is_full/*_is_empty facilities.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Changes from v1:
 * now libxl_is_full/empty return 1 if true and 0 if false,
   as logic (and as requested during review).

diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c
--- a/tools/libxl/libxl_utils.c
+++ b/tools/libxl/libxl_utils.c
@@ -506,6 +506,35 @@ void libxl_bitmap_dispose(libxl_bitmap *
     free(map->map);
 }
 
+void libxl_bitmap_copy(libxl_ctx *ctx, libxl_bitmap *dptr,
+                       const libxl_bitmap *sptr)
+{
+    int sz;
+
+    sz = dptr->size = sptr->size;
+    memcpy(dptr->map, sptr->map, sz * sizeof(*dptr->map));
+}
+
+int libxl_bitmap_is_full(const libxl_bitmap *bitmap)
+{
+    int i;
+
+    for (i = 0; i < bitmap->size; i++)
+        if (bitmap->map[i] != (uint8_t)-1)
+            return 0;
+   return 1;
+}
+
+int libxl_bitmap_is_empty(const libxl_bitmap *bitmap)
+{
+    int i;
+
+    for (i = 0; i < bitmap->size; i++)
+        if (bitmap->map[i])
+            return 0;
+    return 1;
+}
+
 int libxl_bitmap_test(const libxl_bitmap *bitmap, int bit)
 {
     if (bit >= bitmap->size * 8)
diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h
--- a/tools/libxl/libxl_utils.h
+++ b/tools/libxl/libxl_utils.h
@@ -66,6 +66,10 @@ int libxl_vdev_to_device_disk(libxl_ctx 
 int libxl_bitmap_alloc(libxl_ctx *ctx, libxl_bitmap *bitmap, int n_bits);
     /* Allocated bimap is from malloc, libxl_bitmap_dispose() to be
      * called by the application when done. */
+void libxl_bitmap_copy(libxl_ctx *ctx, libxl_bitmap *dptr,
+                       const libxl_bitmap *sptr);
+int libxl_bitmap_is_full(const libxl_bitmap *bitmap);
+int libxl_bitmap_is_empty(const libxl_bitmap *bitmap);
 int libxl_bitmap_test(const libxl_bitmap *bitmap, int bit);
 void libxl_bitmap_set(libxl_bitmap *bitmap, int bit);
 void libxl_bitmap_reset(libxl_bitmap *bitmap, int bit);

Dario Faggioli

2012-Jun-15 17:04 UTC

head link

[PATCH 07 of 10 v2] libxl: introduce some node map helpers

To allow for allocating a node specific libxl_bitmap (as it
is for cpu number and maps). Helper unctions to convert a node
map it its coresponding cpu map and vice versa are also
implemented.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Changes from v1:
 * This patch replaces "libxl: introduce libxl_nodemap", as
   the libxl_bitmap type introduced by the previous patch is
   now used for both cpu and node maps, as requested during
   v1 review.

diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c
--- a/tools/libxl/libxl_utils.c
+++ b/tools/libxl/libxl_utils.c
@@ -556,6 +556,50 @@ void libxl_bitmap_reset(libxl_bitmap *bi
     bitmap->map[bit / 8] &= ~(1 << (bit & 7));
 }
 
+int libxl_nodemap_to_cpumap(libxl_ctx *ctx,
+                            const libxl_bitmap *nodemap,
+                            libxl_bitmap *cpumap)
+{
+    libxl_cputopology *tinfo = NULL;
+    int nr_cpus, i, rc = 0;
+
+    tinfo = libxl_get_cpu_topology(ctx, &nr_cpus);
+    if (tinfo == NULL) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    libxl_bitmap_set_none(cpumap);
+    for (i = 0; i < nr_cpus; i++) {
+        if (libxl_bitmap_test(nodemap, tinfo[i].node))
+            libxl_bitmap_set(cpumap, i);
+    }
+ out:
+    libxl_cputopology_list_free(tinfo, nr_cpus);
+    return rc;
+}
+
+int libxl_cpumap_to_nodemap(libxl_ctx *ctx,
+                            const libxl_bitmap *cpumap,
+                            libxl_bitmap *nodemap)
+{
+    libxl_cputopology *tinfo = NULL;
+    int nr_cpus, i, rc = 0;
+
+    tinfo = libxl_get_cpu_topology(ctx, &nr_cpus);
+    if (tinfo == NULL) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    libxl_bitmap_set_none(nodemap);
+    libxl_for_each_set_bit(i, *cpumap)
+        libxl_bitmap_set(nodemap, tinfo[i].node);
+ out:
+    libxl_cputopology_list_free(tinfo, nr_cpus);
+    return rc;
+}
+
 int libxl_get_max_cpus(libxl_ctx *ctx)
 {
     return xc_get_max_cpus(ctx->xch);
diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h
--- a/tools/libxl/libxl_utils.h
+++ b/tools/libxl/libxl_utils.h
@@ -100,6 +100,27 @@ static inline int libxl_cpu_bitmap_alloc
     return libxl_bitmap_alloc(ctx, cpumap, max_cpus);
 }
 
+static inline int libxl_node_bitmap_alloc(libxl_ctx *ctx,
+                                          libxl_bitmap *nodemap)
+{
+    int max_nodes;
+
+    max_nodes = libxl_get_max_nodes(ctx);
+    if (max_nodes == 0)
+        return ERROR_FAIL;
+
+    return libxl_bitmap_alloc(ctx, nodemap, max_nodes);
+}
+
+int libxl_nodemap_to_cpumap(libxl_ctx *ctx,
+                            const libxl_bitmap *nodemap,
+                            libxl_bitmap *cpumap);
+    /* populate cpumap with the cpus spanned by the nodes in nodemap */
+int libxl_cpumap_to_nodemap(libxl_ctx *ctx,
+                            const libxl_bitmap *cpuemap,
+                            libxl_bitmap *nodemap);
+    /* populate nodemap with the nodes of the cpus in cpumap */
+
 static inline uint32_t libxl__sizekb_to_mb(uint32_t s) {
     return (s + 1023) / 1024;
 }

Dario Faggioli

2012-Jun-15 17:04 UTC

head link

[PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

If a domain does not have a VCPU affinity, try to pin it automatically to some
PCPUs. This is done taking into account the NUMA characteristics of the host.
In fact, we look for a combination of host''s NUMA nodes with enough
free memory
and number of PCPUs for the new domain, and pin it to the VCPUs of those nodes.

Once we know which ones, among all the possible combinations, represents valid
placement candidates for a domain, use some heuistics for deciding which is the
best. For instance, smaller candidates are considered to be better, both from
the domain''s point of view (fewer memory spreading among nodes) and
from the
system as a whole point of view (fewer memoy fragmentation).  In case of
candidates of equal sizes (i.e., with the same number of nodes), the one with
the greater amount of memory wins, as this is also good for keeping memory
fragmentation under control.

This all happens internally to libxl, and no API for driving the mechanism is
provided for now. This matches what xend already does.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Changes from v1:
 * This patches incorporates the changes from both "libxl, xl: enable
automatic
   placement of guests on NUMA nodes" and "libxl, xl: heuristics for
reordering
   NUMA placement candidates" from v1.
 * The logic of the algorithm is basically the same as in v1, but the splitting
   of it in the various functions has been completely redesigned from scratch.
 * No public API for placement or candidate generation is now exposed,
   everything happens within libxl, as agreed during v1 review.
 * The relevant documentation have been moved near the actual functions and
   features. Also, the amount and (hopefully!) the quality of the documentation
   has been improved a lot, as requested.
 * All the comments about using the proper libxl facilities and helpers for
   allocations, etc., have been considered and applied.
 * This patch still bails out from NUMA optimizations if it find out cpupools
   are being utilized. It is next patch that makes the two things interact
   properly, as suggested during review.

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -111,8 +111,8 @@ created online and the remainder will be
 
 =item B<cpus="CPU-LIST">
 
-List of which cpus the guest is allowed to use. Default behavior is
-`all cpus`. A C<CPU-LIST> may be specified as follows:
+List of which cpus the guest is allowed to use. By default xl will (via
+libxl) pick some cpus (see below). A C<CPU-LIST> may be specified as
follows:
 
 =over 4
 
@@ -132,6 +132,12 @@ run on cpu #3 of the host.
 
 =back
 
+If this option is not specified, libxl automatically tries to place the new
+domain on the host''s NUMA nodes (provided the host has more than one
NUMA
+node) by pinning it to the cpus of those nodes. An heuristical approach is
+utilized with the goals of maximizing performance for the domain and, at
+the same time, achieving efficient utilization of the host''s CPUs and
RAM.
+
 =item B<cpu_weight=WEIGHT>
 
 A domain with a weight of 512 will get twice as much CPU as a domain
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -95,6 +95,459 @@ out:
     return sched;
 }
 
+/*
+ * What follows are helpers for generating all the k-combinations
+ * without repetitions of a set S with n elements in it. Formally
+ * speaking, they are subsets of k distinct elements of S and, if
+ * S is n elements big, the number of k-combinations is equal to
+ * the binomial coefficient (n k), which means we get exactly
+ * n!/(k! * (n - k)!) subsets, all of them with k elements.
+ *
+ * The various subset are geneated one after the other by calling
+ * comb_init() first, and, after that, comb_next()
+ * (n k)-1 times. An iterator is used to store the curent status
+ * of the whole generation operation (i.e., basically, the last
+ * combination that has been generated). As soon as all
+ * combinations have been generated, comb_next() will
+ * start returning 0 instead of 1. It is of course impotant that
+ * the same instance of the iterator and the same values for
+ * n and k are used for each call. If that doesn''t happen, the
+ * result is unspecified.
+ *
+ * The algorithm is a well known one, and it produces the
+ * combinations in such a way that they (well, more precisely,
+ * their indexes it the array/map representing the set) come in
+ * lexicogaphic order.
+ *
+ * For example, with n = 5 and k = 3, calling comb_init()
+ * will generate { 0, 1, 2 }, while subsequent valid calls to
+ * comb_next() will produce the following:
+ * { { 0, 1, 2 }, { 0, 1, 3 }, { 0, 1, 4 },
+ *   { 0, 2, 3 }, { 0, 2, 4 }, { 0, 3, 4 },
+ *   { 1, 2, 3 }, { 1, 2, 4 }, { 1, 3, 4 },
+ *   { 2, 3, 4 } }
+ *
+ * This is used by the automatic NUMA placement logic below.
+ */
+typedef int* comb_iter_t;
+
+static int comb_init(libxl__gc *gc, comb_iter_t *it, int n, int k)
+{
+    comb_iter_t new_iter;
+    int i;
+
+    if (n < k)
+        return 0;
+
+    /* First set is always { 0, 1, 2, ..., k-1 } */
+    GCNEW_ARRAY(new_iter, k);
+    for (i = 0; i < k; i++)
+        new_iter[i] = i;
+
+    *it = new_iter;
+    return 1;
+}
+
+static int comb_next(comb_iter_t it, int n, int k)
+{
+    int i;
+
+    /*
+     * The idea here is to find the leftmost element from where
+     * we should start incrementing the indexes of the iterator.
+     * This means looking for the highest index that can be increased
+     * while still producing value smaller than n-1. In the example
+     * above, when dealing with { 0, 1, 4 }, such an element is the
+     * second one, as the third is already equal to 4 (which actually
+     * is n-1).
+     * Once we found from where to start, we increment that element
+     * and overide the right-hand rest of the iterator with its
+     * successors, thus achieving lexicographic ordering.
+     *
+     * Regarding the termination of the generation process, when we
+     * manage in bringing n-k at the very first position of the iterator,
+     * we know that is the last valid combination ( { 2, 3, 4 }, with
+     * n - k = 5 - 2 = 2, in the example above), and thus we start
+     * returning 0 as soon as we cross that border.
+     */
+    for (i = k - 1; it[i] == n - k + i; i--) {
+        if (i <= 0)
+            return 0;
+    }
+    for (it[i]++, i++; i < k; i++)
+        it[i] = it[i - 1] + 1;
+    return 1;
+}
+
+/* NUMA automatic placement (see libxl_internal.h for details) */
+
+/*
+ * This function turns a k-combination iterator into a node map.
+ * This means the bits in the node map corresponding to the indexes
+ * of the given combination are the ones that will be set.
+ * For example, if the iterator represents the combination { 0, 2, 4},
+ * the node map will have bits #0, #2 and #4 set to one and i thus
+ * will point to node 0, node 2 and and node 4.
+ */
+static void comb_get_nodemap(comb_iter_t it, libxl_bitmap *nodemap, int k)
+{
+    int i;
+
+    libxl_bitmap_set_none(nodemap);
+    for (i = 0; i < k; i++)
+        libxl_bitmap_set(nodemap, it[i]);
+}
+
+/* Retrieve the number of cpus that the nodes that are part of the nodemap
+ * span. */
+static int nodemap_to_nodes_cpus(libxl_cputopology *tinfo, int nr_cpus,
+                                 const libxl_bitmap *nodemap)
+{
+    int i, nodes_cpus = 0;
+
+    for (i = 0; i < nr_cpus; i++) {
+        if (libxl_bitmap_test(nodemap, tinfo[i].node))
+            nodes_cpus++;
+    }
+    return nodes_cpus;
+}
+
+/* Retrieve the amount of free memory within the nodemap */
+static uint32_t nodemap_to_free_memkb(libxl_numainfo *ninfo,
+                                      libxl_bitmap *nodemap)
+{
+    uint32_t free_memkb = 0;
+    int i;
+
+    libxl_for_each_set_bit(i, *nodemap)
+        free_memkb += ninfo[i].free / 1024;
+
+    return free_memkb;
+}
+
+/* Retrieve the number of domains that can potentially run on the cpus
+ * the nodes that are part of the nodemap. */
+static int nodemap_to_nr_domains(libxl__gc *gc, libxl_cputopology *tinfo,
+                                 const libxl_bitmap *nodemap)
+{
+    libxl_dominfo *dinfo = NULL;
+    libxl_bitmap dom_nodemap;
+    int nr_doms, nr_cpus;
+    int nr_domains = 0;
+    int i, j, k;
+
+    dinfo = libxl_list_domain(CTX, &nr_doms);
+    if (dinfo == NULL)
+        return ERROR_FAIL;
+
+    if (libxl_node_bitmap_alloc(CTX, &dom_nodemap) < 0) {
+        nr_domains = ERROR_FAIL;
+        goto out;
+    }
+
+    for (i = 0; i < nr_doms; i++) {
+        libxl_vcpuinfo *vinfo;
+        int nr_vcpus;
+
+        vinfo = libxl_list_vcpu(CTX, dinfo[i].domid, &nr_vcpus,
&nr_cpus);
+        if (vinfo == NULL)
+            continue;
+
+        libxl_bitmap_set_none(&dom_nodemap);
+        for (j = 0; j < nr_vcpus; j++) {
+            libxl_for_each_set_bit(k, vinfo[j].cpumap)
+                libxl_bitmap_set(&dom_nodemap, tinfo[k].node);
+        }
+
+        libxl_for_each_set_bit(j, dom_nodemap) {
+            if (libxl_bitmap_test(nodemap, j)) {
+                nr_domains++;
+                goto found;
+            }
+        }
+ found:
+        libxl_vcpuinfo_list_free(vinfo, nr_vcpus);
+    }
+
+ out:
+    libxl_bitmap_dispose(&dom_nodemap);
+    libxl_dominfo_list_free(dinfo, nr_doms);
+    return nr_domains;
+}
+
+/*
+ * This function tries to figure out if the host has a consistent number
+ * of cpus along all its NUMA nodes. In fact, if that is the case, we can
+ * calculate the minimum number of nodes needed for a domain by just
+ * dividing its total number of vcpus by this value computed here.
+ * However, we are not allowed to assume that all the nodes have the
+ * same number of cpus. Therefore, in case discrepancies among different
+ * nodes are found, this function just returns 0 and the caller needs
+ * to sort out things in some other way.
+ */
+static int cpus_per_node_count(libxl_cputopology *tinfo, int nr_cpus,
+                               libxl_numainfo *ninfo, int nr_nodes)
+{
+    int cpus_per_node = 0;
+    int j, i;
+
+    /* This makes sense iff # of PCPUs is the same for all nodes */
+    for (j = 0; j < nr_nodes; j++) {
+        int curr_cpus = 0;
+
+        for (i = 0; i < nr_cpus; i++) {
+            if (tinfo[i].node == j)
+                curr_cpus++;
+        }
+        /* So, if the above does not hold, turn the whole thing off! */
+        cpus_per_node = cpus_per_node == 0 ? curr_cpus : cpus_per_node;
+        if (cpus_per_node != curr_cpus)
+            return 0;
+    }
+    return cpus_per_node;
+}
+
+/* Get all the placement candidates satisfying some specific conditions */
+int libxl__get_numa_candidates(libxl__gc *gc,
+                               uint32_t min_free_memkb, int min_cpus,
+                               int min_nodes, int max_nodes,
+                               libxl__numa_candidate *cndts[], int *nr_cndts)
+{
+    libxl__numa_candidate *new_cndts = NULL;
+    libxl_cputopology *tinfo = NULL;
+    libxl_numainfo *ninfo = NULL;
+    libxl_bitmap nodemap;
+    int nr_nodes, nr_cpus;
+    int array_size, rc;
+
+    /* Get platform info and prepare the map for testing the combinations */
+    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
+    if (ninfo == NULL)
+        return ERROR_FAIL;
+    /* If we don''t have at least 2 nodes, it is useless to proceed */
+    if (nr_nodes < 2) {
+        rc = 0;
+        goto out;
+    }
+
+    tinfo = libxl_get_cpu_topology(CTX, &nr_cpus);
+    if (tinfo == NULL) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    rc = libxl_node_bitmap_alloc(CTX, &nodemap);
+    if (rc)
+        goto out;
+
+    /*
+     * Round up and down some of the constraints. For instance, the minimum
+     * number of cpus a candidate should have must at least be non-negative.
+     * Regarding the minimum number of NUMA nodes, if not explicitly specified
+     * (i.e., min_nodes <= 0), we try to figure out a sensible number of
nodes
+     * from where to start generating candidates, if possible (or just start
+     * from 1 otherwise). The maximum number of nodes should not exceed the
+     * number of existent NUMA nodes on the host, or the candidate genaration
+     * won''t work properly.
+     */
+    min_cpus = min_cpus <= 0 ? 0 : min_cpus;
+    if (min_nodes <= 0) {
+        int cpus_per_node;
+
+        cpus_per_node = cpus_per_node_count(tinfo, nr_cpus, ninfo, nr_nodes);
+        if (cpus_per_node == 0)
+            min_nodes = 1;
+        else
+            min_nodes = (min_cpus + cpus_per_node - 1) / cpus_per_node;
+    }
+    min_nodes = min_nodes > nr_nodes ? nr_nodes : min_nodes;
+    if (max_nodes <= 0)
+        max_nodes = nr_nodes;
+    else
+        max_nodes = max_nodes > nr_nodes ? nr_nodes : max_nodes;
+    if (min_nodes > max_nodes) {
+        rc = ERROR_INVAL;
+        goto out;
+    }
+
+    /* Initialize the local storage for the combinations */
+    *nr_cndts = 0;
+    array_size = nr_nodes;
+    GCNEW_ARRAY(new_cndts, array_size);
+
+    /* Generate all the combinations of any size from min_nodes to
+     * max_nodes (see comb_init() and comb_next()). */
+    while (min_nodes <= max_nodes) {
+        comb_iter_t comb_iter;
+        int comb_ok;
+
+        /*
+	 * And here it is. Each step of this cycle generates a combination of
+	 * nodes as big as min_nodes mandates.  Each of these combinations is
+	 * checked against the constraints provided by the caller (namely,
+	 * amount of free memory and number of cpus) and it becomes an actual
+	 * placement candidate iff it passes the check.
+         */
+        for (comb_ok = comb_init(gc, &comb_iter, nr_nodes, min_nodes);
comb_ok;
+             comb_ok = comb_next(comb_iter, nr_nodes, min_nodes)) {
+            uint32_t nodes_free_memkb;
+            int nodes_cpus;
+
+            comb_get_nodemap(comb_iter, &nodemap, min_nodes);
+
+            /* If there is not enough memoy in this combination, skip it
+             * and go generating the next one... */
+            nodes_free_memkb = nodemap_to_free_memkb(ninfo, &nodemap);
+            if (min_free_memkb > 0 && nodes_free_memkb <
min_free_memkb)
+                continue;
+
+            /* And the same applies if this combination is short in cpus */
+            nodes_cpus = nodemap_to_nodes_cpus(tinfo, nr_cpus, &nodemap);
+            if (min_cpus > 0 && nodes_cpus < min_cpus)
+                continue;
+
+            /*
+             * Conditions are met, we can add this combination to the
+             * NUMA placement candidates list. We first make sure there
+             * is enough space in there, and then we initialize the new
+             * candidate element with the node map corresponding to the
+             * combination we are dealing with. Memory allocation for
+             * expanding the array that hosts the list happens in chunks
+             * equal to the number of NUMA nodes in the system (to
+             * avoid allocating memory each and every time we find a
+             * new candidate).
+             */
+            if (*nr_cndts == array_size)
+                array_size += nr_nodes;
+            GCREALLOC_ARRAY(new_cndts, array_size);
+
+            libxl__numa_candidate_alloc(gc, &new_cndts[*nr_cndts]);
+            libxl__numa_candidate_put_nodemap(gc, &new_cndts[*nr_cndts],
+                                              &nodemap);
+            new_cndts[*nr_cndts].nr_domains +                                  
nodemap_to_nr_domains(gc, tinfo, &nodemap);
+            new_cndts[*nr_cndts].free_memkb = nodes_free_memkb;
+            new_cndts[*nr_cndts].nr_nodes = min_nodes;
+            new_cndts[*nr_cndts].nr_cpus = nodes_cpus;
+
+            LOG(DEBUG, "NUMA placement candidate #%d found: nr_nodes=%d,
"
+                       "nr_cpus=%d, free_memkb=%"PRIu32"",
*nr_cndts,
+                       min_nodes, new_cndts[*nr_cndts].nr_cpus,
+                       new_cndts[*nr_cndts].free_memkb / 1024);
+
+            (*nr_cndts)++;
+        }
+        min_nodes++;
+    }
+
+    *cndts = new_cndts;
+ out:
+    libxl_bitmap_dispose(&nodemap);
+    libxl_cputopology_list_free(tinfo, nr_cpus);
+    libxl_numainfo_list_free(ninfo, nr_nodes);
+    return rc;
+}
+
+void libxl__sort_numa_candidates(libxl__numa_candidate cndts[], int nr_cndts,
+                                 libxl__numa_candidate_cmpf cmpf)
+{
+    /* Reorder candidates (see the comparison function for
+     * the details on the heuristics) */
+    qsort(cndts, nr_cndts, sizeof(cndts[0]), cmpf);
+}
+
+/*
+ * The NUMA placement candidates are reordered according to the following
+ * heuristics:
+ *  - candidates involving fewer nodes come first. In case two (or
+ *    more) candidates span the same number of nodes,
+ *  - candidates with greater amount of free memory come first. In
+ *    case two (or more) candidates differ in their amount of free
+ *    memory by less than 10%,
+ *  - candidates with fewer domains insisting on them at the time of
+ *    this call come first.
+ */
+static int numa_cmpf(const void *v1, const void *v2)
+{
+    const libxl__numa_candidate *c1 = (const libxl__numa_candidate*) v1;
+    const libxl__numa_candidate *c2 = (const libxl__numa_candidate*) v2;
+    double mem_diff = labs(c1->free_memkb - c2->free_memkb);
+    double mem_avg = (c1->free_memkb + c2->free_memkb) / 2.0;
+
+    if (c1->nr_nodes != c2->nr_nodes)
+        return c1->nr_nodes - c2->nr_nodes;
+
+    if ((mem_diff / mem_avg) * 100.0 < 10.0 &&
+        c1->nr_domains != c2->nr_domains)
+        return c1->nr_domains - c2->nr_domains;
+
+    return c2->free_memkb - c1->free_memkb;
+}
+
+/* The actual automatic NUMA placement routine */
+static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info)
+{
+    int nr_candidates = 0;
+    libxl__numa_candidate *candidates = NULL;
+    libxl_bitmap candidate_nodemap;
+    libxl_cpupoolinfo *pinfo;
+    int nr_pools, rc = 0;
+    uint32_t memkb;
+
+    /* First of all, if cpupools are in use, better not to mess with them */
+    pinfo = libxl_list_cpupool(CTX, &nr_pools);
+    if (!pinfo)
+        return ERROR_FAIL;
+    if (nr_pools > 1) {
+        LOG(NOTICE, "skipping NUMA placement as cpupools are in
use");
+        goto out;
+    }
+
+    rc = libxl_domain_need_memory(CTX, info, &memkb);
+    if (rc)
+        goto out;
+    if (libxl_node_bitmap_alloc(CTX, &candidate_nodemap)) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    /* Find all the candidates with enough free memory and at least
+     * as much pcpus as the domain has vcpus.  */
+    rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, 0, 0,
+                                    &candidates, &nr_candidates);
+    if (rc)
+        goto out;
+
+    LOG(DETAIL, "%d NUMA placement candidates found", nr_candidates);
+
+    /* No suitable placement candidates. We just return without touching the
+     * domain''s info->cpumap. It will have affinity with all
nodes/cpus. */
+    if (nr_candidates == 0)
+        goto out;
+
+    /* Bring the best candidate in front of the list --> candidates[0] */
+    if (nr_candidates > 1)
+        libxl__sort_numa_candidates(candidates, nr_candidates, numa_cmpf);
+
+    /*
+     * At this point, the first candidate in the array is the one we want.
+     * Go for it by mapping its node map to the domain''s
info->cpumap.
+     */
+    libxl__numa_candidate_get_nodemap(gc, &candidates[0],
&candidate_nodemap);
+    rc = libxl_nodemap_to_cpumap(CTX, &candidate_nodemap,
&info->cpumap);
+    if (rc)
+        goto out;
+
+    LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and
"
+                "%"PRIu32" KB free selected",
candidates[0].nr_nodes,
+                candidates[0].nr_cpus, candidates[0].free_memkb / 1024);
+
+ out:
+    libxl_bitmap_dispose(&candidate_nodemap);
+    libxl_cpupoolinfo_list_free(pinfo, nr_pools);
+    return rc;
+}
+
 int libxl__build_pre(libxl__gc *gc, uint32_t domid,
               libxl_domain_build_info *info, libxl__domain_build_state *state)
 {
@@ -104,7 +557,22 @@ int libxl__build_pre(libxl__gc *gc, uint
     uint32_t rtc_timeoffset;
 
     xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus);
+
+    /*
+     * Check if the domain has any CPU affinity. If not, try to build up one.
+     * In case numa_place_domain() find at least a suitable candidate, it will
+     * affect info->cpumap accordingly; if it does not, it just leaves it
+     * as it is. This means (unless some weird error manifests) the subsequent
+     * call to libxl_set_vcpuaffinity_all() will do the actual placement,
+     * whatever that turns out to be.
+     */
+    if (libxl_bitmap_is_full(&info->cpumap)) {
+        int rc = numa_place_domain(gc, info);
+        if (rc)
+            return rc;
+    }
     libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus,
&info->cpumap);
+
     xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb +
LIBXL_MAXMEM_CONSTANT);
     if (info->type == LIBXL_DOMAIN_TYPE_PV)
         xc_domain_set_memmap_limit(ctx->xch, domid,
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2021,6 +2021,134 @@ static inline void libxl__ctx_unlock(lib
 #define CTX_LOCK (libxl__ctx_lock(CTX))
 #define CTX_UNLOCK (libxl__ctx_unlock(CTX))
 
+/*
+ * Automatic NUMA placement
+ *
+ * These functions and data structures deal with the initial placement of a
+ * domain onto the host NUMA nodes.
+ *
+ * The key concept here is the one of "numa placement candidate" (or
jusr "numa
+ * candidate", "placement candidate" or even
"candidate"), which basically is a
+ * set of nodes whose characteristics have been successfully checked against
+ * some specific requirements. More pecisely, a candidate is the nodemap
+ * associated with one of the possible subset of the host NUMA nodes providing
+ * a certain amount of free memory, or a given number of cpus, or even both
+ * (depending in what the caller wants). For convenience of use, some of this
+ * information are stored within the candidate itselfs, instead of always being
+ * dynamically computed. A candidate can consist of just one, up to all the
+ * available nodes on the host (the latter being, of course, not ideal).  For
+ * instance, looking for a numa candidates with 2GB of free memoy means we want
+ * all the possible subsets of the host NUMA nodes with, cumulatively, at least
+ * 2GB of free memory. That could be possible by just using one particular
+ * node, or may require more nodes, depending on the characteristics of the
+ * host, on how many domains have been created already, on how big they are,
+ * etc.
+ *
+ * The intended usage is as follows:
+ *  1. by, fist of all, calling libxl__get_numa_candidates(), and specifying
+ *     the proper constraints to it (e.g., the amount of memory a domain need
+ *     as the minimum amount of free memory fo the candidates) one can build
+ *     up a whole set of suitable placing alternatives for a domain;
+ *  2. after that, one specific candidate should be chosen. That can happen
+ *     by looking at their various characteristics;
+ *  3. the chosen candidate''s nodemap should be utilized for computing
the
+ *     actual affinity of the domain which, given the curent NUMA support
+ *     in the hypervisor, is what determines the placement of the
domain''s
+ *     vcpus and memory.
+ *
+ * To make phase 2 even easier, a sorting helper function for the list of
+ * candidates is povided in the form of libxl__sort_numa_candidates(). The only
+ * that is needed is defining a comparison function, containing the chriteria
+ * for deciding, given two candidates, which one is ''better''.
Depending on how
+ * the comparison function is defined, the best candidate (where, of course,
+ * best is defined with respect to the heuristics implemented in the comparison
+ * function itself, libxl__numa_candidate_cmpf()) could become the first o the
+ * last element of the list.
+ *
+ * Summarizing, achieving automatic NUMA placement is just a matter of
+ * obtaining the list of suitable placement candidates, perhaps asking for each
+ * of them to provide at least the amount of memory the domain needs. After
+ * that just implement a comparison function by means of the various helpers
+ * retrieving the relevant information about the candidates themselves.
+ * Finally, call the sorting helper function and use the candidate that became
+ * (typically) the first element of the list fo determining the
domain''s
+ * affinity.
+ */
+
+typedef struct {
+    int nr_cpus, nr_nodes;
+    int nr_domains;
+    uint32_t free_memkb;
+    libxl_bitmap nodemap;
+} libxl__numa_candidate;
+
+/*
+ * This generates the list of NUMA placement candidates satisfying some
+ * specific conditions. If min_nodes and/or max_nodes are not 0, their value is
+ * used to determine the minimum and maximum number of nodes that are allow to
+ * be present in each candidate. If min_nodes and/or max_nodes are 0, the
+ * minimum and maximum number of nodes to be used are automatically selected by
+ * the implementation (and that will likely be just 1 node for the minimum and
+ * the total number of existent nodes for the maximum). Re min_free_memkb and
+ * min_cpu, if not 0, they specify the caller only wants candidates with at
+ * least that amount of free memory and that number of cpus, respectively. If
+ * min_free_memkb and/or min_cpus are 0, the candidates'' free memory
and number
+ * of cpus won''t be checked at all, which means a candidate will
always be
+ * considered suitable wrt the specific constraint.  cndts is where the list of
+ * exactly nr_cndts candidates is returned. Note that, in case no candidates
+ * are found at all, the function returns successfully, but with nr_cndts equal
+ * to zero.
+ */
+_hidden int libxl__get_numa_candidates(libxl__gc *gc,
+                                uint32_t min_free_memkb, int min_cpus,
+                                int min_nodes, int max_nodes,
+                                libxl__numa_candidate *cndts[], int *nr_cndts);
+
+/* allocation and deallocation for placement candidates */
+static inline int libxl__numa_candidate_alloc(libxl__gc *gc,
+                                              libxl__numa_candidate *cndt)
+{
+    return libxl_node_bitmap_alloc(CTX, &cndt->nodemap);
+}
+static inline void libxl__numa_candidate_dispose(libxl__numa_candidate *cndt)
+{
+    libxl_bitmap_dispose(&cndt->nodemap);
+}
+static inline void libxl__numacandidate_list_free(libxl__numa_candidate *cndts,
+                                                  int nr_cndts)
+{
+    int i;
+
+    for (i = 0; i < nr_cndts; i++)
+        libxl__numa_candidate_dispose(&cndts[i]);
+    free(cndts);
+}
+
+/* retrieve (in nodemap) the node map associated to placement candidate cndt */
+static inline
+void libxl__numa_candidate_get_nodemap(libxl__gc *gc,
+                                       const libxl__numa_candidate *cndt,
+                                       libxl_bitmap *nodemap)
+{
+    libxl_bitmap_copy(CTX, nodemap, &cndt->nodemap);
+}
+/* set the node map of placement candidate cndt to match nodemap */
+static inline
+void libxl__numa_candidate_put_nodemap(libxl__gc *gc,
+                                       libxl__numa_candidate *cndt,
+                                       const libxl_bitmap *nodemap)
+{
+    libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap);
+}
+
+/* signature for the comparison function between two candidates c1 and c2
+ * (the thid parameter is provided to enable thread safety). */
+typedef int (*libxl__numa_candidate_cmpf)(const void *v1, const void *v2);
+/* sort the list of candidates in cndts (an array with nr_cndts elements in
+ * it) using cmpf for comparing two candidates. Uses libc''s qsort().
*/
+_hidden void libxl__sort_numa_candidates(libxl__numa_candidate cndts[],
+                                         int nr_cndts,
+                                         libxl__numa_candidate_cmpf cmpf);
 
 /*
  * Inserts "elm_new" into the sorted list "head".

Dario Faggioli

2012-Jun-15 17:04 UTC

head link

[PATCH 09 of 10 v2] libxl: have NUMA placement deal with cpupools

In such a way that only the cpus belonging to the cpupool of the
domain being placed are considered for the placement itself.

This happens by filtering out all the nodes in which the cpupool has
not any cpu from the placement candidates. After that -- as a cpu pooling
not necessarily happens at NUMA nodes boundaries -- we also make sure
only the actual cpus that are part of the pool are considered when
counting how much processors a placement candidate is able to provide.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -198,15 +198,27 @@ static void comb_get_nodemap(comb_iter_t
         libxl_bitmap_set(nodemap, it[i]);
 }
 
+/* Retrieve how many nodes a nodemap spans. */
+static int nodemap_to_nr_nodes(const libxl_bitmap *nodemap)
+{
+    int i, nr_nodes = 0;
+
+    libxl_for_each_set_bit(i, *nodemap)
+        nr_nodes++;
+    return nr_nodes;
+}
+
 /* Retrieve the number of cpus that the nodes that are part of the nodemap
- * span. */
+ * span and that are also set in suitable_cpumap. */
 static int nodemap_to_nodes_cpus(libxl_cputopology *tinfo, int nr_cpus,
+                                 const libxl_bitmap *suitable_cpumap,
                                  const libxl_bitmap *nodemap)
 {
     int i, nodes_cpus = 0;
 
     for (i = 0; i < nr_cpus; i++) {
-        if (libxl_bitmap_test(nodemap, tinfo[i].node))
+        if (libxl_bitmap_test(suitable_cpumap, i) &&
+            libxl_bitmap_test(nodemap, tinfo[i].node))
             nodes_cpus++;
     }
     return nodes_cpus;
@@ -311,12 +323,13 @@ static int cpus_per_node_count(libxl_cpu
 int libxl__get_numa_candidates(libxl__gc *gc,
                                uint32_t min_free_memkb, int min_cpus,
                                int min_nodes, int max_nodes,
+                               const libxl_bitmap *suitable_cpumap,
                                libxl__numa_candidate *cndts[], int *nr_cndts)
 {
     libxl__numa_candidate *new_cndts = NULL;
     libxl_cputopology *tinfo = NULL;
     libxl_numainfo *ninfo = NULL;
-    libxl_bitmap nodemap;
+    libxl_bitmap suitable_nodemap, nodemap;
     int nr_nodes, nr_cpus;
     int array_size, rc;
 
@@ -340,6 +353,15 @@ int libxl__get_numa_candidates(libxl__gc
     if (rc)
         goto out;
 
+    /* Allocate and prepare the map of the node that can be utilized for
+     * placement, basing on the map of suitable cpus. */
+    rc = libxl_node_bitmap_alloc(CTX, &suitable_nodemap);
+    if (rc)
+        goto out;
+    rc = libxl_cpumap_to_nodemap(CTX, suitable_cpumap, &suitable_nodemap);
+    if (rc)
+        goto out;
+
     /*
      * Round up and down some of the constraints. For instance, the minimum
      * number of cpus a candidate should have must at least be non-negative.
@@ -391,9 +413,14 @@ int libxl__get_numa_candidates(libxl__gc
         for (comb_ok = comb_init(gc, &comb_iter, nr_nodes, min_nodes);
comb_ok;
              comb_ok = comb_next(comb_iter, nr_nodes, min_nodes)) {
             uint32_t nodes_free_memkb;
-            int nodes_cpus;
+            int i, nodes_cpus;
 
+            /* Get the nodemap for the combination and filter unwnted nodes */
             comb_get_nodemap(comb_iter, &nodemap, min_nodes);
+            libxl_for_each_set_bit(i, nodemap) {
+                if (!libxl_bitmap_test(&suitable_nodemap, i))
+                    libxl_bitmap_reset(&nodemap, i);
+            }
 
             /* If there is not enough memoy in this combination, skip it
              * and go generating the next one... */
@@ -402,7 +429,8 @@ int libxl__get_numa_candidates(libxl__gc
                 continue;
 
             /* And the same applies if this combination is short in cpus */
-            nodes_cpus = nodemap_to_nodes_cpus(tinfo, nr_cpus, &nodemap);
+            nodes_cpus = nodemap_to_nodes_cpus(tinfo, nr_cpus, suitable_cpumap,
+                                               &nodemap);
             if (min_cpus > 0 && nodes_cpus < min_cpus)
                 continue;
 
@@ -427,12 +455,13 @@ int libxl__get_numa_candidates(libxl__gc
             new_cndts[*nr_cndts].nr_domains                                    
nodemap_to_nr_domains(gc, tinfo, &nodemap);
             new_cndts[*nr_cndts].free_memkb = nodes_free_memkb;
-            new_cndts[*nr_cndts].nr_nodes = min_nodes;
+            new_cndts[*nr_cndts].nr_nodes = nodemap_to_nr_nodes(&nodemap);
             new_cndts[*nr_cndts].nr_cpus = nodes_cpus;
 
             LOG(DEBUG, "NUMA placement candidate #%d found: nr_nodes=%d,
"
                        "nr_cpus=%d, free_memkb=%"PRIu32"",
*nr_cndts,
-                       min_nodes, new_cndts[*nr_cndts].nr_cpus,
+                       new_cndts[*nr_cndts].nr_nodes,
+                       new_cndts[*nr_cndts].nr_cpus,
                        new_cndts[*nr_cndts].free_memkb / 1024);
 
             (*nr_cndts)++;
@@ -442,6 +471,7 @@ int libxl__get_numa_candidates(libxl__gc
 
     *cndts = new_cndts;
  out:
+    libxl_bitmap_dispose(&suitable_nodemap);
     libxl_bitmap_dispose(&nodemap);
     libxl_cputopology_list_free(tinfo, nr_cpus);
     libxl_numainfo_list_free(ninfo, nr_nodes);
@@ -485,23 +515,27 @@ static int numa_cmpf(const void *v1, con
 }
 
 /* The actual automatic NUMA placement routine */
-static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info)
+static int numa_place_domain(libxl__gc *gc, uint32_t domid,
+                             libxl_domain_build_info *info)
 {
     int nr_candidates = 0;
     libxl__numa_candidate *candidates = NULL;
     libxl_bitmap candidate_nodemap;
-    libxl_cpupoolinfo *pinfo;
-    int nr_pools, rc = 0;
+    libxl_cpupoolinfo cpupool_info;
+    int i, cpupool, rc = 0;
     uint32_t memkb;
 
-    /* First of all, if cpupools are in use, better not to mess with them */
-    pinfo = libxl_list_cpupool(CTX, &nr_pools);
-    if (!pinfo)
-        return ERROR_FAIL;
-    if (nr_pools > 1) {
-        LOG(NOTICE, "skipping NUMA placement as cpupools are in
use");
-        goto out;
-    }
+    /*
+     * Extract the cpumap from the cpupool the domain belong to. In fact,
+     * it only makes sense to consider the cpus/nodes that are in there
+     * for placement.
+     */
+    rc = cpupool = libxl__domain_cpupool(gc, domid);
+    if (rc < 0)
+        return rc;
+    rc = libxl_cpupool_info(CTX, &cpupool_info, cpupool);
+    if (rc)
+        return rc;
 
     rc = libxl_domain_need_memory(CTX, info, &memkb);
     if (rc)
@@ -513,7 +547,8 @@ static int numa_place_domain(libxl__gc *
 
     /* Find all the candidates with enough free memory and at least
      * as much pcpus as the domain has vcpus.  */
-    rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, 0, 0,
+    rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus,
+                                    0, 0, &cpupool_info.cpumap,
                                     &candidates, &nr_candidates);
     if (rc)
         goto out;
@@ -538,13 +573,20 @@ static int numa_place_domain(libxl__gc *
     if (rc)
         goto out;
 
+    /* Avoid trying to set the affinity to cpus that might be in the
+     * nodemap but not in our cpupool. */
+    libxl_for_each_set_bit(i, info->cpumap) {
+        if (!libxl_bitmap_test(&cpupool_info.cpumap, i))
+            libxl_bitmap_reset(&info->cpumap, i);
+    }
+
     LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and
"
                 "%"PRIu32" KB free selected",
candidates[0].nr_nodes,
                 candidates[0].nr_cpus, candidates[0].free_memkb / 1024);
 
  out:
     libxl_bitmap_dispose(&candidate_nodemap);
-    libxl_cpupoolinfo_list_free(pinfo, nr_pools);
+    libxl_cpupoolinfo_dispose(&cpupool_info);
     return rc;
 }
 
@@ -567,7 +609,7 @@ int libxl__build_pre(libxl__gc *gc, uint
      * whatever that turns out to be.
      */
     if (libxl_bitmap_is_full(&info->cpumap)) {
-        int rc = numa_place_domain(gc, info);
+        int rc = numa_place_domain(gc, domid, info);
         if (rc)
             return rc;
     }
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2094,14 +2094,17 @@ typedef struct {
  * least that amount of free memory and that number of cpus, respectively. If
  * min_free_memkb and/or min_cpus are 0, the candidates'' free memory
and number
  * of cpus won''t be checked at all, which means a candidate will
always be
- * considered suitable wrt the specific constraint.  cndts is where the list of
- * exactly nr_cndts candidates is returned. Note that, in case no candidates
- * are found at all, the function returns successfully, but with nr_cndts equal
- * to zero.
+ * considered suitable wrt the specific constraint. suitable_cpumap is useful
+ * for specifyin we want only the cpus in that mask to be considered while
+ * generating placement candidates (for example because of cpupools). cndts is
+ * where the list of exactly nr_cndts candidates is returned. Note that, in
+ * case no candidates are found at all, the function returns successfully, but
+ * with nr_cndts equal to zero.
  */
 _hidden int libxl__get_numa_candidates(libxl__gc *gc,
                                 uint32_t min_free_memkb, int min_cpus,
                                 int min_nodes, int max_nodes,
+                                const libxl_bitmap *suitable_cpumap,
                                 libxl__numa_candidate *cndts[], int *nr_cndts);
 
 /* allocation and deallocation for placement candidates */

Dario Faggioli

2012-Jun-15 17:04 UTC

head link

[PATCH 10 of 10 v2] Some automatic NUMA placement documentation

About rationale, usage and (some small bits of) API.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

Changes from v1:
 * API documentation moved close to the actual functions.

diff --git a/docs/misc/xl-numa-placement.markdown
b/docs/misc/xl-numa-placement.markdown
new file mode 100644
--- /dev/null
+++ b/docs/misc/xl-numa-placement.markdown
@@ -0,0 +1,74 @@
+# Guest Automatic NUMA Placement in libxl and xl #
+
+## Rationale ##
+
+The Xen hypervisor deals with Non-Uniform Memory Access (NUMA])
+machines by assigning to its domain a "node affinity", i.e., a set of
NUMA
+nodes of the host from which it gets its memory allocated.
+
+NUMA awareness becomes very important as soon as many domains start running
+memory-intensive workloads on a shared host. In fact, the cost of accessing
+non node-local memory locations is very high, and the performance degradation
+is likely to be noticeable.
+
+## Guest Placement in xl ##
+
+If using xl for creating and managing guests, it is very easy to ask
+for both manual or automatic placement of them across the host''s NUMA
+nodes.
+
+Note that xm/xend does the very same thing, the only differences residing
+in the details of the heuristics adopted for the placement (see below).
+
+### Manual Guest Placement with xl ###
+
+Thanks to the "cpus=" option, it is possible to specify where a
domain
+should be created and scheduled on, directly in its config file. This
+affects NUMA placement and memory accesses as the hypervisor constructs
+the node affinity of a VM basing right on its CPU affinity when it is
+created.
+
+This is very simple and effective, but requires the user/system
+administrator to explicitly specify affinities for each and every domain,
+or Xen won''t be able to enable guarantee the locality for their memory
+accesses.
+
+### Automatic Guest Placement with xl ###
+
+In case no "cpus=" option is specified in the config file, xl tries
+to figure out on its own on which node(s) the domain could fit best.
+
+First of all, it needs to find a node (or a set of nodes) that have
+enough free memory for accommodating the domain. After that, the actual
+decision on where to put the new guest happens by generating all the
+possible combinations of nodes that satisfies the above and chose among
+them according to the following heuristics:
+
+  *  candidates involving fewer nodes come first. In case two (or more)
+     candidates span the same number of nodes,
+  *  candidates with greater amount of free memory come first. In case
+     two (or more) candidates differ in their amount of free memory by
+     less than 10%,
+  *  candidates with fewer domains already placed on them come first.
+
+Giving preference to small candidates ensures better performance for
+the guest, as it avoid spreading its memory among different nodes.
+Using the nodes that have the biggest amounts of free memory helps
+keeping the memory fragmentation small, from a system wide perspective.
+Finally, in case more candidates fulfil these criteria by the same
+extent, choosing the candidate that is hosting fewer domain helps
+balancing the load on the various nodes.
+
+The last step is figuring out whether the selected candidate contains
+at least as much CPUs as the number of VCPUs of the VM. The current
+solution for the case when this is not verified is just to add some
+more nodes, until the condition turns into being true. When doing
+this, the nodes with the least possible distance from the ones
+already in the nodemap are considered.
+
+## Guest Placement within libxl ##
+
+xl achieves automatic NUMA placement by means of the following API
+calls, provided by libxl.
+
+

Dario Faggioli

2012-Jun-18 15:54 UTC

head link

Re: [PATCH 10 of 10 v2] Some automatic NUMA placement documentation

On Fri, 2012-06-15 at 19:04 +0200, Dario Faggioli wrote:> +### Automatic Guest Placement with xl ###
> +
> +In case no "cpus=" option is specified in the config file, xl
tries
> +to figure out on its own on which node(s) the domain could fit best.
> +
> +First of all, it needs to find a node (or a set of nodes) that have
> +enough free memory for accommodating the domain. After that, the actual
> +decision on where to put the new guest happens by generating all the
> +possible combinations of nodes that satisfies the above and chose among
> +them according to the following heuristics:
> +
> +  *  candidates involving fewer nodes come first. In case two (or more)
> +     candidates span the same number of nodes,
> +  *  candidates with greater amount of free memory come first. In case
> +     two (or more) candidates differ in their amount of free memory by
> +     less than 10%,
> +  *  candidates with fewer domains already placed on them come first.
> +
> +Giving preference to small candidates ensures better performance for
> +the guest, as it avoid spreading its memory among different nodes.
> +Using the nodes that have the biggest amounts of free memory helps
> +keeping the memory fragmentation small, from a system wide perspective.
> +Finally, in case more candidates fulfil these criteria by the same
> +extent, choosing the candidate that is hosting fewer domain helps
> +balancing the load on the various nodes.
> +
This part below is basically a leftover from previous version. Things
does not work like that any longer. Basically, what we do is looking for
all the nodes, or a sets of nodes, that have enough free memory and at
least as much pCPUs as the domain has vCPUs and then apply the
heuristics outlined above.> +The last step is figuring out whether the selected candidate contains
> +at least as much CPUs as the number of VCPUs of the VM. The current
> +solution for the case when this is not verified is just to add some
> +more nodes, until the condition turns into being true. When doing
> +this, the nodes with the least possible distance from the ones
> +already in the nodemap are considered.
> +
Sorry, (already fixed in my local patchqueue)
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Ian Campbell

2012-Jun-21 08:53 UTC

head link

Re: [PATCH 01 of 10 v2] libxl: fix a typo in the GCREALLOC_ARRAY macro

On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:> Causing a build failure when trying to use it:
> 
> xxx: error: expected '';'' before '')''
token
> xxx: error: expected statement before '')'' token
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
> 
> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
> --- a/tools/libxl/libxl_internal.h
> +++ b/tools/libxl/libxl_internal.h
> @@ -1972,7 +1972,7 @@ struct libxl__domain_create_state {
>  #define GCREALLOC_ARRAY(var, nmemb)                                     \
>      (assert(nmemb > 0),                                                
\
>       assert(ARRAY_SIZE_OK((var), (nmemb))),                             \
> -     (var) = libxl__realloc((gc), (var), (nmemb)*sizeof(*(var)))))
> +     (var) = libxl__realloc((gc), (var), (nmemb)*sizeof(*(var))))
>  
> 
>  /*

Ian Campbell

2012-Jun-21 09:02 UTC

head link

Re: [PATCH 03 of 10 v2] libxl, libxc: introduce libxl_get_numainfo()

On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:> Make some NUMA node information available to the toolstack. Achieve
> this by means of xc_numainfo(), which exposes memory size and amount
> of free memory of each node, as well as the relative distances of
> each node to all the others.
> 
> For properly exposing distances we need the IDL to support arrays.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> 
> Changes from v1:
>  * malloc converted to libxl__zalloc(NOGC, ...).
>  * The patch also accommodates some bits of what was in "libxc,
>    libxl: introduce xc_nodemap_t and libxl_nodemap" which was
>    removed as well, as full support for node maps at libxc
>    level is not needed (yet!).
> 
> diff --git a/tools/libxc/xc_misc.c b/tools/libxc/xc_misc.c
> --- a/tools/libxc/xc_misc.c
> +++ b/tools/libxc/xc_misc.c
> @@ -35,6 +35,20 @@ int xc_get_max_cpus(xc_interface *xch)
>      return max_cpus;
>  }
>  
> +int xc_get_max_nodes(xc_interface *xch)
> +{
> +    static int max_nodes = 0;
> +    xc_physinfo_t physinfo;
> +
> +    if ( max_nodes )
> +        return max_nodes;
> +
> +    if ( !xc_physinfo(xch, &physinfo) )
> +        max_nodes = physinfo.max_node_id + 1;
> +
> +    return max_nodes;
> +}
> +
>  int xc_get_cpumap_size(xc_interface *xch)
>  {
>      return (xc_get_max_cpus(xch) + 7) / 8;
> diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
> --- a/tools/libxc/xenctrl.h
> +++ b/tools/libxc/xenctrl.h
> @@ -329,6 +329,12 @@ int xc_get_cpumap_size(xc_interface *xch
>  /* allocate a cpumap */
>  xc_cpumap_t xc_cpumap_alloc(xc_interface *xch);
>  
> + /*
> + * NODEMAP handling
> + */
> +/* return maximum number of NUMA nodes the hypervisor supports */
> +int xc_get_max_nodes(xc_interface *xch);
> +
>  /*
>   * DOMAIN DEBUGGING FUNCTIONS
>   */
> diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
> --- a/tools/libxl/libxl.c
> +++ b/tools/libxl/libxl.c
> @@ -3223,6 +3223,71 @@ fail:
>      return ret;
>  }
>  
> +libxl_numainfo *libxl_get_numainfo(libxl_ctx *ctx, int *nr)
> +{[...]

The hypercall buffer stuff all looks good.
> +    if (ret)
> +        *nr = max_nodes;
You could put this before the fail: label. Not that it matters.
> +    return ret;
> +}
> +
>  const libxl_version_info* libxl_get_version_info(libxl_ctx *ctx)
>  {
>      union {
> diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
> --- a/tools/libxl/libxl.h
> +++ b/tools/libxl/libxl.h
> @@ -532,6 +532,9 @@ int libxl_domain_preserve(libxl_ctx *ctx
>  /* get max. number of cpus supported by hypervisor */
>  int libxl_get_max_cpus(libxl_ctx *ctx);
>  
> +/* get max. number of NUMA nodes supported by hypervisor */
> +int libxl_get_max_nodes(libxl_ctx *ctx);
> +
>  int libxl_domain_rename(libxl_ctx *ctx, uint32_t domid,
>                          const char *old_name, const char *new_name);
>  
> @@ -769,6 +772,13 @@ int libxl_get_physinfo(libxl_ctx *ctx, l
>  #define LIBXL_CPUTOPOLOGY_INVALID_ENTRY (~(uint32_t)0)
>  libxl_cputopology *libxl_get_cpu_topology(libxl_ctx *ctx, int *nr);
>  void libxl_cputopology_list_free(libxl_cputopology *, int nr);
> +#define LIBXL_NUMAINFO_INVALID_ENTRY (~(uint32_t)0)
> +libxl_numainfo *libxl_get_numainfo(libxl_ctx *ctx, int *nr);
> +  /* On success, a list of nr libxl_numainfo elements is returned.
> +   * That is from malloc, thus it is up to the caller to invoke
> +   * libxl_cpupoolinfo_list_free() on it.
Don''t you mean libxl_numinfo_list_free() ?

Also normally we put the comment before the prototype.
> +   */
> +void libxl_numainfo_list_free(libxl_numainfo *, int nr);
>  libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t domid,
>                                         int *nb_vcpu, int *nrcpus);
>  void libxl_vcpuinfo_list_free(libxl_vcpuinfo *, int nr);
> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
> --- a/tools/libxl/libxl_types.idl
> +++ b/tools/libxl/libxl_types.idl
> @@ -423,6 +423,12 @@ libxl_physinfo = Struct("physinfo", [
>      ("cap_hvm_directio", bool),
>      ], dir=DIR_OUT)
>  
> +libxl_numainfo = Struct("numainfo", [
> +    ("size", uint64),
> +    ("free", uint64),
> +    ("dists", Array(uint32, "num_dists")),
This is an array of distances from this node to each other node? That
could be worth a comment.

> +    ], dir=DIR_OUT)
> +
>  libxl_cputopology = Struct("cputopology", [
>      ("core", uint32),
>      ("socket", uint32),
> diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c
> --- a/tools/libxl/libxl_utils.c
> +++ b/tools/libxl/libxl_utils.c
> @@ -537,6 +537,11 @@ int libxl_get_max_cpus(libxl_ctx *ctx)
>      return xc_get_max_cpus(ctx->xch);
>  }
>  
> +int libxl_get_max_nodes(libxl_ctx *ctx)
> +{
> +    return xc_get_max_nodes(ctx->xch);
> +}
Is this needed externally to libxl or do we expect all callers to use
libxl_get_numainfo? I suppose there is no harm in exporting this either
way.
> +
>  int libxl__enum_from_string(const libxl_enum_string_table *t,
>                              const char *s, int *e)
>  {
> @@ -551,6 +556,14 @@ int libxl__enum_from_string(const libxl_
>      return ERROR_FAIL;
>  }
>  
> +void libxl_numainfo_list_free(libxl_numainfo *list, int nr)
> +{
> +    int i;
> +    for (i = 0; i < nr; i++)
> +        libxl_numainfo_dispose(&list[i]);
> +    free(list);
> +}
> +
>  void libxl_cputopology_list_free(libxl_cputopology *list, int nr)
>  {
>      int i;
> diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
> --- a/xen/include/public/sysctl.h
> +++ b/xen/include/public/sysctl.h
> @@ -484,6 +484,7 @@ typedef struct xen_sysctl_topologyinfo x
>  DEFINE_XEN_GUEST_HANDLE(xen_sysctl_topologyinfo_t);
>  
>  /* XEN_SYSCTL_numainfo */
> +#define INVALID_NUMAINFO_ID (~0U)
It feels like there ought to be hunks in the hypervisor which either use
this symbol instead of a hardcoded ~0U or which remove the internal
definition in favour of this one?

Ian Campbell

2012-Jun-21 09:04 UTC

head link

Re: [PATCH 04 of 10 v2] xl: add more NUMA information to `xl info -n''

On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:> So that the user knows how much memory there is on each node and
> how far they are from each others.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
> 
> Changes from v1:
>  * integer division replaced by right shift.
> 
> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> --- a/tools/libxl/xl_cmdimpl.c
> +++ b/tools/libxl/xl_cmdimpl.c
> @@ -4254,6 +4254,36 @@ static void output_physinfo(void)
>      return;
>  }
>  
> +static void output_numainfo(void)
> +{
> +    libxl_numainfo *info;
> +    int i, j, nr;
> +
> +    info = libxl_get_numainfo(ctx, &nr);
> +    if (info == NULL) {
> +        fprintf(stderr, "libxl_get_numainfo failed.\n");
> +        return;
> +    }
> +
> +    printf("numa_info              :\n");
> +    printf("node:    memsize    memfree    distances\n");
> +
> +    for (i = 0; i < nr; i++) {
> +        if (info[i].size != LIBXL_NUMAINFO_INVALID_ENTRY) {
> +            printf("%4d:    %6"PRIu64"    
%6"PRIu64"      %d", i,
> +                   info[i].size >> 20, info[i].free >> 20,
> +                   info[i].dists[0]);
> +            for (j = 1; j < info[i].num_dists; j++)
> +                printf(",%d", info[i].dists[j]);
> +            printf("\n");
> +        }
> +    }
> +
> +    libxl_numainfo_list_free(info, nr);
> +
> +    return;
> +}
> +
>  static void output_topologyinfo(void)
>  {
>      libxl_cputopology *info;
> @@ -4276,8 +4306,6 @@ static void output_topologyinfo(void)
>  
>      libxl_cputopology_list_free(info, nr);
>  
> -    printf("numa_info              : none\n");
> -
>      return;
>  }
>  
> @@ -4287,8 +4315,10 @@ static void info(int numa)
>  
>      output_physinfo();
>  
> -    if (numa)
> +    if (numa) {
>          output_topologyinfo();
> +        output_numainfo();
> +    }
>  
>      output_xeninfo();
>

Ian Campbell

2012-Jun-21 09:12 UTC

head link

Re: [PATCH 05 of 10 v2] libxl: rename libxl_cpumap to libxl_bitmap

On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:> And leave to the caller the burden of knowing and remembering what kind
> of bitmap each instance of libxl_bitmap is.
> 
> This is basically just some s/libxl_cpumap/libxl_bitmap/ (and some other
> related interface name substitution, e.g., libxl_for_each_cpu) in a bunch
> of files, with no real functional change involved.
> 
> A specific allocation helper is introduced, besides libxl_bitmap_alloc().
> It is called libxl_cpu_bitmap_alloc() and is meant at substituting the old
> libxl_cpumap_alloc(). It is just something easier to use in cases where one
> wants to allocate a libxl_bitmap that is going to serve as a cpu map.
> 
> This is because we want to be able to deal with both cpu and NUMA node
> maps, but we don''t want to duplicate all the various helpers and
wrappers.
FWIW I''d have been perfectly happy with a bunch of 
	#define for_each_cpu(mumble) for_each_bit(mumble)
type stuff, but I think Ian J and yourself didn''t like those which
I''m
also fine with.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.eu.com>
> diff --git a/tools/libxl/libxl_json.c b/tools/libxl/libxl_json.c
> --- a/tools/libxl/libxl_json.c
> +++ b/tools/libxl/libxl_json.c
> @@ -99,8 +99,8 @@ yajl_gen_status libxl_uuid_gen_json(yajl
>      return yajl_gen_string(hand, (const unsigned char *)buf,
LIBXL_UUID_FMTLEN);
>  }
> 
> -yajl_gen_status libxl_cpumap_gen_json(yajl_gen hand,
> -                                      libxl_cpumap *cpumap)
> +yajl_gen_status libxl_bitmap_gen_json(yajl_gen hand,
> +                                      libxl_bitmap *cpumap)
Minor nit: You likely meant to rename cpumap in the argument list too.
>  {
>      yajl_gen_status s;
>      int i;
> @@ -108,8 +108,8 @@ yajl_gen_status libxl_cpumap_gen_json(ya
>      s = yajl_gen_array_open(hand);
>      if (s != yajl_gen_status_ok) goto out;
> 
> -    libxl_for_each_cpu(i, *cpumap) {
> -        if (libxl_cpumap_test(cpumap, i)) {
> +    libxl_for_each_bit(i, *cpumap) {
> +        if (libxl_bitmap_test(cpumap, i)) {
>              s = yajl_gen_integer(hand, i);
>              if (s != yajl_gen_status_ok) goto out;
>          }
Even with that minor nit:

Acked-by: Ian Campbell <ian.campbell@citrix.com>

Ian Campbell

2012-Jun-21 09:30 UTC

head link

Re: [PATCH 06 of 10 v2] libxl: expand the libxl_bitmap API a bit

On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:> By adding copying and *_is_full/*_is_empty facilities.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> 
> Changes from v1:
>  * now libxl_is_full/empty return 1 if true and 0 if false,
>    as logic (and as requested during review).
> 
> diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c
> --- a/tools/libxl/libxl_utils.c
> +++ b/tools/libxl/libxl_utils.c
> @@ -506,6 +506,35 @@ void libxl_bitmap_dispose(libxl_bitmap *
>      free(map->map);
>  }
>  
> +void libxl_bitmap_copy(libxl_ctx *ctx, libxl_bitmap *dptr,
> +                       const libxl_bitmap *sptr)
> +{
> +    int sz;
> +
> +    sz = dptr->size = sptr->size;
> +    memcpy(dptr->map, sptr->map, sz * sizeof(*dptr->map));
I think you need either some checks/asserts on the two relative sizes or
a reallocation of the destination map. Otherwise you risk an overflow.

Perhaps a simple assert (i.e. make it the callers problem) would be ok?
> +}
> +
> +int libxl_bitmap_is_full(const libxl_bitmap *bitmap)
> +{
> +    int i;
> +
> +    for (i = 0; i < bitmap->size; i++)
> +        if (bitmap->map[i] != (uint8_t)-1)
> +            return 0;
> +   return 1;
> +}
> +
> +int libxl_bitmap_is_empty(const libxl_bitmap *bitmap)
> +{
> +    int i;
> +
> +    for (i = 0; i < bitmap->size; i++)
> +        if (bitmap->map[i])
> +            return 0;
> +    return 1;
> +}
> +
>  int libxl_bitmap_test(const libxl_bitmap *bitmap, int bit)
>  {
>      if (bit >= bitmap->size * 8)
> diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h
> --- a/tools/libxl/libxl_utils.h
> +++ b/tools/libxl/libxl_utils.h
> @@ -66,6 +66,10 @@ int libxl_vdev_to_device_disk(libxl_ctx 
>  int libxl_bitmap_alloc(libxl_ctx *ctx, libxl_bitmap *bitmap, int n_bits);
>      /* Allocated bimap is from malloc, libxl_bitmap_dispose() to be
>       * called by the application when done. */
> +void libxl_bitmap_copy(libxl_ctx *ctx, libxl_bitmap *dptr,
> +                       const libxl_bitmap *sptr);
> +int libxl_bitmap_is_full(const libxl_bitmap *bitmap);
> +int libxl_bitmap_is_empty(const libxl_bitmap *bitmap);
>  int libxl_bitmap_test(const libxl_bitmap *bitmap, int bit);
>  void libxl_bitmap_set(libxl_bitmap *bitmap, int bit);
>  void libxl_bitmap_reset(libxl_bitmap *bitmap, int bit);

Ian Campbell

2012-Jun-21 09:35 UTC

head link

Re: [PATCH 07 of 10 v2] libxl: introduce some node map helpers

On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:
> +int libxl_nodemap_to_cpumap(libxl_ctx *ctx,
> +                            const libxl_bitmap *nodemap,
> +                            libxl_bitmap *cpumap);
> +    /* populate cpumap with the cpus spanned by the nodes in nodemap */
Comments before proto please (I know libxl is a bit of a mish-mash, but
we''ll try to be consistent going forward...)

Otherwise: Acked-by: Ian Campbell <ian.campbell@citrix.com>
> +int libxl_cpumap_to_nodemap(libxl_ctx *ctx,
> +                            const libxl_bitmap *cpuemap,
> +                            libxl_bitmap *nodemap);
> +    /* populate nodemap with the nodes of the cpus in cpumap */
> +
>  static inline uint32_t libxl__sizekb_to_mb(uint32_t s) {
>      return (s + 1023) / 1024;
>  }

Dario Faggioli

2012-Jun-21 09:44 UTC

head link

Re: [PATCH 07 of 10 v2] libxl: introduce some node map helpers

On Thu, 2012-06-21 at 10:35 +0100, Ian Campbell wrote:> On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:
> 
> > +int libxl_nodemap_to_cpumap(libxl_ctx *ctx,
> > +                            const libxl_bitmap *nodemap,
> > +                            libxl_bitmap *cpumap);
> > +    /* populate cpumap with the cpus spanned by the nodes in nodemap
*/
> 
> Comments before proto please 
>Ok.
> (I know libxl is a bit of a mish-mash, but
> we''ll try to be consistent going forward...)
> Yeah, it''s just hard to figure out which one of the various suits you
can find in the file(s) for similar things that one should follow! :-P

Thanks for looking at this,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Jun-21 09:46 UTC

head link

Re: [PATCH 06 of 10 v2] libxl: expand the libxl_bitmap API a bit

On Thu, 2012-06-21 at 10:30 +0100, Ian Campbell wrote:> > diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c
> > --- a/tools/libxl/libxl_utils.c
> > +++ b/tools/libxl/libxl_utils.c
> > @@ -506,6 +506,35 @@ void libxl_bitmap_dispose(libxl_bitmap *
> >      free(map->map);
> >  }
> >  
> > +void libxl_bitmap_copy(libxl_ctx *ctx, libxl_bitmap *dptr,
> > +                       const libxl_bitmap *sptr)
> > +{
> > +    int sz;
> > +
> > +    sz = dptr->size = sptr->size;
> > +    memcpy(dptr->map, sptr->map, sz * sizeof(*dptr->map));
> 
> I think you need either some checks/asserts on the two relative sizes or
> a reallocation of the destination map. Otherwise you risk an overflow.
> 
> Perhaps a simple assert (i.e. make it the callers problem) would be ok?
> Sounds reasonable, will do that.

Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Jun-21 09:49 UTC

head link

Re: [PATCH 05 of 10 v2] libxl: rename libxl_cpumap to libxl_bitmap

On Thu, 2012-06-21 at 10:12 +0100, Ian Campbell wrote:> On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:
> > And leave to the caller the burden of knowing and remembering what
kind
> > of bitmap each instance of libxl_bitmap is.
> > 
> > This is basically just some s/libxl_cpumap/libxl_bitmap/ (and some
other
> > related interface name substitution, e.g., libxl_for_each_cpu) in a
bunch
> > of files, with no real functional change involved.
> > 
> > A specific allocation helper is introduced, besides
libxl_bitmap_alloc().
> > It is called libxl_cpu_bitmap_alloc() and is meant at substituting the
old
> > libxl_cpumap_alloc(). It is just something easier to use in cases
where one
> > wants to allocate a libxl_bitmap that is going to serve as a cpu map.
> > 
> > This is because we want to be able to deal with both cpu and NUMA node
> > maps, but we don''t want to duplicate all the various helpers
and wrappers.
> 
> FWIW I''d have been perfectly happy with a bunch of 
> 	#define for_each_cpu(mumble) for_each_bit(mumble)
> type stuff, but I think Ian J and yourself didn''t like those which
I''m
> also fine with.
> Well, I''m fine with both approaches, actually. However, I''ve
been asked
by Ian during last round to kill the duplication as much as I can, and
here it is what I came out with. :-)

That being said, if it''s just a matter of adding back a couple of
libxl_for_each_{set_}cpu as a define to libxl_for_each_{set_} bit, I
could well do that, as it looks a "reasonable" amount of wrapper
duplication to me...
> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.eu.com>
> > diff --git a/tools/libxl/libxl_json.c b/tools/libxl/libxl_json.c
> > --- a/tools/libxl/libxl_json.c
> > +++ b/tools/libxl/libxl_json.c
> > @@ -99,8 +99,8 @@ yajl_gen_status libxl_uuid_gen_json(yajl
> >      return yajl_gen_string(hand, (const unsigned char *)buf,
LIBXL_UUID_FMTLEN);
> >  }
> > 
> > -yajl_gen_status libxl_cpumap_gen_json(yajl_gen hand,
> > -                                      libxl_cpumap *cpumap)
> > +yajl_gen_status libxl_bitmap_gen_json(yajl_gen hand,
> > +                                      libxl_bitmap *cpumap)
> 
> Minor nit: You likely meant to rename cpumap in the argument list too.
> In this case, I sure did, thanks for pointing it out.
> Even with that minor nit:
> 
> Acked-by: Ian Campbell <ian.campbell@citrix.com>
> Good!

I''ll fix the nit, as I have to resubmit anyway. :-P

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Jun-21 10:00 UTC

head link

Re: [PATCH 03 of 10 v2] libxl, libxc: introduce libxl_get_numainfo()

On Thu, 2012-06-21 at 10:02 +0100, Ian Campbell wrote:> > diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
> > --- a/tools/libxc/xenctrl.h
> > +++ b/tools/libxc/xenctrl.h
> ...
> > +libxl_numainfo *libxl_get_numainfo(libxl_ctx *ctx, int *nr)
> > +{
> [...]
> 
> The hypercall buffer stuff all looks good.
> 
> > +    if (ret)
> > +        *nr = max_nodes;
> 
> You could put this before the fail: label. Not that it matters.
> Ok.
> > diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
> > --- a/tools/libxl/libxl.h
> > +++ b/tools/libxl/libxl.h
> ...
> >  #define LIBXL_CPUTOPOLOGY_INVALID_ENTRY (~(uint32_t)0)
> >  libxl_cputopology *libxl_get_cpu_topology(libxl_ctx *ctx, int *nr);
> >  void libxl_cputopology_list_free(libxl_cputopology *, int nr);
> > +#define LIBXL_NUMAINFO_INVALID_ENTRY (~(uint32_t)0)
> > +libxl_numainfo *libxl_get_numainfo(libxl_ctx *ctx, int *nr);
> > +  /* On success, a list of nr libxl_numainfo elements is returned.
> > +   * That is from malloc, thus it is up to the caller to invoke
> > +   * libxl_cpupoolinfo_list_free() on it.
> 
> Don''t you mean libxl_numinfo_list_free() ?
> 
> Also normally we put the comment before the prototype.
> Yes, I did, and will fix it. For the comment, again, I''ll move that
up... It''s just you can find so much different "examples" in
those
files... :-O
> > diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c
> > --- a/tools/libxl/libxl_utils.c
> > +++ b/tools/libxl/libxl_utils.c
> > @@ -537,6 +537,11 @@ int libxl_get_max_cpus(libxl_ctx *ctx)
> >      return xc_get_max_cpus(ctx->xch);
> >  }
> >  
> > +int libxl_get_max_nodes(libxl_ctx *ctx)
> > +{
> > +    return xc_get_max_nodes(ctx->xch);
> > +}
> 
> Is this needed externally to libxl or do we expect all callers to use
> libxl_get_numainfo? I suppose there is no harm in exporting this either
> way.
> I''m not sure. What I did is to replicate what happens for
libxl_get_max_cpus(), but I really don''t know whether or not they both
make any sense outside libxl. It does not look that bad to me that we
offer our users a chance to figure out how many cpus and/or nodes they
have, without needing to call the proper libxl_get_*info(), which is
quite a bit more of a burden. FWIW, I''d leave both of them public.

> > diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
> > --- a/xen/include/public/sysctl.h
> > +++ b/xen/include/public/sysctl.h
> > @@ -484,6 +484,7 @@ typedef struct xen_sysctl_topologyinfo x
> >  DEFINE_XEN_GUEST_HANDLE(xen_sysctl_topologyinfo_t);
> >  
> >  /* XEN_SYSCTL_numainfo */
> > +#define INVALID_NUMAINFO_ID (~0U)
> 
> It feels like there ought to be hunks in the hypervisor which either use
> this symbol instead of a hardcoded ~0U or which remove the internal
> definition in favour of this one?
> Again, -topologyinfo machinery does exactly this, so I really think we
either fix/change or leave as they are both of them (which of course I
can do, just tell me if that is what you want).

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Campbell

2012-Jun-21 10:21 UTC

head link

Re: [PATCH 03 of 10 v2] libxl, libxc: introduce libxl_get_numainfo()

> > > diff --git a/tools/libxl/libxl_utils.c
b/tools/libxl/libxl_utils.c
> > > --- a/tools/libxl/libxl_utils.c
> > > +++ b/tools/libxl/libxl_utils.c
> > > @@ -537,6 +537,11 @@ int libxl_get_max_cpus(libxl_ctx *ctx)
> > >      return xc_get_max_cpus(ctx->xch);
> > >  }
> > >  
> > > +int libxl_get_max_nodes(libxl_ctx *ctx)
> > > +{
> > > +    return xc_get_max_nodes(ctx->xch);
> > > +}
> > 
> > Is this needed externally to libxl or do we expect all callers to use
> > libxl_get_numainfo? I suppose there is no harm in exporting this
either
> > way.
> > 
> I''m not sure. What I did is to replicate what happens for
> libxl_get_max_cpus(), but I really don''t know whether or not they
both
> make any sense outside libxl. It does not look that bad to me that we
> offer our users a chance to figure out how many cpus and/or nodes they
> have, without needing to call the proper libxl_get_*info(), which is
> quite a bit more of a burden. FWIW, I''d leave both of them public.
OK.
> > > diff --git a/xen/include/public/sysctl.h
b/xen/include/public/sysctl.h
> > > --- a/xen/include/public/sysctl.h
> > > +++ b/xen/include/public/sysctl.h
> > > @@ -484,6 +484,7 @@ typedef struct xen_sysctl_topologyinfo x
> > >  DEFINE_XEN_GUEST_HANDLE(xen_sysctl_topologyinfo_t);
> > >  
> > >  /* XEN_SYSCTL_numainfo */
> > > +#define INVALID_NUMAINFO_ID (~0U)
> > 
> > It feels like there ought to be hunks in the hypervisor which either
use
> > this symbol instead of a hardcoded ~0U or which remove the internal
> > definition in favour of this one?
> > 
> Again, -topologyinfo machinery does exactly this, so I really think we
> either fix/change or leave as they are both of them (which of course I
> can do, just tell me if that is what you want).
Lets leave it as is then.

Ian.

Ian Campbell

2012-Jun-21 10:22 UTC

head link

Re: [PATCH 05 of 10 v2] libxl: rename libxl_cpumap to libxl_bitmap

On Thu, 2012-06-21 at 10:49 +0100, Dario Faggioli wrote:> On Thu, 2012-06-21 at 10:12 +0100, Ian Campbell wrote:
> > On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:
> > > And leave to the caller the burden of knowing and remembering
what kind
> > > of bitmap each instance of libxl_bitmap is.
> > > 
> > > This is basically just some s/libxl_cpumap/libxl_bitmap/ (and
some other
> > > related interface name substitution, e.g., libxl_for_each_cpu) in
a bunch
> > > of files, with no real functional change involved.
> > > 
> > > A specific allocation helper is introduced, besides
libxl_bitmap_alloc().
> > > It is called libxl_cpu_bitmap_alloc() and is meant at
substituting the old
> > > libxl_cpumap_alloc(). It is just something easier to use in cases
where one
> > > wants to allocate a libxl_bitmap that is going to serve as a cpu
map.
> > > 
> > > This is because we want to be able to deal with both cpu and NUMA
node
> > > maps, but we don''t want to duplicate all the various
helpers and wrappers.
> > 
> > FWIW I''d have been perfectly happy with a bunch of 
> > 	#define for_each_cpu(mumble) for_each_bit(mumble)
> > type stuff, but I think Ian J and yourself didn''t like those
which I''m
> > also fine with.
> > 
> Well, I''m fine with both approaches, actually. However,
I''ve been asked
> by Ian during last round to kill the duplication as much as I can, and
> here it is what I came out with. :-)
> 
> That being said, if it''s just a matter of adding back a couple of
> libxl_for_each_{set_}cpu as a define to libxl_for_each_{set_} bit, I
> could well do that, as it looks a "reasonable" amount of wrapper
> duplication to me...
Lets not bother now -- it''s easier to add a wrapper in the future than
to remove a bad one done in haste in the last stages of the release.

Ian Campbell

2012-Jun-21 11:40 UTC

head link

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:> If a domain does not have a VCPU affinity, try to pin it automatically to
some
> PCPUs. This is done taking into account the NUMA characteristics of the
host.
> In fact, we look for a combination of host''s NUMA nodes with
enough free memory
> and number of PCPUs for the new domain, and pin it to the VCPUs of those
nodes.
> 
> Once we know which ones, among all the possible combinations, represents
valid
> placement candidates for a domain, use some heuistics for deciding which is
the
                                              heuristics
> best. For instance, smaller candidates are considered to be better, both
from
> the domain''s point of view (fewer memory spreading among nodes)
and from the
> system as a whole point of view (fewer memoy fragmentation).  In case of
                                         memory
> candidates of equal sizes (i.e., with the same number of nodes), the one
with
> the greater amount of memory wins, as this is also good for keeping memory
> fragmentation under control.
> 
> This all happens internally to libxl, and no API for driving the mechanism
is
> provided for now. This matches what xend already does.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> 
> Changes from v1:
>  * This patches incorporates the changes from both "libxl, xl: enable
automatic
>    placement of guests on NUMA nodes" and "libxl, xl: heuristics
for reordering
>    NUMA placement candidates" from v1.
>  * The logic of the algorithm is basically the same as in v1, but the
splitting
>    of it in the various functions has been completely redesigned from
scratch.
>  * No public API for placement or candidate generation is now exposed,
>    everything happens within libxl, as agreed during v1 review.
>  * The relevant documentation have been moved near the actual functions and
>    features. Also, the amount and (hopefully!) the quality of the
documentation
>    has been improved a lot, as requested.
>  * All the comments about using the proper libxl facilities and helpers for
>    allocations, etc., have been considered and applied.
>  * This patch still bails out from NUMA optimizations if it find out
cpupools
>    are being utilized. It is next patch that makes the two things interact
>    properly, as suggested during review.
> 
> diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
> --- a/docs/man/xl.cfg.pod.5
> +++ b/docs/man/xl.cfg.pod.5
> @@ -111,8 +111,8 @@ created online and the remainder will be
> 
>  =item B<cpus="CPU-LIST">
> 
> -List of which cpus the guest is allowed to use. Default behavior is
There is (according to grep) a slight preference for British spelling,
but I imagine we are totally inconsistent all over the place so lets not
worry too much ;-)
> -`all cpus`. A C<CPU-LIST> may be specified as follows:
> +List of which cpus the guest is allowed to use. By default xl will (via
> +libxl) pick some cpus (see below). A C<CPU-LIST> may be specified as
follows:
> 
>  =over 4
> 
> @@ -132,6 +132,12 @@ run on cpu #3 of the host.
> 
>  =back
> 
> +If this option is not specified, libxl automatically tries to place the
new
> +domain on the host''s NUMA nodes (provided the host has more than
one NUMA
> +node) by pinning it to the cpus of those nodes. An heuristical approach is
I don''t think heuristical is a word, I think "A heuristic
approach..."
is fine

(not sure about "A heuristic" vs "An heuristic" but A sounds
more
natural to me, generally the rule is An before "vowel sounds" so I
guess
it depends on how you pronounce the h in heuristic, which varies even
depending on which bit of England you are from ;-))
> +utilized with the goals of maximizing performance for the domain and, at
> +the same time, achieving efficient utilization of the host''s CPUs
and RAM.
> +
>  =item B<cpu_weight=WEIGHT>
> 
>  A domain with a weight of 512 will get twice as much CPU as a domain
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
I think it would be quite reasonable for you to create a new
libxl_numa(_placement).c if you wanted too. likewise a libxl_numa.h
header. Up to you.
> @@ -95,6 +95,459 @@ out:
>      return sched;
>  }
> 
> +/*
> + * What follows are helpers for generating all the k-combinations
> + * without repetitions of a set S with n elements in it. Formally
> + * speaking, they are subsets of k distinct elements of S and, if
> + * S is n elements big, the number of k-combinations is equal to
> + * the binomial coefficient (n k), which means we get exactly
> + * n!/(k! * (n - k)!) subsets, all of them with k elements.
> + *
> + * The various subset are geneated one after the other by calling
                            generated
> + * comb_init() first, and, after that, comb_next()
> + * (n k)-1 times. An iterator is used to store the curent status
                                                      current
> + * of the whole generation operation (i.e., basically, the last
> + * combination that has been generated). As soon as all
> + * combinations have been generated, comb_next() will
> + * start returning 0 instead of 1. It is of course impotant that
                                                      important
> + * the same instance of the iterator and the same values for
> + * n and k are used for each call. If that doesn''t happen, the
> + * result is unspecified.
> + *
> + * The algorithm is a well known one,
Worth a link/reference?
>  and it produces the
> + * combinations in such a way that they (well, more precisely,
> + * their indexes it the array/map representing the set) come in
> + * lexicogaphic order.
      lexicographic
> + *
> + * For example, with n = 5 and k = 3, calling comb_init()
> + * will generate { 0, 1, 2 }, while subsequent valid calls to
> + * comb_next() will produce the following:
> + * { { 0, 1, 2 }, { 0, 1, 3 }, { 0, 1, 4 },
                ^ strictly speaking this 0,1,2 isn''t produced by a call
                to comb_next since it came from the initial comb_init,
                right? IOW the first comb_next() call won''t duplicate
                it...

I''m not going to try very hard to review the algorithm here, I trust
you
and I''ve got a head cold ;-)
> + *   { 0, 2, 3 }, { 0, 2, 4 }, { 0, 3, 4 },
> + *   { 1, 2, 3 }, { 1, 2, 4 }, { 1, 3, 4 },
> + *   { 2, 3, 4 } }
> + *
> + * This is used by the automatic NUMA placement logic below.
> + */
> +typedef int* comb_iter_t;
> +
> +static int comb_init(libxl__gc *gc, comb_iter_t *it, int n, int k)
> +{
> +    comb_iter_t new_iter;
> +    int i;
> +
> +    if (n < k)
> +        return 0;
> +
> +    /* First set is always { 0, 1, 2, ..., k-1 } */
> +    GCNEW_ARRAY(new_iter, k);
> +    for (i = 0; i < k; i++)
> +        new_iter[i] = i;
> +
> +    *it = new_iter;
> +    return 1;
> +}
> +
> +static int comb_next(comb_iter_t it, int n, int k)
> +{
> +    int i;
> +
> +    /*
> +     * The idea here is to find the leftmost element from where
> +     * we should start incrementing the indexes of the iterator.
> +     * This means looking for the highest index that can be increased
> +     * while still producing value smaller than n-1. In the example
> +     * above, when dealing with { 0, 1, 4 }, such an element is the
> +     * second one, as the third is already equal to 4 (which actually
> +     * is n-1).
> +     * Once we found from where to start, we increment that element
> +     * and overide the right-hand rest of the iterator with its
              override
> +     * successors, thus achieving lexicographic ordering.
> +     *
> +     * Regarding the termination of the generation process, when we
> +     * manage in bringing n-k at the very first position of the iterator,
> +     * we know that is the last valid combination ( { 2, 3, 4 }, with
> +     * n - k = 5 - 2 = 2, in the example above), and thus we start
> +     * returning 0 as soon as we cross that border.
> +     */
> +    for (i = k - 1; it[i] == n - k + i; i--) {
> +        if (i <= 0)
> +            return 0;
> +    }
> +    for (it[i]++, i++; i < k; i++)
> +        it[i] = it[i - 1] + 1;
> +    return 1;
> +}
> +
> +/* NUMA automatic placement (see libxl_internal.h for details) */
> +
> +/*
> + * This function turns a k-combination iterator into a node map.
> + * This means the bits in the node map corresponding to the indexes
> + * of the given combination are the ones that will be set.
> + * For example, if the iterator represents the combination { 0, 2, 4},
> + * the node map will have bits #0, #2 and #4 set to one and i thus
> + * will point to node 0, node 2 and and node 4.
What is "i" in this last bit ("i thus will point to")? I
don''t think you
are referring to the loop iterator variable.

If you meant "it" then I don''t think it needs saying that a
node map
with bits 0, 2 and 4 set refers to nodes 0, 2 and 4.
> + */
> +static void comb_get_nodemap(comb_iter_t it, libxl_bitmap *nodemap, int k)
> +{
> +    int i;
> +
> +    libxl_bitmap_set_none(nodemap);
> +    for (i = 0; i < k; i++)
> +        libxl_bitmap_set(nodemap, it[i]);
> +}
> +
> +/* Retrieve the number of cpus that the nodes that are part of the nodemap
> + * span. */
> +static int nodemap_to_nodes_cpus(libxl_cputopology *tinfo, int nr_cpus,
> +                                 const libxl_bitmap *nodemap)
nodes here == node''s ? But can''t have an apostrophe in a
function which
makes it ready weird to me. Perhaps "nodemap_count_cpus"?
> +{
> +    int i, nodes_cpus = 0;
> +
> +    for (i = 0; i < nr_cpus; i++) {
> +        if (libxl_bitmap_test(nodemap, tinfo[i].node))
> +            nodes_cpus++;
> +    }
> +    return nodes_cpus;
> +}
> +
[...]> +/* Retrieve the number of domains that can potentially run on the cpus
> + * the nodes that are part of the nodemap. */
> +static int nodemap_to_nr_domains(libxl__gc *gc, libxl_cputopology *tinfo,
> +                                 const libxl_bitmap *nodemap)
> +{
> +    libxl_dominfo *dinfo = NULL;
> +    libxl_bitmap dom_nodemap;
> +    int nr_doms, nr_cpus;
> +    int nr_domains = 0;
> +    int i, j, k;
> +
> +    dinfo = libxl_list_domain(CTX, &nr_doms);
> +    if (dinfo == NULL)
> +        return ERROR_FAIL;
> +
> +    if (libxl_node_bitmap_alloc(CTX, &dom_nodemap) < 0) {
> +        nr_domains = ERROR_FAIL;
> +        goto out;
> +    }
> +
> +    for (i = 0; i < nr_doms; i++) {
> +        libxl_vcpuinfo *vinfo;
> +        int nr_vcpus;
> +
> +        vinfo = libxl_list_vcpu(CTX, dinfo[i].domid, &nr_vcpus,
&nr_cpus);
> +        if (vinfo == NULL)
> +            continue;
> +
> +        libxl_bitmap_set_none(&dom_nodemap);
> +        for (j = 0; j < nr_vcpus; j++) {
> +            libxl_for_each_set_bit(k, vinfo[j].cpumap)
> +                libxl_bitmap_set(&dom_nodemap, tinfo[k].node);
> +        }
> +
> +        libxl_for_each_set_bit(j, dom_nodemap) {
> +            if (libxl_bitmap_test(nodemap, j)) {
> +                nr_domains++;
> +                goto found;
AKA break?
> +            }
> +        }
> + found:
> +        libxl_vcpuinfo_list_free(vinfo, nr_vcpus);
> +    }
> +
> + out:
> +    libxl_bitmap_dispose(&dom_nodemap);
You can come to out if libxl_nodebitmap_alloc fails -- in which case
dom_nodemap is (potentially) uninitialised.

You could just move out: down a line.

Otherwise we''d need libxl_bitmap_init which zeroes everything such that
calling dispose is safe even if you don''t call alloc. I don''t
mind
which.
> +    libxl_dominfo_list_free(dinfo, nr_doms);
> +    return nr_domains;
> +}
> +
> +/*
> + * This function tries to figure out if the host has a consistent number
> + * of cpus along all its NUMA nodes. In fact, if that is the case, we can
> + * calculate the minimum number of nodes needed for a domain by just
> + * dividing its total number of vcpus by this value computed here.
> + * However, we are not allowed to assume that all the nodes have the
> + * same number of cpus. Therefore, in case discrepancies among different
> + * nodes are found, this function just returns 0 and the caller needs
> + * to sort out things in some other way.
If the caller has to deal with this anyway why bother with this check
and/or the other algorithm?
> + */
> +static int cpus_per_node_count(libxl_cputopology *tinfo, int nr_cpus,
> +                               libxl_numainfo *ninfo, int nr_nodes)
> +{
> +    int cpus_per_node = 0;
> +    int j, i;
> +
> +    /* This makes sense iff # of PCPUs is the same for all nodes */
> +    for (j = 0; j < nr_nodes; j++) {
> +        int curr_cpus = 0;
> +
> +        for (i = 0; i < nr_cpus; i++) {
> +            if (tinfo[i].node == j)
> +                curr_cpus++;
> +        }
> +        /* So, if the above does not hold, turn the whole thing off! */
> +        cpus_per_node = cpus_per_node == 0 ? curr_cpus : cpus_per_node;
> +        if (cpus_per_node != curr_cpus)
> +            return 0;
> +    }
> +    return cpus_per_node;
> +}
> +
> +/* Get all the placement candidates satisfying some specific conditions */
> +int libxl__get_numa_candidates(libxl__gc *gc,
> +                               uint32_t min_free_memkb, int min_cpus,
> +                               int min_nodes, int max_nodes,
> +                               libxl__numa_candidate *cndts[], int
*nr_cndts)
> +{
> +    libxl__numa_candidate *new_cndts = NULL;
> +    libxl_cputopology *tinfo = NULL;
> +    libxl_numainfo *ninfo = NULL;
> +    libxl_bitmap nodemap;
> +    int nr_nodes, nr_cpus;
> +    int array_size, rc;
> +
> +    /* Get platform info and prepare the map for testing the combinations
*/
> +    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
> +    if (ninfo == NULL)
> +        return ERROR_FAIL;
> +    /* If we don''t have at least 2 nodes, it is useless to
proceed */
You don''t want to return whichever of those 2 nodes meets the
constraints?
> +    if (nr_nodes < 2) {
> +        rc = 0;
> +        goto out;
> +    }
> +
> +    tinfo = libxl_get_cpu_topology(CTX, &nr_cpus);
> +    if (tinfo == NULL) {
> +        rc = ERROR_FAIL;
> +        goto out;
> +    }
> +
> +    rc = libxl_node_bitmap_alloc(CTX, &nodemap);
> +    if (rc)
> +        goto out;
> +
> +    /*
> +     * Round up and down some of the constraints. For instance, the
minimum
> +     * number of cpus a candidate should have must at least be
non-negative.
> +     * Regarding the minimum number of NUMA nodes, if not explicitly
specified
> +     * (i.e., min_nodes <= 0), we try to figure out a sensible number
of nodes
> +     * from where to start generating candidates, if possible (or just
start
> +     * from 1 otherwise). The maximum number of nodes should not exceed
the
> +     * number of existent NUMA nodes on the host, or the candidate
genaration
                                                                     generation
> +     * won''t work properly.
> +     */
> +    min_cpus = min_cpus <= 0 ? 0 : min_cpus;
> +    if (min_nodes <= 0) {
> +        int cpus_per_node;
> +
> +        cpus_per_node = cpus_per_node_count(tinfo, nr_cpus, ninfo,
nr_nodes);
> +        if (cpus_per_node == 0)
> +            min_nodes = 1;
> +        else
> +            min_nodes = (min_cpus + cpus_per_node - 1) / cpus_per_node;
> +    }
> +    min_nodes = min_nodes > nr_nodes ? nr_nodes : min_nodes;
> +    if (max_nodes <= 0)
> +        max_nodes = nr_nodes;
> +    else
> +        max_nodes = max_nodes > nr_nodes ? nr_nodes : max_nodes;
> +    if (min_nodes > max_nodes) {
> +        rc = ERROR_INVAL;
> +        goto out;
> +    }
> +
> +    /* Initialize the local storage for the combinations */
> +    *nr_cndts = 0;
> +    array_size = nr_nodes;
> +    GCNEW_ARRAY(new_cndts, array_size);
> +
> +    /* Generate all the combinations of any size from min_nodes to
> +     * max_nodes (see comb_init() and comb_next()). */
> +    while (min_nodes <= max_nodes) {
> +        comb_iter_t comb_iter;
> +        int comb_ok;
> +
> +        /*
> +        * And here it is. Each step of this cycle generates a combination
of
> +        * nodes as big as min_nodes mandates.  Each of these combinations
is
> +        * checked against the constraints provided by the caller (namely,
> +        * amount of free memory and number of cpus) and it becomes an
actual
> +        * placement candidate iff it passes the check.
> +         */
> +        for (comb_ok = comb_init(gc, &comb_iter, nr_nodes, min_nodes);
comb_ok;
> +             comb_ok = comb_next(comb_iter, nr_nodes, min_nodes)) {
> +            uint32_t nodes_free_memkb;
> +            int nodes_cpus;
> +
> +            comb_get_nodemap(comb_iter, &nodemap, min_nodes);
> +
> +            /* If there is not enough memoy in this combination, skip it
                                        memory
> +             * and go generating the next one... */
> +            nodes_free_memkb = nodemap_to_free_memkb(ninfo, &nodemap);
> +            if (min_free_memkb > 0 && nodes_free_memkb <
min_free_memkb)
> +                continue;
> +
> +            /* And the same applies if this combination is short in cpus
*/
> +            nodes_cpus = nodemap_to_nodes_cpus(tinfo, nr_cpus,
&nodemap);
> +            if (min_cpus > 0 && nodes_cpus < min_cpus)
> +                continue;
> +
> +            /*
> +             * Conditions are met, we can add this combination to the
> +             * NUMA placement candidates list. We first make sure there
> +             * is enough space in there, and then we initialize the new
> +             * candidate element with the node map corresponding to the
> +             * combination we are dealing with. Memory allocation for
> +             * expanding the array that hosts the list happens in chunks
> +             * equal to the number of NUMA nodes in the system (to
> +             * avoid allocating memory each and every time we find a
> +             * new candidate).
> +             */
> +            if (*nr_cndts == array_size)
> +                array_size += nr_nodes;
> +            GCREALLOC_ARRAY(new_cndts, array_size);
> +
> +            libxl__numa_candidate_alloc(gc, &new_cndts[*nr_cndts]);
> +            libxl__numa_candidate_put_nodemap(gc,
&new_cndts[*nr_cndts],
> +                                              &nodemap);
> +            new_cndts[*nr_cndts].nr_domains > +                        
nodemap_to_nr_domains(gc, tinfo, &nodemap);
> +            new_cndts[*nr_cndts].free_memkb = nodes_free_memkb;
> +            new_cndts[*nr_cndts].nr_nodes = min_nodes;
> +            new_cndts[*nr_cndts].nr_cpus = nodes_cpus;
> +
> +            LOG(DEBUG, "NUMA placement candidate #%d found:
nr_nodes=%d, "
> +                       "nr_cpus=%d,
free_memkb=%"PRIu32"", *nr_cndts,
> +                       min_nodes, new_cndts[*nr_cndts].nr_cpus,
> +                       new_cndts[*nr_cndts].free_memkb / 1024);
> +
> +            (*nr_cndts)++;
> +        }
> +        min_nodes++;
> +    }
> +
> +    *cndts = new_cndts;
> + out:
> +    libxl_bitmap_dispose(&nodemap);
nodemap might not have been initialised?
> +    libxl_cputopology_list_free(tinfo, nr_cpus);
> +    libxl_numainfo_list_free(ninfo, nr_nodes);
It looks like the nr_* may also not have been initialised, but maybe
list_free is safe in that case since the associated array must
necessarily be NULL?
> +    return rc;
> +}
> +
> +void libxl__sort_numa_candidates(libxl__numa_candidate cndts[], int
nr_cndts,
> +                                 libxl__numa_candidate_cmpf cmpf)
> +{
> +    /* Reorder candidates (see the comparison function for
> +     * the details on the heuristics) */
> +    qsort(cndts, nr_cndts, sizeof(cndts[0]), cmpf);
> +}
> +
> +/*
> + * The NUMA placement candidates are reordered according to the following
> + * heuristics:
> + *  - candidates involving fewer nodes come first. In case two (or
> + *    more) candidates span the same number of nodes,
> + *  - candidates with greater amount of free memory come first. In
> + *    case two (or more) candidates differ in their amount of free
> + *    memory by less than 10%,
> + *  - candidates with fewer domains insisting on them at the time of
> + *    this call come first.
> + */
> +static int numa_cmpf(const void *v1, const void *v2)
> +{
> +    const libxl__numa_candidate *c1 = (const libxl__numa_candidate*) v1;
> +    const libxl__numa_candidate *c2 = (const libxl__numa_candidate*) v2;
Are these casts necessary?
> +    double mem_diff = labs(c1->free_memkb - c2->free_memkb);
> +    double mem_avg = (c1->free_memkb + c2->free_memkb) / 2.0;
> +
> +    if (c1->nr_nodes != c2->nr_nodes)
> +        return c1->nr_nodes - c2->nr_nodes;
> +
> +    if ((mem_diff / mem_avg) * 100.0 < 10.0 &&
Is all this FP math necessary? Using integers might have some rounding
but is it significant or do we care if it gets it slightly wrong?
> +        c1->nr_domains != c2->nr_domains)
> +        return c1->nr_domains - c2->nr_domains;
> +
> +    return c2->free_memkb - c1->free_memkb;
> +}
> +
> +/* The actual automatic NUMA placement routine */
> +static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info)
> +{
> +    int nr_candidates = 0;
> +    libxl__numa_candidate *candidates = NULL;
> +    libxl_bitmap candidate_nodemap;
> +    libxl_cpupoolinfo *pinfo;
> +    int nr_pools, rc = 0;
> +    uint32_t memkb;
> +
> +    /* First of all, if cpupools are in use, better not to mess with them
*/
> +    pinfo = libxl_list_cpupool(CTX, &nr_pools);
> +    if (!pinfo)
> +        return ERROR_FAIL;
> +    if (nr_pools > 1) {
> +        LOG(NOTICE, "skipping NUMA placement as cpupools are in
use");
> +        goto out;
> +    }
> +
> +    rc = libxl_domain_need_memory(CTX, info, &memkb);
> +    if (rc)
> +        goto out;
> +    if (libxl_node_bitmap_alloc(CTX, &candidate_nodemap)) {
> +        rc = ERROR_FAIL;
> +        goto out;
> +    }
> +
> +    /* Find all the candidates with enough free memory and at least
> +     * as much pcpus as the domain has vcpus.  */
> +    rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, 0, 0,
> +                                    &candidates, &nr_candidates);
> +    if (rc)
> +        goto out;
> +
> +    LOG(DETAIL, "%d NUMA placement candidates found",
nr_candidates);
> +
> +    /* No suitable placement candidates. We just return without touching
the
> +     * domain''s info->cpumap. It will have affinity with all
nodes/cpus. */
> +    if (nr_candidates == 0)
> +        goto out;
> +
> +    /* Bring the best candidate in front of the list --> candidates[0]
*/
> +    if (nr_candidates > 1)
> +        libxl__sort_numa_candidates(candidates, nr_candidates, numa_cmpf);
Is the start up cost of libxl__sort_numa_candidates significant enough
to make that if worthwhile?
> +
> +    /*
> +     * At this point, the first candidate in the array is the one we want.
> +     * Go for it by mapping its node map to the domain''s
info->cpumap.
> +     */
> +    libxl__numa_candidate_get_nodemap(gc, &candidates[0],
&candidate_nodemap);
> +    rc = libxl_nodemap_to_cpumap(CTX, &candidate_nodemap,
&info->cpumap);
> +    if (rc)
> +        goto out;
> +
> +    LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and
"
> +                "%"PRIu32" KB free selected",
candidates[0].nr_nodes,
> +                candidates[0].nr_cpus, candidates[0].free_memkb / 1024);
> +
> + out:
> +    libxl_bitmap_dispose(&candidate_nodemap);
> +    libxl_cpupoolinfo_list_free(pinfo, nr_pools);
> +    return rc;
> +}
> +
>  int libxl__build_pre(libxl__gc *gc, uint32_t domid,
>                libxl_domain_build_info *info, libxl__domain_build_state
*state)
>  {
> @@ -104,7 +557,22 @@ int libxl__build_pre(libxl__gc *gc, uint
>      uint32_t rtc_timeoffset;
> 
>      xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus);
> +
> +    /*
> +     * Check if the domain has any CPU affinity. If not, try to build up
one.
> +     * In case numa_place_domain() find at least a suitable candidate, it
will
> +     * affect info->cpumap accordingly; if it does not, it just leaves
it
> +     * as it is. This means (unless some weird error manifests) the
subsequent
> +     * call to libxl_set_vcpuaffinity_all() will do the actual placement,
> +     * whatever that turns out to be.
> +     */
> +    if (libxl_bitmap_is_full(&info->cpumap)) {
> +        int rc = numa_place_domain(gc, info);
> +        if (rc)
> +            return rc;
I''m not sure about this, if numa_place_domain fails for any reason
would
we be not better off logging and continuing without placement? We
already do that explicitly in a couple of cases.
> +    }
>      libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus,
&info->cpumap);
> +
>      xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb +
LIBXL_MAXMEM_CONSTANT);
>      if (info->type == LIBXL_DOMAIN_TYPE_PV)
>          xc_domain_set_memmap_limit(ctx->xch, domid,
> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
> --- a/tools/libxl/libxl_internal.h
> +++ b/tools/libxl/libxl_internal.h
> @@ -2021,6 +2021,134 @@ static inline void libxl__ctx_unlock(lib
>  #define CTX_LOCK (libxl__ctx_lock(CTX))
>  #define CTX_UNLOCK (libxl__ctx_unlock(CTX))
> 
> +/*
> + * Automatic NUMA placement
> + *
> + * These functions and data structures deal with the initial placement of
a
> + * domain onto the host NUMA nodes.
> + *
> + * The key concept here is the one of "numa placement candidate"
(or jusr "numa
                                                                        just
> + * candidate", "placement candidate" or even
"candidate"), which basically is a
Do we really need that level of detail about what we might call it?
There''s no minimum word limit you know ;-)
> + * set of nodes whose characteristics have been successfully checked
against
> + * some specific requirements. More pecisely, a candidate is the nodemap
                                       precisely
> + * associated with one of the possible subset of the host NUMA nodes
providing
> + * a certain amount of free memory, or a given number of cpus, or even
both
> + * (depending in what the caller wants). For convenience of use, some of
this
> + * information are stored within the candidate itselfs, instead of always
being
                                                  itself
> + * dynamically computed. A candidate can consist of just one, up to all
the
> + * available nodes on the host (the latter being, of course, not ideal).
I can''t parse this sentence.
>   For
> + * instance, looking for a numa candidates with 2GB of free memoy means we
want
                                                              memory
> + * all the possible subsets of the host NUMA nodes with, cumulatively, at
least
> + * 2GB of free memory. That could be possible by just using one particular
> + * node, or may require more nodes, depending on the characteristics of
the
> + * host, on how many domains have been created already, on how big they
are,
> + * etc.
> + *
> + * The intended usage is as follows:
> + *  1. by, fist of all, calling libxl__get_numa_candidates(), and
specifying
> + *     the proper constraints to it (e.g., the amount of memory a domain
need
> + *     as the minimum amount of free memory fo the candidates) one can
build
                                               for
> + *     up a whole set of suitable placing alternatives for a domain;
> + *  2. after that, one specific candidate should be chosen. That can
happen
> + *     by looking at their various characteristics;
> + *  3. the chosen candidate''s nodemap should be utilized for
computing the
> + *     actual affinity of the domain which, given the curent NUMA support
                                                         current
> + *     in the hypervisor, is what determines the placement of the
domain''s
> + *     vcpus and memory.
> + *
> + * To make phase 2 even easier, a sorting helper function for the list of
> + * candidates is povided in the form of libxl__sort_numa_candidates(). The
only
                    provided
> + * that is needed is defining a comparison function, containing the
chriteria
                                                                       criteria
> + * for deciding, given two candidates, which one is
''better''. Depending on how
> + * the comparison function is defined, the best candidate (where, of
course,
> + * best is defined with respect to the heuristics implemented in the
comparison
> + * function itself, libxl__numa_candidate_cmpf()) could become the first o
the
                                                                           or
> + * last element of the list.
> + *
> + * Summarizing, achieving automatic NUMA placement is just a matter of
> + * obtaining the list of suitable placement candidates, perhaps asking for
each
> + * of them to provide at least the amount of memory the domain needs.
After
> + * that just implement a comparison function by means of the various
helpers
> + * retrieving the relevant information about the candidates themselves.
> + * Finally, call the sorting helper function and use the candidate that
became
> + * (typically) the first element of the list fo determining the
domain''s
                                               for
> + * affinity.
> + */
> +
> +typedef struct {
> +    int nr_cpus, nr_nodes;
> +    int nr_domains;
> +    uint32_t free_memkb;
> +    libxl_bitmap nodemap;
> +} libxl__numa_candidate;
> +
> +/*
> + * This generates the list of NUMA placement candidates satisfying some
> + * specific conditions. If min_nodes and/or max_nodes are not 0, their
value is
> + * used to determine the minimum and maximum number of nodes that are
allow to
> + * be present in each candidate. If min_nodes and/or max_nodes are 0, the
> + * minimum and maximum number of nodes to be used are automatically
selected by
> + * the implementation (and that will likely be just 1 node for the minimum
and
> + * the total number of existent nodes for the maximum). Re min_free_memkb
and
> + * min_cpu, if not 0, they specify the caller only wants candidates with
at
> + * least that amount of free memory and that number of cpus, respectively.
If
> + * min_free_memkb and/or min_cpus are 0, the candidates'' free
memory and number
> + * of cpus won''t be checked at all, which means a candidate will
always be
> + * considered suitable wrt the specific constraint.  cndts is where the
list of
> + * exactly nr_cndts candidates is returned. Note that, in case no
candidates
> + * are found at all, the function returns successfully, but with nr_cndts
equal
> + * to zero.
> + */
> +_hidden int libxl__get_numa_candidates(libxl__gc *gc,
> +                                uint32_t min_free_memkb, int min_cpus,
> +                                int min_nodes, int max_nodes,
> +                                libxl__numa_candidate *cndts[], int
*nr_cndts);
> +
> +/* allocation and deallocation for placement candidates */
> +static inline int libxl__numa_candidate_alloc(libxl__gc *gc,
> +                                              libxl__numa_candidate *cndt)
> +{
> +    return libxl_node_bitmap_alloc(CTX, &cndt->nodemap);
> +}
> +static inline void libxl__numa_candidate_dispose(libxl__numa_candidate
*cndt)
> +{
> +    libxl_bitmap_dispose(&cndt->nodemap);
> +}
> +static inline void libxl__numacandidate_list_free(libxl__numa_candidate
*cndts,
> +                                                  int nr_cndts)
> +{
> +    int i;
> +
> +    for (i = 0; i < nr_cndts; i++)
> +        libxl__numa_candidate_dispose(&cndts[i]);
> +    free(cndts);
> +}
> +
> +/* retrieve (in nodemap) the node map associated to placement candidate
cndt */
> +static inline
> +void libxl__numa_candidate_get_nodemap(libxl__gc *gc,
> +                                       const libxl__numa_candidate *cndt,
> +                                       libxl_bitmap *nodemap)
> +{
> +    libxl_bitmap_copy(CTX, nodemap, &cndt->nodemap);
> +}
> +/* set the node map of placement candidate cndt to match nodemap */
> +static inline
> +void libxl__numa_candidate_put_nodemap(libxl__gc *gc,
> +                                       libxl__numa_candidate *cndt,
> +                                       const libxl_bitmap *nodemap)
> +{
> +    libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap);
> +}
> +
> +/* signature for the comparison function between two candidates c1 and c2
> + * (the thid parameter is provided to enable thread safety). */
           third

But there isn''t actually a third parameter so it''s a bit moot
;-)
> +typedef int (*libxl__numa_candidate_cmpf)(const void *v1, const void *v2);
> +/* sort the list of candidates in cndts (an array with nr_cndts elements
in
> + * it) using cmpf for comparing two candidates. Uses libc''s
qsort(). */
> +_hidden void libxl__sort_numa_candidates(libxl__numa_candidate cndts[],
> +                                         int nr_cndts,
> +                                         libxl__numa_candidate_cmpf cmpf);
> 
>  /*
>   * Inserts "elm_new" into the sorted list "head".

Ian Campbell

2012-Jun-21 13:31 UTC

head link

Re: [PATCH 09 of 10 v2] libxl: have NUMA placement deal with cpupools

On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:> In such a way that only the cpus belonging to the cpupool of the
> domain being placed are considered for the placement itself.
> 
> This happens by filtering out all the nodes in which the cpupool has
> not any cpu from the placement candidates. After that -- as a cpu pooling
> not necessarily happens at NUMA nodes boundaries -- we also make sure
> only the actual cpus that are part of the pool are considered when
> counting how much processors a placement candidate is able to provide.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> 
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -198,15 +198,27 @@ static void comb_get_nodemap(comb_iter_t
>          libxl_bitmap_set(nodemap, it[i]);
>  }
>  
> +/* Retrieve how many nodes a nodemap spans. */
> +static int nodemap_to_nr_nodes(const libxl_bitmap *nodemap)
> +{
> +    int i, nr_nodes = 0;
> +
> +    libxl_for_each_set_bit(i, *nodemap)
> +        nr_nodes++;
> +    return nr_nodes;
> +}
> +
>  /* Retrieve the number of cpus that the nodes that are part of the nodemap
> - * span. */
> + * span and that are also set in suitable_cpumap. */
>  static int nodemap_to_nodes_cpus(libxl_cputopology *tinfo, int nr_cpus,
> +                                 const libxl_bitmap *suitable_cpumap,
>                                   const libxl_bitmap *nodemap)
>  {
>      int i, nodes_cpus = 0;
>  
>      for (i = 0; i < nr_cpus; i++) {
> -        if (libxl_bitmap_test(nodemap, tinfo[i].node))
> +        if (libxl_bitmap_test(suitable_cpumap, i) &&
> +            libxl_bitmap_test(nodemap, tinfo[i].node))
>              nodes_cpus++;
>      }
>      return nodes_cpus;
> @@ -311,12 +323,13 @@ static int cpus_per_node_count(libxl_cpu
>  int libxl__get_numa_candidates(libxl__gc *gc,
>                                 uint32_t min_free_memkb, int min_cpus,
>                                 int min_nodes, int max_nodes,
> +                               const libxl_bitmap *suitable_cpumap,
>                                 libxl__numa_candidate *cndts[], int
*nr_cndts)
>  {
>      libxl__numa_candidate *new_cndts = NULL;
>      libxl_cputopology *tinfo = NULL;
>      libxl_numainfo *ninfo = NULL;
> -    libxl_bitmap nodemap;
> +    libxl_bitmap suitable_nodemap, nodemap;
>      int nr_nodes, nr_cpus;
>      int array_size, rc;
>  
> @@ -340,6 +353,15 @@ int libxl__get_numa_candidates(libxl__gc
>      if (rc)
>          goto out;
>  
> +    /* Allocate and prepare the map of the node that can be utilized for
> +     * placement, basing on the map of suitable cpus. */
> +    rc = libxl_node_bitmap_alloc(CTX, &suitable_nodemap);
> +    if (rc)
> +        goto out;
> +    rc = libxl_cpumap_to_nodemap(CTX, suitable_cpumap,
&suitable_nodemap);
> +    if (rc)
> +        goto out;
> +
>      /*
>       * Round up and down some of the constraints. For instance, the
minimum
>       * number of cpus a candidate should have must at least be
non-negative.
> @@ -391,9 +413,14 @@ int libxl__get_numa_candidates(libxl__gc
>          for (comb_ok = comb_init(gc, &comb_iter, nr_nodes, min_nodes);
comb_ok;
>               comb_ok = comb_next(comb_iter, nr_nodes, min_nodes)) {
>              uint32_t nodes_free_memkb;
> -            int nodes_cpus;
> +            int i, nodes_cpus;
>  
> +            /* Get the nodemap for the combination and filter unwnted
nodes */
                                                                 unwanted
>              comb_get_nodemap(comb_iter, &nodemap, min_nodes);
> +            libxl_for_each_set_bit(i, nodemap) {
> +                if (!libxl_bitmap_test(&suitable_nodemap, i))
> +                    libxl_bitmap_reset(&nodemap, i);
> +            }
>  
>              /* If there is not enough memoy in this combination, skip it
>               * and go generating the next one... */
> @@ -402,7 +429,8 @@ int libxl__get_numa_candidates(libxl__gc
>                  continue;
>  
>              /* And the same applies if this combination is short in cpus
*/
> -            nodes_cpus = nodemap_to_nodes_cpus(tinfo, nr_cpus,
&nodemap);
> +            nodes_cpus = nodemap_to_nodes_cpus(tinfo, nr_cpus,
suitable_cpumap,
> +                                               &nodemap);
>              if (min_cpus > 0 && nodes_cpus < min_cpus)
>                  continue;
>  
> @@ -427,12 +455,13 @@ int libxl__get_numa_candidates(libxl__gc
>              new_cndts[*nr_cndts].nr_domains >                          
nodemap_to_nr_domains(gc, tinfo, &nodemap);
>              new_cndts[*nr_cndts].free_memkb = nodes_free_memkb;
> -            new_cndts[*nr_cndts].nr_nodes = min_nodes;
> +            new_cndts[*nr_cndts].nr_nodes =
nodemap_to_nr_nodes(&nodemap);
>              new_cndts[*nr_cndts].nr_cpus = nodes_cpus;
>  
>              LOG(DEBUG, "NUMA placement candidate #%d found:
nr_nodes=%d, "
>                         "nr_cpus=%d,
free_memkb=%"PRIu32"", *nr_cndts,
> -                       min_nodes, new_cndts[*nr_cndts].nr_cpus,
> +                       new_cndts[*nr_cndts].nr_nodes,
> +                       new_cndts[*nr_cndts].nr_cpus,
>                         new_cndts[*nr_cndts].free_memkb / 1024);
>  
>              (*nr_cndts)++;
> @@ -442,6 +471,7 @@ int libxl__get_numa_candidates(libxl__gc
>  
>      *cndts = new_cndts;
>   out:
> +    libxl_bitmap_dispose(&suitable_nodemap);
>      libxl_bitmap_dispose(&nodemap);
>      libxl_cputopology_list_free(tinfo, nr_cpus);
>      libxl_numainfo_list_free(ninfo, nr_nodes);
> @@ -485,23 +515,27 @@ static int numa_cmpf(const void *v1, con
>  }
>  
>  /* The actual automatic NUMA placement routine */
> -static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info)
> +static int numa_place_domain(libxl__gc *gc, uint32_t domid,
> +                             libxl_domain_build_info *info)
>  {
>      int nr_candidates = 0;
>      libxl__numa_candidate *candidates = NULL;
>      libxl_bitmap candidate_nodemap;
> -    libxl_cpupoolinfo *pinfo;
> -    int nr_pools, rc = 0;
> +    libxl_cpupoolinfo cpupool_info;
> +    int i, cpupool, rc = 0;
>      uint32_t memkb;
>  
> -    /* First of all, if cpupools are in use, better not to mess with them
*/
> -    pinfo = libxl_list_cpupool(CTX, &nr_pools);
> -    if (!pinfo)
> -        return ERROR_FAIL;
> -    if (nr_pools > 1) {
> -        LOG(NOTICE, "skipping NUMA placement as cpupools are in
use");
> -        goto out;
> -    }
> +    /*
> +     * Extract the cpumap from the cpupool the domain belong to. In fact,
> +     * it only makes sense to consider the cpus/nodes that are in there
> +     * for placement.
> +     */
> +    rc = cpupool = libxl__domain_cpupool(gc, domid);
> +    if (rc < 0)
> +        return rc;
> +    rc = libxl_cpupool_info(CTX, &cpupool_info, cpupool);
> +    if (rc)
> +        return rc;
>  
>      rc = libxl_domain_need_memory(CTX, info, &memkb);
>      if (rc)
> @@ -513,7 +547,8 @@ static int numa_place_domain(libxl__gc *
>  
>      /* Find all the candidates with enough free memory and at least
>       * as much pcpus as the domain has vcpus.  */
> -    rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, 0, 0,
> +    rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus,
> +                                    0, 0, &cpupool_info.cpumap,
>                                      &candidates, &nr_candidates);
>      if (rc)
>          goto out;
> @@ -538,13 +573,20 @@ static int numa_place_domain(libxl__gc *
>      if (rc)
>          goto out;
>  
> +    /* Avoid trying to set the affinity to cpus that might be in the
> +     * nodemap but not in our cpupool. */
> +    libxl_for_each_set_bit(i, info->cpumap) {
> +        if (!libxl_bitmap_test(&cpupool_info.cpumap, i))
> +            libxl_bitmap_reset(&info->cpumap, i);
> +    }
> +
>      LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and
"
>                  "%"PRIu32" KB free selected",
candidates[0].nr_nodes,
>                  candidates[0].nr_cpus, candidates[0].free_memkb / 1024);
>  
>   out:
>      libxl_bitmap_dispose(&candidate_nodemap);
> -    libxl_cpupoolinfo_list_free(pinfo, nr_pools);
> +    libxl_cpupoolinfo_dispose(&cpupool_info);
>      return rc;
>  }
>  
> @@ -567,7 +609,7 @@ int libxl__build_pre(libxl__gc *gc, uint
>       * whatever that turns out to be.
>       */
>      if (libxl_bitmap_is_full(&info->cpumap)) {
> -        int rc = numa_place_domain(gc, info);
> +        int rc = numa_place_domain(gc, domid, info);
>          if (rc)
>              return rc;
>      }
> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
> --- a/tools/libxl/libxl_internal.h
> +++ b/tools/libxl/libxl_internal.h
> @@ -2094,14 +2094,17 @@ typedef struct {
>   * least that amount of free memory and that number of cpus, respectively.
If
>   * min_free_memkb and/or min_cpus are 0, the candidates'' free
memory and number
>   * of cpus won''t be checked at all, which means a candidate will
always be
> - * considered suitable wrt the specific constraint.  cndts is where the
list of
> - * exactly nr_cndts candidates is returned. Note that, in case no
candidates
> - * are found at all, the function returns successfully, but with nr_cndts
equal
> - * to zero.
> + * considered suitable wrt the specific constraint. suitable_cpumap is
useful
> + * for specifyin we want only the cpus in that mask to be considered while
         specifying

Apart from those two spelling errors:
Acked-by: Ian Campbell <ian.campbell@citrix.com>
> + * generating placement candidates (for example because of cpupools).
cndts is
> + * where the list of exactly nr_cndts candidates is returned. Note that,
in
> + * case no candidates are found at all, the function returns successfully,
but
> + * with nr_cndts equal to zero.
>   */
>  _hidden int libxl__get_numa_candidates(libxl__gc *gc,
>                                  uint32_t min_free_memkb, int min_cpus,
>                                  int min_nodes, int max_nodes,
> +                                const libxl_bitmap *suitable_cpumap,
>                                  libxl__numa_candidate *cndts[], int
*nr_cndts);
>  
>  /* allocation and deallocation for placement candidates */

Ian Campbell

2012-Jun-21 13:38 UTC

head link

Re: [PATCH 10 of 10 v2] Some automatic NUMA placement documentation

On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:> About rationale, usage and (some small bits of) API.
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> 
> Changes from v1:
>  * API documentation moved close to the actual functions.
> 
> diff --git a/docs/misc/xl-numa-placement.markdown
b/docs/misc/xl-numa-placement.markdown
> new file mode 100644
> --- /dev/null
> +++ b/docs/misc/xl-numa-placement.markdown
> @@ -0,0 +1,74 @@
> +# Guest Automatic NUMA Placement in libxl and xl #
> +
> +## Rationale ##
> +
> +The Xen hypervisor deals with Non-Uniform Memory Access (NUMA])
> +machines by assigning to its domain a "node affinity", i.e., a
set of NUMA
> +nodes of the host from which it gets its memory allocated.
> +
> +NUMA awareness becomes very important as soon as many domains start
running
> +memory-intensive workloads on a shared host. In fact, the cost of
accessing
> +non node-local memory locations is very high, and the performance
degradation
> +is likely to be noticeable.
> +
> +## Guest Placement in xl ##
> +
> +If using xl for creating and managing guests, it is very easy to ask
> +for both manual or automatic placement of them across the host''s
NUMA
> +nodes.
> +
> +Note that xm/xend does the very same thing, the only differences residing
> +in the details of the heuristics adopted for the placement (see below).
> +
> +### Manual Guest Placement with xl ###
> +
> +Thanks to the "cpus=" option, it is possible to specify where a
domain
> +should be created and scheduled on, directly in its config file. This
> +affects NUMA placement and memory accesses as the hypervisor constructs
> +the node affinity of a VM basing right on its CPU affinity when it is
> +created.
> +
> +This is very simple and effective, but requires the user/system
> +administrator to explicitly specify affinities for each and every domain,
> +or Xen won''t be able to enable guarantee the locality for their
memory
> +accesses.
> +
> +### Automatic Guest Placement with xl ###
> +
> +In case no "cpus=" option is specified in the config file, xl
tries
> +to figure out on its own on which node(s) the domain could fit best.
> +
> +First of all, it needs to find a node (or a set of nodes) that have
> +enough free memory for accommodating the domain. After that, the actual
> +decision on where to put the new guest happens by generating all the
> +possible combinations of nodes that satisfies the above and chose among
> +them according to the following heuristics:
> +
> +  *  candidates involving fewer nodes come first. In case two (or more)
> +     candidates span the same number of nodes,
> +  *  candidates with greater amount of free memory come first. In case
> +     two (or more) candidates differ in their amount of free memory by
> +     less than 10%,
> +  *  candidates with fewer domains already placed on them come first.
> +
> +Giving preference to small candidates ensures better performance for
> +the guest, as it avoid spreading its memory among different nodes.
> +Using the nodes that have the biggest amounts of free memory helps
> +keeping the memory fragmentation small, from a system wide perspective.
> +Finally, in case more candidates fulfil these criteria by the same
> +extent, choosing the candidate that is hosting fewer domain helps
> +balancing the load on the various nodes.
[...]

snipped the bit you said was wrong. It was just be remove not replaced, right?

> +
> +## Guest Placement within libxl ##
> +
> +xl achieves automatic NUMA placement by means of the following API
> +calls, provided by libxl.
This last seems out of date.

Assuming the bit above was to be removed not replaced:
Acked-by: Ian Campbell <ian.campbell@citrix.com>

Dario Faggioli

2012-Jun-21 13:54 UTC

head link

Re: [PATCH 09 of 10 v2] libxl: have NUMA placement deal with cpupools

On Thu, 2012-06-21 at 14:31 +0100, Ian Campbell wrote:> > + * considered suitable wrt the specific constraint. suitable_cpumap
is useful
> > + * for specifyin we want only the cpus in that mask to be considered
while
> 
>          specifying
> 
> Apart from those two spelling errors:
>Yeah, sorry for that. I''m very bad in spelling correctly when typing,
and this bloody keyboard is not helping! :-P

Isn''t there any tool for spell checking code comments?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Jun-21 13:57 UTC

head link

Re: [PATCH 10 of 10 v2] Some automatic NUMA placement documentation

On Thu, 2012-06-21 at 14:38 +0100, Ian Campbell wrote:> snipped the bit you said was wrong. It was just be remove not replaced,
right?
> Yes, it will go away, as the algorithm changed and does not do that any
longer.
> > +
> > +## Guest Placement within libxl ##
> > +
> > +xl achieves automatic NUMA placement by means of the following API
> > +calls, provided by libxl.
> 
> This last seems out of date.
> Indeed, thanks.

Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Campbell

2012-Jun-21 13:58 UTC

head link

Re: [PATCH 09 of 10 v2] libxl: have NUMA placement deal with cpupools

On Thu, 2012-06-21 at 14:54 +0100, Dario Faggioli wrote:> On Thu, 2012-06-21 at 14:31 +0100, Ian Campbell wrote:
> > > + * considered suitable wrt the specific constraint.
suitable_cpumap is useful
> > > + * for specifyin we want only the cpus in that mask to be
considered while
> > 
> >          specifying
> > 
> > Apart from those two spelling errors:
> >
> Yeah, sorry for that. I''m very bad in spelling correctly when
typing,
> and this bloody keyboard is not helping! :-P
> 
> Isn''t there any tool for spell checking code comments?
I''ve often thought such a thing would be useful, but I don''t
know of
one.

TBH I only really spot the spelling errors because my mail client
happens to draw a squiggly red line under them in the quotes.

Ian.

George Dunlap

2012-Jun-21 16:16 UTC

head link

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

On 15/06/12 18:04, Dario Faggioli wrote:> If a domain does not have a VCPU affinity, try to pin it automatically to
some
> PCPUs. This is done taking into account the NUMA characteristics of the
host.
> In fact, we look for a combination of host''s NUMA nodes with
enough free memory
> and number of PCPUs for the new domain, and pin it to the VCPUs of those
nodes.
>
> Once we know which ones, among all the possible combinations, represents
valid
> placement candidates for a domain, use some heuistics for deciding which is
the
> best. For instance, smaller candidates are considered to be better, both
from
> the domain''s point of view (fewer memory spreading among nodes)
and from the
> system as a whole point of view (fewer memoy fragmentation).  In case of
> candidates of equal sizes (i.e., with the same number of nodes), the one
with
> the greater amount of memory wins, as this is also good for keeping memory
> fragmentation under control.
>
> This all happens internally to libxl, and no API for driving the mechanism
is
> provided for now. This matches what xend already does.
>
> Signed-off-by: Dario Faggioli<dario.faggioli@citrix.com>Overall I think this approach is much better.  I haven''t done an 
extremely detailed review, but I do have some comments
below.> +    /*
> +     * Round up and down some of the constraints. For instance, the
minimum
> +     * number of cpus a candidate should have must at least be
non-negative.
> +     * Regarding the minimum number of NUMA nodes, if not explicitly
specified
> +     * (i.e., min_nodes<= 0), we try to figure out a sensible number of
nodes
> +     * from where to start generating candidates, if possible (or just
start
> +     * from 1 otherwise). The maximum number of nodes should not exceed
the
> +     * number of existent NUMA nodes on the host, or the candidate
genaration
> +     * won''t work properly.
> +     */
> +    min_cpus = min_cpus<= 0 ? 0 : min_cpus;Wouldn''t it just make more sense to specify that "min_cpus"
(and other
parameters) had to be >=0?

In any case, this is a very complicated way of saying "if (min_cpus<0) 
min_cpus = 0;".> +    if (min_nodes<= 0) {
> +        int cpus_per_node;
> +
> +        cpus_per_node = cpus_per_node_count(tinfo, nr_cpus, ninfo,
nr_nodes);
> +        if (cpus_per_node == 0)
> +            min_nodes = 1;
> +        else
> +            min_nodes = (min_cpus + cpus_per_node - 1) / cpus_per_node;
> +    }Same here; you could just write "if(!min_nodes)
{..."> +    min_nodes = min_nodes>  nr_nodes ? nr_nodes : min_nodes;
> +    if (max_nodes<= 0)
> +        max_nodes = nr_nodes;
> +    else
> +        max_nodes = max_nodes>  nr_nodes ? nr_nodes : max_nodes;if (max_nodes == 0 || max_nodes > nr_nodes) max_nodes =
nr_nodes;> +
> +/*
> + * The NUMA placement candidates are reordered according to the following
> + * heuristics:
> + *  - candidates involving fewer nodes come first. In case two (or
> + *    more) candidates span the same number of nodes,
> + *  - candidates with greater amount of free memory come first. In
> + *    case two (or more) candidates differ in their amount of free
> + *    memory by less than 10%,
Interesting idea -- sounds pretty reasonable.> + *  - candidates with fewer domains insisting on them at the time of
> + *    this call come first.Do you mean "existing"?  I think "assigned to" is probably
better.> + */
> +static int numa_cmpf(const void *v1, const void *v2)
> +{
> +    const libxl__numa_candidate *c1 = (const libxl__numa_candidate*) v1;
> +    const libxl__numa_candidate *c2 = (const libxl__numa_candidate*) v2;
> +    double mem_diff = labs(c1->free_memkb - c2->free_memkb);
> +    double mem_avg = (c1->free_memkb + c2->free_memkb) / 2.0;
> +
> +    if (c1->nr_nodes != c2->nr_nodes)
> +        return c1->nr_nodes - c2->nr_nodes;
> +
> +    if ((mem_diff / mem_avg) * 100.0<  10.0&&
> +        c1->nr_domains != c2->nr_domains)
> +        return c1->nr_domains - c2->nr_domains;I realize this isn''t a hot path, but it seems like moving into FP is 
really unnecessary.  You can just do this:

if ( ((mem_diff * 100) / mem_avg) < 10 ...

One minor note: I personally think it''s a lot more readable to put the 
''&&'' and ''||'' on the same line as
the next item, rather than the
previous item; i.e.:
  if ( expression_a
&& expression_b )

One more thing: Is there a reason why you put get_numa_candidates() in 
libxl_internal.h, but not sort_numa_candidates()?  It seems like both or 
neither should go. :-)

That''s all I have for now.  I''m OK with the general approach,
so here''s
a "weak ack", so if a maintainer is happy with the code, he can check
it in:

Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

I''ll try to come back and give a more detailed code review if I get a 
chance.

  -George

Dario Faggioli

2012-Jun-21 16:34 UTC

head link

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

On Thu, 2012-06-21 at 12:40 +0100, Ian Campbell wrote:> > diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
> > --- a/docs/man/xl.cfg.pod.5
> > +++ b/docs/man/xl.cfg.pod.5
> > @@ -111,8 +111,8 @@ created online and the remainder will be
> > 
> >  =item B<cpus="CPU-LIST">
> > 
> > -List of which cpus the guest is allowed to use. Default behavior is
> 
> There is (according to grep) a slight preference for British spelling,
> but I imagine we are totally inconsistent all over the place so lets not
> worry too much ;-)
> Especially considering I''m removing it! :-P (However, I''m sure
it was me
that put that line in some time ago, so I''ll make a note about
this! :-D).
> > @@ -132,6 +132,12 @@ run on cpu #3 of the host.
> > 
> >  =back
> > 
> > +If this option is not specified, libxl automatically tries to place
the new
> > +domain on the host''s NUMA nodes (provided the host has more
than one NUMA
> > +node) by pinning it to the cpus of those nodes. An heuristical
approach is
> 
> I don''t think heuristical is a word, I think "A heuristic
approach..."
> is fine
> Fixed (with "A heuristic approach"). Thanks.
> > + * the same instance of the iterator and the same values for
> > + * n and k are used for each call. If that doesn''t happen,
the
> > + * result is unspecified.
> > + *
> > + * The algorithm is a well known one,
> 
> Worth a link/reference?
> As you like... The general idea is described, for example, in this
Wikipedia page http://en.wikipedia.org/wiki/Combination , the actual
algorithm appeared, for instance, in D. Knuth''s "The Art of
Computer
Programming - Volume 4, Fascicle 3", but in many other places as well. I
guess I can cite TAOCP, although I''m not sure it is actually the first,
and thus the one who should be credited for this... But I guess that
doesn''t matter too much, does it?
> > + *
> > + * For example, with n = 5 and k = 3, calling comb_init()
> > + * will generate { 0, 1, 2 }, while subsequent valid calls to
> > + * comb_next() will produce the following:
> > + * { { 0, 1, 2 }, { 0, 1, 3 }, { 0, 1, 4 },
> 
>                 ^ strictly speaking this 0,1,2 isn''t produced by a
call
>                 to comb_next since it came from the initial comb_init,
>                 right? Yes, that''s right.
> IOW the first comb_next() call won''t duplicate
>                 it...
> No, it won''t. I''ll fix this in the comment.
> I''m not going to try very hard to review the algorithm here, I
trust you
> and I''ve got a head cold ;-)
> Fine. :-)
> > +/* NUMA automatic placement (see libxl_internal.h for details) */
> > +
> > +/*
> > + * This function turns a k-combination iterator into a node map.
> > + * This means the bits in the node map corresponding to the indexes
> > + * of the given combination are the ones that will be set.
> > + * For example, if the iterator represents the combination { 0, 2,
4},
> > + * the node map will have bits #0, #2 and #4 set to one and i thus
> > + * will point to node 0, node 2 and and node 4.
> 
> What is "i" in this last bit ("i thus will point to")?
I don''t think you
> are referring to the loop iterator variable.
> 
> If you meant "it" then I don''t think it needs saying
that a node map
> with bits 0, 2 and 4 set refers to nodes 0, 2 and 4.
> It was actually "it", and I''m happy to kill that part of the
sentence.
> > +/* Retrieve the number of cpus that the nodes that are part of the
nodemap
> > + * span. */
> > +static int nodemap_to_nodes_cpus(libxl_cputopology *tinfo, int
nr_cpus,
> > +                                 const libxl_bitmap *nodemap)
> 
> nodes here == node''s ? But can''t have an apostrophe in a
function which
> makes it ready weird to me. Perhaps "nodemap_count_cpus"?
> Right. I wanted to stick to the nodemap_to_xxx() for those group of
function, so I changed this one into "nodemap_to_nr_cpus()". Hope it
is
fine...
> > +/* Retrieve the number of domains that can potentially run on the
cpus
> > + * the nodes that are part of the nodemap. */
> > +static int nodemap_to_nr_domains(libxl__gc *gc, libxl_cputopology
*tinfo,
> > +                                 const libxl_bitmap *nodemap)
> > +{
> > +    libxl_dominfo *dinfo = NULL;
> > +    libxl_bitmap dom_nodemap;
> > +    int nr_doms, nr_cpus;
> > +    int nr_domains = 0;
> > +    int i, j, k;
> > +
> > +    dinfo = libxl_list_domain(CTX, &nr_doms);
> > +    if (dinfo == NULL)
> > +        return ERROR_FAIL;
> > +
> > +    if (libxl_node_bitmap_alloc(CTX, &dom_nodemap) < 0) {
> > +        nr_domains = ERROR_FAIL;
> > +        goto out;
> > +    }
> > +
> > +    for (i = 0; i < nr_doms; i++) {
> > +        libxl_vcpuinfo *vinfo;
> > +        int nr_vcpus;
> > +
> > +        vinfo = libxl_list_vcpu(CTX, dinfo[i].domid, &nr_vcpus,
&nr_cpus);
> > +        if (vinfo == NULL)
> > +            continue;
> > +
> > +        libxl_bitmap_set_none(&dom_nodemap);
> > +        for (j = 0; j < nr_vcpus; j++) {
> > +            libxl_for_each_set_bit(k, vinfo[j].cpumap)
> > +                libxl_bitmap_set(&dom_nodemap, tinfo[k].node);
> > +        }
> > +
> > +        libxl_for_each_set_bit(j, dom_nodemap) {
> > +            if (libxl_bitmap_test(nodemap, j)) {
> > +                nr_domains++;
> > +                goto found;
> 
> AKA break?
> 
Yes. Leftover from v1, which I forgot to adapt.> > +            }
> > +        }
> > + found:
> > +        libxl_vcpuinfo_list_free(vinfo, nr_vcpus);
> > +    }
> > +
> > + out:
> > +    libxl_bitmap_dispose(&dom_nodemap);
> 
> You can come to out if libxl_nodebitmap_alloc fails -- in which case
> dom_nodemap is (potentially) uninitialised.
>
> You could just move out: down a line.
> Right. As I did not really like that "nr_domains = ERROR_FAIL" I
changed
the error handling style of this function as a whole so that now I do
not need the out: label any more.
> > +/*
> > + * This function tries to figure out if the host has a consistent
number
> > + * of cpus along all its NUMA nodes. In fact, if that is the case, we
can
> > + * calculate the minimum number of nodes needed for a domain by just
> > + * dividing its total number of vcpus by this value computed here.
> > + * However, we are not allowed to assume that all the nodes have the
> > + * same number of cpus. Therefore, in case discrepancies among
different
> > + * nodes are found, this function just returns 0 and the caller needs
> > + * to sort out things in some other way.
> 
> If the caller has to deal with this anyway why bother with this check
> and/or the other algorithm?
> Mmm... I''m not sure I get what you mean here. Perhaps my commenting
about the whole thing was not so clear in the first place.

The (George''s :-)) idea is, if we know each node has 8 CPUs, and our
domain wants more than that, it does not make sense to start looking for
placement solutions with just one node. Actually, if the domain wants
fewer than 16 CPUs, starting looking from solutions with 2 nodes in them
should be the right thing.

That being said, I think it''s important we do not assume we
won''t ever
find some architecture with different number of CPUs in different nodes,
and that is what a zero return from that function means.

What was that you thought not to be worthwhile?
> > +/* Get all the placement candidates satisfying some specific
conditions */
> > +int libxl__get_numa_candidates(libxl__gc *gc,
> > +                               uint32_t min_free_memkb, int min_cpus,
> > +                               int min_nodes, int max_nodes,
> > +                               libxl__numa_candidate *cndts[], int
*nr_cndts)
> > +{
> > +    libxl__numa_candidate *new_cndts = NULL;
> > +    libxl_cputopology *tinfo = NULL;
> > +    libxl_numainfo *ninfo = NULL;
> > +    libxl_bitmap nodemap;
> > +    int nr_nodes, nr_cpus;
> > +    int array_size, rc;
> > +
> > +    /* Get platform info and prepare the map for testing the
combinations */
> > +    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
> > +    if (ninfo == NULL)
> > +        return ERROR_FAIL;
> > +    /* If we don''t have at least 2 nodes, it is useless to
proceed */
> 
> You don''t want to return whichever of those 2 nodes meets the
> constraints?
> I actually was meaning something like "hey, there''s only _1_ node,
go
for it!" :-P
> ...
> > +    *cndts = new_cndts;
> > + out:
> > +    libxl_bitmap_dispose(&nodemap);
> 
> nodemap might not have been initialised?
>Mmm... Right. So I really think I need an init function. :-(
> > +    libxl_cputopology_list_free(tinfo, nr_cpus);
> > +    libxl_numainfo_list_free(ninfo, nr_nodes);
> 
> It looks like the nr_* may also not have been initialised, but maybe
> list_free is safe in that case since the associated array must
> necessarily be NULL?
> Well, probably, but as they both do a "for (i=0;i<nr_*;i++)" I
think
it''s safer if I "=0" those twos, is that ok?
> > +/*
> > + * The NUMA placement candidates are reordered according to the
following
> > + * heuristics:
> > + *  - candidates involving fewer nodes come first. In case two (or
> > + *    more) candidates span the same number of nodes,
> > + *  - candidates with greater amount of free memory come first. In
> > + *    case two (or more) candidates differ in their amount of free
> > + *    memory by less than 10%,
> > + *  - candidates with fewer domains insisting on them at the time of
> > + *    this call come first.
> > + */
> > +static int numa_cmpf(const void *v1, const void *v2)
> > +{
> > +    const libxl__numa_candidate *c1 = (const libxl__numa_candidate*)
v1;
> > +    const libxl__numa_candidate *c2 = (const libxl__numa_candidate*)
v2;
> 
> Are these casts necessary?
> Will check and remove if not. Thanks.
> > +    double mem_diff = labs(c1->free_memkb - c2->free_memkb);
> > +    double mem_avg = (c1->free_memkb + c2->free_memkb) / 2.0;
> > +
> > +    if (c1->nr_nodes != c2->nr_nodes)
> > +        return c1->nr_nodes - c2->nr_nodes;
> > +
> > +    if ((mem_diff / mem_avg) * 100.0 < 10.0 &&
> 
> Is all this FP math necessary? Using integers might have some rounding
> but is it significant or do we care if it gets it slightly wrong?
> I think integer is fine... It''s heuristics, after all. I''ll go
for it.
> > +/* The actual automatic NUMA placement routine */
> > +static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info
*info)
> > +{
> > +    int nr_candidates = 0;
> > +    libxl__numa_candidate *candidates = NULL;
> > +    libxl_bitmap candidate_nodemap;
> > +    libxl_cpupoolinfo *pinfo;
> > +    int nr_pools, rc = 0;
> > +    uint32_t memkb;
> > +
> > +    /* First of all, if cpupools are in use, better not to mess with
them */
> > +    pinfo = libxl_list_cpupool(CTX, &nr_pools);
> > +    if (!pinfo)
> > +        return ERROR_FAIL;
> > +    if (nr_pools > 1) {
> > +        LOG(NOTICE, "skipping NUMA placement as cpupools are in
use");
> > +        goto out;
> > +    }
> > +
> > +    rc = libxl_domain_need_memory(CTX, info, &memkb);
> > +    if (rc)
> > +        goto out;
> > +    if (libxl_node_bitmap_alloc(CTX, &candidate_nodemap)) {
> > +        rc = ERROR_FAIL;
> > +        goto out;
> > +    }
> > +
> > +    /* Find all the candidates with enough free memory and at least
> > +     * as much pcpus as the domain has vcpus.  */
> > +    rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, 0,
0,
> > +                                    &candidates,
&nr_candidates);
> > +    if (rc)
> > +        goto out;
> > +
> > +    LOG(DETAIL, "%d NUMA placement candidates found",
nr_candidates);
> > +
> > +    /* No suitable placement candidates. We just return without
touching the
> > +     * domain''s info->cpumap. It will have affinity with
all nodes/cpus. */
> > +    if (nr_candidates == 0)
> > +        goto out;
> > +
> > +    /* Bring the best candidate in front of the list -->
candidates[0] */
> > +    if (nr_candidates > 1)
> > +        libxl__sort_numa_candidates(candidates, nr_candidates,
numa_cmpf);
> 
> Is the start up cost of libxl__sort_numa_candidates significant enough
> to make that if worthwhile?
> I''m not sure what you mean here, I''m afraid ... :-(
> > +
> > +    /*
> > +     * Check if the domain has any CPU affinity. If not, try to build
up one.
> > +     * In case numa_place_domain() find at least a suitable
candidate, it will
> > +     * affect info->cpumap accordingly; if it does not, it just
leaves it
> > +     * as it is. This means (unless some weird error manifests) the
subsequent
> > +     * call to libxl_set_vcpuaffinity_all() will do the actual
placement,
> > +     * whatever that turns out to be.
> > +     */
> > +    if (libxl_bitmap_is_full(&info->cpumap)) {
> > +        int rc = numa_place_domain(gc, info);
> > +        if (rc)
> > +            return rc;
> 
> I''m not sure about this, if numa_place_domain fails for any reason
would
> we be not better off logging and continuing without placement? We
> already do that explicitly in a couple of cases.
> Well, actually, if it "fails" in the sense it finds no candidates or
does not manage in placing the domain, it returns 0, so we do right what
you''re suggesting. I''m sure this is stated in some comment but
I guess I
can repeat it here. If !rc, it means we stepped into some ERROR_FAIL or
something, which I think has to be handled like this, hasn''t it?

Perhaps I can also add some logging about "successfully failing",
i.e.,
not getting into any error but not being able to place the domain. If
yes, that will happen _inside_ numa_place_domain() rather than here.
> > + * dynamically computed. A candidate can consist of just one, up to
all the
> > + * available nodes on the host (the latter being, of course, not
ideal).
> 
> I can''t parse this sentence.
> Sorry. I just meant a placement candidate is a set of nodes, so it can
be one node, four nodes, or even comprise all the available nodes.
> > +/* signature for the comparison function between two candidates c1
and c2
> > + * (the thid parameter is provided to enable thread safety). */
> 
>            third
> 
> But there isn''t actually a third parameter so it''s a bit
moot ;-)
> 
> > +typedef int (*libxl__numa_candidate_cmpf)(const void *v1, const void
*v2);
>You''re right. That is a leftover from where I was trying to use
qsort_r() for sorting, but it turned out to not be necessary in this
case. Killed.

Thanks again and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Jun-21 16:43 UTC

head link

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

On Thu, 2012-06-21 at 17:16 +0100, George Dunlap wrote:> > Signed-off-by: Dario Faggioli<dario.faggioli@citrix.com>
> Overall I think this approach is much better.  
>Cool, Thanks.
> > +    /*
> > +     * Round up and down some of the constraints. For instance, the
minimum
> > +     * number of cpus a candidate should have must at least be
non-negative.
> > +     * Regarding the minimum number of NUMA nodes, if not explicitly
specified
> > +     * (i.e., min_nodes<= 0), we try to figure out a sensible
number of nodes
> > +     * from where to start generating candidates, if possible (or
just start
> > +     * from 1 otherwise). The maximum number of nodes should not
exceed the
> > +     * number of existent NUMA nodes on the host, or the candidate
genaration
> > +     * won''t work properly.
> > +     */
> > +    min_cpus = min_cpus<= 0 ? 0 : min_cpus;
> Wouldn''t it just make more sense to specify that
"min_cpus" (and other
> parameters) had to be >=0?
> Yes, I think that make sense, will do.
> > +/*
> > + * The NUMA placement candidates are reordered according to the
following
> > + * heuristics:
> > + *  - candidates involving fewer nodes come first. In case two (or
> > + *    more) candidates span the same number of nodes,
> > + *  - candidates with greater amount of free memory come first. In
> > + *    case two (or more) candidates differ in their amount of free
> > + *    memory by less than 10%,
> Interesting idea -- sounds pretty reasonable.
>Time will tell... :-O
> > +static int numa_cmpf(const void *v1, const void *v2)
> > +{
> > +    const libxl__numa_candidate *c1 = (const libxl__numa_candidate*)
v1;
> > +    const libxl__numa_candidate *c2 = (const libxl__numa_candidate*)
v2;
> > +    double mem_diff = labs(c1->free_memkb - c2->free_memkb);
> > +    double mem_avg = (c1->free_memkb + c2->free_memkb) / 2.0;
> > +
> > +    if (c1->nr_nodes != c2->nr_nodes)
> > +        return c1->nr_nodes - c2->nr_nodes;
> > +
> > +    if ((mem_diff / mem_avg) * 100.0<  10.0&&
> > +        c1->nr_domains != c2->nr_domains)
> > +        return c1->nr_domains - c2->nr_domains;
> I realize this isn''t a hot path, but it seems like moving into FP
is
> really unnecessary.  You can just do this:
> Yeah, IanC pointed out that too. I''ll convert everything toward integer
arith.
> One more thing: Is there a reason why you put get_numa_candidates() in 
> libxl_internal.h, but not sort_numa_candidates()?  It seems like both or 
> neither should go. :-)
> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2021,6 +2021,134 @@ static inline void libxl__ctx_unlock(lib
...
+_hidden int libxl__get_numa_candidates(libxl__gc *gc,
+                                uint32_t min_free_memkb, int min_cpus,
+                                int min_nodes, int max_nodes,
+                                libxl__numa_candidate *cndts[], int *nr_cndts);
+
...
+/* signature for the comparison function between two candidates c1 and c2
+ * (the thid parameter is provided to enable thread safety). */
+typedef int (*libxl__numa_candidate_cmpf)(const void *v1, const void *v2);
+/* sort the list of candidates in cndts (an array with nr_cndts elements in
+ * it) using cmpf for comparing two candidates. Uses libc''s qsort().
*/
+_hidden void libxl__sort_numa_candidates(libxl__numa_candidate cndts[],
+                                         int nr_cndts,
+                                         libxl__numa_candidate_cmpf cmpf);

But I''m not entirely sure I understood what you meant...
> That''s all I have for now.  I''m OK with the general
approach, so here''s
> a "weak ack", so if a maintainer is happy with the code, he can
check it in:
> 
> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
> Ok, that''s very nice. I''ll have to respin the series, so
I''ll definitely
address your comments and add your ack. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

George Dunlap

2012-Jun-22 10:05 UTC

head link

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

On Thu, Jun 21, 2012 at 5:43 PM, Dario Faggioli <raistlin@linux.it>
wrote:>> > + *  - candidates with greater amount of free memory come first.
In
>> > + *    case two (or more) candidates differ in their amount of
free
>> > + *    memory by less than 10%,
>> Interesting idea -- sounds pretty reasonable.
>>
> Time will tell... :-O
The only potential concern is that it seems likely that your
comparison function above will not actually generate an ordering,
because the relationship it generates isn''t transitive.  That is, you
could have a situation where A > B, B > C, but C > A.  For example:
A: freemem 1000, domains 1
B: freemem 1090, domains 2
C: freemem 1110, domains 3

In this case, A>B, because memory is within 10% but has fewer domains.
 B > C, because B is within 10%, and has fewer domains.  But C > A,
because C has more than 10% more memory than A (so its domains are not
counted against it).

 This may give you strange results from your quicksort.  But it''s
likely to give something *reasonable* anyway.  (And if it turns out to
cause problems, a bubble sort shouldn''t be too expensive and will
likely to be more robust against this kind of quirkiness.)

 -George

Ian Campbell

2012-Jun-22 10:14 UTC

head link

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

On Thu, 2012-06-21 at 17:34 +0100, Dario Faggioli wrote:> On Thu, 2012-06-21 at 12:40 +0100, Ian Campbell wrote:
> > > +/*
> > > + * This function tries to figure out if the host has a
consistent number
> > > + * of cpus along all its NUMA nodes. In fact, if that is the
case, we can
> > > + * calculate the minimum number of nodes needed for a domain by
just
> > > + * dividing its total number of vcpus by this value computed
here.
> > > + * However, we are not allowed to assume that all the nodes have
the
> > > + * same number of cpus. Therefore, in case discrepancies among
different
> > > + * nodes are found, this function just returns 0 and the caller
needs
> > > + * to sort out things in some other way.
> > 
> > If the caller has to deal with this anyway why bother with this check
> > and/or the other algorithm?
> > 
> Mmm... I''m not sure I get what you mean here. Perhaps my
commenting
> about the whole thing was not so clear in the first place.
> 
> The (George''s :-)) idea is, if we know each node has 8 CPUs, and
our
> domain wants more than that, it does not make sense to start looking for
> placement solutions with just one node. Actually, if the domain wants
> fewer than 16 CPUs, starting looking from solutions with 2 nodes in them
> should be the right thing.
> 
> That being said, I think it''s important we do not assume we
won''t ever
> find some architecture with different number of CPUs in different nodes,
> and that is what a zero return from that function means.
> 
> What was that you thought not to be worthwhile?
The comment said "this function just returns 0 and the caller needs to
sort out things in some other way" so I was wondering -- well if the
caller has some algorithm which can cope with this why not always use
it. 

I think I later observed that if it returns zero the caller in this case
does something basic instead, so I see how this makes sense now.
> 
> > > +/* Get all the placement candidates satisfying some specific
conditions */
> > > +int libxl__get_numa_candidates(libxl__gc *gc,
> > > +                               uint32_t min_free_memkb, int
min_cpus,
> > > +                               int min_nodes, int max_nodes,
> > > +                               libxl__numa_candidate *cndts[],
int *nr_cndts)
> > > +{
> > > +    libxl__numa_candidate *new_cndts = NULL;
> > > +    libxl_cputopology *tinfo = NULL;
> > > +    libxl_numainfo *ninfo = NULL;
> > > +    libxl_bitmap nodemap;
> > > +    int nr_nodes, nr_cpus;
> > > +    int array_size, rc;
> > > +
> > > +    /* Get platform info and prepare the map for testing the
combinations */
> > > +    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
> > > +    if (ninfo == NULL)
> > > +        return ERROR_FAIL;
> > > +    /* If we don''t have at least 2 nodes, it is useless
to proceed */
> > 
> > You don''t want to return whichever of those 2 nodes meets the
> > constraints?
> > 
> I actually was meaning something like "hey, there''s only _1_
node, go
> for it!" :-P
I misread sorry!
> 
> > ...
> > > +    *cndts = new_cndts;
> > > + out:
> > > +    libxl_bitmap_dispose(&nodemap);
> > 
> > nodemap might not have been initialised?
> >
> Mmm... Right. So I really think I need an init function. :-(
In general all libxl datastructures are supposed to have one, most of
them are just memset(.., 0)
> 
> > > +    libxl_cputopology_list_free(tinfo, nr_cpus);
> > > +    libxl_numainfo_list_free(ninfo, nr_nodes);
> > 
> > It looks like the nr_* may also not have been initialised, but maybe
> > list_free is safe in that case since the associated array must
> > necessarily be NULL?
> > 
> Well, probably, but as they both do a "for (i=0;i<nr_*;i++)" I
think
> it''s safer if I "=0" those twos, is that ok?
I think it would be a good idea.

The alternative is that anything which returns a list has to set nr =0.
> > > +    /* Bring the best candidate in front of the list -->
candidates[0] */
> > > +    if (nr_candidates > 1)
> > > +        libxl__sort_numa_candidates(candidates, nr_candidates,
numa_cmpf);
> > 
> > Is the start up cost of libxl__sort_numa_candidates significant enough
> > to make that if worthwhile?
> > 
> I''m not sure what you mean here, I''m afraid ... :-(
Is the cost of sorting the 1 element list so high it is worth avoiding.
Since the sort itself would be trivial the startup costs are what would
dominate.
> 
> > > +
> > > +    /*
> > > +     * Check if the domain has any CPU affinity. If not, try to
build up one.
> > > +     * In case numa_place_domain() find at least a suitable
candidate, it will
> > > +     * affect info->cpumap accordingly; if it does not, it
just leaves it
> > > +     * as it is. This means (unless some weird error manifests)
the subsequent
> > > +     * call to libxl_set_vcpuaffinity_all() will do the actual
placement,
> > > +     * whatever that turns out to be.
> > > +     */
> > > +    if (libxl_bitmap_is_full(&info->cpumap)) {
> > > +        int rc = numa_place_domain(gc, info);
> > > +        if (rc)
> > > +            return rc;
> > 
> > I''m not sure about this, if numa_place_domain fails for any
reason would
> > we be not better off logging and continuing without placement? We
> > already do that explicitly in a couple of cases.
> > 
> Well, actually, if it "fails" in the sense it finds no candidates
or
> does not manage in placing the domain, it returns 0, so we do right what
> you''re suggesting. I''m sure this is stated in some
comment but I guess I
> can repeat it here. If !rc, it means we stepped into some ERROR_FAIL or
> something, which I think has to be handled like this, hasn''t it?
That makes sense.
> Perhaps I can also add some logging about "successfully failing",
i.e.,
> not getting into any error but not being able to place the domain. If
> yes, that will happen _inside_ numa_place_domain() rather than here.
Logging in that case seems wise in any case since it will have
performance implications I think.

Ian.

Ian Jackson

2012-Jun-26 11:03 UTC

head link

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

George Dunlap writes ("Re: [Xen-devel] [PATCH 08 of 10 v2] libxl: enable
automatic placement of guests on NUMA nodes"):> The only potential concern is that it seems likely that your
> comparison function above will not actually generate an ordering,
> because the relationship it generates isn''t transitive.  That is,
you
> could have a situation where A > B, B > C, but C > A.  For
example:
> A: freemem 1000, domains 1
> B: freemem 1090, domains 2
> C: freemem 1110, domains 3
> 
> In this case, A>B, because memory is within 10% but has fewer domains.
>  B > C, because B is within 10%, and has fewer domains.  But C > A,
> because C has more than 10% more memory than A (so its domains are not
> counted against it).
The conventional approach to this kind of thing is to invent a score
for sorting etc.  Something like
    score = tuning_parameter * log(freemem) - number_of_domains
which does sort of roughly the same as your 10% rule if you squint.

Or maybe you want to take the log of the number_of_domains too, which
would be equivalent for sorting to
    score = freemem / (number_of_domains + other_tuning_parameter)
>  This may give you strange results from your quicksort.  But it''s
> likely to give something *reasonable* anyway.  (And if it turns out to
> cause problems, a bubble sort shouldn''t be too expensive and will
> likely to be more robust against this kind of quirkiness.)
The behaviour of sorting functions when giving broken comparison
functions may be entirely arbitrary and is not guaranteed to be
anything near the plausible intended orderings.  If you''re using qsort
you''re not guaranteed that any particular algorithm will be used, so
you can''t make any assumptions about its behaviour in this case.

Ian.

Dario Faggioli

2012-Jun-26 15:20 UTC

head link

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

On Tue, 2012-06-26 at 12:03 +0100, Ian Jackson wrote:> George Dunlap writes ("Re: [Xen-devel] [PATCH 08 of 10 v2] libxl:
enable automatic placement of guests on NUMA nodes"):
> > The only potential concern is that it seems likely that your
> > comparison function above will not actually generate an ordering,
> > because the relationship it generates isn''t transitive.  That
is, you
> > could have a situation where A > B, B > C, but C > A.  For
example:
> > A: freemem 1000, domains 1
> > B: freemem 1090, domains 2
> > C: freemem 1110, domains 3
> > 
> > In this case, A>B, because memory is within 10% but has fewer
domains.
> >  B > C, because B is within 10%, and has fewer domains.  But C >
A,
> > because C has more than 10% more memory than A (so its domains are not
> > counted against it).
> 
> The conventional approach to this kind of thing is to invent a score
> for sorting etc.  Something like
>     score = tuning_parameter * log(freemem) - number_of_domains
> which does sort of roughly the same as your 10% rule if you squint.
> Right, I thought about this, but then got a bit scared about properly
mixing apple, oranges and melons (as I have number of nodes, amount of
free memory and number of domains), giving the proper "weight" to each
of them.

Anyway, I see the potential issue George is reporting, so I''ll try to
come up with some formula...

Thanks and regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Campbell

2012-Jun-26 16:00 UTC

head link

Re: [PATCH 01 of 10 v2] libxl: fix a typo in the GCREALLOC_ARRAY macro

On Thu, 2012-06-21 at 09:53 +0100, Ian Campbell wrote:> On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:
> > Causing a build failure when trying to use it:
> > 
> > xxx: error: expected '';'' before
'')'' token
> > xxx: error: expected statement before '')'' token
> > 
> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> 
> Acked-by: Ian Campbell <ian.campbell@citrix.com>
This bugfix was mostly independent from the rest of the series, so I''ve
committed it.
> 
> > 
> > diff --git a/tools/libxl/libxl_internal.h
b/tools/libxl/libxl_internal.h
> > --- a/tools/libxl/libxl_internal.h
> > +++ b/tools/libxl/libxl_internal.h
> > @@ -1972,7 +1972,7 @@ struct libxl__domain_create_state {
> >  #define GCREALLOC_ARRAY(var, nmemb)                                  
\
> >      (assert(nmemb > 0),                                           
\
> >       assert(ARRAY_SIZE_OK((var), (nmemb))),                          
\
> > -     (var) = libxl__realloc((gc), (var), (nmemb)*sizeof(*(var)))))
> > +     (var) = libxl__realloc((gc), (var), (nmemb)*sizeof(*(var))))
> >  
> > 
> >  /*
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Dario Faggioli

2012-Jun-26 16:25 UTC

head link

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

On Fri, 2012-06-22 at 11:14 +0100, Ian Campbell wrote:> > > > +    /* Bring the best candidate in front of the list -->
candidates[0] */
> > > > +    if (nr_candidates > 1)
> > > > +        libxl__sort_numa_candidates(candidates,
nr_candidates, numa_cmpf);
> > > 
> > > Is the start up cost of libxl__sort_numa_candidates significant
enough
> > > to make that if worthwhile?
> > > 
> > I''m not sure what you mean here, I''m afraid ... :-(
> 
> Is the cost of sorting the 1 element list so high it is worth avoiding.
> Since the sort itself would be trivial the startup costs are what would
> dominate.
> So, the answer is that I really don''t know, as I''m using a
library
function for doing the actual work. I really do not expect for them to
be that high, and I hence can remove the if() if you think it looks
ugly. :-)
> > > I''m not sure about this, if numa_place_domain fails for
any reason would
> > > we be not better off logging and continuing without placement? We
> > > already do that explicitly in a couple of cases.
> > > 
> > Well, actually, if it "fails" in the sense it finds no
candidates or
> > does not manage in placing the domain, it returns 0, so we do right
what
> > you''re suggesting. I''m sure this is stated in some
comment but I guess I
> > can repeat it here. If !rc, it means we stepped into some ERROR_FAIL
or
> > something, which I think has to be handled like this, hasn''t
it?
> 
> That makes sense.
> 
> > Perhaps I can also add some logging about "successfully
failing", i.e.,
> > not getting into any error but not being able to place the domain. If
> > yes, that will happen _inside_ numa_place_domain() rather than here.
> 
> Logging in that case seems wise in any case since it will have
> performance implications I think.
> Ok, I''ll log something out.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Campbell

2012-Jun-26 16:26 UTC

head link

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

On Tue, 2012-06-26 at 17:25 +0100, Dario Faggioli wrote:> On Fri, 2012-06-22 at 11:14 +0100, Ian Campbell wrote:
> > > > > +    /* Bring the best candidate in front of the list
--> candidates[0] */
> > > > > +    if (nr_candidates > 1)
> > > > > +        libxl__sort_numa_candidates(candidates,
nr_candidates, numa_cmpf);
> > > > 
> > > > Is the start up cost of libxl__sort_numa_candidates
significant enough
> > > > to make that if worthwhile?
> > > > 
> > > I''m not sure what you mean here, I''m afraid ...
:-(
> > 
> > Is the cost of sorting the 1 element list so high it is worth
avoiding.
> > Since the sort itself would be trivial the startup costs are what
would
> > dominate.
> > 
> So, the answer is that I really don''t know, as I''m using
a library
> function for doing the actual work. I really do not expect for them to
> be that high, and I hence can remove the if() if you think it looks
> ugly. :-)
Not really ugly, it just seemed like an odd optimisation to make.
> 
> > > > I''m not sure about this, if numa_place_domain fails
for any reason would
> > > > we be not better off logging and continuing without
placement? We
> > > > already do that explicitly in a couple of cases.
> > > > 
> > > Well, actually, if it "fails" in the sense it finds no
candidates or
> > > does not manage in placing the domain, it returns 0, so we do
right what
> > > you''re suggesting. I''m sure this is stated in
some comment but I guess I
> > > can repeat it here. If !rc, it means we stepped into some
ERROR_FAIL or
> > > something, which I think has to be handled like this,
hasn''t it?
> > 
> > That makes sense.
> > 
> > > Perhaps I can also add some logging about "successfully
failing", i.e.,
> > > not getting into any error but not being able to place the
domain. If
> > > yes, that will happen _inside_ numa_place_domain() rather than
here.
> > 
> > Logging in that case seems wise in any case since it will have
> > performance implications I think.
> > 
> Ok, I''ll log something out.
> 
> Thanks and Regards,
> Dario
>

Dario Faggioli

2012-Jun-26 16:26 UTC

head link

Re: [PATCH 01 of 10 v2] libxl: fix a typo in the GCREALLOC_ARRAY macro

On Tue, 2012-06-26 at 17:00 +0100, Ian Campbell wrote:> On Thu, 2012-06-21 at 09:53 +0100, Ian Campbell wrote:
> > On Fri, 2012-06-15 at 18:04 +0100, Dario Faggioli wrote:
> > > Causing a build failure when trying to use it:
> > > 
> > > xxx: error: expected '';'' before
'')'' token
> > > xxx: error: expected statement before '')'' token
> > > 
> > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> > 
> > Acked-by: Ian Campbell <ian.campbell@citrix.com>
> 
> This bugfix was mostly independent from the rest of the series, Indeed. I submitted it as a separate patch the week before, but it maybe
got lost. :-)
> so I''ve committed it.Cool.

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Jackson

2012-Jun-26 17:23 UTC

head link

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Ian Campbell writes ("Re: [PATCH 08 of 10 v2] libxl: enable automatic
placement of guests on NUMA nodes"):> On Tue, 2012-06-26 at 17:25 +0100, Dario Faggioli wrote:
> > So, the answer is that I really don''t know, as I''m
using a library
> > function for doing the actual work. I really do not expect for them to
> > be that high, and I hence can remove the if() if you think it looks
> > ugly. :-)
> 
> Not really ugly, it just seemed like an odd optimisation to make.
Yes, please remove the if().  It makes the code more complicated and
this is not a hot path.

Thanks,
Ian

Dario Faggioli

2012-Jun-27 08:15 UTC

head link

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

On Tue, 2012-06-26 at 12:03 +0100, Ian Jackson wrote:> The conventional approach to this kind of thing is to invent a score
> for sorting etc.  Something like
>     score = tuning_parameter * log(freemem) - number_of_domains
> which does sort of roughly the same as your 10% rule if you squint.
>However, if I go for this approach, I think I should stick to floating
point arith, as I''m not sure I can properly to things like
normalization
and take log()-s with integers... :-/

Would that be fine?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Zhang, Yang Z

2012-Jun-28 07:25 UTC

head link

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Dario Faggioli wrote on 2012-06-16:> If a domain does not have a VCPU affinity, try to pin it automatically
> to some PCPUs. This is done taking into account the NUMA characteristics
> of the host. In fact, we look for a combination of host''s NUMA
nodes
> with enough free memory and number of PCPUs for the new domain, and pin
> it to the VCPUs of those nodes.
> 
> Once we know which ones, among all the possible combinations, represents
> valid placement candidates for a domain, use some heuistics for deciding
> which is the best. For instance, smaller candidates are considered to be
> better, both from the domain''s point of view (fewer memory
spreading
> among nodes) and from the system as a whole point of view (fewer memoy
> fragmentation).  In case of candidates of equal sizes (i.e., with the
> same number of nodes), the one with the greater amount of memory wins,
> as this is also good for keeping memory fragmentation under control.
> 
> This all happens internally to libxl, and no API for driving the mechanism
is
> provided for now. This matches what xend already does.we should consider the basic IONUMA. I mean when a guest with a device assigned,
from the point of view of performance, it''s better to allocated the
memory from the node
which the device belongs to.
Furthermore, we need consider the guest NUMA(I am working on it now) and live
migration.

Best regards,
Yang
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> 
> Changes from v1:
>  * This patches incorporates the changes from both "libxl, xl: enable
automatic
>    placement of guests on NUMA nodes" and "libxl, xl: heuristics
for
>    reordering NUMA placement candidates" from v1. * The logic of the
>    algorithm is basically the same as in v1, but the splitting of it in
>    the various functions has been completely redesigned from scratch. *
>    No public API for placement or candidate generation is now exposed,
>    everything happens within libxl, as agreed during v1 review. * The
>    relevant documentation have been moved near the actual functions and
>    features. Also, the amount and (hopefully!) the quality of the
>    documentation has been improved a lot, as requested. * All the
>    comments about using the proper libxl facilities and helpers for
>    allocations, etc., have been considered and applied. * This patch
>    still bails out from NUMA optimizations if it find out cpupools are
>    being utilized. It is next patch that makes the two things interact
>    properly, as suggested during review.
> diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
> --- a/docs/man/xl.cfg.pod.5
> +++ b/docs/man/xl.cfg.pod.5
> @@ -111,8 +111,8 @@ created online and the remainder will be
> 
>  =item B<cpus="CPU-LIST">
> -List of which cpus the guest is allowed to use. Default behavior is
> -`all cpus`. A C<CPU-LIST> may be specified as follows:
> +List of which cpus the guest is allowed to use. By default xl will (via
> +libxl) pick some cpus (see below). A C<CPU-LIST> may be specified as
follows:
> 
>  =over 4
> @@ -132,6 +132,12 @@ run on cpu #3 of the host.
> 
>  =back
> +If this option is not specified, libxl automatically tries to place the
> new +domain on the host''s NUMA nodes (provided the host has more
than
> one NUMA +node) by pinning it to the cpus of those nodes. An heuristical
> approach is +utilized with the goals of maximizing performance for the
> domain and, at +the same time, achieving efficient utilization of the
> host''s CPUs and RAM. +
>  =item B<cpu_weight=WEIGHT>
>  
>  A domain with a weight of 512 will get twice as much CPU as a domain
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -95,6 +95,459 @@ out:
>      return sched;
>  }
> +/* + * What follows are helpers for generating all the k-combinations +
> * without repetitions of a set S with n elements in it. Formally + *
> speaking, they are subsets of k distinct elements of S and, if + * S is
> n elements big, the number of k-combinations is equal to + * the
> binomial coefficient (n k), which means we get exactly + * n!/(k! * (n -
> k)!) subsets, all of them with k elements. + * + * The various subset
> are geneated one after the other by calling + * comb_init() first, and,
> after that, comb_next() + * (n k)-1 times. An iterator is used to store
> the curent status + * of the whole generation operation (i.e.,
> basically, the last + * combination that has been generated). As soon as
> all + * combinations have been generated, comb_next() will + * start
> returning 0 instead of 1. It is of course impotant that + * the same
> instance of the iterator and the same values for + * n and k are used
> for each call. If that doesn''t happen, the + * result is
unspecified. +
> * + * The algorithm is a well known one, and it produces the + *
> combinations in such a way that they (well, more precisely, + * their
> indexes it the array/map representing the set) come in + * lexicogaphic
> order. + * + * For example, with n = 5 and k = 3, calling comb_init() +
> * will generate { 0, 1, 2 }, while subsequent valid calls to + *
> comb_next() will produce the following: + * { { 0, 1, 2 }, { 0, 1, 3 },
> { 0, 1, 4 }, + *   { 0, 2, 3 }, { 0, 2, 4 }, { 0, 3, 4 }, + *   { 1, 2,
> 3 }, { 1, 2, 4 }, { 1, 3, 4 }, + *   { 2, 3, 4 } } + * + * This is used
> by the automatic NUMA placement logic below. + */ +typedef int*
> comb_iter_t; + +static int comb_init(libxl__gc *gc, comb_iter_t *it, int
> n, int k) +{ +    comb_iter_t new_iter; +    int i; + +    if (n < k) + 
>       return 0; + +    /* First set is always { 0, 1, 2, ..., k-1 } */ +
>    GCNEW_ARRAY(new_iter, k); +    for (i = 0; i < k; i++) +       
> new_iter[i] = i; + +    *it = new_iter; +    return 1; +} + +static int
> comb_next(comb_iter_t it, int n, int k) +{ +    int i; + +    /* +     *
> The idea here is to find the leftmost element from where +     * we
> should start incrementing the indexes of the iterator. +     * This
> means looking for the highest index that can be increased +     * while
> still producing value smaller than n-1. In the example +     * above,
> when dealing with { 0, 1, 4 }, such an element is the +     * second
> one, as the third is already equal to 4 (which actually +     * is n-1).
> +     * Once we found from where to start, we increment that element +  
>   * and overide the right-hand rest of the iterator with its +     *
> successors, thus achieving lexicographic ordering. +     * +     *
> Regarding the termination of the generation process, when we +     *
> manage in bringing n-k at the very first position of the iterator, +    
> * we know that is the last valid combination ( { 2, 3, 4 }, with +     *
> n - k = 5 - 2 = 2, in the example above), and thus we start +     *
> returning 0 as soon as we cross that border. +     */ +    for (i = k -
> 1; it[i] == n - k + i; i--) { +        if (i <= 0) +            return
> 0; +    } +    for (it[i]++, i++; i < k; i++) +        it[i] = it[i - 1]
> + 1; +    return 1; +} + +/* NUMA automatic placement (see
> libxl_internal.h for details) */ + +/* + * This function turns a
> k-combination iterator into a node map. + * This means the bits in the
> node map corresponding to the indexes + * of the given combination are
> the ones that will be set. + * For example, if the iterator represents
> the combination { 0, 2, 4}, + * the node map will have bits #0, #2 and
> #4 set to one and i thus + * will point to node 0, node 2 and and node
> 4. + */ +static void comb_get_nodemap(comb_iter_t it, libxl_bitmap
> *nodemap, int k) +{ +    int i; + +    libxl_bitmap_set_none(nodemap); +
>    for (i = 0; i < k; i++) +        libxl_bitmap_set(nodemap, it[i]); +}
> + +/* Retrieve the number of cpus that the nodes that are part of the
> nodemap + * span. */ +static int nodemap_to_nodes_cpus(libxl_cputopology
> *tinfo, int nr_cpus, +                                 const
> libxl_bitmap *nodemap) +{ +    int i, nodes_cpus = 0; + +    for (i = 0;
> i < nr_cpus; i++) { +        if (libxl_bitmap_test(nodemap,
> tinfo[i].node)) +            nodes_cpus++; +    } +    return
> nodes_cpus; +} + +/* Retrieve the amount of free memory within the
> nodemap */ +static uint32_t nodemap_to_free_memkb(libxl_numainfo *ninfo,
> +                                      libxl_bitmap *nodemap) +{ +   
> uint32_t free_memkb = 0; +    int i; + +    libxl_for_each_set_bit(i,
> *nodemap) +        free_memkb += ninfo[i].free / 1024; + +    return
> free_memkb; +} + +/* Retrieve the number of domains that can potentially
> run on the cpus + * the nodes that are part of the nodemap. */ +static
> int nodemap_to_nr_domains(libxl__gc *gc, libxl_cputopology *tinfo, +    
>                             const libxl_bitmap *nodemap) +{ +   
> libxl_dominfo *dinfo = NULL; +    libxl_bitmap dom_nodemap; +    int
> nr_doms, nr_cpus; +    int nr_domains = 0; +    int i, j, k; + +   
> dinfo = libxl_list_domain(CTX, &nr_doms); +    if (dinfo == NULL) +
>   return ERROR_FAIL; + +    if (libxl_node_bitmap_alloc(CTX,
> &dom_nodemap) < 0) { +        nr_domains = ERROR_FAIL; +        goto
> out; +    } + +    for (i = 0; i < nr_doms; i++) { +       
> libxl_vcpuinfo *vinfo; +        int nr_vcpus; + +        vinfo >
libxl_list_vcpu(CTX, dinfo[i].domid, &nr_vcpus, &nr_cpus); +        if
> (vinfo == NULL) +            continue; + +       
> libxl_bitmap_set_none(&dom_nodemap); +        for (j = 0; j <
nr_vcpus;
> j++) { +            libxl_for_each_set_bit(k, vinfo[j].cpumap) +        
>        libxl_bitmap_set(&dom_nodemap, tinfo[k].node); +        } + +
>    libxl_for_each_set_bit(j, dom_nodemap) { +            if
> (libxl_bitmap_test(nodemap, j)) { +                nr_domains++; +      
>          goto found; +            } +        } + found: +       
> libxl_vcpuinfo_list_free(vinfo, nr_vcpus); +    } + + out: +   
> libxl_bitmap_dispose(&dom_nodemap); +    libxl_dominfo_list_free(dinfo,
> nr_doms); +    return nr_domains; +} + +/* + * This function tries to
> figure out if the host has a consistent number + * of cpus along all its
> NUMA nodes. In fact, if that is the case, we can + * calculate the
> minimum number of nodes needed for a domain by just + * dividing its
> total number of vcpus by this value computed here. + * However, we are
> not allowed to assume that all the nodes have the + * same number of
> cpus. Therefore, in case discrepancies among different + * nodes are
> found, this function just returns 0 and the caller needs + * to sort out
> things in some other way. + */ +static int
> cpus_per_node_count(libxl_cputopology *tinfo, int nr_cpus, +            
>                   libxl_numainfo *ninfo, int nr_nodes) +{ +    int
> cpus_per_node = 0; +    int j, i; + +    /* This makes sense iff # of
> PCPUs is the same for all nodes */ +    for (j = 0; j < nr_nodes; j++) {
> +        int curr_cpus = 0; + +        for (i = 0; i < nr_cpus; i++) { +
>            if (tinfo[i].node == j) +                curr_cpus++; +      
>  } +        /* So, if the above does not hold, turn the whole thing off!
> */ +        cpus_per_node = cpus_per_node == 0 ? curr_cpus :
> cpus_per_node; +        if (cpus_per_node != curr_cpus) +           
> return 0; +    } +    return cpus_per_node; +} + +/* Get all the
> placement candidates satisfying some specific conditions */ +int
> libxl__get_numa_candidates(libxl__gc *gc, +                             
>  uint32_t min_free_memkb, int min_cpus, +                              
> int min_nodes, int max_nodes, +                              
> libxl__numa_candidate *cndts[], int *nr_cndts) +{ +   
> libxl__numa_candidate *new_cndts = NULL; +    libxl_cputopology *tinfo >
NULL; +    libxl_numainfo *ninfo = NULL; +    libxl_bitmap nodemap; +
> int nr_nodes, nr_cpus; +    int array_size, rc; + +    /* Get platform
> info and prepare the map for testing the combinations */ +    ninfo >
libxl_get_numainfo(CTX, &nr_nodes); +    if (ninfo == NULL) +
> return ERROR_FAIL; +    /* If we don''t have at least 2 nodes, it
is
> useless to proceed */ +    if (nr_nodes < 2) { +        rc = 0; +       
> goto out; +    } + +    tinfo = libxl_get_cpu_topology(CTX, &nr_cpus);
+
>    if (tinfo == NULL) { +        rc = ERROR_FAIL; +        goto out; +  
>  } + +    rc = libxl_node_bitmap_alloc(CTX, &nodemap); +    if (rc) +
>     goto out; + +    /* +     * Round up and down some of the
> constraints. For instance, the minimum +     * number of cpus a
> candidate should have must at least be non-negative. +     * Regarding
> the minimum number of NUMA nodes, if not explicitly specified +     *
> (i.e., min_nodes <= 0), we try to figure out a sensible number of nodes
> +     * from where to start generating candidates, if possible (or just
> start +     * from 1 otherwise). The maximum number of nodes should not
> exceed the +     * number of existent NUMA nodes on the host, or the
> candidate genaration +     * won''t work properly. +     */ +   
min_cpus
> = min_cpus <= 0 ? 0 : min_cpus; +    if (min_nodes <= 0) { +       
int
> cpus_per_node; + +        cpus_per_node = cpus_per_node_count(tinfo,
> nr_cpus, ninfo, nr_nodes); +        if (cpus_per_node == 0) +           
> min_nodes = 1; +        else +            min_nodes = (min_cpus +
> cpus_per_node - 1) / cpus_per_node; +    } +    min_nodes = min_nodes >
> nr_nodes ? nr_nodes : min_nodes; +    if (max_nodes <= 0) +       
> max_nodes = nr_nodes; +    else +        max_nodes = max_nodes >
> nr_nodes ? nr_nodes : max_nodes; +    if (min_nodes > max_nodes) { +    
>    rc = ERROR_INVAL; +        goto out; +    } + +    /* Initialize the
> local storage for the combinations */ +    *nr_cndts = 0; +   
> array_size = nr_nodes; +    GCNEW_ARRAY(new_cndts, array_size); + +   
> /* Generate all the combinations of any size from min_nodes to +     *
> max_nodes (see comb_init() and comb_next()). */ +    while (min_nodes
<> max_nodes) { +        comb_iter_t comb_iter; +        int comb_ok; + +
>      /* +	 * And here it is. Each step of this cycle generates a
> combination of +	 * nodes as big as min_nodes mandates.  Each of these
> combinations is +	 * checked against the constraints provided by the
> caller (namely, +	 * amount of free memory and number of cpus) and it
> becomes an actual +	 * placement candidate iff it passes the check. +   
>      */ +        for (comb_ok = comb_init(gc, &comb_iter, nr_nodes,
> min_nodes); comb_ok; +             comb_ok = comb_next(comb_iter,
> nr_nodes, min_nodes)) { +            uint32_t nodes_free_memkb; +       
>     int nodes_cpus; + +            comb_get_nodemap(comb_iter,
&nodemap,
> min_nodes); + +            /* If there is not enough memoy in this
> combination, skip it +             * and go generating the next one...
> */ +            nodes_free_memkb = nodemap_to_free_memkb(ninfo,
> &nodemap); +            if (min_free_memkb > 0 &&
nodes_free_memkb <
> min_free_memkb) +                continue; + +            /* And the
> same applies if this combination is short in cpus */ +           
> nodes_cpus = nodemap_to_nodes_cpus(tinfo, nr_cpus, &nodemap); +
>   if (min_cpus > 0 && nodes_cpus < min_cpus) +               
continue;
> + +            /* +             * Conditions are met, we can add this
> combination to the +             * NUMA placement candidates list. We
> first make sure there +             * is enough space in there, and then
> we initialize the new +             * candidate element with the node
> map corresponding to the +             * combination we are dealing
> with. Memory allocation for +             * expanding the array that
> hosts the list happens in chunks +             * equal to the number of
> NUMA nodes in the system (to +             * avoid allocating memory
> each and every time we find a +             * new candidate). +         
>    */ +            if (*nr_cndts == array_size) +               
> array_size += nr_nodes; +            GCREALLOC_ARRAY(new_cndts,
> array_size); + +            libxl__numa_candidate_alloc(gc,
> &new_cndts[*nr_cndts]); +           
> libxl__numa_candidate_put_nodemap(gc, &new_cndts[*nr_cndts], +
>                                    &nodemap); +           
> new_cndts[*nr_cndts].nr_domains = +                                   
> nodemap_to_nr_domains(gc, tinfo, &nodemap); +           
> new_cndts[*nr_cndts].free_memkb = nodes_free_memkb; +           
> new_cndts[*nr_cndts].nr_nodes = min_nodes; +           
> new_cndts[*nr_cndts].nr_cpus = nodes_cpus; + +            LOG(DEBUG,
> "NUMA placement candidate #%d found: nr_nodes=%d, " +
>    "nr_cpus=%d, free_memkb=%"PRIu32"", *nr_cndts, +
>  min_nodes, new_cndts[*nr_cndts].nr_cpus, +                      
> new_cndts[*nr_cndts].free_memkb / 1024); + +            (*nr_cndts)++; +
>        } +        min_nodes++; +    } + +    *cndts = new_cndts; + out:
> +    libxl_bitmap_dispose(&nodemap); +   
> libxl_cputopology_list_free(tinfo, nr_cpus); +   
> libxl_numainfo_list_free(ninfo, nr_nodes); +    return rc; +} + +void
> libxl__sort_numa_candidates(libxl__numa_candidate cndts[], int nr_cndts,
> +                                 libxl__numa_candidate_cmpf cmpf) +{ + 
>   /* Reorder candidates (see the comparison function for +     * the
> details on the heuristics) */ +    qsort(cndts, nr_cndts,
> sizeof(cndts[0]), cmpf); +} + +/* + * The NUMA placement candidates are
> reordered according to the following + * heuristics: + *  - candidates
> involving fewer nodes come first. In case two (or + *    more)
> candidates span the same number of nodes, + *  - candidates with greater
> amount of free memory come first. In + *    case two (or more)
> candidates differ in their amount of free + *    memory by less than
> 10%, + *  - candidates with fewer domains insisting on them at the time
> of + *    this call come first. + */ +static int numa_cmpf(const void
> *v1, const void *v2) +{ +    const libxl__numa_candidate *c1 = (const
> libxl__numa_candidate*) v1; +    const libxl__numa_candidate *c2 >
(const libxl__numa_candidate*) v2; +    double mem_diff >
labs(c1->free_memkb - c2->free_memkb); +    double mem_avg >
(c1->free_memkb + c2->free_memkb) / 2.0; + +    if (c1->nr_nodes !>
c2->nr_nodes) +        return c1->nr_nodes - c2->nr_nodes; + +    if
> ((mem_diff / mem_avg) * 100.0 < 10.0 && +       
c1->nr_domains !> c2->nr_domains) +        return c1->nr_domains -
c2->nr_domains; + +
> return c2->free_memkb - c1->free_memkb; +} + +/* The actual automatic
> NUMA placement routine */ +static int numa_place_domain(libxl__gc *gc,
> libxl_domain_build_info *info) +{ +    int nr_candidates = 0; +   
> libxl__numa_candidate *candidates = NULL; +    libxl_bitmap
> candidate_nodemap; +    libxl_cpupoolinfo *pinfo; +    int nr_pools, rc
> = 0; +    uint32_t memkb; + +    /* First of all, if cpupools are in
> use, better not to mess with them */ +    pinfo >
libxl_list_cpupool(CTX, &nr_pools); +    if (!pinfo) +        return
> ERROR_FAIL; +    if (nr_pools > 1) { +        LOG(NOTICE, "skipping
NUMA
> placement as cpupools are in use"); +        goto out; +    } + +   
rc
> = libxl_domain_need_memory(CTX, info, &memkb); +    if (rc) +       
> goto out; +    if (libxl_node_bitmap_alloc(CTX, &candidate_nodemap)) {
+
>        rc = ERROR_FAIL; +        goto out; +    } + +    /* Find all the
> candidates with enough free memory and at least +     * as much pcpus as
> the domain has vcpus.  */ +    rc = libxl__get_numa_candidates(gc,
> memkb, info->max_vcpus, 0, 0, +                                   
> &candidates, &nr_candidates); +    if (rc) +        goto out; + +
> LOG(DETAIL, "%d NUMA placement candidates found", nr_candidates);
+ +
> /* No suitable placement candidates. We just return without touching the
> +     * domain''s info->cpumap. It will have affinity with all
> nodes/cpus. */ +    if (nr_candidates == 0) +        goto out; + +    /*
> Bring the best candidate in front of the list --> candidates[0] */ +   
> if (nr_candidates > 1) +        libxl__sort_numa_candidates(candidates,
> nr_candidates, numa_cmpf); + +    /* +     * At this point, the first
> candidate in the array is the one we want. +     * Go for it by mapping
> its node map to the domain''s info->cpumap. +     */ +   
> libxl__numa_candidate_get_nodemap(gc, &candidates[0],
> &candidate_nodemap); +    rc = libxl_nodemap_to_cpumap(CTX,
> &candidate_nodemap, &info->cpumap); +    if (rc) +        goto
out; + +
>   LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and
" +
>              "%"PRIu32" KB free selected",
candidates[0].nr_nodes, +
>            candidates[0].nr_cpus, candidates[0].free_memkb / 1024); + +
> out: +    libxl_bitmap_dispose(&candidate_nodemap); +   
> libxl_cpupoolinfo_list_free(pinfo, nr_pools); +    return rc; +} +
>  int libxl__build_pre(libxl__gc *gc, uint32_t domid,
>                libxl_domain_build_info *info, libxl__domain_build_state
> *state)
>  {
> @@ -104,7 +557,22 @@ int libxl__build_pre(libxl__gc *gc, uint
>      uint32_t rtc_timeoffset;
>      
>      xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus);
> +
> +    /*
> +     * Check if the domain has any CPU affinity. If not, try to build up
one.
> +     * In case numa_place_domain() find at least a suitable candidate, it
will
> +     * affect info->cpumap accordingly; if it does not, it just leaves
it
> +     * as it is. This means (unless some weird error manifests) the
subsequent
> +     * call to libxl_set_vcpuaffinity_all() will do the actual placement,
> +     * whatever that turns out to be.
> +     */
> +    if (libxl_bitmap_is_full(&info->cpumap)) {
> +        int rc = numa_place_domain(gc, info);
> +        if (rc)
> +            return rc;
> +    }
>      libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus,
>      &info->cpumap); + xc_domain_setmaxmem(ctx->xch, domid,
>      info->target_memkb + LIBXL_MAXMEM_CONSTANT); if (info->type
=>      LIBXL_DOMAIN_TYPE_PV)
>          xc_domain_set_memmap_limit(ctx->xch, domid,
> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
> --- a/tools/libxl/libxl_internal.h
> +++ b/tools/libxl/libxl_internal.h
> @@ -2021,6 +2021,134 @@ static inline void libxl__ctx_unlock(lib
>  #define CTX_LOCK (libxl__ctx_lock(CTX))
>  #define CTX_UNLOCK (libxl__ctx_unlock(CTX))
> +/* + * Automatic NUMA placement + * + * These functions and data
> structures deal with the initial placement of a + * domain onto the host
> NUMA nodes. + * + * The key concept here is the one of "numa placement
> candidate" (or jusr "numa + * candidate", "placement
candidate" or even
> "candidate"), which basically is a + * set of nodes whose
> characteristics have been successfully checked against + * some specific
> requirements. More pecisely, a candidate is the nodemap + * associated
> with one of the possible subset of the host NUMA nodes providing + * a
> certain amount of free memory, or a given number of cpus, or even both +
> * (depending in what the caller wants). For convenience of use, some of
> this + * information are stored within the candidate itselfs, instead of
> always being + * dynamically computed. A candidate can consist of just
> one, up to all the + * available nodes on the host (the latter being, of
> course, not ideal).  For + * instance, looking for a numa candidates
> with 2GB of free memoy means we want + * all the possible subsets of the
> host NUMA nodes with, cumulatively, at least + * 2GB of free memory.
> That could be possible by just using one particular + * node, or may
> require more nodes, depending on the characteristics of the + * host, on
> how many domains have been created already, on how big they are, + *
> etc. + * + * The intended usage is as follows: + *  1. by, fist of all,
> calling libxl__get_numa_candidates(), and specifying + *     the proper
> constraints to it (e.g., the amount of memory a domain need + *     as
> the minimum amount of free memory fo the candidates) one can build + *  
>   up a whole set of suitable placing alternatives for a domain; + *  2.
> after that, one specific candidate should be chosen. That can happen + *
>     by looking at their various characteristics; + *  3. the chosen
> candidate''s nodemap should be utilized for computing the + *    
actual
> affinity of the domain which, given the curent NUMA support + *     in
> the hypervisor, is what determines the placement of the domain''s +
*
> vcpus and memory. + * + * To make phase 2 even easier, a sorting helper
> function for the list of + * candidates is povided in the form of
> libxl__sort_numa_candidates(). The only + * that is needed is defining a
> comparison function, containing the chriteria + * for deciding, given
> two candidates, which one is ''better''. Depending on how +
* the
> comparison function is defined, the best candidate (where, of course, +
> * best is defined with respect to the heuristics implemented in the
> comparison + * function itself, libxl__numa_candidate_cmpf()) could
> become the first o the + * last element of the list. + * + *
> Summarizing, achieving automatic NUMA placement is just a matter of + *
> obtaining the list of suitable placement candidates, perhaps asking for
> each + * of them to provide at least the amount of memory the domain
> needs. After + * that just implement a comparison function by means of
> the various helpers + * retrieving the relevant information about the
> candidates themselves. + * Finally, call the sorting helper function and
> use the candidate that became + * (typically) the first element of the
> list fo determining the domain''s + * affinity. + */ + +typedef
struct {
> +    int nr_cpus, nr_nodes; +    int nr_domains; +    uint32_t
> free_memkb; +    libxl_bitmap nodemap; +} libxl__numa_candidate; + +/* +
> * This generates the list of NUMA placement candidates satisfying some +
> * specific conditions. If min_nodes and/or max_nodes are not 0, their
> value is + * used to determine the minimum and maximum number of nodes
> that are allow to + * be present in each candidate. If min_nodes and/or
> max_nodes are 0, the + * minimum and maximum number of nodes to be used
> are automatically selected by + * the implementation (and that will
> likely be just 1 node for the minimum and + * the total number of
> existent nodes for the maximum). Re min_free_memkb and + * min_cpu, if
> not 0, they specify the caller only wants candidates with at + * least
> that amount of free memory and that number of cpus, respectively. If + *
> min_free_memkb and/or min_cpus are 0, the candidates'' free memory
and
> number + * of cpus won''t be checked at all, which means a
candidate will
> always be + * considered suitable wrt the specific constraint.  cndts is
> where the list of + * exactly nr_cndts candidates is returned. Note
> that, in case no candidates + * are found at all, the function returns
> successfully, but with nr_cndts equal + * to zero. + */ +_hidden int
> libxl__get_numa_candidates(libxl__gc *gc, +                             
>   uint32_t min_free_memkb, int min_cpus, +                              
>  int min_nodes, int max_nodes, +                               
> libxl__numa_candidate *cndts[], int *nr_cndts); + +/* allocation and
> deallocation for placement candidates */ +static inline int
> libxl__numa_candidate_alloc(libxl__gc *gc, +                            
>                  libxl__numa_candidate *cndt) +{ +    return
> libxl_node_bitmap_alloc(CTX, &cndt->nodemap); +} +static inline void
> libxl__numa_candidate_dispose(libxl__numa_candidate *cndt) +{ +   
> libxl_bitmap_dispose(&cndt->nodemap); +} +static inline void
> libxl__numacandidate_list_free(libxl__numa_candidate *cndts, +          
>                                        int nr_cndts) +{ +    int i; + + 
>   for (i = 0; i < nr_cndts; i++) +       
> libxl__numa_candidate_dispose(&cndts[i]); +    free(cndts); +} + +/*
> retrieve (in nodemap) the node map associated to placement candidate
> cndt */ +static inline +void libxl__numa_candidate_get_nodemap(libxl__gc
> *gc, +                                       const libxl__numa_candidate
> *cndt, +                                       libxl_bitmap *nodemap) +{
> +    libxl_bitmap_copy(CTX, nodemap, &cndt->nodemap); +} +/* set the
> node map of placement candidate cndt to match nodemap */ +static inline
> +void libxl__numa_candidate_put_nodemap(libxl__gc *gc, +                
>                       libxl__numa_candidate *cndt, +                    
>                   const libxl_bitmap *nodemap) +{ +   
> libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap); +} + +/* signature
for
> the comparison function between two candidates c1 and c2 + * (the thid
> parameter is provided to enable thread safety). */ +typedef int
> (*libxl__numa_candidate_cmpf)(const void *v1, const void *v2); +/* sort
> the list of candidates in cndts (an array with nr_cndts elements in + *
> it) using cmpf for comparing two candidates. Uses libc''s qsort().
*/
> +_hidden void libxl__sort_numa_candidates(libxl__numa_candidate cndts[],
> +                                         int nr_cndts, +               
>                          libxl__numa_candidate_cmpf cmpf);
> 
>  /*
>   * Inserts "elm_new" into the sorted list "head".
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

George Dunlap

2012-Jun-28 08:36 UTC

head link

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

On Thu, Jun 28, 2012 at 8:25 AM, Zhang, Yang Z <yang.z.zhang@intel.com>
wrote:>> This all happens internally to libxl, and no API for driving the
mechanism is
>> provided for now. This matches what xend already does.
> we should consider the basic IONUMA. I mean when a guest with a device
assigned,
> from the point of view of performance, it''s better to allocated
the memory from the node
> which the device belongs to.
> Furthermore, we need consider the guest NUMA(I am working on it now) and
live migration.
Just to help set the context for these patch series: We have lots of
big plans for NUMA in the 4.3 release (and are always glad for input
and help!), but the current goal is just to have *something* for the
upcoming 4.2 release.

 -George

Dario Faggioli

2012-Jun-28 10:12 UTC

head link

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

On Thu, 2012-06-28 at 07:25 +0000, Zhang, Yang Z wrote:> > This all happens internally to libxl, and no API for driving the
mechanism is
> > provided for now. This matches what xend already does.
> we should consider the basic IONUMA. I mean when a guest with a device
assigned,
> from the point of view of performance, it''s better to allocated
the memory from the node
> which the device belongs to.
>That sure makes sense, and it would be a very useful extension!
> Furthermore, we need consider the guest NUMA(I am working on it now) and
live migration.
> Great. Can you share, either here on or a separate thread some more
details about what you mean by both "guest NUMA" and "live
migration" in
this context?

As George said, that would be 4.3 material, but I''m still very
interested. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Pasi Kärkkäinen

2012-Jun-28 12:41 UTC

head link

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

On Thu, Jun 28, 2012 at 12:12:09PM +0200, Dario Faggioli
wrote:> On Thu, 2012-06-28 at 07:25 +0000, Zhang, Yang Z wrote:
> > > This all happens internally to libxl, and no API for driving the
mechanism is
> > > provided for now. This matches what xend already does.
> > we should consider the basic IONUMA. I mean when a guest with a device
assigned,
> > from the point of view of performance, it''s better to
allocated the memory from the node
> > which the device belongs to.
> >
> That sure makes sense, and it would be a very useful extension!
> 
> > Furthermore, we need consider the guest NUMA(I am working on it now)
and live migration.
> > 
> Great. Can you share, either here on or a separate thread some more
> details about what you mean by both "guest NUMA" and "live
migration" in
> this context?
> 
> As George said, that would be 4.3 material, but I''m still very
> interested. :-)
> 
For the mailinglist archives.. 

here are the Xen guest VM NUMA slides from XenSummit 2010:
http://www.slideshare.net/xen_com_mgr/nakajima-numafinal
http://www.slideshare.net/xen_com_mgr/dulloor-xensummit

-- Pasi

Dario Faggioli

2012-Jun-28 17:03 UTC

head link

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

On Thu, 2012-06-28 at 15:41 +0300, Pasi Kärkkäinen
wrote:> > > Furthermore, we need consider the guest NUMA(I am working on it
now) and live migration.
> > > 
> > Great. Can you share, either here on or a separate thread some more
> > details about what you mean by both "guest NUMA" and
"live migration" in
> > this context?
> > 
> > As George said, that would be 4.3 material, but I''m still
very
> > interested. :-)
> > 
> 
> For the mailinglist archives.. 
> :-)
> here are the Xen guest VM NUMA slides from XenSummit 2010:
> http://www.slideshare.net/xen_com_mgr/nakajima-numafinal
> http://www.slideshare.net/xen_com_mgr/dulloor-xensummit
> Yes, I know this slides, and I''ve also looked at most of the code
they''re referring to. That''s all very interesting and
it''s well
worthwhile to keep that in mind for future steps in the field of NUMA
support. In particular, virtual topology awareness of the guest is
something really important to have, and I''ll be happy to
discuss/work/contribute/help on it in the near future.

However, I''m still not sure I see how that ("guest NUMA") and
live
migration should affect (either now or during 4.3 dev cycle) the initial
placement of a domain... That''s what I was asking.


Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Zhang, Yang Z

2012-Jun-29 05:29 UTC

head link

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Dario Faggioli wrote on 2012-06-29:> On Thu, 2012-06-28 at 15:41 +0300, Pasi Kärkkäinen wrote:
>>>> Furthermore, we need consider the guest NUMA(I am working on it
now)
>>>> and live migration.
>>>> 
>>> Great. Can you share, either here on or a separate thread some more
>>> details about what you mean by both "guest NUMA" and
"live migration" in
>>> this context?
>>> 
>>> As George said, that would be 4.3 material, but I'm still very
>>> interested. :-)
>>> 
>> 
>> For the mailinglist archives..
>> 
> :-)
> 
>> here are the Xen guest VM NUMA slides from XenSummit 2010:
>> http://www.slideshare.net/xen_com_mgr/nakajima-numafinal
>> http://www.slideshare.net/xen_com_mgr/dulloor-xensummit
>> 
> Yes, I know this slides, and I've also looked at most of the code
> they're referring to. That's all very interesting and it's well
> worthwhile to keep that in mind for future steps in the field of NUMA
> support. In particular, virtual topology awareness of the guest is
> something really important to have, and I'll be happy to
> discuss/work/contribute/help on it in the near future.
> 
> However, I'm still not sure I see how that ("guest NUMA") and
live
> migration should affect (either now or during 4.3 dev cycle) the initial
> placement of a domain... That's what I was asking.
When doing migration, if the target machine is able to provide the same memory
framework for the guest, we need to consider it firstly.

Best regards,
Yang

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Zhang, Yang Z

2012-Jun-29 05:38 UTC

head link

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

George Dunlap wrote on 2012-06-28:> On Thu, Jun 28, 2012 at 8:25 AM, Zhang, Yang Z
<yang.z.zhang@intel.com> wrote:
>>> This all happens internally to libxl, and no API for driving the
mechanism is
>>> provided for now. This matches what xend already does.
>> we should consider the basic IONUMA. I mean when a guest with a device
>> assigned, from the point of view of performance, it''s better
to
>> allocated the memory from the node which the device belongs to.
>> Furthermore, we need consider the guest NUMA(I am working on it now)
and
> live migration.
> 
> Just to help set the context for these patch series: We have lots of
> big plans for NUMA in the 4.3 release (and are always glad for input
> and help!), but the current goal is just to have *something* for the
> upcoming 4.2 release.
Nice to hear you guys have the big plans fo NUMA things. :)

Best regards,
Yang

Dario Faggioli

2012-Jun-29 09:38 UTC

head link

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

On Fri, 2012-06-29 at 05:29 +0000, Zhang, Yang Z wrote: > > However, I''m still not sure I see how that ("guest
NUMA") and live
> > migration should affect (either now or during 4.3 dev cycle) the
initial
> > placement of a domain... That''s what I was asking.
> 
> When doing migration, if the target machine is able to provide the same
memory framework for the guest, we need to consider it firstly.
>Ok, I see, and I think it make sense. Thanks for replying. :-)

Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Jun-29 09:46 UTC

head link

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

On Fri, 2012-06-29 at 05:38 +0000, Zhang, Yang Z wrote: > > Just to help set the context for these patch series: We have lots of
> > big plans for NUMA in the 4.3 release (and are always glad for input
> > and help!), but the current goal is just to have *something* for the
> > upcoming 4.2 release.
> 
> Nice to hear you guys have the big plans fo NUMA things. :)
> Yes, we actually do! :-)

So, as you mentioned you''re working on related stuff, do you think it
would be useful to try to coordinate a bit? You know, just to avoid
duplicating the efforts too much, and achieving the best result with
minimum strain...

I think even just opening a new thread on this mailing list would do,
but I''m open to any other solution anyone has in mind. For the records,
I plan to talk about NUMA at and during XenSummit too (although I still
don''t know whether my talk has been accepted or not :-D).

Just let me (again, you and/or anyone else) know your thoughts and
preferences are. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Xen devel - Jun 2012 - [PATCH 00 of 10 v2] Automatic NUMA placement for xl

[PATCH 00 of 10 v2] Automatic NUMA placement for xl

[PATCH 01 of 10 v2] libxl: fix a typo in the GCREALLOC_ARRAY macro

[PATCH 02 of 10 v2] libxl: add a new Array type to the IDL

[PATCH 03 of 10 v2] libxl, libxc: introduce libxl_get_numainfo()

[PATCH 04 of 10 v2] xl: add more NUMA information to `xl info -n''

[PATCH 05 of 10 v2] libxl: rename libxl_cpumap to libxl_bitmap

[PATCH 06 of 10 v2] libxl: expand the libxl_bitmap API a bit

[PATCH 07 of 10 v2] libxl: introduce some node map helpers

[PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

[PATCH 09 of 10 v2] libxl: have NUMA placement deal with cpupools

[PATCH 10 of 10 v2] Some automatic NUMA placement documentation

Re: [PATCH 10 of 10 v2] Some automatic NUMA placement documentation

Re: [PATCH 01 of 10 v2] libxl: fix a typo in the GCREALLOC_ARRAY macro

Re: [PATCH 03 of 10 v2] libxl, libxc: introduce libxl_get_numainfo()

Re: [PATCH 04 of 10 v2] xl: add more NUMA information to `xl info -n''

Re: [PATCH 05 of 10 v2] libxl: rename libxl_cpumap to libxl_bitmap

Re: [PATCH 06 of 10 v2] libxl: expand the libxl_bitmap API a bit

Re: [PATCH 07 of 10 v2] libxl: introduce some node map helpers

Re: [PATCH 07 of 10 v2] libxl: introduce some node map helpers

Re: [PATCH 06 of 10 v2] libxl: expand the libxl_bitmap API a bit

Re: [PATCH 05 of 10 v2] libxl: rename libxl_cpumap to libxl_bitmap

Re: [PATCH 03 of 10 v2] libxl, libxc: introduce libxl_get_numainfo()

Re: [PATCH 03 of 10 v2] libxl, libxc: introduce libxl_get_numainfo()

Re: [PATCH 05 of 10 v2] libxl: rename libxl_cpumap to libxl_bitmap

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 09 of 10 v2] libxl: have NUMA placement deal with cpupools

Re: [PATCH 10 of 10 v2] Some automatic NUMA placement documentation

Re: [PATCH 09 of 10 v2] libxl: have NUMA placement deal with cpupools

Re: [PATCH 10 of 10 v2] Some automatic NUMA placement documentation

Re: [PATCH 09 of 10 v2] libxl: have NUMA placement deal with cpupools

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 01 of 10 v2] libxl: fix a typo in the GCREALLOC_ARRAY macro

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 01 of 10 v2] libxl: fix a typo in the GCREALLOC_ARRAY macro

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 08 of 10 v2] libxl: enable automatic placement of guests on NUMA nodes