thr3ads.net - Xen devel - [PATCH 0 of 3 v4/leftover] Automatic NUMA placement for xl [Jul 2012]

If this information is useful, please help other people find it:
Share via:

Dario Faggioli

2012-Jul-10 15:03 UTC

[PATCH 0 of 3 v4/leftover] Automatic NUMA placement for xl

Hello again,

This is a repost of what remained un-committed of my NUMA series. The splat on
1-node-only systems has been resolved as per Ian Campbell''s suggestion
(thanks!), and that was the only thing I''m aware that was preventing
from
checking-in these last patches. While at it, I also addressed all any other
observation made during review, namely, typos and rewording of some sentences,
as well as an improved filtering solution for avoiding generating a lot of
potential duplicate placement candidates when cpupool are being used.

All the patches have already been acked, although with some minor nits that
should be gone now.

If you have any further comment, I''m here to answer. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Dario Faggioli

2012-Jul-10 15:03 UTC

head link

[PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes

If a domain does not have a VCPU affinity, try to pin it automatically to some
PCPUs. This is done taking into account the NUMA characteristics of the host.
In fact, we look for a combination of host''s NUMA nodes with enough
free memory
and number of PCPUs for the new domain, and pin it to the VCPUs of those nodes.

Once we know which ones, among all the possible combinations, represents valid
placement candidates for a domain, use some heuistics for deciding which is the
best. For instance, smaller candidates are considered to be better, both from
the domain''s point of view (fewer memory spreading among nodes) and
from the
system as a whole point of view (fewer memoy fragmentation).  In case of
candidates of equal sizes (i.e., with the same number of nodes), the amount of
free memory and the number of domain already assigned to their nodes are
considered. Very often, candidates with greater amount of memory are the one
we wants, as this is also good for keeping memory fragmentation under control.
However, if the difference in how much free memory two candidates have, the
number of assigned domains might be what decides which candidate wins.

This all happens internally to libxl, and no API for driving the mechanism is
provided for now. This matches what xend already does.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

---
Changes from v3:
 * fixed a double free on non-NUMA systems (namely, when just 1 node exists).
 * Reword an ambiguous sentence in xl''s man page, as requested during
review.
 * Do no special case single node systems in libxl__get_numa_candidates(), to
   avoid generating a spurious warning on them.

Changes from v2:
 * lots of typos.
 * Clayfied some comments, as requested during (ijc''s) review.
 * Added some more information/reference for the combination generation
   algorithm.
 * nodemap_to_nodes_cpus() function renamed to nodemap_to_nr_cpus().
 * libxl_bitmap_init() used to make sure we do not try to free random
   memory on failure paths of functions that allocates a libxl_bitmap.
 * Always invoke libxl__sort_numa_candidates(), even if we get there
   with just 1 candidate, as requested during review.
 * Simplified the if-s that check for input parameter consistency in
   libxl__get_numa_candidates() as requested during (gwd''s) review.
 * Comparison function for candidates changed so that it now provides
   total ordering, as requested during review. It is still using FP
   arithmetic, though. Also I think that just putting the difference
   between the amount of free memory and between the number of assigned
   domains of two candidates in a single formula (after normalizing and
   weighting them) is both clear and effective enough.
 * Function definitions moved to a numa specific source file (libxl_numa.c),
   as suggested during review.


Changes from v1:
 * This patches incorporates the changes from both "libxl, xl: enable
automatic
   placement of guests on NUMA nodes" and "libxl, xl: heuristics for
reordering
   NUMA placement candidates" from v1.
 * The logic of the algorithm is basically the same as in v1, but the splitting
   of it in the various functions has been completely redesigned from scratch.
 * No public API for placement or candidate generation is now exposed,
   everything happens within libxl, as agreed during v1 review.
 * The relevant documentation have been moved near the actual functions and
   features. Also, the amount and (hopefully!) the quality of the documentation
   has been improved a lot, as requested.
 * All the comments about using the proper libxl facilities and helpers for
   allocations, etc., have been considered and applied.
 * This patch still bails out from NUMA optimizations if it find out cpupools
   are being utilized. It is next patch that makes the two things interact
   properly, as suggested during review.

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -111,8 +111,8 @@ created online and the remainder will be
 
 =item B<cpus="CPU-LIST">
 
-List of which cpus the guest is allowed to use. Default behavior is
-`all cpus`. A C<CPU-LIST> may be specified as follows:
+List of which cpus the guest is allowed to use. By default xl will (via
+libxl) pick some cpus (see below). A C<CPU-LIST> may be specified as
follows:
 
 =over 4
 
@@ -132,6 +132,12 @@ run on cpu #3 of the host.
 
 =back
 
+If this option is not specified, libxl automatically tries to place the new
+domain on the host''s NUMA nodes (provided the host has more than one
NUMA
+node) by pinning it to the cpus of those nodes. A heuristic approach is
+utilized with the goals of maximizing performance for the domain and, at
+the same time, achieving efficient utilization of the host''s CPUs and
RAM.
+
 =item B<cpu_weight=WEIGHT>
 
 A domain with a weight of 512 will get twice as much CPU as a domain
diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -66,7 +66,7 @@ LIBXL_LIBS += -lyajl -lm
 LIBXL_OBJS = flexarray.o libxl.o libxl_create.o libxl_dm.o libxl_pci.o \
 			libxl_dom.o libxl_exec.o libxl_xshelp.o libxl_device.o \
 			libxl_internal.o libxl_utils.o libxl_uuid.o \
-			libxl_json.o libxl_aoutils.o \
+			libxl_json.o libxl_aoutils.o libxl_numa.o \
 			libxl_save_callout.o _libxl_save_msgs_callout.o \
 			libxl_qmp.o libxl_event.o libxl_fork.o $(LIBXL_OBJS-y)
 LIBXL_OBJS += _libxl_types.o libxl_flask.o _libxl_types_internal.o
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -98,6 +98,106 @@ out:
     return sched;
 }
 
+/* Subtract two values and translate the result in [0, 1] */
+static double normalized_diff(double a, double b)
+{
+#define max(a, b) (a > b ? a : b)
+    if (!a && a == b)
+        return 0.0;
+    return (a - b) / max(a, b);
+}
+
+/*
+ * The NUMA placement candidates are reordered according to the following
+ * heuristics:
+ *  - candidates involving fewer nodes come first. In case two (or
+ *    more) candidates span the same number of nodes,
+ *  - the amount of free memory and the number of domains assigned to the
+ *    candidates are considered. In doing that, candidates with greater
+ *    amount of free memory and fewer domains assigned to them are preferred,
+ *    with free memory "weighting" three times as much as number of
domains.
+ */
+static int numa_cmpf(const void *v1, const void *v2)
+{
+    const libxl__numa_candidate *c1 = v1;
+    const libxl__numa_candidate *c2 = v2;
+#define sign(a) a > 0 ? 1 : a < 0 ? -1 : 0
+    double freememkb_diff = normalized_diff(c2->free_memkb,
c1->free_memkb);
+    double nrdomains_diff = normalized_diff(c1->nr_domains,
c2->nr_domains);
+
+    if (c1->nr_nodes != c2->nr_nodes)
+        return c1->nr_nodes - c2->nr_nodes;
+
+    return sign(3*freememkb_diff + nrdomains_diff);
+}
+
+/* The actual automatic NUMA placement routine */
+static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info)
+{
+    int nr_candidates = 0;
+    libxl__numa_candidate *candidates = NULL;
+    libxl_bitmap candidate_nodemap;
+    libxl_cpupoolinfo *pinfo;
+    int nr_pools, rc = 0;
+    uint32_t memkb;
+
+    libxl_bitmap_init(&candidate_nodemap);
+
+    /* First of all, if cpupools are in use, better not to mess with them */
+    pinfo = libxl_list_cpupool(CTX, &nr_pools);
+    if (!pinfo)
+        return ERROR_FAIL;
+    if (nr_pools > 1) {
+        LOG(NOTICE, "skipping NUMA placement as cpupools are in
use");
+        goto out;
+    }
+
+    rc = libxl_domain_need_memory(CTX, info, &memkb);
+    if (rc)
+        goto out;
+    if (libxl_node_bitmap_alloc(CTX, &candidate_nodemap, 0)) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    /* Find all the candidates with enough free memory and at least
+     * as much pcpus as the domain has vcpus.  */
+    rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, 0, 0,
+                                    &candidates, &nr_candidates);
+    if (rc)
+        goto out;
+
+    LOG(DETAIL, "%d NUMA placement candidates found", nr_candidates);
+
+    /* No suitable placement candidates. We just return without touching the
+     * domain''s info->cpumap. It will have affinity with all
nodes/cpus. */
+    if (nr_candidates == 0) {
+        LOG(NOTICE, "NUMA placement failed, performance might be
affected");
+        goto out;
+    }
+
+    /* Bring the best candidate in front of the list --> candidates[0] */
+    libxl__sort_numa_candidates(candidates, nr_candidates, numa_cmpf);
+
+    /*
+     * At this point, the first candidate in the array is the one we want.
+     * Go for it by mapping its node map to the domain''s
info->cpumap.
+     */
+    libxl__numa_candidate_get_nodemap(gc, &candidates[0],
&candidate_nodemap);
+    rc = libxl_nodemap_to_cpumap(CTX, &candidate_nodemap,
&info->cpumap);
+    if (rc)
+        goto out;
+
+    LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and
"
+                "%"PRIu32" KB free selected",
candidates[0].nr_nodes,
+                candidates[0].nr_cpus, candidates[0].free_memkb / 1024);
+
+ out:
+    libxl_bitmap_dispose(&candidate_nodemap);
+    libxl_cpupoolinfo_list_free(pinfo, nr_pools);
+    return rc;
+}
+
 int libxl__build_pre(libxl__gc *gc, uint32_t domid,
               libxl_domain_build_info *info, libxl__domain_build_state *state)
 {
@@ -107,7 +207,22 @@ int libxl__build_pre(libxl__gc *gc, uint
     uint32_t rtc_timeoffset;
 
     xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus);
+
+    /*
+     * Check if the domain has any CPU affinity. If not, try to build up one.
+     * In case numa_place_domain() find at least a suitable candidate, it will
+     * affect info->cpumap accordingly; if it does not, it just leaves it
+     * as it is. This means (unless some weird error manifests) the subsequent
+     * call to libxl_set_vcpuaffinity_all() will do the actual placement,
+     * whatever that turns out to be.
+     */
+    if (libxl_bitmap_is_full(&info->cpumap)) {
+        int rc = numa_place_domain(gc, info);
+        if (rc)
+            return rc;
+    }
     libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus,
&info->cpumap);
+
     xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb +
LIBXL_MAXMEM_CONSTANT);
     if (info->type == LIBXL_DOMAIN_TYPE_PV)
         xc_domain_set_memmap_limit(ctx->xch, domid,
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2216,6 +2216,140 @@ static inline void libxl__ctx_unlock(lib
 #define CTX_LOCK (libxl__ctx_lock(CTX))
 #define CTX_UNLOCK (libxl__ctx_unlock(CTX))
 
+/*
+ * Automatic NUMA placement
+ *
+ * These functions and data structures deal with the initial placement of a
+ * domain onto the host NUMA nodes.
+ *
+ * The key concept here is the one of "NUMA placement candidate",
which is
+ * basically a set of nodes whose characteristics have been successfully
+ * checked against some specific requirements. More precisely, a candidate is
+ * the nodemap associated with one of the possible subset of the host NUMA
+ * nodes providing a certain amount of free memory, or a given number of cpus,
+ * or even both (depending in what the caller wants). For convenience of use,
+ * some of this information are stored within the candidate itself, instead of
+ * always being dynamically computed. A single node is a valid placement
+ * candidate, but it is also possible for a candidate to contain all the nodes
+ * of the host. The fewer nodes there are in a candidate, the better
+ * performance a domain placed onto it should get. For instance, looking for a
+ * numa candidates with 2GB of free memory means we want all the possible
+ * subsets of the host NUMA nodes with, cumulatively, at least 2GB of free
+ * memory. That could be possible by just using one particular node, or may
+ * require more nodes, depending on the characteristics of the host, on how
+ * many domains have been created already, on how big they are, etc.
+ *
+ * The intended usage is as follows:
+ *  1. by, fist of all, calling libxl__get_numa_candidates(), and specifying
+ *     the proper constraints to it (e.g., the amount of memory a domain need
+ *     as the minimum amount of free memory for the candidates) one can build
+ *     up a whole set of suitable placing alternatives for a domain;
+ *  2. after that, one specific candidate should be chosen. That can happen
+ *     by looking at their various characteristics;
+ *  3. the chosen candidate''s nodemap should be utilized for computing
the
+ *     actual affinity of the domain which, given the current NUMA support
+ *     in the hypervisor, is what determines the placement of the
domain''s
+ *     vcpus and memory.
+ *
+ * To make phase 2 even easier, a sorting helper function for the list of
+ * candidates is provided in the form of libxl__sort_numa_candidates(). The
+ * only that is needed is defining a comparison function, containing the
+ * criteria for deciding, given two candidates, which one is
''better''.
+ * Depending on how the comparison function is defined, the best candidate
+ * (where, of course, best is defined with respect to the heuristics
+ * implemented in the comparison function itself, libxl__numa_candidate_cmpf())
+ * could become the first or the last element of the list.
+ *
+ * Summarizing, achieving automatic NUMA placement is just a matter of
+ * obtaining the list of suitable placement candidates, perhaps asking for each
+ * of them to provide at least the amount of memory the domain needs. After
+ * that just implement a comparison function by means of the various helpers
+ * retrieving the relevant information about the candidates themselves.
+ * Finally, call the sorting helper function and use the candidate that became
+ * (typically) the first element of the list for determining the
domain''s
+ * affinity.
+ */
+
+typedef struct {
+    int nr_cpus, nr_nodes;
+    int nr_domains;
+    uint32_t free_memkb;
+    libxl_bitmap nodemap;
+} libxl__numa_candidate;
+
+/*
+ * This generates the list of NUMA placement candidates satisfying some
+ * specific conditions. If min_nodes and/or max_nodes are not 0, their value is
+ * used to determine the minimum and maximum number of nodes that are allow to
+ * be present in each candidate. If min_nodes and/or max_nodes are 0, the
+ * minimum and maximum number of nodes to be used are automatically selected by
+ * the implementation (and that will likely be just 1 node for the minimum and
+ * the total number of existent nodes for the maximum). Re min_free_memkb and
+ * min_cpu, if not 0, it means the caller only wants candidates with at
+ * least that amount of free memory and that number of cpus, respectively. If
+ * min_free_memkb and/or min_cpus are 0, the candidates'' free memory
and number
+ * of cpus won''t be checked at all, which means a candidate will
always be
+ * considered suitable wrt the specific constraint.  cndts is where the list of
+ * exactly nr_cndts candidates is returned. Note that, in case no candidates
+ * are found at all, the function returns successfully, but with nr_cndts equal
+ * to zero.
+ */
+_hidden int libxl__get_numa_candidates(libxl__gc *gc,
+                                uint32_t min_free_memkb, int min_cpus,
+                                int min_nodes, int max_nodes,
+                                libxl__numa_candidate *cndts[], int *nr_cndts);
+
+/* Initialization, allocation and deallocation for placement candidates */
+static inline void libxl__numa_candidate_init(libxl__numa_candidate *cndt)
+{
+    cndt->free_memkb = 0;
+    cndt->nr_cpus = cndt->nr_nodes = cndt->nr_domains = 0;
+    libxl_bitmap_init(&cndt->nodemap);
+}
+
+static inline int libxl__numa_candidate_alloc(libxl__gc *gc,
+                                              libxl__numa_candidate *cndt)
+{
+    return libxl_node_bitmap_alloc(CTX, &cndt->nodemap, 0);
+}
+static inline void libxl__numa_candidate_dispose(libxl__numa_candidate *cndt)
+{
+    libxl_bitmap_dispose(&cndt->nodemap);
+}
+static inline void libxl__numacandidate_list_free(libxl__numa_candidate *cndts,
+                                                  int nr_cndts)
+{
+    int i;
+
+    for (i = 0; i < nr_cndts; i++)
+        libxl__numa_candidate_dispose(&cndts[i]);
+    free(cndts);
+}
+
+/* Retrieve (in nodemap) the node map associated to placement candidate cndt */
+static inline
+void libxl__numa_candidate_get_nodemap(libxl__gc *gc,
+                                       const libxl__numa_candidate *cndt,
+                                       libxl_bitmap *nodemap)
+{
+    libxl_bitmap_copy(CTX, nodemap, &cndt->nodemap);
+}
+/* Set the node map of placement candidate cndt to match nodemap */
+static inline
+void libxl__numa_candidate_put_nodemap(libxl__gc *gc,
+                                       libxl__numa_candidate *cndt,
+                                       const libxl_bitmap *nodemap)
+{
+    libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap);
+}
+
+/* Signature for the comparison function between two candidates c1 and c2 */
+typedef int (*libxl__numa_candidate_cmpf)(const void *v1, const void *v2);
+/* Sort the list of candidates in cndts (an array with nr_cndts elements in
+ * it) using cmpf for comparing two candidates. Uses libc''s qsort().
*/
+_hidden void libxl__sort_numa_candidates(libxl__numa_candidate cndts[],
+                                         int nr_cndts,
+                                         libxl__numa_candidate_cmpf cmpf);
 
 /*
  * Inserts "elm_new" into the sorted list "head".
diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c
new file mode 100644
--- /dev/null
+++ b/tools/libxl/libxl_numa.c
@@ -0,0 +1,377 @@
+/*
+ * Copyright (C) 2012      Citrix Ltd.
+ * Author Dario Faggioli <dario.faggioli@citrix.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+
+#include "libxl_osdeps.h" /* must come before any other headers */
+
+#include <glob.h>
+
+#include "libxl_internal.h"
+
+/*
+ * What follows are helpers for generating all the k-combinations
+ * without repetitions of a set S with n elements in it. Formally
+ * speaking, they are subsets of k distinct elements of S and, if
+ * S is n elements big, the number of k-combinations is equal to
+ * the binomial coefficient (n k), which means we get exactly
+ * n!/(k! * (n - k)!) subsets, all of them with k elements.
+ *
+ * The various subset are generated one after the other by calling
+ * comb_init() first, and, after that, comb_next()
+ * (n k)-1 times. An iterator is used to store the current status
+ * of the whole generation operation (i.e., basically, the last
+ * combination that has been generated). As soon as all
+ * combinations have been generated, comb_next() will
+ * start returning 0 instead of 1. It is of course important that
+ * the same instance of the iterator and the same values for
+ * n and k are used for each call. If that doesn''t happen, the
+ * result is unspecified.
+ *
+ * The algorithm is a well known one (see, for example, D. Knuth''s
"The
+ * Art of Computer Programming - Volume 4, Fascicle 3" and it produces
+ * the combinations in such a way that they (well, more precisely,
+ * their indexes it the array/map representing the set) come with
+ * lexicographic ordering.
+ *
+ * For example, with n = 5 and k = 3, calling comb_init()
+ * will generate { 0, 1, 2 }, while subsequent valid calls to
+ * comb_next() will produce the following:
+ * { { 0, 1, 3 }, { 0, 1, 4 },
+ *   { 0, 2, 3 }, { 0, 2, 4 }, { 0, 3, 4 },
+ *   { 1, 2, 3 }, { 1, 2, 4 }, { 1, 3, 4 },
+ *   { 2, 3, 4 } }
+ *
+ * This is used by the automatic NUMA placement logic below.
+ */
+typedef int* comb_iter_t;
+
+static int comb_init(libxl__gc *gc, comb_iter_t *it, int n, int k)
+{
+    comb_iter_t new_iter;
+    int i;
+
+    if (n < k)
+        return 0;
+
+    /* First set is always { 0, 1, 2, ..., k-1 } */
+    GCNEW_ARRAY(new_iter, k);
+    for (i = 0; i < k; i++)
+        new_iter[i] = i;
+
+    *it = new_iter;
+    return 1;
+}
+
+static int comb_next(comb_iter_t it, int n, int k)
+{
+    int i;
+
+    /*
+     * The idea here is to find the leftmost element from where
+     * we should start incrementing the indexes of the iterator.
+     * This means looking for the highest index that can be increased
+     * while still producing value smaller than n-1. In the example
+     * above, when dealing with { 0, 1, 4 }, such an element is the
+     * second one, as the third is already equal to 4 (which actually
+     * is n-1).
+     * Once we found from where to start, we increment that element
+     * and override the right-hand rest of the iterator with its
+     * successors, thus achieving lexicographic ordering.
+     *
+     * Regarding the termination of the generation process, when we
+     * manage in bringing n-k at the very first position of the iterator,
+     * we know that is the last valid combination ( { 2, 3, 4 }, with
+     * n - k = 5 - 2 = 2, in the example above), and thus we start
+     * returning 0 as soon as we cross that border.
+     */
+    for (i = k - 1; it[i] == n - k + i; i--) {
+        if (i <= 0)
+            return 0;
+    }
+    for (it[i]++, i++; i < k; i++)
+        it[i] = it[i - 1] + 1;
+    return 1;
+}
+
+/* NUMA automatic placement (see libxl_internal.h for details) */
+
+/*
+ * This function turns a k-combination iterator into a node map.
+ * This means the bits in the node map corresponding to the indexes
+ * of the given combination are the ones that will be set.
+ * For example, if the iterator represents the combination { 0, 2, 4},
+ * the node map will have bits #0, #2 and #4 set.
+ */
+static void comb_get_nodemap(comb_iter_t it, libxl_bitmap *nodemap, int k)
+{
+    int i;
+
+    libxl_bitmap_set_none(nodemap);
+    for (i = 0; i < k; i++)
+        libxl_bitmap_set(nodemap, it[i]);
+}
+
+/* Retrieve the number of cpus that the nodes that are part of the nodemap
+ * span. */
+static int nodemap_to_nr_cpus(libxl_cputopology *tinfo, int nr_cpus,
+                              const libxl_bitmap *nodemap)
+{
+    int i, nodes_cpus = 0;
+
+    for (i = 0; i < nr_cpus; i++) {
+        if (libxl_bitmap_test(nodemap, tinfo[i].node))
+            nodes_cpus++;
+    }
+    return nodes_cpus;
+}
+
+/* Retrieve the amount of free memory within the nodemap */
+static uint32_t nodemap_to_free_memkb(libxl_numainfo *ninfo,
+                                      libxl_bitmap *nodemap)
+{
+    uint32_t free_memkb = 0;
+    int i;
+
+    libxl_for_each_set_bit(i, *nodemap)
+        free_memkb += ninfo[i].free / 1024;
+
+    return free_memkb;
+}
+
+/* Retrieve the number of domains that can potentially run on the cpus
+ * the nodes that are part of the nodemap. */
+static int nodemap_to_nr_domains(libxl__gc *gc, libxl_cputopology *tinfo,
+                                 const libxl_bitmap *nodemap)
+{
+    libxl_dominfo *dinfo = NULL;
+    libxl_bitmap dom_nodemap;
+    int nr_doms, nr_cpus;
+    int nr_domains = 0;
+    int i, j, k;
+
+    dinfo = libxl_list_domain(CTX, &nr_doms);
+    if (dinfo == NULL)
+        return ERROR_FAIL;
+
+    if (libxl_node_bitmap_alloc(CTX, &dom_nodemap, 0) < 0) {
+        libxl_dominfo_list_free(dinfo, nr_doms);
+        return ERROR_FAIL;
+    }
+
+    for (i = 0; i < nr_doms; i++) {
+        libxl_vcpuinfo *vinfo;
+        int nr_vcpus;
+
+        vinfo = libxl_list_vcpu(CTX, dinfo[i].domid, &nr_vcpus,
&nr_cpus);
+        if (vinfo == NULL)
+            continue;
+
+        libxl_bitmap_set_none(&dom_nodemap);
+        for (j = 0; j < nr_vcpus; j++) {
+            libxl_for_each_set_bit(k, vinfo[j].cpumap)
+                libxl_bitmap_set(&dom_nodemap, tinfo[k].node);
+        }
+
+        libxl_for_each_set_bit(j, dom_nodemap) {
+            if (libxl_bitmap_test(nodemap, j)) {
+                nr_domains++;
+                break;
+            }
+        }
+
+        libxl_vcpuinfo_list_free(vinfo, nr_vcpus);
+    }
+
+    libxl_bitmap_dispose(&dom_nodemap);
+    libxl_dominfo_list_free(dinfo, nr_doms);
+    return nr_domains;
+}
+
+/*
+ * This function tries to figure out if the host has a consistent number
+ * of cpus along all its NUMA nodes. In fact, if that is the case, we can
+ * calculate the minimum number of nodes needed for a domain by just
+ * dividing its total number of vcpus by this value computed here.
+ * However, we are not allowed to assume that all the nodes have the
+ * same number of cpus. Therefore, in case discrepancies among different
+ * nodes are found, this function just returns 0, for the caller to know
+ * it shouldn''t rely on this ''optimization'', and
sort out things in some
+ * other way (most likely, just start trying with candidates with just
+ * one node).
+ */
+static int cpus_per_node_count(libxl_cputopology *tinfo, int nr_cpus,
+                               libxl_numainfo *ninfo, int nr_nodes)
+{
+    int cpus_per_node = 0;
+    int j, i;
+
+    /* This makes sense iff # of PCPUs is the same for all nodes */
+    for (j = 0; j < nr_nodes; j++) {
+        int curr_cpus = 0;
+
+        for (i = 0; i < nr_cpus; i++) {
+            if (tinfo[i].node == j)
+                curr_cpus++;
+        }
+        /* So, if the above does not hold, turn the whole thing off! */
+        cpus_per_node = cpus_per_node == 0 ? curr_cpus : cpus_per_node;
+        if (cpus_per_node != curr_cpus)
+            return 0;
+    }
+    return cpus_per_node;
+}
+
+/* Get all the placement candidates satisfying some specific conditions */
+int libxl__get_numa_candidates(libxl__gc *gc,
+                               uint32_t min_free_memkb, int min_cpus,
+                               int min_nodes, int max_nodes,
+                               libxl__numa_candidate *cndts[], int *nr_cndts)
+{
+    libxl__numa_candidate *new_cndts = NULL;
+    libxl_cputopology *tinfo = NULL;
+    libxl_numainfo *ninfo = NULL;
+    int nr_nodes = 0, nr_cpus = 0;
+    libxl_bitmap nodemap;
+    int array_size, rc;
+
+    libxl_bitmap_init(&nodemap);
+
+    /* Get platform info and prepare the map for testing the combinations */
+    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
+    if (ninfo == NULL)
+        return ERROR_FAIL;
+
+    tinfo = libxl_get_cpu_topology(CTX, &nr_cpus);
+    if (tinfo == NULL) {
+        rc = ERROR_FAIL;
+        goto out;
+    }
+
+    rc = libxl_node_bitmap_alloc(CTX, &nodemap, 0);
+    if (rc)
+        goto out;
+
+    /*
+     * If the minimum number of NUMA nodes is not explicitly specified
+     * (i.e., min_nodes == 0), we try to figure out a sensible number of nodes
+     * from where to start generating candidates, if possible (or just start
+     * from 1 otherwise). The maximum number of nodes should not exceed the
+     * number of existent NUMA nodes on the host, or the candidate generation
+     * won''t work properly.
+     */
+    if (!min_nodes) {
+        int cpus_per_node;
+
+        cpus_per_node = cpus_per_node_count(tinfo, nr_cpus, ninfo, nr_nodes);
+        if (cpus_per_node == 0)
+            min_nodes = 1;
+        else
+            min_nodes = (min_cpus + cpus_per_node - 1) / cpus_per_node;
+    }
+    if (min_nodes > nr_nodes)
+        min_nodes = nr_nodes;
+    if (!max_nodes || max_nodes > nr_nodes)
+        max_nodes = nr_nodes;
+    if (min_nodes > max_nodes) {
+        rc = ERROR_INVAL;
+        goto out;
+    }
+
+    /* Initialize the local storage for the combinations */
+    *nr_cndts = 0;
+    array_size = nr_nodes;
+    GCNEW_ARRAY(new_cndts, array_size);
+
+    /* Generate all the combinations of any size from min_nodes to
+     * max_nodes (see comb_init() and comb_next()). */
+    while (min_nodes <= max_nodes) {
+        comb_iter_t comb_iter;
+        int comb_ok;
+
+        /*
+	 * And here it is. Each step of this cycle generates a combination of
+	 * nodes as big as min_nodes mandates.  Each of these combinations is
+	 * checked against the constraints provided by the caller (namely,
+	 * amount of free memory and number of cpus) and it becomes an actual
+	 * placement candidate iff it passes the check.
+         */
+        for (comb_ok = comb_init(gc, &comb_iter, nr_nodes, min_nodes);
comb_ok;
+             comb_ok = comb_next(comb_iter, nr_nodes, min_nodes)) {
+            uint32_t nodes_free_memkb;
+            int nodes_cpus;
+
+            comb_get_nodemap(comb_iter, &nodemap, min_nodes);
+
+            /* If there is not enough memory in this combination, skip it
+             * and go generating the next one... */
+            nodes_free_memkb = nodemap_to_free_memkb(ninfo, &nodemap);
+            if (min_free_memkb && nodes_free_memkb < min_free_memkb)
+                continue;
+
+            /* And the same applies if this combination is short in cpus */
+            nodes_cpus = nodemap_to_nr_cpus(tinfo, nr_cpus, &nodemap);
+            if (min_cpus && nodes_cpus < min_cpus)
+                continue;
+
+            /*
+             * Conditions are met, we can add this combination to the
+             * NUMA placement candidates list. We first make sure there
+             * is enough space in there, and then we initialize the new
+             * candidate element with the node map corresponding to the
+             * combination we are dealing with. Memory allocation for
+             * expanding the array that hosts the list happens in chunks
+             * equal to the number of NUMA nodes in the system (to
+             * avoid allocating memory each and every time we find a
+             * new candidate).
+             */
+            if (*nr_cndts == array_size)
+                array_size += nr_nodes;
+            GCREALLOC_ARRAY(new_cndts, array_size);
+
+            libxl__numa_candidate_alloc(gc, &new_cndts[*nr_cndts]);
+            libxl__numa_candidate_put_nodemap(gc, &new_cndts[*nr_cndts],
+                                              &nodemap);
+            new_cndts[*nr_cndts].nr_domains +                                  
nodemap_to_nr_domains(gc, tinfo, &nodemap);
+            new_cndts[*nr_cndts].free_memkb = nodes_free_memkb;
+            new_cndts[*nr_cndts].nr_nodes = min_nodes;
+            new_cndts[*nr_cndts].nr_cpus = nodes_cpus;
+
+            LOG(DEBUG, "NUMA placement candidate #%d found: nr_nodes=%d,
"
+                       "nr_cpus=%d, free_memkb=%"PRIu32"",
*nr_cndts,
+                       min_nodes, new_cndts[*nr_cndts].nr_cpus,
+                       new_cndts[*nr_cndts].free_memkb / 1024);
+
+            (*nr_cndts)++;
+        }
+        min_nodes++;
+    }
+
+    *cndts = new_cndts;
+ out:
+    libxl_bitmap_dispose(&nodemap);
+    libxl_cputopology_list_free(tinfo, nr_cpus);
+    libxl_numainfo_list_free(ninfo, nr_nodes);
+    return rc;
+}
+
+void libxl__sort_numa_candidates(libxl__numa_candidate cndts[], int nr_cndts,
+                                 libxl__numa_candidate_cmpf cmpf)
+{
+    /* Reorder candidates (see the comparison function for
+     * the details on the heuristics) */
+    qsort(cndts, nr_cndts, sizeof(cndts[0]), cmpf);
+}
+
+

Dario Faggioli

2012-Jul-10 15:03 UTC

head link

[PATCH 2 of 3 v4/leftover] libxl: have NUMA placement deal with cpupools

In such a way that only the cpus belonging to the cpupool of the
domain being placed are considered for the placement itself.

This happens by filtering out all the nodes in which the cpupool has
not any cpu from the placement candidates. After that -- as a cpu pooling
not necessarily happens at NUMA nodes boundaries -- we also make sure
only the actual cpus that are part of the pool are considered when
counting how much processors a placement candidate provides.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

---
Changes from v3:
 * fixed a double free on non-NUMA systems (namely, when just 1 node exists).
 * Changed the filtering logic: now only the meaningful combinations and
   nodemap are being considered.

Changes from v2:
 * fixed typos in comments.

diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -132,25 +132,29 @@ static int numa_cmpf(const void *v1, con
 }
 
 /* The actual automatic NUMA placement routine */
-static int numa_place_domain(libxl__gc *gc, libxl_domain_build_info *info)
+static int numa_place_domain(libxl__gc *gc, uint32_t domid,
+                             libxl_domain_build_info *info)
 {
     int nr_candidates = 0;
     libxl__numa_candidate *candidates = NULL;
     libxl_bitmap candidate_nodemap;
-    libxl_cpupoolinfo *pinfo;
-    int nr_pools, rc = 0;
+    libxl_cpupoolinfo cpupool_info;
+    int i, cpupool, rc = 0;
     uint32_t memkb;
 
     libxl_bitmap_init(&candidate_nodemap);
 
-    /* First of all, if cpupools are in use, better not to mess with them */
-    pinfo = libxl_list_cpupool(CTX, &nr_pools);
-    if (!pinfo)
-        return ERROR_FAIL;
-    if (nr_pools > 1) {
-        LOG(NOTICE, "skipping NUMA placement as cpupools are in
use");
-        goto out;
-    }
+    /*
+     * Extract the cpumap from the cpupool the domain belong to. In fact,
+     * it only makes sense to consider the cpus/nodes that are in there
+     * for placement.
+     */
+    rc = cpupool = libxl__domain_cpupool(gc, domid);
+    if (rc < 0)
+        return rc;
+    rc = libxl_cpupool_info(CTX, &cpupool_info, cpupool);
+    if (rc)
+        return rc;
 
     rc = libxl_domain_need_memory(CTX, info, &memkb);
     if (rc)
@@ -162,7 +166,8 @@ static int numa_place_domain(libxl__gc *
 
     /* Find all the candidates with enough free memory and at least
      * as much pcpus as the domain has vcpus.  */
-    rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus, 0, 0,
+    rc = libxl__get_numa_candidates(gc, memkb, info->max_vcpus,
+                                    0, 0, &cpupool_info.cpumap,
                                     &candidates, &nr_candidates);
     if (rc)
         goto out;
@@ -188,13 +193,20 @@ static int numa_place_domain(libxl__gc *
     if (rc)
         goto out;
 
+    /* Avoid trying to set the affinity to cpus that might be in the
+     * nodemap but not in our cpupool. */
+    libxl_for_each_set_bit(i, info->cpumap) {
+        if (!libxl_bitmap_test(&cpupool_info.cpumap, i))
+            libxl_bitmap_reset(&info->cpumap, i);
+    }
+
     LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and
"
                 "%"PRIu32" KB free selected",
candidates[0].nr_nodes,
                 candidates[0].nr_cpus, candidates[0].free_memkb / 1024);
 
  out:
     libxl_bitmap_dispose(&candidate_nodemap);
-    libxl_cpupoolinfo_list_free(pinfo, nr_pools);
+    libxl_cpupoolinfo_dispose(&cpupool_info);
     return rc;
 }
 
@@ -217,7 +229,7 @@ int libxl__build_pre(libxl__gc *gc, uint
      * whatever that turns out to be.
      */
     if (libxl_bitmap_is_full(&info->cpumap)) {
-        int rc = numa_place_domain(gc, info);
+        int rc = numa_place_domain(gc, domid, info);
         if (rc)
             return rc;
     }
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2289,14 +2289,17 @@ typedef struct {
  * least that amount of free memory and that number of cpus, respectively. If
  * min_free_memkb and/or min_cpus are 0, the candidates'' free memory
and number
  * of cpus won''t be checked at all, which means a candidate will
always be
- * considered suitable wrt the specific constraint.  cndts is where the list of
- * exactly nr_cndts candidates is returned. Note that, in case no candidates
- * are found at all, the function returns successfully, but with nr_cndts equal
- * to zero.
+ * considered suitable wrt the specific constraint. suitable_cpumap is useful
+ * for specifying we want only the cpus in that mask to be considered while
+ * generating placement candidates (for example because of cpupools). cndts is
+ * where the list of exactly nr_cndts candidates is returned. Note that, in
+ * case no candidates are found at all, the function returns successfully, but
+ * with nr_cndts equal to zero.
  */
 _hidden int libxl__get_numa_candidates(libxl__gc *gc,
                                 uint32_t min_free_memkb, int min_cpus,
                                 int min_nodes, int max_nodes,
+                                const libxl_bitmap *suitable_cpumap,
                                 libxl__numa_candidate *cndts[], int *nr_cndts);
 
 /* Initialization, allocation and deallocation for placement candidates */
diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c
--- a/tools/libxl/libxl_numa.c
+++ b/tools/libxl/libxl_numa.c
@@ -107,30 +107,48 @@ static int comb_next(comb_iter_t it, int
 /* NUMA automatic placement (see libxl_internal.h for details) */
 
 /*
- * This function turns a k-combination iterator into a node map.
- * This means the bits in the node map corresponding to the indexes
- * of the given combination are the ones that will be set.
- * For example, if the iterator represents the combination { 0, 2, 4},
- * the node map will have bits #0, #2 and #4 set.
+ * This function turns a k-combination iterator into a node map,
+ * given another map, telling us which nodes should be considered.
+ *
+ * This means the bits that are set in suitable_nodemap and that
+ * corresponds to the indexes of the given combination are the ones
+ * that will be set in nodemap.
+ *
+ * For example, given a fully set suitable_nodemap, if the iterator
+ * represents the combination { 0, 2, 4}, nodmeap will have bits #0,
+ * #2 and #4 set.
+ * On the other hand, if, say,  suitable_nodemap=01011011, the same
+ * iterator will cause bits #1, #4 and #7 of nodemap to be set.
  */
-static void comb_get_nodemap(comb_iter_t it, libxl_bitmap *nodemap, int k)
+static void comb_get_nodemap(comb_iter_t it, libxl_bitmap *suitable_nodemap,
+                             libxl_bitmap *nodemap, int k)
 {
-    int i;
+    int i, m = 0, n = 0;
 
     libxl_bitmap_set_none(nodemap);
-    for (i = 0; i < k; i++)
-        libxl_bitmap_set(nodemap, it[i]);
+    libxl_for_each_set_bit(i, *suitable_nodemap) {
+        /* Check wether the n-th set bit of suitable_nodemap
+         * matches with the m-th element of the iterator (and,
+         * only if it does, advance to the next one) */
+        if (m < k && n == it[m]) {
+            libxl_bitmap_set(nodemap, i);
+            m++;
+        }
+        n++;
+    }
 }
 
 /* Retrieve the number of cpus that the nodes that are part of the nodemap
- * span. */
+ * span and are also set in suitable_cpumap. */
 static int nodemap_to_nr_cpus(libxl_cputopology *tinfo, int nr_cpus,
+                              const libxl_bitmap *suitable_cpumap,
                               const libxl_bitmap *nodemap)
 {
     int i, nodes_cpus = 0;
 
     for (i = 0; i < nr_cpus; i++) {
-        if (libxl_bitmap_test(nodemap, tinfo[i].node))
+        if (libxl_bitmap_test(suitable_cpumap, i) &&
+            libxl_bitmap_test(nodemap, tinfo[i].node))
             nodes_cpus++;
     }
     return nodes_cpus;
@@ -236,15 +254,17 @@ static int cpus_per_node_count(libxl_cpu
 int libxl__get_numa_candidates(libxl__gc *gc,
                                uint32_t min_free_memkb, int min_cpus,
                                int min_nodes, int max_nodes,
+                               const libxl_bitmap *suitable_cpumap,
                                libxl__numa_candidate *cndts[], int *nr_cndts)
 {
     libxl__numa_candidate *new_cndts = NULL;
     libxl_cputopology *tinfo = NULL;
     libxl_numainfo *ninfo = NULL;
     int nr_nodes = 0, nr_cpus = 0;
-    libxl_bitmap nodemap;
+    libxl_bitmap suitable_nodemap, nodemap;
     int array_size, rc;
 
+    libxl_bitmap_init(&suitable_nodemap);
     libxl_bitmap_init(&nodemap);
 
     /* Get platform info and prepare the map for testing the combinations */
@@ -262,6 +282,15 @@ int libxl__get_numa_candidates(libxl__gc
     if (rc)
         goto out;
 
+    /* Allocate and prepare the map of the node that can be utilized for
+     * placement, basing on the map of suitable cpus. */
+    rc = libxl_node_bitmap_alloc(CTX, &suitable_nodemap, 0);
+    if (rc)
+        goto out;
+    rc = libxl_cpumap_to_nodemap(CTX, suitable_cpumap, &suitable_nodemap);
+    if (rc)
+        goto out;
+
     /*
      * If the minimum number of NUMA nodes is not explicitly specified
      * (i.e., min_nodes == 0), we try to figure out a sensible number of nodes
@@ -287,6 +316,13 @@ int libxl__get_numa_candidates(libxl__gc
         rc = ERROR_INVAL;
         goto out;
     }
+    /* We also need to be sure we do not exceed the number of
+     * nodes we are allowed to use. */
+    if (libxl_bitmap_count_set(&suitable_nodemap) < max_nodes) {
+        max_nodes = libxl_bitmap_count_set(&suitable_nodemap);
+        if (min_nodes > max_nodes)
+            min_nodes = max_nodes;
+    }
 
     /* Initialize the local storage for the combinations */
     *nr_cndts = 0;
@@ -311,7 +347,10 @@ int libxl__get_numa_candidates(libxl__gc
             uint32_t nodes_free_memkb;
             int nodes_cpus;
 
-            comb_get_nodemap(comb_iter, &nodemap, min_nodes);
+            /* Get the nodemap for the combination, only considering
+             * suitable nodes. */
+            comb_get_nodemap(comb_iter, &suitable_nodemap,
+                             &nodemap, min_nodes);
 
             /* If there is not enough memory in this combination, skip it
              * and go generating the next one... */
@@ -320,7 +359,8 @@ int libxl__get_numa_candidates(libxl__gc
                 continue;
 
             /* And the same applies if this combination is short in cpus */
-            nodes_cpus = nodemap_to_nr_cpus(tinfo, nr_cpus, &nodemap);
+            nodes_cpus = nodemap_to_nr_cpus(tinfo, nr_cpus, suitable_cpumap,
+                                            &nodemap);
             if (min_cpus && nodes_cpus < min_cpus)
                 continue;
 
@@ -345,12 +385,13 @@ int libxl__get_numa_candidates(libxl__gc
             new_cndts[*nr_cndts].nr_domains                                    
nodemap_to_nr_domains(gc, tinfo, &nodemap);
             new_cndts[*nr_cndts].free_memkb = nodes_free_memkb;
-            new_cndts[*nr_cndts].nr_nodes = min_nodes;
+            new_cndts[*nr_cndts].nr_nodes =
libxl_bitmap_count_set(&nodemap);
             new_cndts[*nr_cndts].nr_cpus = nodes_cpus;
 
             LOG(DEBUG, "NUMA placement candidate #%d found: nr_nodes=%d,
"
                        "nr_cpus=%d, free_memkb=%"PRIu32"",
*nr_cndts,
-                       min_nodes, new_cndts[*nr_cndts].nr_cpus,
+                       new_cndts[*nr_cndts].nr_nodes,
+                       new_cndts[*nr_cndts].nr_cpus,
                        new_cndts[*nr_cndts].free_memkb / 1024);
 
             (*nr_cndts)++;
@@ -360,6 +401,7 @@ int libxl__get_numa_candidates(libxl__gc
 
     *cndts = new_cndts;
  out:
+    libxl_bitmap_dispose(&suitable_nodemap);
     libxl_bitmap_dispose(&nodemap);
     libxl_cputopology_list_free(tinfo, nr_cpus);
     libxl_numainfo_list_free(ninfo, nr_nodes);

Dario Faggioli

2012-Jul-10 15:03 UTC

head link

[PATCH 3 of 3 v4/leftover] Some automatic NUMA placement documentation

About rationale, usage and (some small bits of) API.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

---
Changes from v3:
 * typos and rewording of some sentences, as suggested during review.

Changes from v1:
 * API documentation moved close to the actual functions.

diff --git a/docs/misc/xl-numa-placement.markdown
b/docs/misc/xl-numa-placement.markdown
new file mode 100644
--- /dev/null
+++ b/docs/misc/xl-numa-placement.markdown
@@ -0,0 +1,89 @@
+# Guest Automatic NUMA Placement in libxl and xl #
+
+## Rationale ##
+
+NUMA means the memory accessing times of a program running on a CPU depends on
+the relative distance between that CPU and that memory. In fact, most of the
+NUMA systems are built in such a way that each processor has its local memory,
+on which it can operate very fast. On the other hand, getting and storing data
+from and on remote memory (that is, memory local to some other processor) is
+quite more complex and slow. On these machines, a NUMA node is usually defined
+as a set of processor cores (typically a physical CPU package) and the memory
+directly attached to the set of cores.
+
+The Xen hypervisor deals with Non-Uniform Memory Access (NUMA]) machines by
+assigning to its domain a "node affinity", i.e., a set of NUMA nodes
of the
+host from which it gets its memory allocated.
+
+NUMA awareness becomes very important as soon as many domains start running
+memory-intensive workloads on a shared host. In fact, the cost of accessing non
+node-local memory locations is very high, and the performance degradation is
+likely to be noticeable.
+
+## Guest Placement in xl ##
+
+If using xl for creating and managing guests, it is very easy to ask for both
+manual or automatic placement of them across the host''s NUMA nodes.
+
+Note that xm/xend does the very same thing, the only differences residing in
+the details of the heuristics adopted for the placement (see below).
+
+### Manual Guest Placement with xl ###
+
+Thanks to the "cpus=" option, it is possible to specify where a
domain should
+be created and scheduled on, directly in its config file. This affects NUMA
+placement and memory accesses as the hypervisor constructs the node affinity of
+a VM basing right on its CPU affinity when it is created.
+
+This is very simple and effective, but requires the user/system administrator
+to explicitly specify affinities for each and every domain, or Xen
won''t be
+able to guarantee the locality for their memory accesses.
+
+It is also possible to deal with NUMA by partitioning the system using cpupools
+(available in the upcoming release of Xen, 4.2). Again, this could be "The
+Right Answer" for many needs and occasions, but  has to to be carefully
+considered and manually setup by hand.
+
+### Automatic Guest Placement with xl ###
+
+If no "cpus=" option is specified in the config file, libxl tries to
figure out
+on its own on which node(s) the domain could fit best.  It is worthwhile noting
+that optimally fitting a set of VMs on the NUMA nodes of an host is an
+incarnation of the Bin Packing Problem. In fact, the various VMs with different
+memory sizes are the items to be packed, and the host nodes are the bins. That
+is known to be NP-hard, thus. We will therefore be using some heuristics.
+
+The first thing to do is finding  a node, or even a set of nodes, that have
+enough free memory and enough physical CPUs for accommodating the one new
+domain. The idea is to find a spot for the domain with at least as much free
+memory as it has configured, and as much pCPUs as it has vCPUs.  After that,
+the actual decision on which solution to go for happens accordingly to the
+following heuristics:
+
+  *  candidates involving fewer nodes come first. In case two (or more)
+     candidates span the same number of nodes,
+  *  the amount of free memory and the number of domains assigned to the
+     candidates are considered. In doing that, candidates with greater amount
+     of free memory and fewer assigned domains are preferred, with free memory
+     "weighting" three times as much as number of domains.
+
+Giving preference to candidates with fewer nodes ensures better performance for
+the guest, as it avoid spreading its memory among different nodes.  Favouring
+the nodes that have the largest amounts of free memory helps keeping the memory
+fragmentation small, from a system wide perspective.  However, if more
+candidates fulfil these criteria by roughly the same extent, having the number
+of domains the candidates are "hosting" helps balancing the load on
the various
+nodes.
+
+## Guest Placement within libxl ##
+
+xl achieves automatic NUMA just because libxl does it interrnally.
+No API is provided (yet) for interacting with this feature and modify
+the library behaviour regarding automatic placement, it just happens
+by default if no affinity is specified (as it is with xm/xend).
+
+For actually looking and maybe tweaking the mechanism and the algorithms it
+uses, all is implemented as a set of libxl internal interfaces and facilities.
+Look at the comment "Automatic NUMA placement" in libxl\_internal.h.
+
+Note this may change in future versions of Xen/libxl.

Dario Faggioli

2012-Jul-16 17:03 UTC

head link

Re: [PATCH 0 of 3 v4/leftover] Automatic NUMA placement for xl

On Tue, 2012-07-10 at 17:03 +0200, Dario Faggioli wrote:> Hello again,
> Hi,

This seems to not have been considered yet... If that is so, I''d like
to
post a new version, in which I fix a small bug that is presenti in this
version (patch 2/3).

Please, consider that one repost for review or inclusion (subject will
be "[... v5/leftover] Automatic NUMA placement for xl").

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Jackson

2012-Jul-17 15:55 UTC

head link

Re: [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes

Dario Faggioli writes ("[PATCH 1 of 3 v4/leftover] libxl: enable automatic
placement of guests on NUMA nodes"):> If a domain does not have a VCPU affinity, try to pin it automatically to
some
> PCPUs. This is done taking into account the NUMA characteristics of the
host.
> In fact, we look for a combination of host''s NUMA nodes with
enough free memory
> and number of PCPUs for the new domain, and pin it to the VCPUs of those
nodes.
Thanks for this admirably clear patch.

Can you please rewrap your commit messages to around 70 lines ?  Many
VCSs indent them in the log in some situations, and as you see here
mail programs indent them when quoting too.

Indeed I would generally prefer to wrap code to ~75 rather than 80 but
I know that''s a bit controversial.
> +/* Subtract two values and translate the result in [0, 1] */
> +static double normalized_diff(double a, double b)
> +{
> +#define max(a, b) (a > b ? a : b)
> +    if (!a && a == b)
> +        return 0.0;
> +    return (a - b) / max(a, b);
> +}
1. This macro max() should be in libxl_internal.h.
2. It should be MAX so people are warned it''s a macro
3. It should have all the necessary ()s for macro precedence safety
> +static int numa_cmpf(const void *v1, const void *v2)
> +{
> +    const libxl__numa_candidate *c1 = v1;
> +    const libxl__numa_candidate *c2 = v2;
> +#define sign(a) a > 0 ? 1 : a < 0 ? -1 : 0
Again.
> +    double freememkb_diff = normalized_diff(c2->free_memkb,
c1->free_memkb);
> +    double nrdomains_diff = normalized_diff(c1->nr_domains,
c2->nr_domains);
> +
> +    if (c1->nr_nodes != c2->nr_nodes)
> +        return c1->nr_nodes - c2->nr_nodes;
> +
> +    return sign(3*freememkb_diff + nrdomains_diff);
The reason you need what sign() does is that you need to convert from
double to int, I guess.
> +
> +    /*
> +     * Check if the domain has any CPU affinity. If not, try to build up
one.
> +     * In case numa_place_domain() find at least a suitable candidate, it
will
> +     * affect info->cpumap accordingly; if it does not, it just leaves
it
> +     * as it is. This means (unless some weird error manifests) the
subsequent
> +     * call to libxl_set_vcpuaffinity_all() will do the actual placement,
> +     * whatever that turns out to be.
> +     */
> +    if (libxl_bitmap_is_full(&info->cpumap)) {
> +        int rc = numa_place_domain(gc, info);
> +        if (rc)
> +            return rc;
> +    }
I guess it would be preferable to do this only if the bitmap was full
by default, so that setting the bitmap explicitly to all cpus still
works.

I''m not sure that that''s essential to have, though.
> +            /*
> +             * Conditions are met, we can add this combination to the
> +             * NUMA placement candidates list. We first make sure there
> +             * is enough space in there, and then we initialize the new
> +             * candidate element with the node map corresponding to the
> +             * combination we are dealing with. Memory allocation for
> +             * expanding the array that hosts the list happens in chunks
> +             * equal to the number of NUMA nodes in the system (to
> +             * avoid allocating memory each and every time we find a
> +             * new candidate).
> +             */
> +            if (*nr_cndts == array_size)
> +                array_size += nr_nodes;
> +            GCREALLOC_ARRAY(new_cndts, array_size);
This part of the algorithm is quadratic in the number of combinations
divided by the number of nodes.  So the algorithm is
   O( (  C( nr_nodes, min_nodes ) / min_nodes  )^2 )
which is quite bad really.

At the very least this needs to be an exponential allocation, eg
  +                array_size += nr_nodes + array_size;

But if you didn''t insist on creating the whole list and sorting it,
you would avoid this allocation entirely, wouldn''t you ?

Should we bve worried that this algorithm will be too slow even if it
involves just
  O( C(nr_nodes,min_nodes) )
iterations ?

Thanks,
Ian.

Ian Campbell

2012-Jul-17 15:59 UTC

head link

Re: [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes

On Tue, 2012-07-17 at 16:55 +0100, Ian Jackson wrote:
> > +/* Subtract two values and translate the result in [0, 1] */
> > +static double normalized_diff(double a, double b)
> > +{
> > +#define max(a, b) (a > b ? a : b)
> > +    if (!a && a == b)
> > +        return 0.0;
> > +    return (a - b) / max(a, b);
> > +}
> 
> 1. This macro max() should be in libxl_internal.h.
> 2. It should be MAX so people are warned it''s a macro
> 3. It should have all the necessary ()s for macro precedence safety
You probably also want to do the necessary tricks to avoid multiple
evaluation of the arguments?

Ian Jackson

2012-Jul-17 18:01 UTC

head link

Re: [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes

Ian Campbell writes ("Re: [PATCH 1 of 3 v4/leftover] libxl: enable
automatic placement of guests on NUMA nodes"):> On Tue, 2012-07-17 at 16:55 +0100, Ian Jackson wrote:
> > 1. This macro max() should be in libxl_internal.h.
> > 2. It should be MAX so people are warned it''s a macro
> > 3. It should have all the necessary ()s for macro precedence safety
> 
> You probably also want to do the necessary tricks to avoid multiple
> evaluation of the arguments?
Those tricks are a pain.  I think in general macros are `allowed'' to
do this provided their names are SHOUTY.  That''s why it should be
called `MAX''.

Ian.

Dario Faggioli

2012-Jul-17 22:15 UTC

head link

Re: [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes

On Tue, 2012-07-17 at 16:55 +0100, Ian Jackson wrote: > Dario Faggioli writes ("[PATCH 1 of 3 v4/leftover] libxl: enable
automatic placement of guests on NUMA nodes"):
> > If a domain does not have a VCPU affinity, try to pin it automatically
to some
> > PCPUs. This is done taking into account the NUMA characteristics of
the host.
> > In fact, we look for a combination of host''s NUMA nodes with
enough free memory
> > and number of PCPUs for the new domain, and pin it to the VCPUs of
those nodes.
> 
> Thanks for this admirably clear patch.
> Thanks to you for looking at it.
> Can you please rewrap your commit messages to around 70 lines ?  Many
> VCSs indent them in the log in some situations, and as you see here
> mail programs indent them when quoting too.
> That should have happened in the first place. As I''m reposting,
I''ll
take extra attention to that, sorry.
> > +/* Subtract two values and translate the result in [0, 1] */
> > +static double normalized_diff(double a, double b)
> > +{
> > +#define max(a, b) (a > b ? a : b)
> > +    if (!a && a == b)
> > +        return 0.0;
> > +    return (a - b) / max(a, b);
> > +}
> 
> 1. This macro max() should be in libxl_internal.h.
> 2. It should be MAX so people are warned it''s a macro
> 3. It should have all the necessary ()s for macro precedence safety
> Ok, will do that.
> > +    double freememkb_diff = normalized_diff(c2->free_memkb,
c1->free_memkb);
> > +    double nrdomains_diff = normalized_diff(c1->nr_domains,
c2->nr_domains);
> > +
> > +    if (c1->nr_nodes != c2->nr_nodes)
> > +        return c1->nr_nodes - c2->nr_nodes;
> > +
> > +    return sign(3*freememkb_diff + nrdomains_diff);
> 
> The reason you need what sign() does is that you need to convert from
> double to int, I guess.
> Mostly for that and to make what''s happening even more clear.
> > +
> > +    /*
> > +     * Check if the domain has any CPU affinity. If not, try to build
up one.
> > +     * In case numa_place_domain() find at least a suitable
candidate, it will
> > +     * affect info->cpumap accordingly; if it does not, it just
leaves it
> > +     * as it is. This means (unless some weird error manifests) the
subsequent
> > +     * call to libxl_set_vcpuaffinity_all() will do the actual
placement,
> > +     * whatever that turns out to be.
> > +     */
> > +    if (libxl_bitmap_is_full(&info->cpumap)) {
> > +        int rc = numa_place_domain(gc, info);
> > +        if (rc)
> > +            return rc;
> > +    }
> 
> I guess it would be preferable to do this only if the bitmap was full
> by default, so that setting the bitmap explicitly to all cpus still
> works.
> 
> I''m not sure that that''s essential to have, though.
> I was thinking about this right in this days, and I think it should be
exactly as you say, as one need a mechanism for disabling this thing as
a whole. I really don''t think it should take to much to put something
together, even as a separate, future, patch, if these get checked-in.
Thanks.
> > +            /*
> > +             * Conditions are met, we can add this combination to the
> > +             * NUMA placement candidates list. We first make sure
there
> > +             * is enough space in there, and then we initialize the
new
> > +             * candidate element with the node map corresponding to
the
> > +             * combination we are dealing with. Memory allocation for
> > +             * expanding the array that hosts the list happens in
chunks
> > +             * equal to the number of NUMA nodes in the system (to
> > +             * avoid allocating memory each and every time we find a
> > +             * new candidate).
> > +             */
> > +            if (*nr_cndts == array_size)
> > +                array_size += nr_nodes;
> > +            GCREALLOC_ARRAY(new_cndts, array_size);
> 
> This part of the algorithm is quadratic in the number of combinations
> divided by the number of nodes.  So the algorithm is
>    O( (  C( nr_nodes, min_nodes ) / min_nodes  )^2 )
> which is quite bad really.
> I might well be wrong, but I was thinking to it as something like this:

O( C(nr_nodes,(nr_nodes/2)) * nr_nodes )

That''s because the external while() is repeated, at most, nr_nodes
times
(if min_nodes=1 and max_nodes=nr_nodes). Each of these steps hosts a
for() which visits all the combinations, the maximum number of which
ISTR to be C(nr_nodes,(nr_nodes/2)).

I''m not sure what you meant when putting min_nodes up there in your
formula (min_nodes is likely to be 1 most of the cases...), so I can''t
get numbers and compare them, but it looked (looks?) a bit less bad to
me... Or did I make some obvious mistake I''m not seeing right now? :-(
> At the very least this needs to be an exponential allocation, eg
>   +                array_size += nr_nodes + array_size;
> 
> But if you didn''t insist on creating the whole list and sorting
it,
> you would avoid this allocation entirely, wouldn''t you ?
> I''ll kill the separation between candidate identification and sorting:
that''s easy, quick, and will make George happy as well. :-)
> Should we bve worried that this algorithm will be too slow even if it
> involves just
>   O( C(nr_nodes,min_nodes) )
> iterations ?
> I''m commenting about this in the other thread you opened...

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Xen devel - Jul 2012 - [PATCH 0 of 3 v4/leftover] Automatic NUMA placement for xl

[PATCH 0 of 3 v4/leftover] Automatic NUMA placement for xl

[PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes

[PATCH 2 of 3 v4/leftover] libxl: have NUMA placement deal with cpupools

[PATCH 3 of 3 v4/leftover] Some automatic NUMA placement documentation

Re: [PATCH 0 of 3 v4/leftover] Automatic NUMA placement for xl

Re: [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes

Re: [PATCH 1 of 3 v4/leftover] libxl: enable automatic placement of guests on NUMA nodes