thr3ads.net - Xen devel - [PATCH 0 of 3 v2] Some small NUMA placement improvements [Oct 2012]

If this information is useful, please help other people find it:
Share via:

Dario Faggioli

2012-Oct-19 14:34 UTC

[PATCH 0 of 3 v2] Some small NUMA placement improvements

Hi again,

Take 2 of this series, with the only comment about strstr()-vs-strncmp() in
patch 2 addressed, and all the three patches already Acked-by George.

Just as a quick reminder, the series is about:
 - use node distances information during automatic placement (patch 1);
 - let the user specify a minimum and a maximum number of NUMA nodes they want
   the automaic placement to use (patch 2);
 - enhance the syntax of the "cpus=" config switch so that full nodes
can be
   specified instead of just single CPUs (patch 3).

Thanks and Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Dario Faggioli

2012-Oct-19 14:34 UTC

head link

[PATCH 1 of 3 v2] libxl: take node distances into account during NUMA placement

In fact, among placement candidates with the same number of nodes, the
closer the various nodes are to each others, the better the performances
for a domain placed there.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -105,6 +105,9 @@ out:
  *  - the number of vcpus runnable on the candidates is considered, and
  *    candidates with fewer of them are preferred. If two candidate have
  *    the same number of runnable vcpus,
+ *  - the sum of the node distances in the candidates is considered, and
+ *    candidates with smaller total distance are preferred. If total
+ *    distance is the same for the two candidatess,
  *  - the amount of free memory in the candidates is considered, and the
  *    candidate with greater amount of it is preferred.
  *
@@ -114,6 +117,10 @@ out:
  * overloading large (from a memory POV) nodes. That''s right the
effect
  * that counting the vcpus able to run on the nodes tries to prevent.
  *
+ * The relative distance within the nodes in the candidates is considered
+ * as the closer the nodes, the better for the domain ending up on the
+ * candidate.
+ *
  * Note that this completely ignore the number of nodes each candidate span,
  * as the fact that fewer nodes is better is already accounted for in the
  * algorithm.
@@ -124,6 +131,9 @@ static int numa_cmpf(const libxl__numa_c
     if (c1->nr_vcpus != c2->nr_vcpus)
         return c1->nr_vcpus - c2->nr_vcpus;
 
+    if (c1->dists_sum != c2->dists_sum)
+        return c1->dists_sum - c2->dists_sum;
+
     return c2->free_memkb - c1->free_memkb;
 }
 
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2735,6 +2735,7 @@ static inline void libxl__ctx_unlock(lib
 typedef struct {
     int nr_cpus, nr_nodes;
     int nr_vcpus;
+    int dists_sum;
     uint32_t free_memkb;
     libxl_bitmap nodemap;
 } libxl__numa_candidate;
diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c
--- a/tools/libxl/libxl_numa.c
+++ b/tools/libxl/libxl_numa.c
@@ -218,6 +218,40 @@ static int nodemap_to_nr_vcpus(libxl__gc
     return nr_vcpus;
 }
 
+/* Sum the relative distances of nodes in the nodemap to help finding
+ * out which candidate is the "tightest" one. */
+static int nodemap_to_dists_sum(libxl_numainfo *ninfo, libxl_bitmap *nodemap)
+{
+    int tot_dist = 0;
+    int i, j, a = 0, b;
+
+    for (i = 0; i < libxl_bitmap_count_set(nodemap); i++) {
+        while (!libxl_bitmap_test(nodemap, a))
+            a++;
+
+        /* As it is usually non-zero, we do take the latency of
+         * of a node to itself into account. */
+        b = a;
+        for (j = 0; j < libxl_bitmap_count_set(nodemap) - i; j++) {
+            while (!libxl_bitmap_test(nodemap, b))
+                b++;
+
+            /*
+             * In most architectures, going from node A to node B costs
+             * exactly as much as going from B to A does. However,
let''s
+             * not rely on this and consider both contributions, just to
+             * be ready for everything future might reserve for us.
+             */
+            tot_dist += ninfo[a].dists[b];
+            tot_dist += ninfo[b].dists[a];
+            b++;
+        }
+        a++;
+    }
+
+    return tot_dist;
+}
+
 /*
  * This function tries to figure out if the host has a consistent number
  * of cpus along all its NUMA nodes. In fact, if that is the case, we can
@@ -415,6 +449,7 @@ int libxl__get_numa_candidate(libxl__gc 
              */
             libxl__numa_candidate_put_nodemap(gc, &new_cndt, &nodemap);
             new_cndt.nr_vcpus = nodemap_to_nr_vcpus(gc, tinfo, &nodemap);
+            new_cndt.dists_sum = nodemap_to_dists_sum(ninfo, &nodemap);
             new_cndt.free_memkb = nodes_free_memkb;
             new_cndt.nr_nodes = libxl_bitmap_count_set(&nodemap);
             new_cndt.nr_cpus = nodes_cpus;
@@ -430,12 +465,14 @@ int libxl__get_numa_candidate(libxl__gc 
 
                 LOG(DEBUG, "New best NUMA placement candidate found:
"
                            "nr_nodes=%d, nr_cpus=%d, nr_vcpus=%d, "
-                           "free_memkb=%"PRIu32"",
new_cndt.nr_nodes,
-                           new_cndt.nr_cpus, new_cndt.nr_vcpus,
+                           "dists_sum=%d,
free_memkb=%"PRIu32"",
+                           new_cndt.nr_nodes, new_cndt.nr_cpus,
+                           new_cndt.nr_vcpus, new_cndt.dists_sum,
                            new_cndt.free_memkb / 1024);
 
                 libxl__numa_candidate_put_nodemap(gc, cndt_out, &nodemap);
                 cndt_out->nr_vcpus = new_cndt.nr_vcpus;
+                cndt_out->dists_sum = new_cndt.dists_sum;
                 cndt_out->free_memkb = new_cndt.free_memkb;
                 cndt_out->nr_nodes = new_cndt.nr_nodes;
                 cndt_out->nr_cpus = new_cndt.nr_cpus;

Dario Faggioli

2012-Oct-19 14:34 UTC

head link

[PATCH 2 of 3 v2] libxl, xl: user can ask for min and max nodes to use during placement

And the placement algorithm will stick to that, or fail. This happens
adding support for "minnodes=" and "maxnodes=" in the domain
config
file parser.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -133,6 +133,18 @@ the same time, achieving efficient utili
 
 =back
 
+=item B<minnodes=N>
+
+Tells libxl to place the new domain on at least `N` nodes. This is only
+effective if automatic NUMA placement occurs (i.e., if no `cpus=` option
+is specified).
+
+=item B<maxnodes=M>
+
+Tells libxl to place the new domain on at most `M` nodes. This is only
+effective if automatic NUMA placement occurs (i.e., if no `cpus=` option
+is specified).
+
 =head3 CPU Scheduling
 
 =over 4
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -219,6 +219,11 @@ int libxl__domain_build_info_setdefault(
 
     libxl_defbool_setdefault(&b_info->numa_placement, true);
 
+    if (!b_info->min_nodes)
+        b_info->min_nodes = 0;
+    if (!b_info->max_nodes)
+        b_info->max_nodes = 0;
+
     if (b_info->max_memkb == LIBXL_MEMKB_DEFAULT)
         b_info->max_memkb = 32 * 1024;
     if (b_info->target_memkb == LIBXL_MEMKB_DEFAULT)
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -174,7 +174,8 @@ static int numa_place_domain(libxl__gc *
     /* Find the best candidate with enough free memory and at least
      * as much pcpus as the domain has vcpus.  */
     rc = libxl__get_numa_candidate(gc, memkb, info->max_vcpus,
-                                   0, 0, &cpupool_info.cpumap,
+                                   info->min_nodes, info->max_nodes,
+                                   &cpupool_info.cpumap,
                                    numa_cmpf, &candidate, &found);
     if (rc)
         goto out;
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -262,6 +262,8 @@ libxl_domain_build_info = Struct("domain
     ("avail_vcpus",     libxl_bitmap),
     ("cpumap",          libxl_bitmap),
     ("numa_placement",  libxl_defbool),
+    ("min_nodes",       integer),
+    ("max_nodes",       integer),
     ("tsc_mode",        libxl_tsc_mode),
     ("max_memkb",       MemKB),
     ("target_memkb",    MemKB),
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -731,6 +731,11 @@ static void parse_config_data(const char
         libxl_defbool_set(&b_info->numa_placement, false);
     }
 
+    if (!xlu_cfg_get_long (config, "minnodes", &l, 0))
+        b_info->min_nodes = l;
+    if (!xlu_cfg_get_long (config, "maxnodes", &l, 0))
+        b_info->max_nodes = l;
+
     if (!xlu_cfg_get_long (config, "memory", &l, 0)) {
         b_info->max_memkb = l * 1024;
         b_info->target_memkb = b_info->max_memkb;

Dario Faggioli

2012-Oct-19 14:34 UTC

head link

[PATCH 3 of 3 v2] xl: allow for node-wise specification of vcpu pinning

Making it possible to use something like the following:
 * "nodes:0-3": all pCPUs of nodes 0,1,2,3;
 * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3;
 * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2 but not pCPU
6;
 * ...

In both domain config file and `xl vcpu-pin''.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
---
Changes since v1:
 * strstr() replaced with strncmp().

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -109,7 +109,7 @@ some cpus on its own (see below). A C<CP
 
 =over 4
 
-=item "all"
+=item "all" (or "nodes:all")
 
 To allow all the vcpus of the guest to run on all the cpus on the host.
 
@@ -117,6 +117,14 @@ To allow all the vcpus of the guest to r
 
 To allow all the vcpus of the guest to run on cpus 0,2,3,5.
 
+=item "nodes:0-3,^node:2"
+
+To allow all the vcpus of the guest to run on the cpus belonging to
+the NUMA nodes 0,1,3 of the host. Notice that it is possible to combine
+this syntax with the one above. That means, something like
"1,node:2,^6"
+is possible and means all the vcpus of the guest will run on cpus 1 plus
+on all the cpus of node 2, but never on cpu 6.
+
 =item ["2", "3"] (or [2, 3])
 
 To ask for specific vcpu mapping. That means (in this example), vcpu #0
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -506,31 +506,58 @@ static void split_string_into_string_lis
 
 static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap)
 {
-    libxl_bitmap exclude_cpumap;
-    uint32_t cpuida, cpuidb;
+    libxl_bitmap nodemap, cpu_nodemap;
+    libxl_bitmap exclude_cpumap, exclude_nodemap;
+    uint32_t ida, idb;
     char *endptr, *toka, *tokb, *saveptr = NULL;
-    int i, rc = 0, rmcpu;
-
-    if (!strcmp(cpu, "all")) {
+    int i, rc = 0, isnot, isnode;
+
+    if (!strcmp(cpu, "all") || !strcmp(cpu, "nodes:all")) {
         libxl_bitmap_set_any(cpumap);
         return 0;
     }
 
-    if (libxl_cpu_bitmap_alloc(ctx, &exclude_cpumap, 0)) {
+    libxl_bitmap_init(&cpu_nodemap);
+    libxl_bitmap_init(&nodemap);
+    libxl_bitmap_init(&exclude_nodemap);
+    libxl_bitmap_init(&exclude_nodemap);
+
+    rc = libxl_node_bitmap_alloc(ctx, &cpu_nodemap, 0);
+    if (rc) {
+        fprintf(stderr, "Error: Failed to allocate nodemap.\n");
+        goto vcpp_out;
+    }
+    rc = libxl_node_bitmap_alloc(ctx, &nodemap, 0);
+    if (rc) {
+        fprintf(stderr, "Error: Failed to allocate nodemap.\n");
+        goto vcpp_out;
+    }
+    rc = libxl_node_bitmap_alloc(ctx, &exclude_nodemap, 0);
+    if (rc) {
+        fprintf(stderr, "Error: Failed to allocate nodemap.\n");
+        goto vcpp_out;
+    }
+    rc = libxl_cpu_bitmap_alloc(ctx, &exclude_cpumap, 0);
+    if (rc) {
         fprintf(stderr, "Error: Failed to allocate cpumap.\n");
-        return ENOMEM;
+        goto vcpp_out;
     }
 
     for (toka = strtok_r(cpu, ",", &saveptr); toka;
          toka = strtok_r(NULL, ",", &saveptr)) {
-        rmcpu = 0;
+        isnot = 0; isnode = 0;
         if (*toka == ''^'') {
-            /* This (These) Cpu(s) will be removed from the map */
+            /* This (These) Cpu(s)/Node(s) will be removed from the map */
             toka++;
-            rmcpu = 1;
-        }
-        /* Extract a valid (range of) cpu(s) */
-        cpuida = cpuidb = strtoul(toka, &endptr, 10);
+            isnot = 1;
+        }
+        /* Check if we''re dealing with a full node */
+        if (!strncmp(toka, "node:", 5) || !strncmp(toka,
"nodes:", 6)) {
+            toka += 5 + (toka[4] == ''s'');
+            isnode = 1;
+        }
+        /* Extract a valid (range of) cpu(s) or node(s) */
+        ida = idb = strtoul(toka, &endptr, 10);
         if (endptr == toka) {
             fprintf(stderr, "Error: Invalid argument.\n");
             rc = EINVAL;
@@ -538,27 +565,48 @@ static int vcpupin_parse(char *cpu, libx
         }
         if (*endptr == ''-'') {
             tokb = endptr + 1;
-            cpuidb = strtoul(tokb, &endptr, 10);
-            if (endptr == tokb || cpuida > cpuidb) {
+            idb = strtoul(tokb, &endptr, 10);
+            if (endptr == tokb || ida > idb) {
                 fprintf(stderr, "Error: Invalid argument.\n");
                 rc = EINVAL;
                 goto vcpp_out;
             }
         }
-        while (cpuida <= cpuidb) {
-            rmcpu == 0 ? libxl_bitmap_set(cpumap, cpuida) :
-                         libxl_bitmap_set(&exclude_cpumap, cpuida);
-            cpuida++;
-        }
-    }
-
-    /* Clear all the cpus from the removal list */
+        while (ida <= idb) {
+            if (!isnode)
+                isnot == 0 ? libxl_bitmap_set(cpumap, ida) :
+                             libxl_bitmap_set(&exclude_cpumap, ida);
+            else
+                isnot == 0 ? libxl_bitmap_set(&nodemap, ida) :
+                             libxl_bitmap_set(&exclude_nodemap, ida);
+            ida++;
+        }
+    }
+
+    /* Add the cpus that have been specified via "node:" items */
+    rc = libxl_nodemap_to_cpumap(ctx, &nodemap, &cpu_nodemap);
+    if (rc)
+        goto vcpp_out;
+    libxl_for_each_set_bit(i, cpu_nodemap) {
+        libxl_bitmap_set(cpumap, i);
+    }
+
+    /* Clear all the cpus from the removal cpu and node lists */
     libxl_for_each_set_bit(i, exclude_cpumap) {
         libxl_bitmap_reset(cpumap, i);
     }
+    rc = libxl_nodemap_to_cpumap(ctx, &exclude_nodemap, &cpu_nodemap);
+    if (rc)
+        goto vcpp_out;
+    libxl_for_each_set_bit(i, cpu_nodemap) {
+        libxl_bitmap_reset(cpumap, i);
+    }
 
 vcpp_out:
     libxl_bitmap_dispose(&exclude_cpumap);
+    libxl_bitmap_dispose(&exclude_nodemap);
+    libxl_bitmap_dispose(&nodemap);
+    libxl_bitmap_dispose(&cpu_nodemap);
 
     return rc;
 }

Dario Faggioli

2012-Oct-19 17:00 UTC

head link

Re: [PATCH 1 of 3 v2] libxl: take node distances into account during NUMA placement

On Fri, 2012-10-19 at 16:34 +0200, Dario Faggioli wrote:> In fact, among placement candidates with the same number of nodes, the
> closer the various nodes are to each others, the better the performances
> for a domain placed there.
> So, it seems we need to discuss a bit more about this patch... Please,
ignore it for now.

OTOH, patches 2 and 3 can be considered for further review/ack/commit,
even without this one, as they are pretty much unrelated from this one
(and within each other). I''d expect them to apply cleanly, but if I
need
to rebase and resend, let me know.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Campbell

2012-Nov-12 16:26 UTC

head link

Re: [PATCH 2 of 3 v2] libxl, xl: user can ask for min and max nodes to use during placement

On Fri, 2012-10-19 at 15:34 +0100, Dario Faggioli wrote:> @@ -219,6 +219,11 @@ int libxl__domain_build_info_setdefault(
>  
>      libxl_defbool_setdefault(&b_info->numa_placement, true);
>  
> +    if (!b_info->min_nodes)
> +        b_info->min_nodes = 0;
> +    if (!b_info->max_nodes)
> +        b_info->max_nodes = 0; 
Aren''t both of these if statements tautologous?

Ian.

Dario Faggioli

2012-Nov-12 16:34 UTC

head link

Re: [PATCH 2 of 3 v2] libxl, xl: user can ask for min and max nodes to use during placement

On Mon, 2012-11-12 at 16:26 +0000, Ian Campbell wrote:> On Fri, 2012-10-19 at 15:34 +0100, Dario Faggioli wrote:
> > @@ -219,6 +219,11 @@ int libxl__domain_build_info_setdefault(
> >  
> >      libxl_defbool_setdefault(&b_info->numa_placement, true);
> >  
> > +    if (!b_info->min_nodes)
> > +        b_info->min_nodes = 0;
> > +    if (!b_info->max_nodes)
> > +        b_info->max_nodes = 0; 
> 
> Aren''t both of these if statements tautologous?
> Looks like they are. I''ll double check this before reposting.

I think what I wanted to do was making sure to zero those new fields, if
not otherwise initialized by someone else... and, of course, I got that
wrong! :-P


Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Xen devel - Oct 2012 - [PATCH 0 of 3 v2] Some small NUMA placement improvements

[PATCH 0 of 3 v2] Some small NUMA placement improvements

[PATCH 1 of 3 v2] libxl: take node distances into account during NUMA placement

[PATCH 2 of 3 v2] libxl, xl: user can ask for min and max nodes to use during placement

[PATCH 3 of 3 v2] xl: allow for node-wise specification of vcpu pinning

Re: [PATCH 1 of 3 v2] libxl: take node distances into account during NUMA placement

Re: [PATCH 2 of 3 v2] libxl, xl: user can ask for min and max nodes to use during placement

Re: [PATCH 2 of 3 v2] libxl, xl: user can ask for min and max nodes to use during placement