thr3ads.net - Xen devel - [PATCH 0 of 3] Some small NUMA placement improvements [Oct 2012]

If this information is useful, please help other people find it:
Share via:

Dario Faggioli

2012-Oct-16 17:26 UTC

[PATCH 0 of 3] Some small NUMA placement improvements

Hi everyone,

This series implements some small impovement in how NUMA is dealt at the libxl
and xl level that I had pending in one of my queues.

More specifically, it enables actually using node distances information during
automatic placement and it allows the users to ask for a minimum and a maximum
number of NUMA nodes they want the automaic placement to use.

A little bit less related to _automatic_ placement, it also enhances the syntax
of the "cpus=" config switch (as well as the one of the `xl
vcpu-pin'' command),
so that full nodes can be specified instead of just single CPUs.

This is all about, on one hand, keeping improving performances of the NUMA
placement mechanism and, on the other, starting to give users some knobs to
control it.

Comments appreciated, as usual. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Dario Faggioli

2012-Oct-16 17:26 UTC

head link

[PATCH 1 of 3] libxl: take node distances into account during NUMA placement

In fact, among placement candidates with the same number of nodes, the
closer the various nodes are to each others, the better the performances
for a domain placed there.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -105,6 +105,9 @@ out:
  *  - the number of vcpus runnable on the candidates is considered, and
  *    candidates with fewer of them are preferred. If two candidate have
  *    the same number of runnable vcpus,
+ *  - the sum of the node distances in the candidates is considered, and
+ *    candidates with smaller total distance are preferred. If total
+ *    distance is the same for the two candidatess,
  *  - the amount of free memory in the candidates is considered, and the
  *    candidate with greater amount of it is preferred.
  *
@@ -114,6 +117,10 @@ out:
  * overloading large (from a memory POV) nodes. That''s right the
effect
  * that counting the vcpus able to run on the nodes tries to prevent.
  *
+ * The relative distance within the nodes in the candidates is considered
+ * as the closer the nodes, the better for the domain ending up on the
+ * candidate.
+ *
  * Note that this completely ignore the number of nodes each candidate span,
  * as the fact that fewer nodes is better is already accounted for in the
  * algorithm.
@@ -124,6 +131,9 @@ static int numa_cmpf(const libxl__numa_c
     if (c1->nr_vcpus != c2->nr_vcpus)
         return c1->nr_vcpus - c2->nr_vcpus;
 
+    if (c1->dists_sum != c2->dists_sum)
+        return c1->dists_sum - c2->dists_sum;
+
     return c2->free_memkb - c1->free_memkb;
 }
 
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -2732,6 +2732,7 @@ static inline void libxl__ctx_unlock(lib
 typedef struct {
     int nr_cpus, nr_nodes;
     int nr_vcpus;
+    int dists_sum;
     uint32_t free_memkb;
     libxl_bitmap nodemap;
 } libxl__numa_candidate;
diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c
--- a/tools/libxl/libxl_numa.c
+++ b/tools/libxl/libxl_numa.c
@@ -218,6 +218,40 @@ static int nodemap_to_nr_vcpus(libxl__gc
     return nr_vcpus;
 }
 
+/* Sum the relative distances of nodes in the nodemap to help finding
+ * out which candidate is the "tightest" one. */
+static int nodemap_to_dists_sum(libxl_numainfo *ninfo, libxl_bitmap *nodemap)
+{
+    int tot_dist = 0;
+    int i, j, a = 0, b;
+
+    for (i = 0; i < libxl_bitmap_count_set(nodemap); i++) {
+        while (!libxl_bitmap_test(nodemap, a))
+            a++;
+
+        /* As it is usually non-zero, we do take the latency of
+         * of a node to itself into account. */
+        b = a;
+        for (j = 0; j < libxl_bitmap_count_set(nodemap) - i; j++) {
+            while (!libxl_bitmap_test(nodemap, b))
+                b++;
+
+            /*
+             * In most architectures, going from node A to node B costs
+             * exactly as much as going from B to A does. However,
let''s
+             * not rely on this and consider both contributions, just to
+             * be ready for everything future might reserve for us.
+             */
+            tot_dist += ninfo[a].dists[b];
+            tot_dist += ninfo[b].dists[a];
+            b++;
+        }
+        a++;
+    }
+
+    return tot_dist;
+}
+
 /*
  * This function tries to figure out if the host has a consistent number
  * of cpus along all its NUMA nodes. In fact, if that is the case, we can
@@ -415,6 +449,7 @@ int libxl__get_numa_candidate(libxl__gc 
              */
             libxl__numa_candidate_put_nodemap(gc, &new_cndt, &nodemap);
             new_cndt.nr_vcpus = nodemap_to_nr_vcpus(gc, tinfo, &nodemap);
+            new_cndt.dists_sum = nodemap_to_dists_sum(ninfo, &nodemap);
             new_cndt.free_memkb = nodes_free_memkb;
             new_cndt.nr_nodes = libxl_bitmap_count_set(&nodemap);
             new_cndt.nr_cpus = nodes_cpus;
@@ -430,12 +465,14 @@ int libxl__get_numa_candidate(libxl__gc 
 
                 LOG(DEBUG, "New best NUMA placement candidate found:
"
                            "nr_nodes=%d, nr_cpus=%d, nr_vcpus=%d, "
-                           "free_memkb=%"PRIu32"",
new_cndt.nr_nodes,
-                           new_cndt.nr_cpus, new_cndt.nr_vcpus,
+                           "dists_sum=%d,
free_memkb=%"PRIu32"",
+                           new_cndt.nr_nodes, new_cndt.nr_cpus,
+                           new_cndt.nr_vcpus, new_cndt.dists_sum,
                            new_cndt.free_memkb / 1024);
 
                 libxl__numa_candidate_put_nodemap(gc, cndt_out, &nodemap);
                 cndt_out->nr_vcpus = new_cndt.nr_vcpus;
+                cndt_out->dists_sum = new_cndt.dists_sum;
                 cndt_out->free_memkb = new_cndt.free_memkb;
                 cndt_out->nr_nodes = new_cndt.nr_nodes;
                 cndt_out->nr_cpus = new_cndt.nr_cpus;

Dario Faggioli

2012-Oct-16 17:26 UTC

head link

[PATCH 2 of 3] libxl, xl: user can ask for min and max nodes to use during placement

And the placement algorithm will stick to that, or fail. This happens
adding support for "minnodes=" and "maxnodes=" in the domain
config
file parser.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -133,6 +133,18 @@ the same time, achieving efficient utili
 
 =back
 
+=item B<minnodes=N>
+
+Tells libxl to place the new domain on at least `N` nodes. This is only
+effective if automatic NUMA placement occurs (i.e., if no `cpus=` option
+is specified).
+
+=item B<maxnodes=M>
+
+Tells libxl to place the new domain on at most `M` nodes. This is only
+effective if automatic NUMA placement occurs (i.e., if no `cpus=` option
+is specified).
+
 =head3 CPU Scheduling
 
 =over 4
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -219,6 +219,11 @@ int libxl__domain_build_info_setdefault(
 
     libxl_defbool_setdefault(&b_info->numa_placement, true);
 
+    if (!b_info->min_nodes)
+        b_info->min_nodes = 0;
+    if (!b_info->max_nodes)
+        b_info->max_nodes = 0;
+
     if (b_info->max_memkb == LIBXL_MEMKB_DEFAULT)
         b_info->max_memkb = 32 * 1024;
     if (b_info->target_memkb == LIBXL_MEMKB_DEFAULT)
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -174,7 +174,8 @@ static int numa_place_domain(libxl__gc *
     /* Find the best candidate with enough free memory and at least
      * as much pcpus as the domain has vcpus.  */
     rc = libxl__get_numa_candidate(gc, memkb, info->max_vcpus,
-                                   0, 0, &cpupool_info.cpumap,
+                                   info->min_nodes, info->max_nodes,
+                                   &cpupool_info.cpumap,
                                    numa_cmpf, &candidate, &found);
     if (rc)
         goto out;
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -262,6 +262,8 @@ libxl_domain_build_info = Struct("domain
     ("avail_vcpus",     libxl_bitmap),
     ("cpumap",          libxl_bitmap),
     ("numa_placement",  libxl_defbool),
+    ("min_nodes",       integer),
+    ("max_nodes",       integer),
     ("tsc_mode",        libxl_tsc_mode),
     ("max_memkb",       MemKB),
     ("target_memkb",    MemKB),
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -731,6 +731,11 @@ static void parse_config_data(const char
         libxl_defbool_set(&b_info->numa_placement, false);
     }
 
+    if (!xlu_cfg_get_long (config, "minnodes", &l, 0))
+        b_info->min_nodes = l;
+    if (!xlu_cfg_get_long (config, "maxnodes", &l, 0))
+        b_info->max_nodes = l;
+
     if (!xlu_cfg_get_long (config, "memory", &l, 0)) {
         b_info->max_memkb = l * 1024;
         b_info->target_memkb = b_info->max_memkb;

Dario Faggioli

2012-Oct-16 17:26 UTC

head link

[PATCH 3 of 3] xl: allow for node-wise specification of vcpu pinning

Making it possible to use something like the following:
 * "nodes:0-3": all pCPUs of nodes 0,1,2,3;
 * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3;
 * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2 but not pCPU
6;
 * ...

In both domain config file and `xl vcpu-pin''.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -109,7 +109,7 @@ some cpus on its own (see below). A C<CP
 
 =over 4
 
-=item "all"
+=item "all" (or "nodes:all")
 
 To allow all the vcpus of the guest to run on all the cpus on the host.
 
@@ -117,6 +117,14 @@ To allow all the vcpus of the guest to r
 
 To allow all the vcpus of the guest to run on cpus 0,2,3,5.
 
+=item "nodes:0-3,^node:2"
+
+To allow all the vcpus of the guest to run on the cpus belonging to
+the NUMA nodes 0,1,3 of the host. Notice that it is possible to combine
+this syntax with the one above. That means, something like
"1,node:2,^6"
+is possible and means all the vcpus of the guest will run on cpus 1 plus
+on all the cpus of node 2, but never on cpu 6.
+
 =item ["2", "3"] (or [2, 3])
 
 To ask for specific vcpu mapping. That means (in this example), vcpu #0
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -506,31 +506,58 @@ static void split_string_into_string_lis
 
 static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap)
 {
-    libxl_bitmap exclude_cpumap;
-    uint32_t cpuida, cpuidb;
+    libxl_bitmap nodemap, cpu_nodemap;
+    libxl_bitmap exclude_cpumap, exclude_nodemap;
+    uint32_t ida, idb;
     char *endptr, *toka, *tokb, *saveptr = NULL;
-    int i, rc = 0, rmcpu;
-
-    if (!strcmp(cpu, "all")) {
+    int i, rc = 0, isnot, isnode;
+
+    if (!strcmp(cpu, "all") || !strcmp(cpu, "nodes:all")) {
         libxl_bitmap_set_any(cpumap);
         return 0;
     }
 
-    if (libxl_cpu_bitmap_alloc(ctx, &exclude_cpumap, 0)) {
+    libxl_bitmap_init(&cpu_nodemap);
+    libxl_bitmap_init(&nodemap);
+    libxl_bitmap_init(&exclude_nodemap);
+    libxl_bitmap_init(&exclude_nodemap);
+
+    rc = libxl_node_bitmap_alloc(ctx, &cpu_nodemap, 0);
+    if (rc) {
+        fprintf(stderr, "Error: Failed to allocate nodemap.\n");
+        goto vcpp_out;
+    }
+    rc = libxl_node_bitmap_alloc(ctx, &nodemap, 0);
+    if (rc) {
+        fprintf(stderr, "Error: Failed to allocate nodemap.\n");
+        goto vcpp_out;
+    }
+    rc = libxl_node_bitmap_alloc(ctx, &exclude_nodemap, 0);
+    if (rc) {
+        fprintf(stderr, "Error: Failed to allocate nodemap.\n");
+        goto vcpp_out;
+    }
+    rc = libxl_cpu_bitmap_alloc(ctx, &exclude_cpumap, 0);
+    if (rc) {
         fprintf(stderr, "Error: Failed to allocate cpumap.\n");
-        return ENOMEM;
+        goto vcpp_out;
     }
 
     for (toka = strtok_r(cpu, ",", &saveptr); toka;
          toka = strtok_r(NULL, ",", &saveptr)) {
-        rmcpu = 0;
+        isnot = 0; isnode = 0;
         if (*toka == ''^'') {
-            /* This (These) Cpu(s) will be removed from the map */
+            /* This (These) Cpu(s)/Node(s) will be removed from the map */
             toka++;
-            rmcpu = 1;
-        }
-        /* Extract a valid (range of) cpu(s) */
-        cpuida = cpuidb = strtoul(toka, &endptr, 10);
+            isnot = 1;
+        }
+        /* Check if we''re dealing with a full node */
+        if (strstr(toka, "node:") == toka || strstr(toka,
"nodes:") == toka) {
+            toka += 5 + (toka[4] == ''s'');
+            isnode = 1;
+        }
+        /* Extract a valid (range of) cpu(s) or node(s) */
+        ida = idb = strtoul(toka, &endptr, 10);
         if (endptr == toka) {
             fprintf(stderr, "Error: Invalid argument.\n");
             rc = EINVAL;
@@ -538,27 +565,48 @@ static int vcpupin_parse(char *cpu, libx
         }
         if (*endptr == ''-'') {
             tokb = endptr + 1;
-            cpuidb = strtoul(tokb, &endptr, 10);
-            if (endptr == tokb || cpuida > cpuidb) {
+            idb = strtoul(tokb, &endptr, 10);
+            if (endptr == tokb || ida > idb) {
                 fprintf(stderr, "Error: Invalid argument.\n");
                 rc = EINVAL;
                 goto vcpp_out;
             }
         }
-        while (cpuida <= cpuidb) {
-            rmcpu == 0 ? libxl_bitmap_set(cpumap, cpuida) :
-                         libxl_bitmap_set(&exclude_cpumap, cpuida);
-            cpuida++;
-        }
-    }
-
-    /* Clear all the cpus from the removal list */
+        while (ida <= idb) {
+            if (!isnode)
+                isnot == 0 ? libxl_bitmap_set(cpumap, ida) :
+                             libxl_bitmap_set(&exclude_cpumap, ida);
+            else
+                isnot == 0 ? libxl_bitmap_set(&nodemap, ida) :
+                             libxl_bitmap_set(&exclude_nodemap, ida);
+            ida++;
+        }
+    }
+
+    /* Add the cpus that have been specified via "node:" items */
+    rc = libxl_nodemap_to_cpumap(ctx, &nodemap, &cpu_nodemap);
+    if (rc)
+        goto vcpp_out;
+    libxl_for_each_set_bit(i, cpu_nodemap) {
+        libxl_bitmap_set(cpumap, i);
+    }
+
+    /* Clear all the cpus from the removal cpu and node lists */
     libxl_for_each_set_bit(i, exclude_cpumap) {
         libxl_bitmap_reset(cpumap, i);
     }
+    rc = libxl_nodemap_to_cpumap(ctx, &exclude_nodemap, &cpu_nodemap);
+    if (rc)
+        goto vcpp_out;
+    libxl_for_each_set_bit(i, cpu_nodemap) {
+        libxl_bitmap_reset(cpumap, i);
+    }
 
 vcpp_out:
     libxl_bitmap_dispose(&exclude_cpumap);
+    libxl_bitmap_dispose(&exclude_nodemap);
+    libxl_bitmap_dispose(&nodemap);
+    libxl_bitmap_dispose(&cpu_nodemap);
 
     return rc;
 }

Ian Campbell

2012-Oct-18 11:23 UTC

head link

Re: [PATCH 3 of 3] xl: allow for node-wise specification of vcpu pinning

On Tue, 2012-10-16 at 18:26 +0100, Dario Faggioli wrote:
> +            isnot = 1;
> +        }
> +        /* Check if we''re dealing with a full node */
> +        if (strstr(toka, "node:") == toka || strstr(toka,
"nodes:") == toka) {
Isn''t strstr(FOO, "thing") == FOO just an expensive way of
writing
strncmp(FOO, "ting", sizeof(...))?

Ian

Dario Faggioli

2012-Oct-18 13:11 UTC

head link

Re: [PATCH 3 of 3] xl: allow for node-wise specification of vcpu pinning

On Thu, 2012-10-18 at 12:23 +0100, Ian Campbell wrote: > On Tue, 2012-10-16 at 18:26 +0100, Dario Faggioli wrote:
> 
> > +            isnot = 1;
> > +        }
> > +        /* Check if we''re dealing with a full node */
> > +        if (strstr(toka, "node:") == toka || strstr(toka,
"nodes:") == toka) {
> 
> Isn''t strstr(FOO, "thing") == FOO just an expensive way
of writing
> strncmp(FOO, "ting", sizeof(...))?
> Mmm... what I wanted was to make sure that "thing" not only is part of
FOO, but that also is the very beginning of that same string. That''s
why
I exploited the fact that strstr() returns a pointer to the beginning of
the substring "thing" within FOO.

Now that you mentioned it, I think that I actually can achieve the same
with strncmp, limiting the comparison at strlen("thing"). If that was
what you meant, the answer is yes, I can do that (and I agree it''s
probably better).

Thanks for looking into this,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Campbell

2012-Oct-18 13:15 UTC

head link

Re: [PATCH 3 of 3] xl: allow for node-wise specification of vcpu pinning

On Thu, 2012-10-18 at 14:11 +0100, Dario Faggioli wrote:> On Thu, 2012-10-18 at 12:23 +0100, Ian Campbell wrote: 
> > On Tue, 2012-10-16 at 18:26 +0100, Dario Faggioli wrote:
> > 
> > > +            isnot = 1;
> > > +        }
> > > +        /* Check if we''re dealing with a full node */
> > > +        if (strstr(toka, "node:") == toka ||
strstr(toka, "nodes:") == toka) {
> > 
> > Isn''t strstr(FOO, "thing") == FOO just an expensive
way of writing
> > strncmp(FOO, "ting", sizeof(...))?
> > 
> Mmm... what I wanted was to make sure that "thing" not only is
part of
> FOO, but that also is the very beginning of that same string.
That''s why
> I exploited the fact that strstr() returns a pointer to the beginning of
> the substring "thing" within FOO.
> 
> Now that you mentioned it, I think that I actually can achieve the same
> with strncmp, limiting the comparison at strlen("thing"). If that
was
> what you meant, the answer is yes, I can do that (and I agree it''s
> probably better).
Right, With the strstr method if there is 1000 characters in the input
before the "thing" you''ve wasted a bunch of time scanning for
it just to
throw it away, whereas with str*cmp you would give up after the first
non-matching char.

Ian.

Dario Faggioli

2012-Oct-18 13:18 UTC

head link

Re: [PATCH 3 of 3] xl: allow for node-wise specification of vcpu pinning

On Thu, 2012-10-18 at 14:15 +0100, Ian Campbell wrote: > Right, With the strstr method if there is 1000 characters in the input
> before the "thing" you''ve wasted a bunch of time
scanning for it just to
> throw it away, whereas with str*cmp you would give up after the first
> non-matching char.
> Ok, I''ll go for it.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

George Dunlap

2012-Oct-18 15:17 UTC

head link

Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement

On Tue, Oct 16, 2012 at 6:26 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:> In fact, among placement candidates with the same number of nodes, the
> closer the various nodes are to each others, the better the performances
> for a domain placed there.
Looks good overall -- my only worry is the N^2 nature of the
algorithm.  We''re already doing some big combinatorial thing to
generate the candidates, right?  And now we''re doing N^2 for each
candidate? Suppose we get an ARM system with 4096 cores and 128 NUMA
nodes?  If Xen 4.4 doesn''t come out until March 2014, there will still
be distros using 4.3 through mid-2015.

I seem to remember having a discussion about this issue already, but I
can''t remember what the outcome was...

 -George

>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
>
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -105,6 +105,9 @@ out:
>   *  - the number of vcpus runnable on the candidates is considered, and
>   *    candidates with fewer of them are preferred. If two candidate have
>   *    the same number of runnable vcpus,
> + *  - the sum of the node distances in the candidates is considered, and
> + *    candidates with smaller total distance are preferred. If total
> + *    distance is the same for the two candidatess,
>   *  - the amount of free memory in the candidates is considered, and the
>   *    candidate with greater amount of it is preferred.
>   *
> @@ -114,6 +117,10 @@ out:
>   * overloading large (from a memory POV) nodes. That''s right the
effect
>   * that counting the vcpus able to run on the nodes tries to prevent.
>   *
> + * The relative distance within the nodes in the candidates is considered
> + * as the closer the nodes, the better for the domain ending up on the
> + * candidate.
> + *
>   * Note that this completely ignore the number of nodes each candidate
span,
>   * as the fact that fewer nodes is better is already accounted for in the
>   * algorithm.
> @@ -124,6 +131,9 @@ static int numa_cmpf(const libxl__numa_c
>      if (c1->nr_vcpus != c2->nr_vcpus)
>          return c1->nr_vcpus - c2->nr_vcpus;
>
> +    if (c1->dists_sum != c2->dists_sum)
> +        return c1->dists_sum - c2->dists_sum;
> +
>      return c2->free_memkb - c1->free_memkb;
>  }
>
> diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
> --- a/tools/libxl/libxl_internal.h
> +++ b/tools/libxl/libxl_internal.h
> @@ -2732,6 +2732,7 @@ static inline void libxl__ctx_unlock(lib
>  typedef struct {
>      int nr_cpus, nr_nodes;
>      int nr_vcpus;
> +    int dists_sum;
>      uint32_t free_memkb;
>      libxl_bitmap nodemap;
>  } libxl__numa_candidate;
> diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c
> --- a/tools/libxl/libxl_numa.c
> +++ b/tools/libxl/libxl_numa.c
> @@ -218,6 +218,40 @@ static int nodemap_to_nr_vcpus(libxl__gc
>      return nr_vcpus;
>  }
>
> +/* Sum the relative distances of nodes in the nodemap to help finding
> + * out which candidate is the "tightest" one. */
> +static int nodemap_to_dists_sum(libxl_numainfo *ninfo, libxl_bitmap
*nodemap)
> +{
> +    int tot_dist = 0;
> +    int i, j, a = 0, b;
> +
> +    for (i = 0; i < libxl_bitmap_count_set(nodemap); i++) {
> +        while (!libxl_bitmap_test(nodemap, a))
> +            a++;
> +
> +        /* As it is usually non-zero, we do take the latency of
> +         * of a node to itself into account. */
> +        b = a;
> +        for (j = 0; j < libxl_bitmap_count_set(nodemap) - i; j++) {
> +            while (!libxl_bitmap_test(nodemap, b))
> +                b++;
> +
> +            /*
> +             * In most architectures, going from node A to node B costs
> +             * exactly as much as going from B to A does. However,
let''s
> +             * not rely on this and consider both contributions, just to
> +             * be ready for everything future might reserve for us.
> +             */
> +            tot_dist += ninfo[a].dists[b];
> +            tot_dist += ninfo[b].dists[a];
> +            b++;
> +        }
> +        a++;
> +    }
> +
> +    return tot_dist;
> +}
> +
>  /*
>   * This function tries to figure out if the host has a consistent number
>   * of cpus along all its NUMA nodes. In fact, if that is the case, we can
> @@ -415,6 +449,7 @@ int libxl__get_numa_candidate(libxl__gc
>               */
>              libxl__numa_candidate_put_nodemap(gc, &new_cndt,
&nodemap);
>              new_cndt.nr_vcpus = nodemap_to_nr_vcpus(gc, tinfo,
&nodemap);
> +            new_cndt.dists_sum = nodemap_to_dists_sum(ninfo,
&nodemap);
>              new_cndt.free_memkb = nodes_free_memkb;
>              new_cndt.nr_nodes = libxl_bitmap_count_set(&nodemap);
>              new_cndt.nr_cpus = nodes_cpus;
> @@ -430,12 +465,14 @@ int libxl__get_numa_candidate(libxl__gc
>
>                  LOG(DEBUG, "New best NUMA placement candidate found:
"
>                             "nr_nodes=%d, nr_cpus=%d, nr_vcpus=%d,
"
> -                           "free_memkb=%"PRIu32"",
new_cndt.nr_nodes,
> -                           new_cndt.nr_cpus, new_cndt.nr_vcpus,
> +                           "dists_sum=%d,
free_memkb=%"PRIu32"",
> +                           new_cndt.nr_nodes, new_cndt.nr_cpus,
> +                           new_cndt.nr_vcpus, new_cndt.dists_sum,
>                             new_cndt.free_memkb / 1024);
>
>                  libxl__numa_candidate_put_nodemap(gc, cndt_out,
&nodemap);
>                  cndt_out->nr_vcpus = new_cndt.nr_vcpus;
> +                cndt_out->dists_sum = new_cndt.dists_sum;
>                  cndt_out->free_memkb = new_cndt.free_memkb;
>                  cndt_out->nr_nodes = new_cndt.nr_nodes;
>                  cndt_out->nr_cpus = new_cndt.nr_cpus;
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

George Dunlap

2012-Oct-18 15:21 UTC

head link

Re: [PATCH 2 of 3] libxl, xl: user can ask for min and max nodes to use during placement

On Tue, Oct 16, 2012 at 6:26 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:> And the placement algorithm will stick to that, or fail. This happens
> adding support for "minnodes=" and "maxnodes=" in the
domain config
> file parser.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
>
> diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
> --- a/docs/man/xl.cfg.pod.5
> +++ b/docs/man/xl.cfg.pod.5
> @@ -133,6 +133,18 @@ the same time, achieving efficient utili
>
>  =back
>
> +=item B<minnodes=N>
> +
> +Tells libxl to place the new domain on at least `N` nodes. This is only
> +effective if automatic NUMA placement occurs (i.e., if no `cpus=` option
> +is specified).
> +
> +=item B<maxnodes=M>
> +
> +Tells libxl to place the new domain on at most `M` nodes. This is only
> +effective if automatic NUMA placement occurs (i.e., if no `cpus=` option
> +is specified).
> +
>  =head3 CPU Scheduling
>
>  =over 4
> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
> --- a/tools/libxl/libxl_create.c
> +++ b/tools/libxl/libxl_create.c
> @@ -219,6 +219,11 @@ int libxl__domain_build_info_setdefault(
>
>      libxl_defbool_setdefault(&b_info->numa_placement, true);
>
> +    if (!b_info->min_nodes)
> +        b_info->min_nodes = 0;
> +    if (!b_info->max_nodes)
> +        b_info->max_nodes = 0;
> +
>      if (b_info->max_memkb == LIBXL_MEMKB_DEFAULT)
>          b_info->max_memkb = 32 * 1024;
>      if (b_info->target_memkb == LIBXL_MEMKB_DEFAULT)
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -174,7 +174,8 @@ static int numa_place_domain(libxl__gc *
>      /* Find the best candidate with enough free memory and at least
>       * as much pcpus as the domain has vcpus.  */
>      rc = libxl__get_numa_candidate(gc, memkb, info->max_vcpus,
> -                                   0, 0, &cpupool_info.cpumap,
> +                                   info->min_nodes, info->max_nodes,
> +                                   &cpupool_info.cpumap,
>                                     numa_cmpf, &candidate, &found);
>      if (rc)
>          goto out;
> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
> --- a/tools/libxl/libxl_types.idl
> +++ b/tools/libxl/libxl_types.idl
> @@ -262,6 +262,8 @@ libxl_domain_build_info = Struct("domain
>      ("avail_vcpus",     libxl_bitmap),
>      ("cpumap",          libxl_bitmap),
>      ("numa_placement",  libxl_defbool),
> +    ("min_nodes",       integer),
> +    ("max_nodes",       integer),
>      ("tsc_mode",        libxl_tsc_mode),
>      ("max_memkb",       MemKB),
>      ("target_memkb",    MemKB),
> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> --- a/tools/libxl/xl_cmdimpl.c
> +++ b/tools/libxl/xl_cmdimpl.c
> @@ -731,6 +731,11 @@ static void parse_config_data(const char
>          libxl_defbool_set(&b_info->numa_placement, false);
>      }
>
> +    if (!xlu_cfg_get_long (config, "minnodes", &l, 0))
> +        b_info->min_nodes = l;
> +    if (!xlu_cfg_get_long (config, "maxnodes", &l, 0))
> +        b_info->max_nodes = l;
> +
>      if (!xlu_cfg_get_long (config, "memory", &l, 0)) {
>          b_info->max_memkb = l * 1024;
>          b_info->target_memkb = b_info->max_memkb;
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

George Dunlap

2012-Oct-18 15:30 UTC

head link

Re: [PATCH 3 of 3] xl: allow for node-wise specification of vcpu pinning

On Tue, Oct 16, 2012 at 6:26 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:> Making it possible to use something like the following:
>  * "nodes:0-3": all pCPUs of nodes 0,1,2,3;
>  * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3;
>  * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2 but not
pCPU 6;
>  * ...
>
> In both domain config file and `xl vcpu-pin''.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Is it OK if I give this an "I''m OK with the idea but I''m
too lazy to
figure out if the parsing code is doing the right thing" Ack? :-)

 -George
>
> diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
> --- a/docs/man/xl.cfg.pod.5
> +++ b/docs/man/xl.cfg.pod.5
> @@ -109,7 +109,7 @@ some cpus on its own (see below). A C<CP
>
>  =over 4
>
> -=item "all"
> +=item "all" (or "nodes:all")
>
>  To allow all the vcpus of the guest to run on all the cpus on the host.
>
> @@ -117,6 +117,14 @@ To allow all the vcpus of the guest to r
>
>  To allow all the vcpus of the guest to run on cpus 0,2,3,5.
>
> +=item "nodes:0-3,^node:2"
> +
> +To allow all the vcpus of the guest to run on the cpus belonging to
> +the NUMA nodes 0,1,3 of the host. Notice that it is possible to combine
> +this syntax with the one above. That means, something like
"1,node:2,^6"
> +is possible and means all the vcpus of the guest will run on cpus 1 plus
> +on all the cpus of node 2, but never on cpu 6.
> +
>  =item ["2", "3"] (or [2, 3])
>
>  To ask for specific vcpu mapping. That means (in this example), vcpu #0
> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> --- a/tools/libxl/xl_cmdimpl.c
> +++ b/tools/libxl/xl_cmdimpl.c
> @@ -506,31 +506,58 @@ static void split_string_into_string_lis
>
>  static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap)
>  {
> -    libxl_bitmap exclude_cpumap;
> -    uint32_t cpuida, cpuidb;
> +    libxl_bitmap nodemap, cpu_nodemap;
> +    libxl_bitmap exclude_cpumap, exclude_nodemap;
> +    uint32_t ida, idb;
>      char *endptr, *toka, *tokb, *saveptr = NULL;
> -    int i, rc = 0, rmcpu;
> -
> -    if (!strcmp(cpu, "all")) {
> +    int i, rc = 0, isnot, isnode;
> +
> +    if (!strcmp(cpu, "all") || !strcmp(cpu,
"nodes:all")) {
>          libxl_bitmap_set_any(cpumap);
>          return 0;
>      }
>
> -    if (libxl_cpu_bitmap_alloc(ctx, &exclude_cpumap, 0)) {
> +    libxl_bitmap_init(&cpu_nodemap);
> +    libxl_bitmap_init(&nodemap);
> +    libxl_bitmap_init(&exclude_nodemap);
> +    libxl_bitmap_init(&exclude_nodemap);
> +
> +    rc = libxl_node_bitmap_alloc(ctx, &cpu_nodemap, 0);
> +    if (rc) {
> +        fprintf(stderr, "Error: Failed to allocate nodemap.\n");
> +        goto vcpp_out;
> +    }
> +    rc = libxl_node_bitmap_alloc(ctx, &nodemap, 0);
> +    if (rc) {
> +        fprintf(stderr, "Error: Failed to allocate nodemap.\n");
> +        goto vcpp_out;
> +    }
> +    rc = libxl_node_bitmap_alloc(ctx, &exclude_nodemap, 0);
> +    if (rc) {
> +        fprintf(stderr, "Error: Failed to allocate nodemap.\n");
> +        goto vcpp_out;
> +    }
> +    rc = libxl_cpu_bitmap_alloc(ctx, &exclude_cpumap, 0);
> +    if (rc) {
>          fprintf(stderr, "Error: Failed to allocate cpumap.\n");
> -        return ENOMEM;
> +        goto vcpp_out;
>      }
>
>      for (toka = strtok_r(cpu, ",", &saveptr); toka;
>           toka = strtok_r(NULL, ",", &saveptr)) {
> -        rmcpu = 0;
> +        isnot = 0; isnode = 0;
>          if (*toka == ''^'') {
> -            /* This (These) Cpu(s) will be removed from the map */
> +            /* This (These) Cpu(s)/Node(s) will be removed from the map */
>              toka++;
> -            rmcpu = 1;
> -        }
> -        /* Extract a valid (range of) cpu(s) */
> -        cpuida = cpuidb = strtoul(toka, &endptr, 10);
> +            isnot = 1;
> +        }
> +        /* Check if we''re dealing with a full node */
> +        if (strstr(toka, "node:") == toka || strstr(toka,
"nodes:") == toka) {
> +            toka += 5 + (toka[4] == ''s'');
> +            isnode = 1;
> +        }
> +        /* Extract a valid (range of) cpu(s) or node(s) */
> +        ida = idb = strtoul(toka, &endptr, 10);
>          if (endptr == toka) {
>              fprintf(stderr, "Error: Invalid argument.\n");
>              rc = EINVAL;
> @@ -538,27 +565,48 @@ static int vcpupin_parse(char *cpu, libx
>          }
>          if (*endptr == ''-'') {
>              tokb = endptr + 1;
> -            cpuidb = strtoul(tokb, &endptr, 10);
> -            if (endptr == tokb || cpuida > cpuidb) {
> +            idb = strtoul(tokb, &endptr, 10);
> +            if (endptr == tokb || ida > idb) {
>                  fprintf(stderr, "Error: Invalid argument.\n");
>                  rc = EINVAL;
>                  goto vcpp_out;
>              }
>          }
> -        while (cpuida <= cpuidb) {
> -            rmcpu == 0 ? libxl_bitmap_set(cpumap, cpuida) :
> -                         libxl_bitmap_set(&exclude_cpumap, cpuida);
> -            cpuida++;
> -        }
> -    }
> -
> -    /* Clear all the cpus from the removal list */
> +        while (ida <= idb) {
> +            if (!isnode)
> +                isnot == 0 ? libxl_bitmap_set(cpumap, ida) :
> +                             libxl_bitmap_set(&exclude_cpumap, ida);
> +            else
> +                isnot == 0 ? libxl_bitmap_set(&nodemap, ida) :
> +                             libxl_bitmap_set(&exclude_nodemap, ida);
> +            ida++;
> +        }
> +    }
> +
> +    /* Add the cpus that have been specified via "node:" items
*/
> +    rc = libxl_nodemap_to_cpumap(ctx, &nodemap, &cpu_nodemap);
> +    if (rc)
> +        goto vcpp_out;
> +    libxl_for_each_set_bit(i, cpu_nodemap) {
> +        libxl_bitmap_set(cpumap, i);
> +    }
> +
> +    /* Clear all the cpus from the removal cpu and node lists */
>      libxl_for_each_set_bit(i, exclude_cpumap) {
>          libxl_bitmap_reset(cpumap, i);
>      }
> +    rc = libxl_nodemap_to_cpumap(ctx, &exclude_nodemap,
&cpu_nodemap);
> +    if (rc)
> +        goto vcpp_out;
> +    libxl_for_each_set_bit(i, cpu_nodemap) {
> +        libxl_bitmap_reset(cpumap, i);
> +    }
>
>  vcpp_out:
>      libxl_bitmap_dispose(&exclude_cpumap);
> +    libxl_bitmap_dispose(&exclude_nodemap);
> +    libxl_bitmap_dispose(&nodemap);
> +    libxl_bitmap_dispose(&cpu_nodemap);
>
>      return rc;
>  }
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Dario Faggioli

2012-Oct-18 22:35 UTC

head link

Re: [PATCH 3 of 3] xl: allow for node-wise specification of vcpu pinning

On Thu, 2012-10-18 at 16:30 +0100, George Dunlap wrote:> On Tue, Oct 16, 2012 at 6:26 PM, Dario Faggioli
> <dario.faggioli@citrix.com> wrote:
> > Making it possible to use something like the following:
> >  * "nodes:0-3": all pCPUs of nodes 0,1,2,3;
> >  * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3;
> >  * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2 but
not pCPU 6;
> >  * ...
> >
> > In both domain config file and `xl vcpu-pin''.
> >
> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> 
> Is it OK if I give this an "I''m OK with the idea but
I''m too lazy to
> figure out if the parsing code is doing the right thing" Ack? :-)
> Well, it''s certainly ok with me... :-)

Actually, the new parsing code is nothing particularly sexy, and the
only relevant bit is the one IanC commented about, i.e., checking for a
"node:" substring at the beginning of each '',''
separated element of the
original string.

That''s it.

The reason why this requires this whole lot of + lines is that I need a
couple of nodemap where to store the data, and declaring, initializing,
allocating and freeing them requires that. :-(

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Oct-18 23:20 UTC

head link

Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement

On Thu, 2012-10-18 at 16:17 +0100, George Dunlap wrote:> On Tue, Oct 16, 2012 at 6:26 PM, Dario Faggioli
> <dario.faggioli@citrix.com> wrote:
> > In fact, among placement candidates with the same number of nodes, the
> > closer the various nodes are to each others, the better the
performances
> > for a domain placed there.
> 
> Looks good overall -- my only worry is the N^2 nature of the
> algorithm.  We''re already doing some big combinatorial thing to
> generate the candidates, right?  
>It is, with N being the number of nodes, which we discussed thoroughly
already a couple of months ago, and reached consensus on the fact that N
will stay less than 8 for the next 5 (but probably even more) years. :-)

In any case, if something really unexpected happens, and N jumps to
anything bigger than 16, the placement algorithm won''t even start, and
we''ll never reach this point!.

Moreover, given the number we''re playing with, I don''t think
this
specific patch is adding much complexity, as we already have the
function that counts the number of vCPUs (as it was for xend) bound to a
candidate, which is Ndoms*Nvcpus, and we''re very likely going o have
much more domains than nodes. :-)
> And now we''re doing N^2 for each
> candidate? 
>Again, yes, but that is turning it from Ndoms*Nvcpus to
Ndoms*Nvcpus+Nnodes^2, which is still dominated by the first term. IIRC,
Andre tried to start >50 domains with 2 vCPUs on a 8 nodes system, which
means 50*2 vs 8*8.
> Suppose we get an ARM system with 4096 cores and 128 NUMA
> nodes?  If Xen 4.4 doesn''t come out until March 2014, there will
still
> be distros using 4.3 through mid-2015.
> Right, but I really don''t think that monster is actually made out of
4096 cores arranged in 128 _NUMA_ nodes on which you run the same
instance of the hypervisor.

I also recall hearing the numbers and the use of the word "node", but
I
really think they was rather referred to a cluster architecture where "a
node" means something more like "a server", each one running
their copy
of Xen (although they''ll be packed all together in the same rack,
talking via some super-fast interconnect).
I''m pretty sure I remember Stefano speculating about the need to use
some orchestration layer (like {Cloud,Open}Stack) _within_ those big
irons to deal exactly with that... Stefano, am I talking nonsense? :-D

Finally, allow me to say that the whole placement algorithm already
interacts quite nicely with cpupools. Thus, even in the unlikely event
of an actual 128 NUMA nodes machine, you can have, say, 16 cpupools with
8 nodes each (or vice versa), and the algorithm will be back dealing
with _no_more_than_ 8 (or 16) nodes. Yes, right now this would require
for someone to manually setup the pools and decide which domain to put
where. However, it would be very very easy to add, at that point,
something doing this pooling and more coarse placing automatically (and
quickly). In fact, we can even think about having it for 4.3, if you
really believe it''s necessary.
> I seem to remember having a discussion about this issue already, but I
> can''t remember what the outcome was...
> Yep, we did, and the outcome was right what I tried to summarize
above. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Jackson

2012-Oct-19 10:03 UTC

head link

Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement

Dario Faggioli writes ("Re: [Xen-devel] [PATCH 1 of 3] libxl: take node
distances into account during NUMA placement"):> On Thu, 2012-10-18 at 16:17 +0100, George Dunlap wrote:
> > Looks good overall -- my only worry is the N^2 nature of the
> > algorithm.  We''re already doing some big combinatorial thing
to
> > generate the candidates, right?  
> >
> It is, with N being the number of nodes, which we discussed thoroughly
> already a couple of months ago, and reached consensus on the fact that N
> will stay less than 8 for the next 5 (but probably even more) years. :-)
No, I don''t think we did reach consensus about that.  It was asserted
but I dissented.  I don''t think this is a reasonable assumption.
> In any case, if something really unexpected happens, and N jumps to
> anything bigger than 16, the placement algorithm won''t even start,
and
> we''ll never reach this point!.
And that is the safety catch you put in because I didn''t think the
assumption about numbers of cpus was reasonable.

Ian.

Stefano Stabellini

2012-Oct-19 10:35 UTC

head link

Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement

On Fri, 19 Oct 2012, Dario Faggioli wrote:> > Suppose we get an ARM system with 4096 cores and 128 NUMA
> > nodes?  If Xen 4.4 doesn''t come out until March 2014, there
will still
> > be distros using 4.3 through mid-2015.
> > 
> Right, but I really don''t think that monster is actually made out
of
> 4096 cores arranged in 128 _NUMA_ nodes on which you run the same
> instance of the hypervisor.
In fact it does not. It would be 4096 cores arranged in 1024 nodes, each
node completely independent from one another.

> I also recall hearing the numbers and the use of the word "node",
but I
> really think they was rather referred to a cluster architecture where
"a
> node" means something more like "a server", each one running
their copy
> of Xen (although they''ll be packed all together in the same rack,
> talking via some super-fast interconnect).
Exactly right.

> I''m pretty sure I remember Stefano speculating about the need to
use
> some orchestration layer (like {Cloud,Open}Stack) _within_ those big
> irons to deal exactly with that... Stefano, am I talking nonsense? :-D
Nope, all true.

Stefano Stabellini

2012-Oct-19 10:39 UTC

head link

Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement

On Fri, 19 Oct 2012, Ian Jackson wrote:> Dario Faggioli writes ("Re: [Xen-devel] [PATCH 1 of 3] libxl: take
node distances into account during NUMA placement"):
> > On Thu, 2012-10-18 at 16:17 +0100, George Dunlap wrote:
> > > Looks good overall -- my only worry is the N^2 nature of the
> > > algorithm.  We''re already doing some big combinatorial
thing to
> > > generate the candidates, right?  
> > >
> > It is, with N being the number of nodes, which we discussed thoroughly
> > already a couple of months ago, and reached consensus on the fact that
N
> > will stay less than 8 for the next 5 (but probably even more) years.
:-)
> 
> No, I don''t think we did reach consensus about that.  It was
asserted
> but I dissented.  I don''t think this is a reasonable assumption.
When Dario speaks about consensus in terms of hardware that is going to
reach the market, he really means what Intel and AMD have told us.

Sorry but your opinion doesn''t count that much in the matter of cpu
architecture being produces in the near future, unless you have already
started a secret Californian cpu startup, that nobody knows about yet
;-)

George Dunlap

2012-Oct-19 10:50 UTC

head link

Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement

On 19/10/12 00:20, Dario Faggioli wrote:> On Thu, 2012-10-18 at 16:17 +0100, George Dunlap wrote:
>> On Tue, Oct 16, 2012 at 6:26 PM, Dario Faggioli
>> <dario.faggioli@citrix.com> wrote:
>>> In fact, among placement candidates with the same number of nodes,
the
>>> closer the various nodes are to each others, the better the
performances
>>> for a domain placed there.
>> Looks good overall -- my only worry is the N^2 nature of the
>> algorithm.  We''re already doing some big combinatorial thing
to
>> generate the candidates, right?
>>
> It is, with N being the number of nodes, which we discussed thoroughly
> already a couple of months ago, and reached consensus on the fact that N
> will stay less than 8 for the next 5 (but probably even more) years. :-)
>
> In any case, if something really unexpected happens, and N jumps to
> anything bigger than 16, the placement algorithm won''t even start,
and
> we''ll never reach this point!.
Ah, right -- Yes, I think the "safety valve" was the thing I wanted to
have in place.

In that case:

Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

Dario Faggioli

2012-Oct-19 10:56 UTC

head link

Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement

On Fri, 2012-10-19 at 11:39 +0100, Stefano Stabellini
wrote:> On Fri, 19 Oct 2012, Ian Jackson wrote:
> > Dario Faggioli writes ("Re: [Xen-devel] [PATCH 1 of 3] libxl:
take node distances into account during NUMA placement"):
> > > It is, with N being the number of nodes, which we discussed
thoroughly
> > > already a couple of months ago, and reached consensus on the fact
that N
> > > will stay less than 8 for the next 5 (but probably even more)
years. :-)
> > 
> > No, I don''t think we did reach consensus about that.  It was
asserted
> > but I dissented.  I don''t think this is a reasonable
assumption.
> 
> When Dario speaks about consensus in terms of hardware that is going to
> reach the market, he really means what Intel and AMD have told us.
> Indeed.

That being said, I really wanted to avoid re-starting that discussion
so, while trying to summarize it as quickly as possible, I''ve no
problem
admitting that "reach consensus" could not be the perfect choice of
words. My bad.

The whole point is that the solution we have (and that I''m trying to
keep up improving with these patches) is both viable with current and
near future hardware and no harms with any crazy unexpectable
breakthrough in CPU design --thanks to the safety catch IanJ suggested. 

Moreover, it is easily amendable, for example taking more advantage of
cpupools (with witch the placement algorithm is already integrated up to
some extent), in case the latter happens.
This is so true that, as I already said, I''ve no problem even starting
to think about how to put it together. Maybe not from tomorrow (I''m
quite busy with other stuff :-D), but definitely before 4.3, if we think
it''s something we couldn''t live without.
> Sorry but your opinion doesn''t count that much in the matter of
cpu
> architecture being produces in the near future, unless you have already
> started a secret Californian cpu startup, that nobody knows about yet
> ;-)
>:-D

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Oct-19 11:00 UTC

head link

Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement

On Fri, 2012-10-19 at 11:50 +0100, George Dunlap wrote:> > In any case, if something really unexpected happens, and N jumps to
> > anything bigger than 16, the placement algorithm won''t even
start, and
> > we''ll never reach this point!.
> 
> Ah, right -- Yes, I think the "safety valve" was the thing I
wanted to
> have in place.
> Both you and Ian Jackson, yes. :-)
> In that case:
> 
> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
> Cool. I''ll add you''re Acks and respin. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Jackson

2012-Oct-19 14:57 UTC

head link

Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement

Dario Faggioli writes ("Re: [Xen-devel] [PATCH 1 of 3] libxl: take node
distances into account during NUMA placement"):> On Thu, 2012-10-18 at 16:17 +0100, George Dunlap wrote:
> > And now we''re doing N^2 for each
> > candidate? 
> 
> Again, yes, but that is turning it from Ndoms*Nvcpus to
> Ndoms*Nvcpus+Nnodes^2, which is still dominated by the first term. IIRC,
> Andre tried to start >50 domains with 2 vCPUs on a 8 nodes system, which
> means 50*2 vs 8*8.
Are you sure about this calculation ?  It seems to me that 
doing Nnodes^2 for each candidate multiplies the cost by the number of
candidates, rather than adding Nnodes^2.

There are in the worst case nearly 2^Nnodes candidates (combinations
of nodes).  So the cost is
  <= 2^Nnodes * Nnodes^2

Your algorithm runs with up to 16 nodes.  Your change here increases
the worst-case cost from
  2^16 = 65556
to
  2^16 * 16^2 = 16777216

I don''t think that''s a good idea.

Ian.

Dario Faggioli

2012-Oct-19 18:02 UTC

head link

Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement

On Fri, 2012-10-19 at 15:57 +0100, Ian Jackson wrote:> Dario Faggioli writes ("Re: [Xen-devel] [PATCH 1 of 3] libxl: take
node distances into account during NUMA placement"):
> > On Thu, 2012-10-18 at 16:17 +0100, George Dunlap wrote:
> > > And now we''re doing N^2 for each
> > > candidate? 
> > 
> > Again, yes, but that is turning it from Ndoms*Nvcpus to
> > Ndoms*Nvcpus+Nnodes^2, which is still dominated by the first term.
IIRC,
> > Andre tried to start >50 domains with 2 vCPUs on a 8 nodes system,
which
> > means 50*2 vs 8*8.
> 
> Are you sure about this calculation ?  
>I am. In fact, I think they answer George''s question which was "by
how
much we are increasing the complexity of each step?" and not "by how
much we are increasing the overall complexity?".

That of course doesn''t mean I don''t care about the overall
complexity,
just that I don''t think this is making things worse than what we have
(which is, btw, the result of the long discussion we had pre 4.2
release).
> It seems to me that 
> doing Nnodes^2 for each candidate multiplies the cost by the number of
> candidates, rather than adding Nnodes^2.
> That is definitely true. Again the key is the difference between talking
about "each candidate", as we were, and total.
> There are in the worst case nearly 2^Nnodes candidates (combinations
> of nodes).  So the cost is
>   <= 2^Nnodes * Nnodes^2
> 
> Your algorithm runs with up to 16 nodes.  Your change here increases
> the worst-case cost from
>   2^16 = 65556
> to
>   2^16 * 16^2 = 16777216
> Ok, I''m fine with the "let''s throw numbers" game, so
let me throw
mine[*] ones before proposing a couple of possible strategies. :-D

So, the current exact calculation for the number of candidates that are
generated (in the worst case, i.e., when the only suitable candidate for
the domain being created is the very last one evaluated):

combs=0;
N=16;
for k=1:N
  combs=combs+bincoeff(N,k);
endfor;
combs
combs =  65535

Now, the change about distances adds to each step something that can be
upper bounded by this:

steps=0;
N=16;
for i=0:N
  for j=1:N-i
    steps++;
  endfor;
endfor;
steps
steps =  136

Hence:

combs*steps
ans =  8912760

But this is of course not tight, and a more accurate calculation of the
worst case overall required number of "steps" you''re
interested in
should look more like this:

combs=0;
N=16;
for k=1:N
  steps=0;
  for i=0:k
    for j=1:k-i
      steps++;
    endfor;
  endfor;
  combs=combs+bincoeff(n,k)*steps;
endfor;
combs
combs =  2490368

So not exactly 16777216, although, don''t get me wrong, I agree it is
huge.

So, now that the math has been taken care of, here it comes the time for
the proposals I was talking about before.

1. For this specific issue, I can try to change the way I use the 
  distances matrix and the way I define and calculate a measure of how
  distant the nodes in a candidate are from each others, trying to make
  it at least linear in computation time. It''s not immediate, mainly
  because it''s a matrix after all, but maybe I can save some
  intermediate results during the process and amortize the complexity
  (I''ve ideas, but nothing precise enough to share yet). Does that make
  sense to you?

2. Independently from this specific issue, I think we need something to
  determine where a reasonable limit lie, and whether what we have and
  the changes we will make result or not in a decent domain creation  
  time.
  That''s something very hard to do, but I was thinking about writing
  some sort of ''simulator'', i.e., a standalone program that
calls the
  placement function and intercept the libxl_ calls it does, to create
  and let it run in an artificial scenario of choice.
  Something like that has already been suggested during one of the last
  rounds of review (although for different purposes), and I thought it
  was a real good idea... Now I think is something we definitely
  need. :-)
  I know it''s not going to reach usec-level precision (hypercall being
  simulated by function calls, and stuff like that), but I think it
  could be useful... You?

Thanks again for looking into this and Regards,
Dario

[*] copying and pasting the lines in octave should work.

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Oct-21 07:35 UTC

head link

Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement

On Fri, 2012-10-19 at 20:02 +0200, Dario Faggioli wrote:
> I am. In fact, I think they answer George''s question which was
"by how
> much we are increasing the complexity of each step?" and not "by
how
> much we are increasing the overall complexity?".
> 
> That of course doesn''t mean I don''t care about the
overall complexity,
> just that I don''t think this is making things worse than what we
have
> (which is, btw, the result of the long discussion we had pre 4.2
> release).
> 
> > It seems to me that 
> > doing Nnodes^2 for each candidate multiplies the cost by the number of
> > candidates, rather than adding Nnodes^2.
> > 
> That is definitely true. Again the key is the difference between talking
> about "each candidate", as we were, and total.
> Which, BTW, make me thinking that you don''t like the fact that, for
each
candidate, we go through the list of domains to check what vcpus are
bound to such candidate.

That was requested during review of the automatic placement series,
reviewed itself, tested on an 8 nodes box with 50-60 domains and became
cs 4165d71479f9.

Anyway, if you want, I think I can change that too. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Xen devel - Oct 2012 - [PATCH 0 of 3] Some small NUMA placement improvements

[PATCH 0 of 3] Some small NUMA placement improvements

[PATCH 1 of 3] libxl: take node distances into account during NUMA placement

[PATCH 2 of 3] libxl, xl: user can ask for min and max nodes to use during placement

[PATCH 3 of 3] xl: allow for node-wise specification of vcpu pinning

Re: [PATCH 3 of 3] xl: allow for node-wise specification of vcpu pinning

Re: [PATCH 3 of 3] xl: allow for node-wise specification of vcpu pinning

Re: [PATCH 3 of 3] xl: allow for node-wise specification of vcpu pinning

Re: [PATCH 3 of 3] xl: allow for node-wise specification of vcpu pinning

Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement

Re: [PATCH 2 of 3] libxl, xl: user can ask for min and max nodes to use during placement

Re: [PATCH 3 of 3] xl: allow for node-wise specification of vcpu pinning

Re: [PATCH 3 of 3] xl: allow for node-wise specification of vcpu pinning

Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement

Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement

Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement

Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement

Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement

Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement

Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement

Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement

Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement

Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement