Dario Faggioli
2012-Oct-16 17:26 UTC
[PATCH 0 of 3] Some small NUMA placement improvements
Hi everyone, This series implements some small impovement in how NUMA is dealt at the libxl and xl level that I had pending in one of my queues. More specifically, it enables actually using node distances information during automatic placement and it allows the users to ask for a minimum and a maximum number of NUMA nodes they want the automaic placement to use. A little bit less related to _automatic_ placement, it also enhances the syntax of the "cpus=" config switch (as well as the one of the `xl vcpu-pin'' command), so that full nodes can be specified instead of just single CPUs. This is all about, on one hand, keeping improving performances of the NUMA placement mechanism and, on the other, starting to give users some knobs to control it. Comments appreciated, as usual. :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Dario Faggioli
2012-Oct-16 17:26 UTC
[PATCH 1 of 3] libxl: take node distances into account during NUMA placement
In fact, among placement candidates with the same number of nodes, the closer the various nodes are to each others, the better the performances for a domain placed there. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -105,6 +105,9 @@ out: * - the number of vcpus runnable on the candidates is considered, and * candidates with fewer of them are preferred. If two candidate have * the same number of runnable vcpus, + * - the sum of the node distances in the candidates is considered, and + * candidates with smaller total distance are preferred. If total + * distance is the same for the two candidatess, * - the amount of free memory in the candidates is considered, and the * candidate with greater amount of it is preferred. * @@ -114,6 +117,10 @@ out: * overloading large (from a memory POV) nodes. That''s right the effect * that counting the vcpus able to run on the nodes tries to prevent. * + * The relative distance within the nodes in the candidates is considered + * as the closer the nodes, the better for the domain ending up on the + * candidate. + * * Note that this completely ignore the number of nodes each candidate span, * as the fact that fewer nodes is better is already accounted for in the * algorithm. @@ -124,6 +131,9 @@ static int numa_cmpf(const libxl__numa_c if (c1->nr_vcpus != c2->nr_vcpus) return c1->nr_vcpus - c2->nr_vcpus; + if (c1->dists_sum != c2->dists_sum) + return c1->dists_sum - c2->dists_sum; + return c2->free_memkb - c1->free_memkb; } diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -2732,6 +2732,7 @@ static inline void libxl__ctx_unlock(lib typedef struct { int nr_cpus, nr_nodes; int nr_vcpus; + int dists_sum; uint32_t free_memkb; libxl_bitmap nodemap; } libxl__numa_candidate; diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c --- a/tools/libxl/libxl_numa.c +++ b/tools/libxl/libxl_numa.c @@ -218,6 +218,40 @@ static int nodemap_to_nr_vcpus(libxl__gc return nr_vcpus; } +/* Sum the relative distances of nodes in the nodemap to help finding + * out which candidate is the "tightest" one. */ +static int nodemap_to_dists_sum(libxl_numainfo *ninfo, libxl_bitmap *nodemap) +{ + int tot_dist = 0; + int i, j, a = 0, b; + + for (i = 0; i < libxl_bitmap_count_set(nodemap); i++) { + while (!libxl_bitmap_test(nodemap, a)) + a++; + + /* As it is usually non-zero, we do take the latency of + * of a node to itself into account. */ + b = a; + for (j = 0; j < libxl_bitmap_count_set(nodemap) - i; j++) { + while (!libxl_bitmap_test(nodemap, b)) + b++; + + /* + * In most architectures, going from node A to node B costs + * exactly as much as going from B to A does. However, let''s + * not rely on this and consider both contributions, just to + * be ready for everything future might reserve for us. + */ + tot_dist += ninfo[a].dists[b]; + tot_dist += ninfo[b].dists[a]; + b++; + } + a++; + } + + return tot_dist; +} + /* * This function tries to figure out if the host has a consistent number * of cpus along all its NUMA nodes. In fact, if that is the case, we can @@ -415,6 +449,7 @@ int libxl__get_numa_candidate(libxl__gc */ libxl__numa_candidate_put_nodemap(gc, &new_cndt, &nodemap); new_cndt.nr_vcpus = nodemap_to_nr_vcpus(gc, tinfo, &nodemap); + new_cndt.dists_sum = nodemap_to_dists_sum(ninfo, &nodemap); new_cndt.free_memkb = nodes_free_memkb; new_cndt.nr_nodes = libxl_bitmap_count_set(&nodemap); new_cndt.nr_cpus = nodes_cpus; @@ -430,12 +465,14 @@ int libxl__get_numa_candidate(libxl__gc LOG(DEBUG, "New best NUMA placement candidate found: " "nr_nodes=%d, nr_cpus=%d, nr_vcpus=%d, " - "free_memkb=%"PRIu32"", new_cndt.nr_nodes, - new_cndt.nr_cpus, new_cndt.nr_vcpus, + "dists_sum=%d, free_memkb=%"PRIu32"", + new_cndt.nr_nodes, new_cndt.nr_cpus, + new_cndt.nr_vcpus, new_cndt.dists_sum, new_cndt.free_memkb / 1024); libxl__numa_candidate_put_nodemap(gc, cndt_out, &nodemap); cndt_out->nr_vcpus = new_cndt.nr_vcpus; + cndt_out->dists_sum = new_cndt.dists_sum; cndt_out->free_memkb = new_cndt.free_memkb; cndt_out->nr_nodes = new_cndt.nr_nodes; cndt_out->nr_cpus = new_cndt.nr_cpus;
Dario Faggioli
2012-Oct-16 17:26 UTC
[PATCH 2 of 3] libxl, xl: user can ask for min and max nodes to use during placement
And the placement algorithm will stick to that, or fail. This happens adding support for "minnodes=" and "maxnodes=" in the domain config file parser. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -133,6 +133,18 @@ the same time, achieving efficient utili =back +=item B<minnodes=N> + +Tells libxl to place the new domain on at least `N` nodes. This is only +effective if automatic NUMA placement occurs (i.e., if no `cpus=` option +is specified). + +=item B<maxnodes=M> + +Tells libxl to place the new domain on at most `M` nodes. This is only +effective if automatic NUMA placement occurs (i.e., if no `cpus=` option +is specified). + =head3 CPU Scheduling =over 4 diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c --- a/tools/libxl/libxl_create.c +++ b/tools/libxl/libxl_create.c @@ -219,6 +219,11 @@ int libxl__domain_build_info_setdefault( libxl_defbool_setdefault(&b_info->numa_placement, true); + if (!b_info->min_nodes) + b_info->min_nodes = 0; + if (!b_info->max_nodes) + b_info->max_nodes = 0; + if (b_info->max_memkb == LIBXL_MEMKB_DEFAULT) b_info->max_memkb = 32 * 1024; if (b_info->target_memkb == LIBXL_MEMKB_DEFAULT) diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -174,7 +174,8 @@ static int numa_place_domain(libxl__gc * /* Find the best candidate with enough free memory and at least * as much pcpus as the domain has vcpus. */ rc = libxl__get_numa_candidate(gc, memkb, info->max_vcpus, - 0, 0, &cpupool_info.cpumap, + info->min_nodes, info->max_nodes, + &cpupool_info.cpumap, numa_cmpf, &candidate, &found); if (rc) goto out; diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl --- a/tools/libxl/libxl_types.idl +++ b/tools/libxl/libxl_types.idl @@ -262,6 +262,8 @@ libxl_domain_build_info = Struct("domain ("avail_vcpus", libxl_bitmap), ("cpumap", libxl_bitmap), ("numa_placement", libxl_defbool), + ("min_nodes", integer), + ("max_nodes", integer), ("tsc_mode", libxl_tsc_mode), ("max_memkb", MemKB), ("target_memkb", MemKB), diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -731,6 +731,11 @@ static void parse_config_data(const char libxl_defbool_set(&b_info->numa_placement, false); } + if (!xlu_cfg_get_long (config, "minnodes", &l, 0)) + b_info->min_nodes = l; + if (!xlu_cfg_get_long (config, "maxnodes", &l, 0)) + b_info->max_nodes = l; + if (!xlu_cfg_get_long (config, "memory", &l, 0)) { b_info->max_memkb = l * 1024; b_info->target_memkb = b_info->max_memkb;
Dario Faggioli
2012-Oct-16 17:26 UTC
[PATCH 3 of 3] xl: allow for node-wise specification of vcpu pinning
Making it possible to use something like the following: * "nodes:0-3": all pCPUs of nodes 0,1,2,3; * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3; * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2 but not pCPU 6; * ... In both domain config file and `xl vcpu-pin''. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -109,7 +109,7 @@ some cpus on its own (see below). A C<CP =over 4 -=item "all" +=item "all" (or "nodes:all") To allow all the vcpus of the guest to run on all the cpus on the host. @@ -117,6 +117,14 @@ To allow all the vcpus of the guest to r To allow all the vcpus of the guest to run on cpus 0,2,3,5. +=item "nodes:0-3,^node:2" + +To allow all the vcpus of the guest to run on the cpus belonging to +the NUMA nodes 0,1,3 of the host. Notice that it is possible to combine +this syntax with the one above. That means, something like "1,node:2,^6" +is possible and means all the vcpus of the guest will run on cpus 1 plus +on all the cpus of node 2, but never on cpu 6. + =item ["2", "3"] (or [2, 3]) To ask for specific vcpu mapping. That means (in this example), vcpu #0 diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -506,31 +506,58 @@ static void split_string_into_string_lis static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap) { - libxl_bitmap exclude_cpumap; - uint32_t cpuida, cpuidb; + libxl_bitmap nodemap, cpu_nodemap; + libxl_bitmap exclude_cpumap, exclude_nodemap; + uint32_t ida, idb; char *endptr, *toka, *tokb, *saveptr = NULL; - int i, rc = 0, rmcpu; - - if (!strcmp(cpu, "all")) { + int i, rc = 0, isnot, isnode; + + if (!strcmp(cpu, "all") || !strcmp(cpu, "nodes:all")) { libxl_bitmap_set_any(cpumap); return 0; } - if (libxl_cpu_bitmap_alloc(ctx, &exclude_cpumap, 0)) { + libxl_bitmap_init(&cpu_nodemap); + libxl_bitmap_init(&nodemap); + libxl_bitmap_init(&exclude_nodemap); + libxl_bitmap_init(&exclude_nodemap); + + rc = libxl_node_bitmap_alloc(ctx, &cpu_nodemap, 0); + if (rc) { + fprintf(stderr, "Error: Failed to allocate nodemap.\n"); + goto vcpp_out; + } + rc = libxl_node_bitmap_alloc(ctx, &nodemap, 0); + if (rc) { + fprintf(stderr, "Error: Failed to allocate nodemap.\n"); + goto vcpp_out; + } + rc = libxl_node_bitmap_alloc(ctx, &exclude_nodemap, 0); + if (rc) { + fprintf(stderr, "Error: Failed to allocate nodemap.\n"); + goto vcpp_out; + } + rc = libxl_cpu_bitmap_alloc(ctx, &exclude_cpumap, 0); + if (rc) { fprintf(stderr, "Error: Failed to allocate cpumap.\n"); - return ENOMEM; + goto vcpp_out; } for (toka = strtok_r(cpu, ",", &saveptr); toka; toka = strtok_r(NULL, ",", &saveptr)) { - rmcpu = 0; + isnot = 0; isnode = 0; if (*toka == ''^'') { - /* This (These) Cpu(s) will be removed from the map */ + /* This (These) Cpu(s)/Node(s) will be removed from the map */ toka++; - rmcpu = 1; - } - /* Extract a valid (range of) cpu(s) */ - cpuida = cpuidb = strtoul(toka, &endptr, 10); + isnot = 1; + } + /* Check if we''re dealing with a full node */ + if (strstr(toka, "node:") == toka || strstr(toka, "nodes:") == toka) { + toka += 5 + (toka[4] == ''s''); + isnode = 1; + } + /* Extract a valid (range of) cpu(s) or node(s) */ + ida = idb = strtoul(toka, &endptr, 10); if (endptr == toka) { fprintf(stderr, "Error: Invalid argument.\n"); rc = EINVAL; @@ -538,27 +565,48 @@ static int vcpupin_parse(char *cpu, libx } if (*endptr == ''-'') { tokb = endptr + 1; - cpuidb = strtoul(tokb, &endptr, 10); - if (endptr == tokb || cpuida > cpuidb) { + idb = strtoul(tokb, &endptr, 10); + if (endptr == tokb || ida > idb) { fprintf(stderr, "Error: Invalid argument.\n"); rc = EINVAL; goto vcpp_out; } } - while (cpuida <= cpuidb) { - rmcpu == 0 ? libxl_bitmap_set(cpumap, cpuida) : - libxl_bitmap_set(&exclude_cpumap, cpuida); - cpuida++; - } - } - - /* Clear all the cpus from the removal list */ + while (ida <= idb) { + if (!isnode) + isnot == 0 ? libxl_bitmap_set(cpumap, ida) : + libxl_bitmap_set(&exclude_cpumap, ida); + else + isnot == 0 ? libxl_bitmap_set(&nodemap, ida) : + libxl_bitmap_set(&exclude_nodemap, ida); + ida++; + } + } + + /* Add the cpus that have been specified via "node:" items */ + rc = libxl_nodemap_to_cpumap(ctx, &nodemap, &cpu_nodemap); + if (rc) + goto vcpp_out; + libxl_for_each_set_bit(i, cpu_nodemap) { + libxl_bitmap_set(cpumap, i); + } + + /* Clear all the cpus from the removal cpu and node lists */ libxl_for_each_set_bit(i, exclude_cpumap) { libxl_bitmap_reset(cpumap, i); } + rc = libxl_nodemap_to_cpumap(ctx, &exclude_nodemap, &cpu_nodemap); + if (rc) + goto vcpp_out; + libxl_for_each_set_bit(i, cpu_nodemap) { + libxl_bitmap_reset(cpumap, i); + } vcpp_out: libxl_bitmap_dispose(&exclude_cpumap); + libxl_bitmap_dispose(&exclude_nodemap); + libxl_bitmap_dispose(&nodemap); + libxl_bitmap_dispose(&cpu_nodemap); return rc; }
Ian Campbell
2012-Oct-18 11:23 UTC
Re: [PATCH 3 of 3] xl: allow for node-wise specification of vcpu pinning
On Tue, 2012-10-16 at 18:26 +0100, Dario Faggioli wrote:> + isnot = 1; > + } > + /* Check if we''re dealing with a full node */ > + if (strstr(toka, "node:") == toka || strstr(toka, "nodes:") == toka) {Isn''t strstr(FOO, "thing") == FOO just an expensive way of writing strncmp(FOO, "ting", sizeof(...))? Ian
Dario Faggioli
2012-Oct-18 13:11 UTC
Re: [PATCH 3 of 3] xl: allow for node-wise specification of vcpu pinning
On Thu, 2012-10-18 at 12:23 +0100, Ian Campbell wrote:> On Tue, 2012-10-16 at 18:26 +0100, Dario Faggioli wrote: > > > + isnot = 1; > > + } > > + /* Check if we''re dealing with a full node */ > > + if (strstr(toka, "node:") == toka || strstr(toka, "nodes:") == toka) { > > Isn''t strstr(FOO, "thing") == FOO just an expensive way of writing > strncmp(FOO, "ting", sizeof(...))? >Mmm... what I wanted was to make sure that "thing" not only is part of FOO, but that also is the very beginning of that same string. That''s why I exploited the fact that strstr() returns a pointer to the beginning of the substring "thing" within FOO. Now that you mentioned it, I think that I actually can achieve the same with strncmp, limiting the comparison at strlen("thing"). If that was what you meant, the answer is yes, I can do that (and I agree it''s probably better). Thanks for looking into this, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Campbell
2012-Oct-18 13:15 UTC
Re: [PATCH 3 of 3] xl: allow for node-wise specification of vcpu pinning
On Thu, 2012-10-18 at 14:11 +0100, Dario Faggioli wrote:> On Thu, 2012-10-18 at 12:23 +0100, Ian Campbell wrote: > > On Tue, 2012-10-16 at 18:26 +0100, Dario Faggioli wrote: > > > > > + isnot = 1; > > > + } > > > + /* Check if we''re dealing with a full node */ > > > + if (strstr(toka, "node:") == toka || strstr(toka, "nodes:") == toka) { > > > > Isn''t strstr(FOO, "thing") == FOO just an expensive way of writing > > strncmp(FOO, "ting", sizeof(...))? > > > Mmm... what I wanted was to make sure that "thing" not only is part of > FOO, but that also is the very beginning of that same string. That''s why > I exploited the fact that strstr() returns a pointer to the beginning of > the substring "thing" within FOO. > > Now that you mentioned it, I think that I actually can achieve the same > with strncmp, limiting the comparison at strlen("thing"). If that was > what you meant, the answer is yes, I can do that (and I agree it''s > probably better).Right, With the strstr method if there is 1000 characters in the input before the "thing" you''ve wasted a bunch of time scanning for it just to throw it away, whereas with str*cmp you would give up after the first non-matching char. Ian.
Dario Faggioli
2012-Oct-18 13:18 UTC
Re: [PATCH 3 of 3] xl: allow for node-wise specification of vcpu pinning
On Thu, 2012-10-18 at 14:15 +0100, Ian Campbell wrote:> Right, With the strstr method if there is 1000 characters in the input > before the "thing" you''ve wasted a bunch of time scanning for it just to > throw it away, whereas with str*cmp you would give up after the first > non-matching char. >Ok, I''ll go for it. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
George Dunlap
2012-Oct-18 15:17 UTC
Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement
On Tue, Oct 16, 2012 at 6:26 PM, Dario Faggioli <dario.faggioli@citrix.com> wrote:> In fact, among placement candidates with the same number of nodes, the > closer the various nodes are to each others, the better the performances > for a domain placed there.Looks good overall -- my only worry is the N^2 nature of the algorithm. We''re already doing some big combinatorial thing to generate the candidates, right? And now we''re doing N^2 for each candidate? Suppose we get an ARM system with 4096 cores and 128 NUMA nodes? If Xen 4.4 doesn''t come out until March 2014, there will still be distros using 4.3 through mid-2015. I seem to remember having a discussion about this issue already, but I can''t remember what the outcome was... -George> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c > --- a/tools/libxl/libxl_dom.c > +++ b/tools/libxl/libxl_dom.c > @@ -105,6 +105,9 @@ out: > * - the number of vcpus runnable on the candidates is considered, and > * candidates with fewer of them are preferred. If two candidate have > * the same number of runnable vcpus, > + * - the sum of the node distances in the candidates is considered, and > + * candidates with smaller total distance are preferred. If total > + * distance is the same for the two candidatess, > * - the amount of free memory in the candidates is considered, and the > * candidate with greater amount of it is preferred. > * > @@ -114,6 +117,10 @@ out: > * overloading large (from a memory POV) nodes. That''s right the effect > * that counting the vcpus able to run on the nodes tries to prevent. > * > + * The relative distance within the nodes in the candidates is considered > + * as the closer the nodes, the better for the domain ending up on the > + * candidate. > + * > * Note that this completely ignore the number of nodes each candidate span, > * as the fact that fewer nodes is better is already accounted for in the > * algorithm. > @@ -124,6 +131,9 @@ static int numa_cmpf(const libxl__numa_c > if (c1->nr_vcpus != c2->nr_vcpus) > return c1->nr_vcpus - c2->nr_vcpus; > > + if (c1->dists_sum != c2->dists_sum) > + return c1->dists_sum - c2->dists_sum; > + > return c2->free_memkb - c1->free_memkb; > } > > diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h > --- a/tools/libxl/libxl_internal.h > +++ b/tools/libxl/libxl_internal.h > @@ -2732,6 +2732,7 @@ static inline void libxl__ctx_unlock(lib > typedef struct { > int nr_cpus, nr_nodes; > int nr_vcpus; > + int dists_sum; > uint32_t free_memkb; > libxl_bitmap nodemap; > } libxl__numa_candidate; > diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c > --- a/tools/libxl/libxl_numa.c > +++ b/tools/libxl/libxl_numa.c > @@ -218,6 +218,40 @@ static int nodemap_to_nr_vcpus(libxl__gc > return nr_vcpus; > } > > +/* Sum the relative distances of nodes in the nodemap to help finding > + * out which candidate is the "tightest" one. */ > +static int nodemap_to_dists_sum(libxl_numainfo *ninfo, libxl_bitmap *nodemap) > +{ > + int tot_dist = 0; > + int i, j, a = 0, b; > + > + for (i = 0; i < libxl_bitmap_count_set(nodemap); i++) { > + while (!libxl_bitmap_test(nodemap, a)) > + a++; > + > + /* As it is usually non-zero, we do take the latency of > + * of a node to itself into account. */ > + b = a; > + for (j = 0; j < libxl_bitmap_count_set(nodemap) - i; j++) { > + while (!libxl_bitmap_test(nodemap, b)) > + b++; > + > + /* > + * In most architectures, going from node A to node B costs > + * exactly as much as going from B to A does. However, let''s > + * not rely on this and consider both contributions, just to > + * be ready for everything future might reserve for us. > + */ > + tot_dist += ninfo[a].dists[b]; > + tot_dist += ninfo[b].dists[a]; > + b++; > + } > + a++; > + } > + > + return tot_dist; > +} > + > /* > * This function tries to figure out if the host has a consistent number > * of cpus along all its NUMA nodes. In fact, if that is the case, we can > @@ -415,6 +449,7 @@ int libxl__get_numa_candidate(libxl__gc > */ > libxl__numa_candidate_put_nodemap(gc, &new_cndt, &nodemap); > new_cndt.nr_vcpus = nodemap_to_nr_vcpus(gc, tinfo, &nodemap); > + new_cndt.dists_sum = nodemap_to_dists_sum(ninfo, &nodemap); > new_cndt.free_memkb = nodes_free_memkb; > new_cndt.nr_nodes = libxl_bitmap_count_set(&nodemap); > new_cndt.nr_cpus = nodes_cpus; > @@ -430,12 +465,14 @@ int libxl__get_numa_candidate(libxl__gc > > LOG(DEBUG, "New best NUMA placement candidate found: " > "nr_nodes=%d, nr_cpus=%d, nr_vcpus=%d, " > - "free_memkb=%"PRIu32"", new_cndt.nr_nodes, > - new_cndt.nr_cpus, new_cndt.nr_vcpus, > + "dists_sum=%d, free_memkb=%"PRIu32"", > + new_cndt.nr_nodes, new_cndt.nr_cpus, > + new_cndt.nr_vcpus, new_cndt.dists_sum, > new_cndt.free_memkb / 1024); > > libxl__numa_candidate_put_nodemap(gc, cndt_out, &nodemap); > cndt_out->nr_vcpus = new_cndt.nr_vcpus; > + cndt_out->dists_sum = new_cndt.dists_sum; > cndt_out->free_memkb = new_cndt.free_memkb; > cndt_out->nr_nodes = new_cndt.nr_nodes; > cndt_out->nr_cpus = new_cndt.nr_cpus; > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
George Dunlap
2012-Oct-18 15:21 UTC
Re: [PATCH 2 of 3] libxl, xl: user can ask for min and max nodes to use during placement
On Tue, Oct 16, 2012 at 6:26 PM, Dario Faggioli <dario.faggioli@citrix.com> wrote:> And the placement algorithm will stick to that, or fail. This happens > adding support for "minnodes=" and "maxnodes=" in the domain config > file parser. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>Acked-by: George Dunlap <george.dunlap@eu.citrix.com>> > diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 > --- a/docs/man/xl.cfg.pod.5 > +++ b/docs/man/xl.cfg.pod.5 > @@ -133,6 +133,18 @@ the same time, achieving efficient utili > > =back > > +=item B<minnodes=N> > + > +Tells libxl to place the new domain on at least `N` nodes. This is only > +effective if automatic NUMA placement occurs (i.e., if no `cpus=` option > +is specified). > + > +=item B<maxnodes=M> > + > +Tells libxl to place the new domain on at most `M` nodes. This is only > +effective if automatic NUMA placement occurs (i.e., if no `cpus=` option > +is specified). > + > =head3 CPU Scheduling > > =over 4 > diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c > --- a/tools/libxl/libxl_create.c > +++ b/tools/libxl/libxl_create.c > @@ -219,6 +219,11 @@ int libxl__domain_build_info_setdefault( > > libxl_defbool_setdefault(&b_info->numa_placement, true); > > + if (!b_info->min_nodes) > + b_info->min_nodes = 0; > + if (!b_info->max_nodes) > + b_info->max_nodes = 0; > + > if (b_info->max_memkb == LIBXL_MEMKB_DEFAULT) > b_info->max_memkb = 32 * 1024; > if (b_info->target_memkb == LIBXL_MEMKB_DEFAULT) > diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c > --- a/tools/libxl/libxl_dom.c > +++ b/tools/libxl/libxl_dom.c > @@ -174,7 +174,8 @@ static int numa_place_domain(libxl__gc * > /* Find the best candidate with enough free memory and at least > * as much pcpus as the domain has vcpus. */ > rc = libxl__get_numa_candidate(gc, memkb, info->max_vcpus, > - 0, 0, &cpupool_info.cpumap, > + info->min_nodes, info->max_nodes, > + &cpupool_info.cpumap, > numa_cmpf, &candidate, &found); > if (rc) > goto out; > diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl > --- a/tools/libxl/libxl_types.idl > +++ b/tools/libxl/libxl_types.idl > @@ -262,6 +262,8 @@ libxl_domain_build_info = Struct("domain > ("avail_vcpus", libxl_bitmap), > ("cpumap", libxl_bitmap), > ("numa_placement", libxl_defbool), > + ("min_nodes", integer), > + ("max_nodes", integer), > ("tsc_mode", libxl_tsc_mode), > ("max_memkb", MemKB), > ("target_memkb", MemKB), > diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c > --- a/tools/libxl/xl_cmdimpl.c > +++ b/tools/libxl/xl_cmdimpl.c > @@ -731,6 +731,11 @@ static void parse_config_data(const char > libxl_defbool_set(&b_info->numa_placement, false); > } > > + if (!xlu_cfg_get_long (config, "minnodes", &l, 0)) > + b_info->min_nodes = l; > + if (!xlu_cfg_get_long (config, "maxnodes", &l, 0)) > + b_info->max_nodes = l; > + > if (!xlu_cfg_get_long (config, "memory", &l, 0)) { > b_info->max_memkb = l * 1024; > b_info->target_memkb = b_info->max_memkb; > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
George Dunlap
2012-Oct-18 15:30 UTC
Re: [PATCH 3 of 3] xl: allow for node-wise specification of vcpu pinning
On Tue, Oct 16, 2012 at 6:26 PM, Dario Faggioli <dario.faggioli@citrix.com> wrote:> Making it possible to use something like the following: > * "nodes:0-3": all pCPUs of nodes 0,1,2,3; > * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3; > * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2 but not pCPU 6; > * ... > > In both domain config file and `xl vcpu-pin''. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>Is it OK if I give this an "I''m OK with the idea but I''m too lazy to figure out if the parsing code is doing the right thing" Ack? :-) -George> > diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 > --- a/docs/man/xl.cfg.pod.5 > +++ b/docs/man/xl.cfg.pod.5 > @@ -109,7 +109,7 @@ some cpus on its own (see below). A C<CP > > =over 4 > > -=item "all" > +=item "all" (or "nodes:all") > > To allow all the vcpus of the guest to run on all the cpus on the host. > > @@ -117,6 +117,14 @@ To allow all the vcpus of the guest to r > > To allow all the vcpus of the guest to run on cpus 0,2,3,5. > > +=item "nodes:0-3,^node:2" > + > +To allow all the vcpus of the guest to run on the cpus belonging to > +the NUMA nodes 0,1,3 of the host. Notice that it is possible to combine > +this syntax with the one above. That means, something like "1,node:2,^6" > +is possible and means all the vcpus of the guest will run on cpus 1 plus > +on all the cpus of node 2, but never on cpu 6. > + > =item ["2", "3"] (or [2, 3]) > > To ask for specific vcpu mapping. That means (in this example), vcpu #0 > diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c > --- a/tools/libxl/xl_cmdimpl.c > +++ b/tools/libxl/xl_cmdimpl.c > @@ -506,31 +506,58 @@ static void split_string_into_string_lis > > static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap) > { > - libxl_bitmap exclude_cpumap; > - uint32_t cpuida, cpuidb; > + libxl_bitmap nodemap, cpu_nodemap; > + libxl_bitmap exclude_cpumap, exclude_nodemap; > + uint32_t ida, idb; > char *endptr, *toka, *tokb, *saveptr = NULL; > - int i, rc = 0, rmcpu; > - > - if (!strcmp(cpu, "all")) { > + int i, rc = 0, isnot, isnode; > + > + if (!strcmp(cpu, "all") || !strcmp(cpu, "nodes:all")) { > libxl_bitmap_set_any(cpumap); > return 0; > } > > - if (libxl_cpu_bitmap_alloc(ctx, &exclude_cpumap, 0)) { > + libxl_bitmap_init(&cpu_nodemap); > + libxl_bitmap_init(&nodemap); > + libxl_bitmap_init(&exclude_nodemap); > + libxl_bitmap_init(&exclude_nodemap); > + > + rc = libxl_node_bitmap_alloc(ctx, &cpu_nodemap, 0); > + if (rc) { > + fprintf(stderr, "Error: Failed to allocate nodemap.\n"); > + goto vcpp_out; > + } > + rc = libxl_node_bitmap_alloc(ctx, &nodemap, 0); > + if (rc) { > + fprintf(stderr, "Error: Failed to allocate nodemap.\n"); > + goto vcpp_out; > + } > + rc = libxl_node_bitmap_alloc(ctx, &exclude_nodemap, 0); > + if (rc) { > + fprintf(stderr, "Error: Failed to allocate nodemap.\n"); > + goto vcpp_out; > + } > + rc = libxl_cpu_bitmap_alloc(ctx, &exclude_cpumap, 0); > + if (rc) { > fprintf(stderr, "Error: Failed to allocate cpumap.\n"); > - return ENOMEM; > + goto vcpp_out; > } > > for (toka = strtok_r(cpu, ",", &saveptr); toka; > toka = strtok_r(NULL, ",", &saveptr)) { > - rmcpu = 0; > + isnot = 0; isnode = 0; > if (*toka == ''^'') { > - /* This (These) Cpu(s) will be removed from the map */ > + /* This (These) Cpu(s)/Node(s) will be removed from the map */ > toka++; > - rmcpu = 1; > - } > - /* Extract a valid (range of) cpu(s) */ > - cpuida = cpuidb = strtoul(toka, &endptr, 10); > + isnot = 1; > + } > + /* Check if we''re dealing with a full node */ > + if (strstr(toka, "node:") == toka || strstr(toka, "nodes:") == toka) { > + toka += 5 + (toka[4] == ''s''); > + isnode = 1; > + } > + /* Extract a valid (range of) cpu(s) or node(s) */ > + ida = idb = strtoul(toka, &endptr, 10); > if (endptr == toka) { > fprintf(stderr, "Error: Invalid argument.\n"); > rc = EINVAL; > @@ -538,27 +565,48 @@ static int vcpupin_parse(char *cpu, libx > } > if (*endptr == ''-'') { > tokb = endptr + 1; > - cpuidb = strtoul(tokb, &endptr, 10); > - if (endptr == tokb || cpuida > cpuidb) { > + idb = strtoul(tokb, &endptr, 10); > + if (endptr == tokb || ida > idb) { > fprintf(stderr, "Error: Invalid argument.\n"); > rc = EINVAL; > goto vcpp_out; > } > } > - while (cpuida <= cpuidb) { > - rmcpu == 0 ? libxl_bitmap_set(cpumap, cpuida) : > - libxl_bitmap_set(&exclude_cpumap, cpuida); > - cpuida++; > - } > - } > - > - /* Clear all the cpus from the removal list */ > + while (ida <= idb) { > + if (!isnode) > + isnot == 0 ? libxl_bitmap_set(cpumap, ida) : > + libxl_bitmap_set(&exclude_cpumap, ida); > + else > + isnot == 0 ? libxl_bitmap_set(&nodemap, ida) : > + libxl_bitmap_set(&exclude_nodemap, ida); > + ida++; > + } > + } > + > + /* Add the cpus that have been specified via "node:" items */ > + rc = libxl_nodemap_to_cpumap(ctx, &nodemap, &cpu_nodemap); > + if (rc) > + goto vcpp_out; > + libxl_for_each_set_bit(i, cpu_nodemap) { > + libxl_bitmap_set(cpumap, i); > + } > + > + /* Clear all the cpus from the removal cpu and node lists */ > libxl_for_each_set_bit(i, exclude_cpumap) { > libxl_bitmap_reset(cpumap, i); > } > + rc = libxl_nodemap_to_cpumap(ctx, &exclude_nodemap, &cpu_nodemap); > + if (rc) > + goto vcpp_out; > + libxl_for_each_set_bit(i, cpu_nodemap) { > + libxl_bitmap_reset(cpumap, i); > + } > > vcpp_out: > libxl_bitmap_dispose(&exclude_cpumap); > + libxl_bitmap_dispose(&exclude_nodemap); > + libxl_bitmap_dispose(&nodemap); > + libxl_bitmap_dispose(&cpu_nodemap); > > return rc; > } > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Dario Faggioli
2012-Oct-18 22:35 UTC
Re: [PATCH 3 of 3] xl: allow for node-wise specification of vcpu pinning
On Thu, 2012-10-18 at 16:30 +0100, George Dunlap wrote:> On Tue, Oct 16, 2012 at 6:26 PM, Dario Faggioli > <dario.faggioli@citrix.com> wrote: > > Making it possible to use something like the following: > > * "nodes:0-3": all pCPUs of nodes 0,1,2,3; > > * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3; > > * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2 but not pCPU 6; > > * ... > > > > In both domain config file and `xl vcpu-pin''. > > > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > Is it OK if I give this an "I''m OK with the idea but I''m too lazy to > figure out if the parsing code is doing the right thing" Ack? :-) >Well, it''s certainly ok with me... :-) Actually, the new parsing code is nothing particularly sexy, and the only relevant bit is the one IanC commented about, i.e., checking for a "node:" substring at the beginning of each '','' separated element of the original string. That''s it. The reason why this requires this whole lot of + lines is that I need a couple of nodemap where to store the data, and declaring, initializing, allocating and freeing them requires that. :-( Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2012-Oct-18 23:20 UTC
Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement
On Thu, 2012-10-18 at 16:17 +0100, George Dunlap wrote:> On Tue, Oct 16, 2012 at 6:26 PM, Dario Faggioli > <dario.faggioli@citrix.com> wrote: > > In fact, among placement candidates with the same number of nodes, the > > closer the various nodes are to each others, the better the performances > > for a domain placed there. > > Looks good overall -- my only worry is the N^2 nature of the > algorithm. We''re already doing some big combinatorial thing to > generate the candidates, right? >It is, with N being the number of nodes, which we discussed thoroughly already a couple of months ago, and reached consensus on the fact that N will stay less than 8 for the next 5 (but probably even more) years. :-) In any case, if something really unexpected happens, and N jumps to anything bigger than 16, the placement algorithm won''t even start, and we''ll never reach this point!. Moreover, given the number we''re playing with, I don''t think this specific patch is adding much complexity, as we already have the function that counts the number of vCPUs (as it was for xend) bound to a candidate, which is Ndoms*Nvcpus, and we''re very likely going o have much more domains than nodes. :-)> And now we''re doing N^2 for each > candidate? >Again, yes, but that is turning it from Ndoms*Nvcpus to Ndoms*Nvcpus+Nnodes^2, which is still dominated by the first term. IIRC, Andre tried to start >50 domains with 2 vCPUs on a 8 nodes system, which means 50*2 vs 8*8.> Suppose we get an ARM system with 4096 cores and 128 NUMA > nodes? If Xen 4.4 doesn''t come out until March 2014, there will still > be distros using 4.3 through mid-2015. >Right, but I really don''t think that monster is actually made out of 4096 cores arranged in 128 _NUMA_ nodes on which you run the same instance of the hypervisor. I also recall hearing the numbers and the use of the word "node", but I really think they was rather referred to a cluster architecture where "a node" means something more like "a server", each one running their copy of Xen (although they''ll be packed all together in the same rack, talking via some super-fast interconnect). I''m pretty sure I remember Stefano speculating about the need to use some orchestration layer (like {Cloud,Open}Stack) _within_ those big irons to deal exactly with that... Stefano, am I talking nonsense? :-D Finally, allow me to say that the whole placement algorithm already interacts quite nicely with cpupools. Thus, even in the unlikely event of an actual 128 NUMA nodes machine, you can have, say, 16 cpupools with 8 nodes each (or vice versa), and the algorithm will be back dealing with _no_more_than_ 8 (or 16) nodes. Yes, right now this would require for someone to manually setup the pools and decide which domain to put where. However, it would be very very easy to add, at that point, something doing this pooling and more coarse placing automatically (and quickly). In fact, we can even think about having it for 4.3, if you really believe it''s necessary.> I seem to remember having a discussion about this issue already, but I > can''t remember what the outcome was... >Yep, we did, and the outcome was right what I tried to summarize above. :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Jackson
2012-Oct-19 10:03 UTC
Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement
Dario Faggioli writes ("Re: [Xen-devel] [PATCH 1 of 3] libxl: take node distances into account during NUMA placement"):> On Thu, 2012-10-18 at 16:17 +0100, George Dunlap wrote: > > Looks good overall -- my only worry is the N^2 nature of the > > algorithm. We''re already doing some big combinatorial thing to > > generate the candidates, right? > > > It is, with N being the number of nodes, which we discussed thoroughly > already a couple of months ago, and reached consensus on the fact that N > will stay less than 8 for the next 5 (but probably even more) years. :-)No, I don''t think we did reach consensus about that. It was asserted but I dissented. I don''t think this is a reasonable assumption.> In any case, if something really unexpected happens, and N jumps to > anything bigger than 16, the placement algorithm won''t even start, and > we''ll never reach this point!.And that is the safety catch you put in because I didn''t think the assumption about numbers of cpus was reasonable. Ian.
Stefano Stabellini
2012-Oct-19 10:35 UTC
Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement
On Fri, 19 Oct 2012, Dario Faggioli wrote:> > Suppose we get an ARM system with 4096 cores and 128 NUMA > > nodes? If Xen 4.4 doesn''t come out until March 2014, there will still > > be distros using 4.3 through mid-2015. > > > Right, but I really don''t think that monster is actually made out of > 4096 cores arranged in 128 _NUMA_ nodes on which you run the same > instance of the hypervisor.In fact it does not. It would be 4096 cores arranged in 1024 nodes, each node completely independent from one another.> I also recall hearing the numbers and the use of the word "node", but I > really think they was rather referred to a cluster architecture where "a > node" means something more like "a server", each one running their copy > of Xen (although they''ll be packed all together in the same rack, > talking via some super-fast interconnect).Exactly right.> I''m pretty sure I remember Stefano speculating about the need to use > some orchestration layer (like {Cloud,Open}Stack) _within_ those big > irons to deal exactly with that... Stefano, am I talking nonsense? :-DNope, all true.
Stefano Stabellini
2012-Oct-19 10:39 UTC
Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement
On Fri, 19 Oct 2012, Ian Jackson wrote:> Dario Faggioli writes ("Re: [Xen-devel] [PATCH 1 of 3] libxl: take node distances into account during NUMA placement"): > > On Thu, 2012-10-18 at 16:17 +0100, George Dunlap wrote: > > > Looks good overall -- my only worry is the N^2 nature of the > > > algorithm. We''re already doing some big combinatorial thing to > > > generate the candidates, right? > > > > > It is, with N being the number of nodes, which we discussed thoroughly > > already a couple of months ago, and reached consensus on the fact that N > > will stay less than 8 for the next 5 (but probably even more) years. :-) > > No, I don''t think we did reach consensus about that. It was asserted > but I dissented. I don''t think this is a reasonable assumption.When Dario speaks about consensus in terms of hardware that is going to reach the market, he really means what Intel and AMD have told us. Sorry but your opinion doesn''t count that much in the matter of cpu architecture being produces in the near future, unless you have already started a secret Californian cpu startup, that nobody knows about yet ;-)
George Dunlap
2012-Oct-19 10:50 UTC
Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement
On 19/10/12 00:20, Dario Faggioli wrote:> On Thu, 2012-10-18 at 16:17 +0100, George Dunlap wrote: >> On Tue, Oct 16, 2012 at 6:26 PM, Dario Faggioli >> <dario.faggioli@citrix.com> wrote: >>> In fact, among placement candidates with the same number of nodes, the >>> closer the various nodes are to each others, the better the performances >>> for a domain placed there. >> Looks good overall -- my only worry is the N^2 nature of the >> algorithm. We''re already doing some big combinatorial thing to >> generate the candidates, right? >> > It is, with N being the number of nodes, which we discussed thoroughly > already a couple of months ago, and reached consensus on the fact that N > will stay less than 8 for the next 5 (but probably even more) years. :-) > > In any case, if something really unexpected happens, and N jumps to > anything bigger than 16, the placement algorithm won''t even start, and > we''ll never reach this point!.Ah, right -- Yes, I think the "safety valve" was the thing I wanted to have in place. In that case: Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Dario Faggioli
2012-Oct-19 10:56 UTC
Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement
On Fri, 2012-10-19 at 11:39 +0100, Stefano Stabellini wrote:> On Fri, 19 Oct 2012, Ian Jackson wrote: > > Dario Faggioli writes ("Re: [Xen-devel] [PATCH 1 of 3] libxl: take node distances into account during NUMA placement"): > > > It is, with N being the number of nodes, which we discussed thoroughly > > > already a couple of months ago, and reached consensus on the fact that N > > > will stay less than 8 for the next 5 (but probably even more) years. :-) > > > > No, I don''t think we did reach consensus about that. It was asserted > > but I dissented. I don''t think this is a reasonable assumption. > > When Dario speaks about consensus in terms of hardware that is going to > reach the market, he really means what Intel and AMD have told us. >Indeed. That being said, I really wanted to avoid re-starting that discussion so, while trying to summarize it as quickly as possible, I''ve no problem admitting that "reach consensus" could not be the perfect choice of words. My bad. The whole point is that the solution we have (and that I''m trying to keep up improving with these patches) is both viable with current and near future hardware and no harms with any crazy unexpectable breakthrough in CPU design --thanks to the safety catch IanJ suggested. Moreover, it is easily amendable, for example taking more advantage of cpupools (with witch the placement algorithm is already integrated up to some extent), in case the latter happens. This is so true that, as I already said, I''ve no problem even starting to think about how to put it together. Maybe not from tomorrow (I''m quite busy with other stuff :-D), but definitely before 4.3, if we think it''s something we couldn''t live without.> Sorry but your opinion doesn''t count that much in the matter of cpu > architecture being produces in the near future, unless you have already > started a secret Californian cpu startup, that nobody knows about yet > ;-) >:-D Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2012-Oct-19 11:00 UTC
Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement
On Fri, 2012-10-19 at 11:50 +0100, George Dunlap wrote:> > In any case, if something really unexpected happens, and N jumps to > > anything bigger than 16, the placement algorithm won''t even start, and > > we''ll never reach this point!. > > Ah, right -- Yes, I think the "safety valve" was the thing I wanted to > have in place. >Both you and Ian Jackson, yes. :-)> In that case: > > Acked-by: George Dunlap <george.dunlap@eu.citrix.com> >Cool. I''ll add you''re Acks and respin. :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Jackson
2012-Oct-19 14:57 UTC
Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement
Dario Faggioli writes ("Re: [Xen-devel] [PATCH 1 of 3] libxl: take node distances into account during NUMA placement"):> On Thu, 2012-10-18 at 16:17 +0100, George Dunlap wrote: > > And now we''re doing N^2 for each > > candidate? > > Again, yes, but that is turning it from Ndoms*Nvcpus to > Ndoms*Nvcpus+Nnodes^2, which is still dominated by the first term. IIRC, > Andre tried to start >50 domains with 2 vCPUs on a 8 nodes system, which > means 50*2 vs 8*8.Are you sure about this calculation ? It seems to me that doing Nnodes^2 for each candidate multiplies the cost by the number of candidates, rather than adding Nnodes^2. There are in the worst case nearly 2^Nnodes candidates (combinations of nodes). So the cost is <= 2^Nnodes * Nnodes^2 Your algorithm runs with up to 16 nodes. Your change here increases the worst-case cost from 2^16 = 65556 to 2^16 * 16^2 = 16777216 I don''t think that''s a good idea. Ian.
Dario Faggioli
2012-Oct-19 18:02 UTC
Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement
On Fri, 2012-10-19 at 15:57 +0100, Ian Jackson wrote:> Dario Faggioli writes ("Re: [Xen-devel] [PATCH 1 of 3] libxl: take node distances into account during NUMA placement"): > > On Thu, 2012-10-18 at 16:17 +0100, George Dunlap wrote: > > > And now we''re doing N^2 for each > > > candidate? > > > > Again, yes, but that is turning it from Ndoms*Nvcpus to > > Ndoms*Nvcpus+Nnodes^2, which is still dominated by the first term. IIRC, > > Andre tried to start >50 domains with 2 vCPUs on a 8 nodes system, which > > means 50*2 vs 8*8. > > Are you sure about this calculation ? >I am. In fact, I think they answer George''s question which was "by how much we are increasing the complexity of each step?" and not "by how much we are increasing the overall complexity?". That of course doesn''t mean I don''t care about the overall complexity, just that I don''t think this is making things worse than what we have (which is, btw, the result of the long discussion we had pre 4.2 release).> It seems to me that > doing Nnodes^2 for each candidate multiplies the cost by the number of > candidates, rather than adding Nnodes^2. >That is definitely true. Again the key is the difference between talking about "each candidate", as we were, and total.> There are in the worst case nearly 2^Nnodes candidates (combinations > of nodes). So the cost is > <= 2^Nnodes * Nnodes^2 > > Your algorithm runs with up to 16 nodes. Your change here increases > the worst-case cost from > 2^16 = 65556 > to > 2^16 * 16^2 = 16777216 >Ok, I''m fine with the "let''s throw numbers" game, so let me throw mine[*] ones before proposing a couple of possible strategies. :-D So, the current exact calculation for the number of candidates that are generated (in the worst case, i.e., when the only suitable candidate for the domain being created is the very last one evaluated): combs=0; N=16; for k=1:N combs=combs+bincoeff(N,k); endfor; combs combs = 65535 Now, the change about distances adds to each step something that can be upper bounded by this: steps=0; N=16; for i=0:N for j=1:N-i steps++; endfor; endfor; steps steps = 136 Hence: combs*steps ans = 8912760 But this is of course not tight, and a more accurate calculation of the worst case overall required number of "steps" you''re interested in should look more like this: combs=0; N=16; for k=1:N steps=0; for i=0:k for j=1:k-i steps++; endfor; endfor; combs=combs+bincoeff(n,k)*steps; endfor; combs combs = 2490368 So not exactly 16777216, although, don''t get me wrong, I agree it is huge. So, now that the math has been taken care of, here it comes the time for the proposals I was talking about before. 1. For this specific issue, I can try to change the way I use the distances matrix and the way I define and calculate a measure of how distant the nodes in a candidate are from each others, trying to make it at least linear in computation time. It''s not immediate, mainly because it''s a matrix after all, but maybe I can save some intermediate results during the process and amortize the complexity (I''ve ideas, but nothing precise enough to share yet). Does that make sense to you? 2. Independently from this specific issue, I think we need something to determine where a reasonable limit lie, and whether what we have and the changes we will make result or not in a decent domain creation time. That''s something very hard to do, but I was thinking about writing some sort of ''simulator'', i.e., a standalone program that calls the placement function and intercept the libxl_ calls it does, to create and let it run in an artificial scenario of choice. Something like that has already been suggested during one of the last rounds of review (although for different purposes), and I thought it was a real good idea... Now I think is something we definitely need. :-) I know it''s not going to reach usec-level precision (hypercall being simulated by function calls, and stuff like that), but I think it could be useful... You? Thanks again for looking into this and Regards, Dario [*] copying and pasting the lines in octave should work. -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2012-Oct-21 07:35 UTC
Re: [PATCH 1 of 3] libxl: take node distances into account during NUMA placement
On Fri, 2012-10-19 at 20:02 +0200, Dario Faggioli wrote:> I am. In fact, I think they answer George''s question which was "by how > much we are increasing the complexity of each step?" and not "by how > much we are increasing the overall complexity?". > > That of course doesn''t mean I don''t care about the overall complexity, > just that I don''t think this is making things worse than what we have > (which is, btw, the result of the long discussion we had pre 4.2 > release). > > > It seems to me that > > doing Nnodes^2 for each candidate multiplies the cost by the number of > > candidates, rather than adding Nnodes^2. > > > That is definitely true. Again the key is the difference between talking > about "each candidate", as we were, and total. >Which, BTW, make me thinking that you don''t like the fact that, for each candidate, we go through the list of domains to check what vcpus are bound to such candidate. That was requested during review of the automatic placement series, reviewed itself, tested on an 8 nodes box with 50-60 domains and became cs 4165d71479f9. Anyway, if you want, I think I can change that too. :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel