Dario Faggioli
2012-Oct-19 14:34 UTC
[PATCH 0 of 3 v2] Some small NUMA placement improvements
Hi again, Take 2 of this series, with the only comment about strstr()-vs-strncmp() in patch 2 addressed, and all the three patches already Acked-by George. Just as a quick reminder, the series is about: - use node distances information during automatic placement (patch 1); - let the user specify a minimum and a maximum number of NUMA nodes they want the automaic placement to use (patch 2); - enhance the syntax of the "cpus=" config switch so that full nodes can be specified instead of just single CPUs (patch 3). Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Dario Faggioli
2012-Oct-19 14:34 UTC
[PATCH 1 of 3 v2] libxl: take node distances into account during NUMA placement
In fact, among placement candidates with the same number of nodes, the closer the various nodes are to each others, the better the performances for a domain placed there. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -105,6 +105,9 @@ out: * - the number of vcpus runnable on the candidates is considered, and * candidates with fewer of them are preferred. If two candidate have * the same number of runnable vcpus, + * - the sum of the node distances in the candidates is considered, and + * candidates with smaller total distance are preferred. If total + * distance is the same for the two candidatess, * - the amount of free memory in the candidates is considered, and the * candidate with greater amount of it is preferred. * @@ -114,6 +117,10 @@ out: * overloading large (from a memory POV) nodes. That''s right the effect * that counting the vcpus able to run on the nodes tries to prevent. * + * The relative distance within the nodes in the candidates is considered + * as the closer the nodes, the better for the domain ending up on the + * candidate. + * * Note that this completely ignore the number of nodes each candidate span, * as the fact that fewer nodes is better is already accounted for in the * algorithm. @@ -124,6 +131,9 @@ static int numa_cmpf(const libxl__numa_c if (c1->nr_vcpus != c2->nr_vcpus) return c1->nr_vcpus - c2->nr_vcpus; + if (c1->dists_sum != c2->dists_sum) + return c1->dists_sum - c2->dists_sum; + return c2->free_memkb - c1->free_memkb; } diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h --- a/tools/libxl/libxl_internal.h +++ b/tools/libxl/libxl_internal.h @@ -2735,6 +2735,7 @@ static inline void libxl__ctx_unlock(lib typedef struct { int nr_cpus, nr_nodes; int nr_vcpus; + int dists_sum; uint32_t free_memkb; libxl_bitmap nodemap; } libxl__numa_candidate; diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c --- a/tools/libxl/libxl_numa.c +++ b/tools/libxl/libxl_numa.c @@ -218,6 +218,40 @@ static int nodemap_to_nr_vcpus(libxl__gc return nr_vcpus; } +/* Sum the relative distances of nodes in the nodemap to help finding + * out which candidate is the "tightest" one. */ +static int nodemap_to_dists_sum(libxl_numainfo *ninfo, libxl_bitmap *nodemap) +{ + int tot_dist = 0; + int i, j, a = 0, b; + + for (i = 0; i < libxl_bitmap_count_set(nodemap); i++) { + while (!libxl_bitmap_test(nodemap, a)) + a++; + + /* As it is usually non-zero, we do take the latency of + * of a node to itself into account. */ + b = a; + for (j = 0; j < libxl_bitmap_count_set(nodemap) - i; j++) { + while (!libxl_bitmap_test(nodemap, b)) + b++; + + /* + * In most architectures, going from node A to node B costs + * exactly as much as going from B to A does. However, let''s + * not rely on this and consider both contributions, just to + * be ready for everything future might reserve for us. + */ + tot_dist += ninfo[a].dists[b]; + tot_dist += ninfo[b].dists[a]; + b++; + } + a++; + } + + return tot_dist; +} + /* * This function tries to figure out if the host has a consistent number * of cpus along all its NUMA nodes. In fact, if that is the case, we can @@ -415,6 +449,7 @@ int libxl__get_numa_candidate(libxl__gc */ libxl__numa_candidate_put_nodemap(gc, &new_cndt, &nodemap); new_cndt.nr_vcpus = nodemap_to_nr_vcpus(gc, tinfo, &nodemap); + new_cndt.dists_sum = nodemap_to_dists_sum(ninfo, &nodemap); new_cndt.free_memkb = nodes_free_memkb; new_cndt.nr_nodes = libxl_bitmap_count_set(&nodemap); new_cndt.nr_cpus = nodes_cpus; @@ -430,12 +465,14 @@ int libxl__get_numa_candidate(libxl__gc LOG(DEBUG, "New best NUMA placement candidate found: " "nr_nodes=%d, nr_cpus=%d, nr_vcpus=%d, " - "free_memkb=%"PRIu32"", new_cndt.nr_nodes, - new_cndt.nr_cpus, new_cndt.nr_vcpus, + "dists_sum=%d, free_memkb=%"PRIu32"", + new_cndt.nr_nodes, new_cndt.nr_cpus, + new_cndt.nr_vcpus, new_cndt.dists_sum, new_cndt.free_memkb / 1024); libxl__numa_candidate_put_nodemap(gc, cndt_out, &nodemap); cndt_out->nr_vcpus = new_cndt.nr_vcpus; + cndt_out->dists_sum = new_cndt.dists_sum; cndt_out->free_memkb = new_cndt.free_memkb; cndt_out->nr_nodes = new_cndt.nr_nodes; cndt_out->nr_cpus = new_cndt.nr_cpus;
Dario Faggioli
2012-Oct-19 14:34 UTC
[PATCH 2 of 3 v2] libxl, xl: user can ask for min and max nodes to use during placement
And the placement algorithm will stick to that, or fail. This happens adding support for "minnodes=" and "maxnodes=" in the domain config file parser. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -133,6 +133,18 @@ the same time, achieving efficient utili =back +=item B<minnodes=N> + +Tells libxl to place the new domain on at least `N` nodes. This is only +effective if automatic NUMA placement occurs (i.e., if no `cpus=` option +is specified). + +=item B<maxnodes=M> + +Tells libxl to place the new domain on at most `M` nodes. This is only +effective if automatic NUMA placement occurs (i.e., if no `cpus=` option +is specified). + =head3 CPU Scheduling =over 4 diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c --- a/tools/libxl/libxl_create.c +++ b/tools/libxl/libxl_create.c @@ -219,6 +219,11 @@ int libxl__domain_build_info_setdefault( libxl_defbool_setdefault(&b_info->numa_placement, true); + if (!b_info->min_nodes) + b_info->min_nodes = 0; + if (!b_info->max_nodes) + b_info->max_nodes = 0; + if (b_info->max_memkb == LIBXL_MEMKB_DEFAULT) b_info->max_memkb = 32 * 1024; if (b_info->target_memkb == LIBXL_MEMKB_DEFAULT) diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -174,7 +174,8 @@ static int numa_place_domain(libxl__gc * /* Find the best candidate with enough free memory and at least * as much pcpus as the domain has vcpus. */ rc = libxl__get_numa_candidate(gc, memkb, info->max_vcpus, - 0, 0, &cpupool_info.cpumap, + info->min_nodes, info->max_nodes, + &cpupool_info.cpumap, numa_cmpf, &candidate, &found); if (rc) goto out; diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl --- a/tools/libxl/libxl_types.idl +++ b/tools/libxl/libxl_types.idl @@ -262,6 +262,8 @@ libxl_domain_build_info = Struct("domain ("avail_vcpus", libxl_bitmap), ("cpumap", libxl_bitmap), ("numa_placement", libxl_defbool), + ("min_nodes", integer), + ("max_nodes", integer), ("tsc_mode", libxl_tsc_mode), ("max_memkb", MemKB), ("target_memkb", MemKB), diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -731,6 +731,11 @@ static void parse_config_data(const char libxl_defbool_set(&b_info->numa_placement, false); } + if (!xlu_cfg_get_long (config, "minnodes", &l, 0)) + b_info->min_nodes = l; + if (!xlu_cfg_get_long (config, "maxnodes", &l, 0)) + b_info->max_nodes = l; + if (!xlu_cfg_get_long (config, "memory", &l, 0)) { b_info->max_memkb = l * 1024; b_info->target_memkb = b_info->max_memkb;
Dario Faggioli
2012-Oct-19 14:34 UTC
[PATCH 3 of 3 v2] xl: allow for node-wise specification of vcpu pinning
Making it possible to use something like the following: * "nodes:0-3": all pCPUs of nodes 0,1,2,3; * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3; * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2 but not pCPU 6; * ... In both domain config file and `xl vcpu-pin''. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> --- Changes since v1: * strstr() replaced with strncmp(). diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -109,7 +109,7 @@ some cpus on its own (see below). A C<CP =over 4 -=item "all" +=item "all" (or "nodes:all") To allow all the vcpus of the guest to run on all the cpus on the host. @@ -117,6 +117,14 @@ To allow all the vcpus of the guest to r To allow all the vcpus of the guest to run on cpus 0,2,3,5. +=item "nodes:0-3,^node:2" + +To allow all the vcpus of the guest to run on the cpus belonging to +the NUMA nodes 0,1,3 of the host. Notice that it is possible to combine +this syntax with the one above. That means, something like "1,node:2,^6" +is possible and means all the vcpus of the guest will run on cpus 1 plus +on all the cpus of node 2, but never on cpu 6. + =item ["2", "3"] (or [2, 3]) To ask for specific vcpu mapping. That means (in this example), vcpu #0 diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -506,31 +506,58 @@ static void split_string_into_string_lis static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap) { - libxl_bitmap exclude_cpumap; - uint32_t cpuida, cpuidb; + libxl_bitmap nodemap, cpu_nodemap; + libxl_bitmap exclude_cpumap, exclude_nodemap; + uint32_t ida, idb; char *endptr, *toka, *tokb, *saveptr = NULL; - int i, rc = 0, rmcpu; - - if (!strcmp(cpu, "all")) { + int i, rc = 0, isnot, isnode; + + if (!strcmp(cpu, "all") || !strcmp(cpu, "nodes:all")) { libxl_bitmap_set_any(cpumap); return 0; } - if (libxl_cpu_bitmap_alloc(ctx, &exclude_cpumap, 0)) { + libxl_bitmap_init(&cpu_nodemap); + libxl_bitmap_init(&nodemap); + libxl_bitmap_init(&exclude_nodemap); + libxl_bitmap_init(&exclude_nodemap); + + rc = libxl_node_bitmap_alloc(ctx, &cpu_nodemap, 0); + if (rc) { + fprintf(stderr, "Error: Failed to allocate nodemap.\n"); + goto vcpp_out; + } + rc = libxl_node_bitmap_alloc(ctx, &nodemap, 0); + if (rc) { + fprintf(stderr, "Error: Failed to allocate nodemap.\n"); + goto vcpp_out; + } + rc = libxl_node_bitmap_alloc(ctx, &exclude_nodemap, 0); + if (rc) { + fprintf(stderr, "Error: Failed to allocate nodemap.\n"); + goto vcpp_out; + } + rc = libxl_cpu_bitmap_alloc(ctx, &exclude_cpumap, 0); + if (rc) { fprintf(stderr, "Error: Failed to allocate cpumap.\n"); - return ENOMEM; + goto vcpp_out; } for (toka = strtok_r(cpu, ",", &saveptr); toka; toka = strtok_r(NULL, ",", &saveptr)) { - rmcpu = 0; + isnot = 0; isnode = 0; if (*toka == ''^'') { - /* This (These) Cpu(s) will be removed from the map */ + /* This (These) Cpu(s)/Node(s) will be removed from the map */ toka++; - rmcpu = 1; - } - /* Extract a valid (range of) cpu(s) */ - cpuida = cpuidb = strtoul(toka, &endptr, 10); + isnot = 1; + } + /* Check if we''re dealing with a full node */ + if (!strncmp(toka, "node:", 5) || !strncmp(toka, "nodes:", 6)) { + toka += 5 + (toka[4] == ''s''); + isnode = 1; + } + /* Extract a valid (range of) cpu(s) or node(s) */ + ida = idb = strtoul(toka, &endptr, 10); if (endptr == toka) { fprintf(stderr, "Error: Invalid argument.\n"); rc = EINVAL; @@ -538,27 +565,48 @@ static int vcpupin_parse(char *cpu, libx } if (*endptr == ''-'') { tokb = endptr + 1; - cpuidb = strtoul(tokb, &endptr, 10); - if (endptr == tokb || cpuida > cpuidb) { + idb = strtoul(tokb, &endptr, 10); + if (endptr == tokb || ida > idb) { fprintf(stderr, "Error: Invalid argument.\n"); rc = EINVAL; goto vcpp_out; } } - while (cpuida <= cpuidb) { - rmcpu == 0 ? libxl_bitmap_set(cpumap, cpuida) : - libxl_bitmap_set(&exclude_cpumap, cpuida); - cpuida++; - } - } - - /* Clear all the cpus from the removal list */ + while (ida <= idb) { + if (!isnode) + isnot == 0 ? libxl_bitmap_set(cpumap, ida) : + libxl_bitmap_set(&exclude_cpumap, ida); + else + isnot == 0 ? libxl_bitmap_set(&nodemap, ida) : + libxl_bitmap_set(&exclude_nodemap, ida); + ida++; + } + } + + /* Add the cpus that have been specified via "node:" items */ + rc = libxl_nodemap_to_cpumap(ctx, &nodemap, &cpu_nodemap); + if (rc) + goto vcpp_out; + libxl_for_each_set_bit(i, cpu_nodemap) { + libxl_bitmap_set(cpumap, i); + } + + /* Clear all the cpus from the removal cpu and node lists */ libxl_for_each_set_bit(i, exclude_cpumap) { libxl_bitmap_reset(cpumap, i); } + rc = libxl_nodemap_to_cpumap(ctx, &exclude_nodemap, &cpu_nodemap); + if (rc) + goto vcpp_out; + libxl_for_each_set_bit(i, cpu_nodemap) { + libxl_bitmap_reset(cpumap, i); + } vcpp_out: libxl_bitmap_dispose(&exclude_cpumap); + libxl_bitmap_dispose(&exclude_nodemap); + libxl_bitmap_dispose(&nodemap); + libxl_bitmap_dispose(&cpu_nodemap); return rc; }
Dario Faggioli
2012-Oct-19 17:00 UTC
Re: [PATCH 1 of 3 v2] libxl: take node distances into account during NUMA placement
On Fri, 2012-10-19 at 16:34 +0200, Dario Faggioli wrote:> In fact, among placement candidates with the same number of nodes, the > closer the various nodes are to each others, the better the performances > for a domain placed there. >So, it seems we need to discuss a bit more about this patch... Please, ignore it for now. OTOH, patches 2 and 3 can be considered for further review/ack/commit, even without this one, as they are pretty much unrelated from this one (and within each other). I''d expect them to apply cleanly, but if I need to rebase and resend, let me know. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Campbell
2012-Nov-12 16:26 UTC
Re: [PATCH 2 of 3 v2] libxl, xl: user can ask for min and max nodes to use during placement
On Fri, 2012-10-19 at 15:34 +0100, Dario Faggioli wrote:> @@ -219,6 +219,11 @@ int libxl__domain_build_info_setdefault( > > libxl_defbool_setdefault(&b_info->numa_placement, true); > > + if (!b_info->min_nodes) > + b_info->min_nodes = 0; > + if (!b_info->max_nodes) > + b_info->max_nodes = 0;Aren''t both of these if statements tautologous? Ian.
Dario Faggioli
2012-Nov-12 16:34 UTC
Re: [PATCH 2 of 3 v2] libxl, xl: user can ask for min and max nodes to use during placement
On Mon, 2012-11-12 at 16:26 +0000, Ian Campbell wrote:> On Fri, 2012-10-19 at 15:34 +0100, Dario Faggioli wrote: > > @@ -219,6 +219,11 @@ int libxl__domain_build_info_setdefault( > > > > libxl_defbool_setdefault(&b_info->numa_placement, true); > > > > + if (!b_info->min_nodes) > > + b_info->min_nodes = 0; > > + if (!b_info->max_nodes) > > + b_info->max_nodes = 0; > > Aren''t both of these if statements tautologous? >Looks like they are. I''ll double check this before reposting. I think what I wanted to do was making sure to zero those new fields, if not otherwise initialized by someone else... and, of course, I got that wrong! :-P Thanks, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel