Dario Faggioli
2013-Oct-03 17:45 UTC
[PATCH 00/15] Implement per-vcpu NUMA node-affinity for credit1
Hi everyone, So, this series introduces the concept of per-vcpu NUMA node-affinity. In fact, up to now, node-affinity has only been "per-domain". That means it was the domain that had a node-affinity and: - that node-affinity was used to decide where to allocate the memory for the domain; - that node-affinity was used to decide on what nodes _all_ the vcpus of the domain prefer to be scheduled. After this series this changes like this: - each vcpu of a domain has (well, may have) its own node-affinity, and that is what is used to determine (if the credit1 scheduler is used) where each specific vcpu prefers to run; - the node-affinity of the whole domain is the _union_ of all the node-affinities of the domain''s vcpus; - the memory is still allocated following what the node-affinity of the whole domain (so, the union of vcpu node-affinities, as said above) says. In practise, it''s not such a big change, I''m just extending at the per-vcpu level what we already had at the domain level. This is also making node-affinity a lot more similar to vcpu-pinning, both in terms of functioning and user interface. As a side efect, that simplify the scheduling code (at least the NUMA-aware part) by quite a bit. Finally, and most important, this is something that will become really important when we will start to support virtual NUMA topologies, as, a that point, having the same node-affinity for all the vcpus in a domain won''t be enough any longer (we''ll want the vcpus from a particular vnode to have node-afinity with a particular pnode). More detailed description of the mechanism and of the implementation choices are provided in the changelogs and in the documentation (docs/misc and manpages). One last thing is that this series relies on some other patches and series that I sent on xen-devel already, but have not been applied yet. I''m re-sending them here, as a part of this series, so feel free to pick them up from here, if wanting to apply them, or comment on them in this thread, if you want me to change them. In particular, patches 01 and 03, I already sent as single patches, patches 04-07, I already sent them as a series. Sorry if that is a bit clumsy, but I couldn''t find a better way to do it. :-) In the detailed list of patches below, ''x'' means previously submitted, ''*'' means already acked/reviewed-by. Finally, Elena, that is not super important, but perhaps, in the next release of your vNUMA series, you could try to integrate it with this (and of course, ask if you need anything while trying to do that). Matt, if/when you eventually get to release, even as RFC or something like that, your HVM vNUMA series, we can try to figure out how to integrate that with this, so to use node-affinity instead than pinning. The series is also available at the following git coordinates: git://xenbits.xen.org/people/dariof/xen.git numa/per-vcpu-affinity-v1 http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/numa/per-vcpu-affinity-v1 Let me know what you think about all this. Regards, Dario PS. Someone of you probably received part of this series as a direct message (i.e., with your address in ''To'', rather than in ''Cc''). I''m sincerely sorry for that, messed up with `stg mail''... Won''t happen again, I promise! :-P --- Dario Faggioli (15): x * xl: update the manpage about "cpus=" and NUMA node-affinity xl: fix a typo in main_vcpulist() x * xen: numa-sched: leave node-affinity alone if not in "auto" mode x * libxl: introduce libxl_node_to_cpumap x xl: allow for node-wise specification of vcpu pinning x * xl: implement and enable dryrun mode for `xl vcpu-pin'' x xl: test script for the cpumap parser (for vCPU pinning) xen: numa-sched: make space for per-vcpu node-affinity xen: numa-sched: domain node-affinity always comes from vcpu node-affinity xen: numa-sched: use per-vcpu node-affinity for actual scheduling xen: numa-sched: enable getting/specifying per-vcpu node-affinity libxc: numa-sched: enable getting/specifying per-vcpu node-affinity libxl: numa-sched: enable getting/specifying per-vcpu node-affinity xl: numa-sched: enable getting/specifying per-vcpu node-affinity xl: numa-sched: enable specifying node-affinity in VM config file docs/man/xl.cfg.pod.5 | 88 ++++ docs/man/xl.pod.1 | 25 + docs/misc/xl-numa-placement.markdown | 124 ++++-- tools/libxc/xc_domain.c | 90 ++++- tools/libxc/xenctrl.h | 19 + tools/libxl/check-xl-vcpupin-parse | 294 +++++++++++++++ tools/libxl/check-xl-vcpupin-parse.data-example | 53 +++ tools/libxl/libxl.c | 28 + tools/libxl/libxl.h | 11 + tools/libxl/libxl_dom.c | 18 + tools/libxl/libxl_numa.c | 14 - tools/libxl/libxl_types.idl | 1 tools/libxl/libxl_utils.c | 22 + tools/libxl/libxl_utils.h | 15 + tools/libxl/xl.h | 1 tools/libxl/xl_cmdimpl.c | 458 +++++++++++++++++++---- tools/libxl/xl_cmdtable.c | 11 - xen/common/domain.c | 97 ++--- xen/common/domctl.c | 47 ++ xen/common/keyhandler.c | 6 xen/common/sched_credit.c | 63 --- xen/common/schedule.c | 55 +++ xen/include/public/domctl.h | 8 xen/include/xen/sched-if.h | 2 xen/include/xen/sched.h | 13 + xen/xsm/flask/hooks.c | 2 26 files changed, 1282 insertions(+), 283 deletions(-) create mode 100755 tools/libxl/check-xl-vcpupin-parse create mode 100644 tools/libxl/check-xl-vcpupin-parse.data-example -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Dario Faggioli
2013-Oct-03 17:45 UTC
[PATCH 01/15] xl: update the manpage about "cpus=" and NUMA node-affinity
Since d06b1bf169a01a9c7b0947d7825e58cb455a0ba5 (''libxl: automatic placement deals with node-affinity'') it is no longer true that, if no "cpus=" option is specified, xl picks up some pCPUs by default and pin the domain there. In fact, it is the NUMA node-affinity that is affected by automatic placement, not vCPU to pCPU pinning. Update the xl config file documenation accordingly, as it seems to have been forgotten at that time. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> --- Picking this up from a previously submitted series ("xl: allow for node-wise specification of vcpu pinning") as the changes in that and in this series would otherwise be conflicting. If this is considered fine, Feel free to apply it from here and skip the corresponding e-mail in the original submission. --- docs/man/xl.cfg.pod.5 | 22 +++++++++++++++------- 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 index d57cd4d..cd96969 100644 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -104,8 +104,8 @@ created online and the remainder will be offline. =item B<cpus="CPU-LIST"> -List of which cpus the guest is allowed to use. By default xl will pick -some cpus on its own (see below). A C<CPU-LIST> may be specified as follows: +List of which cpus the guest is allowed to use. Default is no pinning at +all (more on this below). A C<CPU-LIST> may be specified as follows: =over 4 @@ -125,11 +125,19 @@ run on cpu #3 of the host. =back -If this option is not specified, libxl automatically tries to place the new -domain on the host''s NUMA nodes (provided the host has more than one NUMA -node) by pinning it to the cpus of those nodes. A heuristic approach is -utilized with the goals of maximizing performance for the domain and, at -the same time, achieving efficient utilization of the host''s CPUs and RAM. +If this option is not specified, no vcpu to cpu pinning is established, +and the vcpus of the guest can run on all the cpus of the host. + +If we are on a NUMA machine (i.e., if the host has more than one NUMA +node) and this option is not specified, libxl automatically tries to +place the guest on the least possible number of nodes. That, however, +will not affect vcpu pinning, so the guest will still be able to run on +all the cpus, it will just prefer the ones from the node it has been +placed on. A heuristic approach is used for choosing the best node (or +set of nodes), with the goals of maximizing performance for the guest +and, at the same time, achieving efficient utilization of host cpus +and memory. See F<docs/misc/xl-numa-placement.markdown> for more +details. =back
which was preventing `xl vcpu-list -h'' to work. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- tools/libxl/xl_cmdimpl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index 642b130..1fe6caa 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -4465,7 +4465,7 @@ int main_vcpulist(int argc, char **argv) { int opt; - SWITCH_FOREACH_OPT(opt, "", NULL, "cpu-list", 0) { + SWITCH_FOREACH_OPT(opt, "", NULL, "vcpu-list", 0) { /* No options */ }
Dario Faggioli
2013-Oct-03 17:45 UTC
[PATCH 03/15] xen: numa-sched: leave node-affinity alone if not in "auto" mode
If the domain''s NUMA node-affinity is being specified by the user/toolstack (instead of being automatically computed by Xen), we really should stick to that. This means domain_update_node_affinity() is wrong when it filters out some stuff from there even in "!auto" mode. This commit fixes that. Of course, this does not mean node-affinity is always honoured (e.g., a vcpu won''t run on a pcpu of a different cpupool) but the necessary logic for taking into account all the possible situations lives in the scheduler code, where it belongs. What could happen without this change is that, under certain circumstances, the node-affinity of a domain may change when the user modifies the vcpu-affinity of the domain''s vcpus. This, even if probably not a real bug, is at least something the user does not expect, so let''s avoid it. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> --- This has been submitted already as a single patch on its own. Since this series needs the change done here, just include it in here, instead of pinging the original submission and deferring posting this series. --- xen/common/domain.c | 28 +++++++++------------------- 1 file changed, 9 insertions(+), 19 deletions(-) diff --git a/xen/common/domain.c b/xen/common/domain.c index 5999779..af31ab4 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -352,7 +352,6 @@ void domain_update_node_affinity(struct domain *d) cpumask_var_t cpumask; cpumask_var_t online_affinity; const cpumask_t *online; - nodemask_t nodemask = NODE_MASK_NONE; struct vcpu *v; unsigned int node; @@ -374,28 +373,19 @@ void domain_update_node_affinity(struct domain *d) cpumask_or(cpumask, cpumask, online_affinity); } + /* + * If d->auto_node_affinity is true, the domain''s node-affinity mask + * (d->node_affinity) is automaically computed from all the domain''s + * vcpus'' vcpu-affinity masks (the union of which we have just built + * above in cpumask). OTOH, if d->auto_node_affinity is false, we + * must leave the node-affinity of the domain alone. + */ if ( d->auto_node_affinity ) { - /* Node-affinity is automaically computed from all vcpu-affinities */ + nodes_clear(d->node_affinity); for_each_online_node ( node ) if ( cpumask_intersects(&node_to_cpumask(node), cpumask) ) - node_set(node, nodemask); - - d->node_affinity = nodemask; - } - else - { - /* Node-affinity is provided by someone else, just filter out cpus - * that are either offline or not in the affinity of any vcpus. */ - nodemask = d->node_affinity; - for_each_node_mask ( node, d->node_affinity ) - if ( !cpumask_intersects(&node_to_cpumask(node), cpumask) ) - node_clear(node, nodemask);//d->node_affinity); - - /* Avoid loosing track of node-affinity because of a bad - * vcpu-affinity has been specified. */ - if ( !nodes_empty(nodemask) ) - d->node_affinity = nodemask; + node_set(node, d->node_affinity); } sched_set_node_affinity(d, &d->node_affinity);
As an helper for the special case (of libxl_nodemap_to_cpumap) when one wants the cpumap for just one node. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> --- Picking this up from a previously submitted series ("xl: allow for node-wise specification of vcpu pinning") as the changes in that and in this series would otherwise be conflicting. If this is considered fine, Feel free to apply it from here and skip the corresponding e-mail in the original submission. --- tools/libxl/libxl_utils.c | 22 ++++++++++++++++++++++ tools/libxl/libxl_utils.h | 3 +++ 2 files changed, 25 insertions(+) diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c index 4309e5e..5c125e6 100644 --- a/tools/libxl/libxl_utils.c +++ b/tools/libxl/libxl_utils.c @@ -668,6 +668,28 @@ int libxl_nodemap_to_cpumap(libxl_ctx *ctx, return rc; } +int libxl_node_to_cpumap(libxl_ctx *ctx, int node, + libxl_bitmap *cpumap) +{ + libxl_bitmap nodemap; + int rc = 0; + + libxl_bitmap_init(&nodemap); + + rc = libxl_node_bitmap_alloc(ctx, &nodemap, 0); + if (rc) + goto out; + + libxl_bitmap_set_none(&nodemap); + libxl_bitmap_set(&nodemap, node); + + rc = libxl_nodemap_to_cpumap(ctx, &nodemap, cpumap); + + out: + libxl_bitmap_dispose(&nodemap); + return rc; +} + int libxl_cpumap_to_nodemap(libxl_ctx *ctx, const libxl_bitmap *cpumap, libxl_bitmap *nodemap) diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h index a430362..7b84e6a 100644 --- a/tools/libxl/libxl_utils.h +++ b/tools/libxl/libxl_utils.h @@ -129,6 +129,9 @@ static inline int libxl_node_bitmap_alloc(libxl_ctx *ctx, int libxl_nodemap_to_cpumap(libxl_ctx *ctx, const libxl_bitmap *nodemap, libxl_bitmap *cpumap); +/* Populate cpumap with the cpus spanned by node */ +int libxl_node_to_cpumap(libxl_ctx *ctx, int node, + libxl_bitmap *cpumap); /* Populate nodemap with the nodes of the cpus in cpumap */ int libxl_cpumap_to_nodemap(libxl_ctx *ctx, const libxl_bitmap *cpuemap,
Dario Faggioli
2013-Oct-03 17:46 UTC
[PATCH 05/15] xl: allow for node-wise specification of vcpu pinning
Making it possible to use something like the following: * "nodes:0-3": all pCPUs of nodes 0,1,2,3; * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3; * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2 but not pCPU 6; * ... In both domain config file and `xl vcpu-pin''. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- Picking this up from a previously submitted series ("xl: allow for node-wise specification of vcpu pinning") as the changes in that and in this series would otherwise be conflicting. If this is considered fine, Feel free to apply it from here and skip the corresponding e-mail in the original submission. --- Changes from v2: * turned a ''return'' into ''goto out'', consistently with the most of exit patterns; * harmonized error handling: now parse_range() return a libxl error code, as requested during review; * dealing with "all" moved inside update_cpumap_range(). It''s tricky to move it in parse_range() (as requested during review), since we need the cpumap being modified handy when dealing with it. However, having it in update_cpumap_range() simplifies the code just as much as that; * explicitly checking for junk after a valid value or range in parse_range(), as requested during review; * xl exits on parsing failing, so no need to reset the cpumap to something sensible in vcpupin_parse(), as suggested during review; Changes from v1: * code rearranged in order to look more simple to follow and understand, as requested during review; * improved docs in xl.cfg.pod.5, as requested during review; * strtoul() now returns into unsigned long, and the case where it returns ULONG_MAX is now taken into account, as requested during review; * stuff like "all,^7" now works, as requested during review. Specifying just "^7" does not work either before or after this change * killed some magic (i.e., `ptr += 5 + (ptr[4] == ''s''`) by introducing STR_SKIP_PREFIX() macro, as requested during review. --- docs/man/xl.cfg.pod.5 | 20 ++++++ tools/libxl/xl_cmdimpl.c | 145 ++++++++++++++++++++++++++++++++-------------- 2 files changed, 121 insertions(+), 44 deletions(-) diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 index cd96969..e63ee04 100644 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -115,7 +115,25 @@ To allow all the vcpus of the guest to run on all the cpus on the host. =item "0-3,5,^1" -To allow all the vcpus of the guest to run on cpus 0,2,3,5. +To allow all the vcpus of the guest to run on cpus 0,2,3,5. Combining +this with "all" is possible, meaning "all,^7" results in all the vcpus +of the guest running on all the cpus on the host except cpu 7. + +=item "nodes:0-3,node:^2" + +To allow all the vcpus of the guest to run on the cpus from NUMA nodes +0,1,3 of the host. So, if cpus 0-3 belongs to node 0, cpus 4-7 belongs +to node 1 and cpus 8-11 to node 3, the above would mean all the vcpus +of the guest will run on cpus 0-3,8-11. + +Combining this notation with the one above is possible. For instance, +"1,node:2,^6", means all the vcpus of the guest will run on cpu 1 and +on all the cpus of NUMA node 2, but not on cpu 6. Following the same +example as above, that would be cpus 1,4,5,7. + +Combining this with "all" is also possible, meaning "all,^nodes:1" +results in all the vcpus of the guest running on all the cpus on the +host, except for the cpus belonging to the host NUMA node 1. =item ["2", "3"] (or [2, 3]) diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index 1fe6caa..5ece4af 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -59,6 +59,11 @@ } \ }) +#define STR_HAS_PREFIX( a, b ) \ + ( strncmp(a, b, strlen(b)) == 0 ) +#define STR_SKIP_PREFIX( a, b ) \ + ( STR_HAS_PREFIX(a, b) ? (a) += strlen(b) : NULL ) + int logfile = 2; @@ -481,61 +486,115 @@ static void split_string_into_string_list(const char *str, free(s); } -static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap) +static int parse_range(const char *str, unsigned long *a, unsigned long *b) +{ + char *nstr, *endptr; + + *a = *b = strtoul(str, &endptr, 10); + if (endptr == str || *a == ULONG_MAX) + return ERROR_INVAL; + + if (*endptr == ''-'') { + nstr = endptr + 1; + + *b = strtoul(nstr, &endptr, 10); + if (endptr == nstr || *b == ULONG_MAX || *b < *a) + return ERROR_INVAL; + } + + /* Valid value or range so far, but we also don''t want junk after that */ + if (*endptr != ''\0'') + return ERROR_INVAL; + + return 0; +} + +/* + * Add or removes a specific set of cpus (specified in str, either as + * single cpus or as entire NUMA nodes) to/from cpumap. + */ +static int update_cpumap_range(const char *str, libxl_bitmap *cpumap) { - libxl_bitmap exclude_cpumap; - uint32_t cpuida, cpuidb; - char *endptr, *toka, *tokb, *saveptr = NULL; - int i, rc = 0, rmcpu; + unsigned long ida, idb; + libxl_bitmap node_cpumap; + bool is_not = false, is_nodes = false; + int rc = 0; + + libxl_bitmap_init(&node_cpumap); - if (!strcmp(cpu, "all")) { + rc = libxl_node_bitmap_alloc(ctx, &node_cpumap, 0); + if (rc) { + fprintf(stderr, "libxl_node_bitmap_alloc failed.\n"); + goto out; + } + + /* Are we adding or removing cpus/nodes? */ + if (STR_SKIP_PREFIX(str, "^")) { + is_not = true; + } + + /* Are we dealing with cpus or full nodes? */ + if (STR_SKIP_PREFIX(str, "nodes:")) { + is_nodes = true; + } + + if (STR_HAS_PREFIX(str, "all")) { libxl_bitmap_set_any(cpumap); - return 0; + goto out; } - if (libxl_cpu_bitmap_alloc(ctx, &exclude_cpumap, 0)) { - fprintf(stderr, "Error: Failed to allocate cpumap.\n"); - return ENOMEM; + rc = parse_range(str, &ida, &idb); + if (rc) { + fprintf(stderr, "Invalid pcpu range: %s.\n", str); + goto out; } - for (toka = strtok_r(cpu, ",", &saveptr); toka; - toka = strtok_r(NULL, ",", &saveptr)) { - rmcpu = 0; - if (*toka == ''^'') { - /* This (These) Cpu(s) will be removed from the map */ - toka++; - rmcpu = 1; - } - /* Extract a valid (range of) cpu(s) */ - cpuida = cpuidb = strtoul(toka, &endptr, 10); - if (endptr == toka) { - fprintf(stderr, "Error: Invalid argument.\n"); - rc = EINVAL; - goto vcpp_out; - } - if (*endptr == ''-'') { - tokb = endptr + 1; - cpuidb = strtoul(tokb, &endptr, 10); - if (endptr == tokb || cpuida > cpuidb) { - fprintf(stderr, "Error: Invalid argument.\n"); - rc = EINVAL; - goto vcpp_out; + /* Add or remove the specified cpus in the range */ + while (ida <= idb) { + if (is_nodes) { + /* Add/Remove all the cpus of a NUMA node */ + int i; + + rc = libxl_node_to_cpumap(ctx, ida, &node_cpumap); + if (rc) { + fprintf(stderr, "libxl_node_to_cpumap failed.\n"); + goto out; } + + /* Add/Remove all the cpus in the node cpumap */ + libxl_for_each_set_bit(i, node_cpumap) { + is_not ? libxl_bitmap_reset(cpumap, i) : + libxl_bitmap_set(cpumap, i); + } + } else { + /* Add/Remove this cpu */ + is_not ? libxl_bitmap_reset(cpumap, ida) : + libxl_bitmap_set(cpumap, ida); } - while (cpuida <= cpuidb) { - rmcpu == 0 ? libxl_bitmap_set(cpumap, cpuida) : - libxl_bitmap_set(&exclude_cpumap, cpuida); - cpuida++; - } + ida++; } - /* Clear all the cpus from the removal list */ - libxl_for_each_set_bit(i, exclude_cpumap) { - libxl_bitmap_reset(cpumap, i); - } + out: + libxl_bitmap_dispose(&node_cpumap); + return rc; +} + +/* + * Takes a string representing a set of cpus (specified either as + * single cpus or as eintire NUMA nodes) and turns it into the + * corresponding libxl_bitmap (in cpumap). + */ +static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap) +{ + char *ptr, *saveptr = NULL; + int rc = 0; -vcpp_out: - libxl_bitmap_dispose(&exclude_cpumap); + for (ptr = strtok_r(cpu, ",", &saveptr); ptr; + ptr = strtok_r(NULL, ",", &saveptr)) { + rc = update_cpumap_range(ptr, cpumap); + if (rc) + break; + } return rc; }
Dario Faggioli
2013-Oct-03 17:46 UTC
[PATCH 06/15] xl: implement and enable dryrun mode for `xl vcpu-pin''
As it can be useful to see if the outcome of some complex vCPU pinning bitmap specification looks as expected. This also allow for the introduction of some automatic testing and verification for the bitmap parsing code, as it happens already in check-xl-disk-parse and check-xl-vif-parse. In particular, to make the above possible, this commit also changes the implementation of the vcpu-pin command so that, instead of always returning 0, it returns an error if the parsing fails. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> --- Picking this up from a previously submitted series ("xl: allow for node-wise specification of vcpu pinning") as the changes in that and in this series would otherwise be conflicting. If this is considered fine, Feel free to apply it from here and skip the corresponding e-mail in the original submission. --- Changes since v2: * fixed a typo in the changelog --- tools/libxl/xl_cmdimpl.c | 49 +++++++++++++++++++++++++++++++++------------ tools/libxl/xl_cmdtable.c | 2 +- 2 files changed, 37 insertions(+), 14 deletions(-) diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index 5ece4af..33c0d0d 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -4532,30 +4532,53 @@ int main_vcpulist(int argc, char **argv) return 0; } -static void vcpupin(uint32_t domid, const char *vcpu, char *cpu) +static int vcpupin(uint32_t domid, const char *vcpu, char *cpu) { libxl_vcpuinfo *vcpuinfo; libxl_bitmap cpumap; uint32_t vcpuid; char *endptr; - int i, nb_vcpu; + int i = 0, nb_vcpu, rc = -1; + + libxl_bitmap_init(&cpumap); vcpuid = strtoul(vcpu, &endptr, 10); if (vcpu == endptr) { if (strcmp(vcpu, "all")) { fprintf(stderr, "Error: Invalid argument.\n"); - return; + goto out; } vcpuid = -1; } - if (libxl_cpu_bitmap_alloc(ctx, &cpumap, 0)) { - goto vcpupin_out; - } + if (libxl_cpu_bitmap_alloc(ctx, &cpumap, 0)) + goto out; if (vcpupin_parse(cpu, &cpumap)) - goto vcpupin_out1; + goto out; + + if (dryrun_only) { + libxl_cputopology *info = libxl_get_cpu_topology(ctx, &i); + + if (!info) { + fprintf(stderr, "libxl_get_cpu_topology failed.\n"); + goto out; + } + libxl_cputopology_list_free(info, i); + + fprintf(stdout, "cpumap: "); + print_cpumap(cpumap.map, i, stdout); + fprintf(stdout, "\n"); + + if (ferror(stdout) || fflush(stdout)) { + perror("stdout"); + exit(-1); + } + + rc = 0; + goto out; + } if (vcpuid != -1) { if (libxl_set_vcpuaffinity(ctx, domid, vcpuid, &cpumap) == -1) { @@ -4565,7 +4588,7 @@ static void vcpupin(uint32_t domid, const char *vcpu, char *cpu) else { if (!(vcpuinfo = libxl_list_vcpu(ctx, domid, &nb_vcpu, &i))) { fprintf(stderr, "libxl_list_vcpu failed.\n"); - goto vcpupin_out1; + goto out; } for (i = 0; i < nb_vcpu; i++) { if (libxl_set_vcpuaffinity(ctx, domid, vcpuinfo[i].vcpuid, @@ -4576,10 +4599,11 @@ static void vcpupin(uint32_t domid, const char *vcpu, char *cpu) } libxl_vcpuinfo_list_free(vcpuinfo, nb_vcpu); } - vcpupin_out1: + + rc = 0; + out: libxl_bitmap_dispose(&cpumap); - vcpupin_out: - ; + return rc; } int main_vcpupin(int argc, char **argv) @@ -4590,8 +4614,7 @@ int main_vcpupin(int argc, char **argv) /* No options */ } - vcpupin(find_domain(argv[optind]), argv[optind+1] , argv[optind+2]); - return 0; + return vcpupin(find_domain(argv[optind]), argv[optind+1] , argv[optind+2]); } static void vcpuset(uint32_t domid, const char* nr_vcpus, int check_host) diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c index 326a660..d3dcbf0 100644 --- a/tools/libxl/xl_cmdtable.c +++ b/tools/libxl/xl_cmdtable.c @@ -211,7 +211,7 @@ struct cmd_spec cmd_table[] = { "[Domain, ...]", }, { "vcpu-pin", - &main_vcpupin, 0, 1, + &main_vcpupin, 1, 1, "Set which CPUs a VCPU can use", "<Domain> <VCPU|all> <CPUs|all>", },
Dario Faggioli
2013-Oct-03 17:46 UTC
[PATCH 07/15] xl: test script for the cpumap parser (for vCPU pinning)
This commit introduces "check-xl-vcpupin-parse" for helping verifying and debugging the (v)CPU bitmap parsing code in xl. The script runs "xl -N vcpu-pin 0 all <some strings>" repeatedly, with various input strings, and checks that the output is as expected. This is what the script can do: # ./check-xl-vcpupin-parse -h usage: ./check-xl-vcpupin-parse [options] Tests various vcpu-pinning strings. If run without arguments acts as follows: - generates some test data and saves them in check-xl-vcpupin-parse.data; - tests all the generated configurations (reading them back from check-xl-vcpupin-parse.data). An example of a test vector file is provided in check-xl-vcpupin-parse.data-example. Options: -h prints this message -r seed uses seed for initializing the rundom number generator (default: the script PID) -s string tries using string as a vcpu pinning configuration and reports whether that succeeds or not -o ofile save the test data in ofile (default: check-xl-vcpupin-parse.data) -i ifile read test data from ifile An example test data file (generated on a 2 NUMA nodes, 16 CPUs host) is being provided in check-xl-vcpupin-parse.data-example. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- Picking this up from a previously submitted series ("xl: allow for node-wise specification of vcpu pinning") as the changes in that and in this series would otherwise be conflicting. If this is considered fine, Feel free to apply it from here and skip the corresponding e-mail in the original submission. --- Changes from v2: * killed the `sleep 1'', as requested during review; * allow for passing a custom randon seed, and report what is the actual random seed used, as requested during review; * allow for testing for specific pinning configuration strings, as suggested during review; * stores the test data in a file, after generating them, and read them back from there for actual testing, as suggested during review; * allow for reading the test data from an existing test file instead than always generating new ones. Changes from v1: * this was not there in v1, and adding it has been requested during review. --- tools/libxl/check-xl-vcpupin-parse | 294 +++++++++++++++++++++++ tools/libxl/check-xl-vcpupin-parse.data-example | 53 ++++ 2 files changed, 347 insertions(+) create mode 100755 tools/libxl/check-xl-vcpupin-parse create mode 100644 tools/libxl/check-xl-vcpupin-parse.data-example diff --git a/tools/libxl/check-xl-vcpupin-parse b/tools/libxl/check-xl-vcpupin-parse new file mode 100755 index 0000000..fb4689c --- /dev/null +++ b/tools/libxl/check-xl-vcpupin-parse @@ -0,0 +1,294 @@ +#!/bin/bash + +set -e + +if [ -x ./xl ] ; then + export LD_LIBRARY_PATH=.:../libxc:../xenstore: + XL=./xl +else + XL=xl +fi + +fprefix=tmp.check-xl-vcpupin-parse +outfile=check-xl-vcpupin-parse.data + +usage () { +cat <<END +usage: $0 [options] + +Tests various vcpu-pinning strings. If run without arguments acts +as follows: + - generates some test data and saves them in $outfile; + - tests all the generated configurations (reading them back from + $outfile). + +An example of a test vector file is provided in ${outfile}-example. + +Options: + -h prints this message + -r seed uses seed for initializing the rundom number generator + (default: the script PID) + -s string tries using string as a vcpu pinning configuration and + reports whether that succeeds or not + -o ofile save the test data in ofile (default: $outfile) + -i ifile read test data from ifile +END +} + +expected () { + cat >$fprefix.expected +} + +# by default, re-seed with our PID +seed=$$ +failures=0 + +# Execute one test and check the result against the provided +# rc value and output +one () { + expected_rc=$1; shift + printf "test case %s...\n" "$*" + set +e + ${XL} -N vcpu-pin 0 all "$@" </dev/null >$fprefix.actual 2>/dev/null + actual_rc=$? + if [ $actual_rc != $expected_rc ]; then + diff -u $fprefix.expected $fprefix.actual + echo >&2 "test case \`$*'' failed ($actual_rc $diff_rc)" + failures=$(( $failures + 1 )) + fi + set -e +} + +# Write an entry in the test vector file. Format is as follows: +# test-string*expected-rc*expected-output +write () { + printf "$1*$2*$3\n" >> $outfile +} + +complete () { + if [ "$failures" = 0 ]; then + echo all ok.; exit 0 + else + echo "$failures tests failed."; exit 1 + fi +} + +# Test a specific pinning string +string () { + expected_rc=$1; shift + printf "test case %s...\n" "$*" + set +e + ${XL} -N vcpu-pin 0 all "$@" &> /dev/null + actual_rc=$? + set -e + + if [ $actual_rc != $expected_rc ]; then + echo >&2 "test case \`$*'' failed ($actual_rc)" + else + echo >&2 "test case \`$*'' succeeded" + fi + + exit 0 +} + +# Read a test vector file (provided as $1) line by line and +# test all the entries it contains +run () +{ + while read line + do + if [ ${line:0:1} != ''#'' ]; then + test_string="`echo $line | cut -f1 -d''*''`" + exp_rc="`echo $line | cut -f2 -d''*''`" + exp_output="`echo $line | cut -f3 -d''*''`" + + expected <<END +$exp_output +END + one $exp_rc "$test_string" + fi + done < $1 + + complete + + exit 0 +} + +while getopts "hr:s:o:i:" option +do + case $option in + h) + usage + exit 0 + ;; + r) + seed=$OPTARG + ;; + s) + string 0 "$OPTARG" + ;; + o) + outfile=$OPTARG + ;; + i) + run $OPTARG + ;; + esac +done + +#---------- test data ---------- +# +nr_cpus=`xl info | grep nr_cpus | cut -f2 -d'':''` +nr_nodes=`xl info | grep nr_nodes | cut -f2 -d'':''` +nr_cpus_per_node=`xl info -n | sed ''/cpu:/,/numa_info/!d'' | head -n -1 | \ + awk ''{print $4}'' | uniq -c | tail -1 | awk ''{print $1}''` +cat >$outfile <<END +# WARNING: some of these tests are topology based tests. +# Expect failures if the topology is not detected correctly +# detected topology: $nr_cpus CPUs, $nr_nodes nodes, $nr_cpus_per_node CPUs per node. +# +# seed used for random number generation: seed=${seed}. +# +# Format is as follows: +# test-string*expected-return-code*expected-output +# +END + +# Re-seed the random number generator +RANDOM=$seed + +echo "# Testing a wring configuration" >> $outfile +write foo 255 "" + +echo "# Testing the ''all'' syntax" >> $outfile +write "all" 0 "cpumap: any cpu" +write "nodes:all" 0 "cpumap: any cpu" +write "all,nodes:all" 0 "cpumap: any cpu" +write "all,^nodes:0,all" 0 "cpumap: any cpu" + +echo "# Testing the empty cpumap case" >> $outfile +write "^0" 0 "cpumap: none" + +echo "# A few attempts of pinning to just one random cpu" >> $outfile +if [ $nr_cpus -gt 1 ]; then + for i in `seq 0 3`; do + cpu=$(($RANDOM % nr_cpus)) + write "$cpu" 0 "cpumap: $cpu" + done +fi + +echo "# A few attempts of pinning to all but one random cpu" >> $outfile +if [ $nr_cpus -gt 2 ]; then + for i in `seq 0 3`; do + cpu=$(($RANDOM % nr_cpus)) + if [ $cpu -eq 0 ]; then + expected_range="1-$((nr_cpus - 1))" + elif [ $cpu -eq 1 ]; then + expected_range="0,2-$((nr_cpus - 1))" + elif [ $cpu -eq $((nr_cpus - 2)) ]; then + expected_range="0-$((cpu - 1)),$((nr_cpus - 1))" + elif [ $cpu -eq $((nr_cpus - 1)) ]; then + expected_range="0-$((nr_cpus - 2))" + else + expected_range="0-$((cpu - 1)),$((cpu + 1))-$((nr_cpus - 1))" + fi + write "all,^$cpu" 0 "cpumap: $expected_range" + done +fi + +echo "# A few attempts of pinning to a random range of cpus" >> $outfile +if [ $nr_cpus -gt 2 ]; then + for i in `seq 0 3`; do + cpua=$(($RANDOM % nr_cpus)) + range=$((nr_cpus - cpua)) + cpub=$(($RANDOM % range)) + cpubb=$((cpua + cpub)) + if [ $cpua -eq $cpubb ]; then + expected_range="$cpua" + else + expected_range="$cpua-$cpubb" + fi + write "$expected_range" 0 "cpumap: $expected_range" + done +fi + +echo "# A few attempts of pinning to just one random node" >> $outfile +if [ $nr_nodes -gt 1 ]; then + for i in `seq 0 3`; do + node=$(($RANDOM % nr_nodes)) + # this assumes that the first $nr_cpus_per_node (from cpu + # 0 to cpu $nr_cpus_per_node-1) are assigned to the first node + # (node 0), the second $nr_cpus_per_node (from $nr_cpus_per_node + # to 2*$nr_cpus_per_node-1) are assigned to the second node (node + # 1), etc. Expect failures if that is not the case. + write "nodes:$node" 0 "cpumap: $((nr_cpus_per_node*node))-$((nr_cpus_per_node*(node+1)-1))" + done +fi + +echo "# A few attempts of pinning to all but one random node" >> $outfile +if [ $nr_nodes -gt 1 ]; then + for i in `seq 0 3`; do + node=$(($RANDOM % nr_nodes)) + # this assumes that the first $nr_cpus_per_node (from cpu + # 0 to cpu $nr_cpus_per_node-1) are assigned to the first node + # (node 0), the second $nr_cpus_per_node (from $nr_cpus_per_node + # to 2*$nr_cpus_per_node-1) are assigned to the second node (node + # 1), etc. Expect failures if that is not the case. + if [ $node -eq 0 ]; then + expected_range="$nr_cpus_per_node-$((nr_cpus - 1))" + elif [ $node -eq $((nr_nodes - 1)) ]; then + expected_range="0-$((nr_cpus - nr_cpus_per_node - 1))" + else + expected_range="0-$((nr_cpus_per_node*node-1)),$((nr_cpus_per_node*(node+1)))-$nr_cpus" + fi + write "all,^nodes:$node" 0 "cpumap: $expected_range" + done +fi + +echo "# A few attempts of pinning to a random range of nodes" >> $outfile +if [ $nr_nodes -gt 1 ]; then + for i in `seq 0 3`; do + nodea=$(($RANDOM % nr_nodes)) + range=$((nr_nodes - nodea)) + nodeb=$(($RANDOM % range)) + nodebb=$((nodea + nodeb)) + # this assumes that the first $nr_cpus_per_node (from cpu + # 0 to cpu $nr_cpus_per_node-1) are assigned to the first node + # (node 0), the second $nr_cpus_per_node (from $nr_cpus_per_node + # to 2*$nr_cpus_per_node-1) are assigned to the second node (node + # 1), etc. Expect failures if that is not the case. + if [ $nodea -eq 0 ] && [ $nodebb -eq $((nr_nodes - 1)) ]; then + expected_range="any cpu" + else + expected_range="$((nr_cpus_per_node*nodea))-$((nr_cpus_per_node*(nodebb+1) - 1))" + fi + write "nodes:$nodea-$nodebb" 0 "cpumap: $expected_range" + done +fi + +echo "# A few attempts of pinning to a node but excluding one random cpu" >> $outfile +if [ $nr_nodes -gt 1 ]; then + for i in `seq 0 3`; do + node=$(($RANDOM % nr_nodes)) + # this assumes that the first $nr_cpus_per_node (from cpu + # 0 to cpu $nr_cpus_per_node-1) are assigned to the first node + # (node 0), the second $nr_cpus_per_node (from $nr_cpus_per_node + # to 2*$nr_cpus_per_node-1) are assigned to the second node (node + # 1), etc. Expect failures if that is not the case. + cpu=$(($RANDOM % nr_cpus_per_node + nr_cpus_per_node*node)) + if [ $cpu -eq $((nr_cpus_per_node*node)) ]; then + expected_range="$((nr_cpus_per_node*node + 1))-$((nr_cpus_per_node*(node+1) - 1))" + elif [ $cpu -eq $((nr_cpus_per_node*node + 1)) ]; then + expected_range="$((nr_cpus_per_node*node)),$((nr_cpus_per_node*node + 2))-$((nr_cpus_per_node*(node+1) - 1))" + elif [ $cpu -eq $((nr_cpus_per_node*(node+1) - 2)) ]; then + expected_range="$((nr_cpus_per_node*node))-$((nr_cpus_per_node*(node+1) - 3)),$((nr_cpus_per_node*(node+1) - 1))" + elif [ $cpu -eq $((nr_cpus_per_node*(node+1) - 1)) ]; then + expected_range="$((nr_cpus_per_node*node))-$((nr_cpus_per_node*(node+1) - 2))" + else + expected_range="$((nr_cpus_per_node*node))-$((cpu - 1)),$((cpu + 1))-$((nr_cpus_per_node*(node+1) - 1))" + fi + write "nodes:$node,^$cpu" 0 "cpumap: $expected_range" + done +fi + +run $outfile diff --git a/tools/libxl/check-xl-vcpupin-parse.data-example b/tools/libxl/check-xl-vcpupin-parse.data-example new file mode 100644 index 0000000..c9cf660 --- /dev/null +++ b/tools/libxl/check-xl-vcpupin-parse.data-example @@ -0,0 +1,53 @@ +# WARNING: some of these tests are topology based tests. +# Expect failures if the topology is not detected correctly +# detected topology: 16 CPUs, 2 nodes, 8 CPUs per node. +# +# seed used for random number generation: seed=13328. +# +# Format is as follows: +# test-string*expected-return-code*expected-output +# +# Testing a wring configuration +foo*255* +# Testing the ''all'' syntax +all*0*cpumap: any cpu +nodes:all*0*cpumap: any cpu +all,nodes:all*0*cpumap: any cpu +all,^nodes:0,all*0*cpumap: any cpu +# Testing the empty cpumap case +^0*0*cpumap: none +# A few attempts of pinning to just one random cpu +0*0*cpumap: 0 +9*0*cpumap: 9 +6*0*cpumap: 6 +0*0*cpumap: 0 +# A few attempts of pinning to all but one random cpu +all,^12*0*cpumap: 0-11,13-15 +all,^6*0*cpumap: 0-5,7-15 +all,^3*0*cpumap: 0-2,4-15 +all,^7*0*cpumap: 0-6,8-15 +# A few attempts of pinning to a random range of cpus +13-15*0*cpumap: 13-15 +7*0*cpumap: 7 +3-5*0*cpumap: 3-5 +8-11*0*cpumap: 8-11 +# A few attempts of pinning to just one random node +nodes:1*0*cpumap: 8-15 +nodes:0*0*cpumap: 0-7 +nodes:0*0*cpumap: 0-7 +nodes:0*0*cpumap: 0-7 +# A few attempts of pinning to all but one random node +all,^nodes:0*0*cpumap: 8-15 +all,^nodes:1*0*cpumap: 0-7 +all,^nodes:1*0*cpumap: 0-7 +all,^nodes:0*0*cpumap: 8-15 +# A few attempts of pinning to a random range of nodes +nodes:1-1*0*cpumap: 8-15 +nodes:1-1*0*cpumap: 8-15 +nodes:0-1*0*cpumap: any cpu +nodes:0-0*0*cpumap: 0-7 +# A few attempts of pinning to a node but excluding one random cpu +nodes:1,^8*0*cpumap: 9-15 +nodes:0,^6*0*cpumap: 0-5,7 +nodes:1,^9*0*cpumap: 8,10-15 +nodes:0,^5*0*cpumap: 0-4,6-7
Dario Faggioli
2013-Oct-03 17:46 UTC
[PATCH 08/15] xen: numa-sched: make space for per-vcpu node-affinity
Before this change, each vcpu had its own vcpu-affinity (also called pinning), while the whole domain had a NUMA node-affinity. Of course, as the (credit) scheduler schedules vcpus and not whole domains, this means that all the vcpus of a domain had the same NUMA node-affinity. This change is the first step toward overcoming such limitation. It adds the data structures for storing the node-affinity on a per-vcpu basis (along with allocating and initializing it). As far as this change only is concerned, there is no specific way to change the node-affinity of a vcpu to something which is not automatically computed (basing on its vcpu-affinity). Such logic is being introduced in subsequent commits. Also, now that each vcpu has its own node-affinity, and in case the domain''s node-affinity is set to ''automatically computed'', we build it up as the union of all the node-affinities of all the vcpus of the domain. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- xen/common/domain.c | 39 ++++++++++++++++++++++++++++++++----- xen/common/keyhandler.c | 6 +++++- xen/common/schedule.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++ xen/include/xen/sched.h | 10 +++++++++ 4 files changed, 99 insertions(+), 6 deletions(-) diff --git a/xen/common/domain.c b/xen/common/domain.c index af31ab4..8d2ff49 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -128,6 +128,7 @@ struct vcpu *alloc_vcpu( if ( !zalloc_cpumask_var(&v->cpu_affinity) || !zalloc_cpumask_var(&v->cpu_affinity_tmp) || !zalloc_cpumask_var(&v->cpu_affinity_saved) || + !zalloc_cpumask_var(&v->node_affinity) || !zalloc_cpumask_var(&v->vcpu_dirty_cpumask) ) goto fail_free; @@ -159,6 +160,7 @@ struct vcpu *alloc_vcpu( free_cpumask_var(v->cpu_affinity); free_cpumask_var(v->cpu_affinity_tmp); free_cpumask_var(v->cpu_affinity_saved); + free_cpumask_var(v->node_affinity); free_cpumask_var(v->vcpu_dirty_cpumask); free_vcpu_struct(v); return NULL; @@ -353,7 +355,7 @@ void domain_update_node_affinity(struct domain *d) cpumask_var_t online_affinity; const cpumask_t *online; struct vcpu *v; - unsigned int node; + unsigned int cpu; if ( !zalloc_cpumask_var(&cpumask) ) return; @@ -367,9 +369,36 @@ void domain_update_node_affinity(struct domain *d) spin_lock(&d->node_affinity_lock); + /* + * Let''s prepare the cpumask that will be used below to actually update + * the node-affinity of the whole domain. Each vcpu has a vcpu-affinity + * and a numa-affinity. What gets built in cpumask (and used below) is + * the union of all the (online) cpus in all the vcpu''s numa-affinity + * masks. + * + * On its turn, the numa-affinity mask of the i-eth vcpu (say, ''v'') is + * either derived directly from the vcpu''s vcpu-affinity mask (in case + * v->auto_node_affinity is true) or has its own value, (potentially) + * completely independent from v->cpu_affinity. In the former case, it + * is here that we make sure the two affinity masks matches (since this + * function gets called in correspondence of each modification to + * v->cpu_affinity happening in vcpu_set_affinity() ); in the latter + * case, we just leave v->node_affinity alone. + */ for_each_vcpu ( d, v ) { - cpumask_and(online_affinity, v->cpu_affinity, online); + if ( v->auto_node_affinity ) + { + cpumask_clear(v->node_affinity); + for_each_cpu ( cpu, v->cpu_affinity ) + cpumask_or(v->node_affinity, v->node_affinity, + &node_to_cpumask(cpu_to_node(cpu))); + + cpumask_and(online_affinity, v->node_affinity, online); + } + else + cpumask_copy(online_affinity, v->node_affinity); + cpumask_or(cpumask, cpumask, online_affinity); } @@ -383,9 +412,8 @@ void domain_update_node_affinity(struct domain *d) if ( d->auto_node_affinity ) { nodes_clear(d->node_affinity); - for_each_online_node ( node ) - if ( cpumask_intersects(&node_to_cpumask(node), cpumask) ) - node_set(node, d->node_affinity); + for_each_cpu ( cpu, cpumask ) + node_set(cpu_to_node(cpu), d->node_affinity); } sched_set_node_affinity(d, &d->node_affinity); @@ -734,6 +762,7 @@ static void complete_domain_destroy(struct rcu_head *head) { free_cpumask_var(v->cpu_affinity); free_cpumask_var(v->cpu_affinity_tmp); + free_cpumask_var(v->node_affinity); free_cpumask_var(v->vcpu_dirty_cpumask); free_vcpu_struct(v); } diff --git a/xen/common/keyhandler.c b/xen/common/keyhandler.c index b9ad1b5..8f844bc 100644 --- a/xen/common/keyhandler.c +++ b/xen/common/keyhandler.c @@ -297,7 +297,11 @@ static void dump_domains(unsigned char key) cpuset_print(tmpstr, sizeof(tmpstr), v->vcpu_dirty_cpumask); printk("dirty_cpus=%s ", tmpstr); cpuset_print(tmpstr, sizeof(tmpstr), v->cpu_affinity); - printk("cpu_affinity=%s\n", tmpstr); + printk("cpu_affinity=%s ", tmpstr); + cpuset_print(tmpstr, sizeof(tmpstr), v->node_affinity); + printk("node_affinity=%s%s\n", + v->auto_node_affinity ? "(auto)" : "(manual)", + tmpstr); printk(" pause_count=%d pause_flags=%lx\n", atomic_read(&v->pause_count), v->pause_flags); arch_dump_vcpu_info(v); diff --git a/xen/common/schedule.c b/xen/common/schedule.c index a8398bd..04f0a38 100644 --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -200,6 +200,9 @@ int sched_init_vcpu(struct vcpu *v, unsigned int processor) else cpumask_setall(v->cpu_affinity); + v->auto_node_affinity = 1; + cpumask_copy(v->node_affinity, v->cpu_affinity); + /* Initialise the per-vcpu timers. */ init_timer(&v->periodic_timer, vcpu_periodic_timer_fn, v, v->processor); @@ -676,6 +679,53 @@ int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity) return 0; } +int vcpu_set_node_affinity(struct vcpu *v, const nodemask_t *nodes) +{ + nodemask_t online_nodes; + int node; + + nodes_and(online_nodes, node_online_map, *nodes); + + /* Having no affinity at all is just wrong */ + if ( nodes_empty(online_nodes) ) + return -EINVAL; + + spin_lock(&v->domain->node_affinity_lock); + + /* + * Explicitly saying "all nodes" is not particularly useful here. + * Let''s use it as the `reset numa-affinity to auto'' command. + */ + if ( nodes_full(*nodes) ) + { + v->auto_node_affinity = 1; + goto out; + } + + /* + * When someone asks for a specific numa-affinity for a vcpu we need to + * clear auto_node_affinity, convert the nodemask in online_nodes + * into a cpumask_t and store it in node_affinity. + */ + v->auto_node_affinity = 0; + + cpumask_clear(v->node_affinity); + for_each_node_mask( node, online_nodes ) + cpumask_or(v->node_affinity, v->node_affinity, + &node_to_cpumask(node)); + +out: + spin_unlock(&v->domain->node_affinity_lock); + + /* + * Changing the numa-affinity of a vcpu calls for an update + * of the node-affinity of the whole domain. + */ + domain_update_node_affinity(v->domain); + + return 0; +} + /* Block the currently-executing domain until a pertinent event occurs. */ void vcpu_block(void) { diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index 0013a8d..c46487e 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -156,6 +156,8 @@ struct vcpu /* VCPU need affinity restored */ bool_t affinity_broken; + /* is node_affinity (below) automatically computed from vcpu-affinity? */ + bool_t auto_node_affinity; /* * > 0: a single port is being polled; @@ -181,6 +183,13 @@ struct vcpu /* Used to restore affinity across S3. */ cpumask_var_t cpu_affinity_saved; + /* + * Bitmask of CPUs on which this VCPU prefers to run. For both this + * and auto_node_affinity access is serialized against + * v->domain->node_affinity_lock. + */ + cpumask_var_t node_affinity; + /* Bitmask of CPUs which are holding onto this VCPU''s state. */ cpumask_var_t vcpu_dirty_cpumask; @@ -708,6 +717,7 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c); void vcpu_force_reschedule(struct vcpu *v); int cpu_disable_scheduler(unsigned int cpu); int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity); +int vcpu_set_node_affinity(struct vcpu *v, const nodemask_t *nodes); void restore_vcpu_affinity(struct domain *d); void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate);
Dario Faggioli
2013-Oct-03 17:46 UTC
[PATCH 09/15] xen: numa-sched: domain node-affinity always comes from vcpu node-affinity
Now that we have per-vcpu node-affinity we can do the following: * always consider the domain''s node-affinity as ''automaically computed''; * always construct it out of the domain''s vcpus'' own node-affinity. That means, if someone wants to change the node-affinity of a domain, it is the node-affinities of all the domain''s vcpus that need to be modified. This change modifies domain_set_node_affinity() in such a way that it does right the above, i.e., it goes through all the domain''s vcpus and set their node-affinity to some specified mask. This means that, "seen from the outside", nothing changes: you call domain_set_node_affinity(), passing a nodemask_t to it, and you get, (1) that mask to be the node-affinity for the domain (which basically means on what NUMA nodes memory is allocated), and (2) all the vcpus of the domains prefer to run on the pcpus from the nodes in that mask, exactly as it was before this commit. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- xen/common/domain.c | 48 ++++++++++++++------------------------------- xen/common/sched_credit.c | 3 +-- xen/include/xen/sched.h | 2 -- 3 files changed, 16 insertions(+), 37 deletions(-) diff --git a/xen/common/domain.c b/xen/common/domain.c index 8d2ff49..366d9b9 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -228,7 +228,6 @@ struct domain *domain_create( spin_lock_init(&d->node_affinity_lock); d->node_affinity = NODE_MASK_ALL; - d->auto_node_affinity = 1; spin_lock_init(&d->shutdown_lock); d->shutdown_code = -1; @@ -403,18 +402,13 @@ void domain_update_node_affinity(struct domain *d) } /* - * If d->auto_node_affinity is true, the domain''s node-affinity mask - * (d->node_affinity) is automaically computed from all the domain''s - * vcpus'' vcpu-affinity masks (the union of which we have just built - * above in cpumask). OTOH, if d->auto_node_affinity is false, we - * must leave the node-affinity of the domain alone. + * A domain''s node-affinity is just the union of all the domain''s vcpus'' + * numa-affinity masks, which is exactly what we have in cpumask + * (although, we need to convert it from cpumask to nodemask, of course). */ - if ( d->auto_node_affinity ) - { - nodes_clear(d->node_affinity); - for_each_cpu ( cpu, cpumask ) - node_set(cpu_to_node(cpu), d->node_affinity); - } + nodes_clear(d->node_affinity); + for_each_cpu ( cpu, cpumask ) + node_set(cpu_to_node(cpu), d->node_affinity); sched_set_node_affinity(d, &d->node_affinity); @@ -425,33 +419,21 @@ void domain_update_node_affinity(struct domain *d) } +/* Sets the numa-affinity (via vcpu_set_node_affinity() ) for all + * the vcpus of the domain. */ int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity) { - /* Being affine with no nodes is just wrong */ - if ( nodes_empty(*affinity) ) - return -EINVAL; - - spin_lock(&d->node_affinity_lock); + struct vcpu *v; + int rc = 0; - /* - * Being/becoming explicitly affine to all nodes is not particularly - * useful. Let''s take it as the `reset node affinity` command. - */ - if ( nodes_full(*affinity) ) + for_each_vcpu ( d, v ) { - d->auto_node_affinity = 1; - goto out; + rc = vcpu_set_node_affinity(v, affinity); + if ( rc ) + break; } - d->auto_node_affinity = 0; - d->node_affinity = *affinity; - -out: - spin_unlock(&d->node_affinity_lock); - - domain_update_node_affinity(d); - - return 0; + return rc; } diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c index 3d6ea7a..c133ca3 100644 --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -311,8 +311,7 @@ static inline int __vcpu_has_node_affinity(const struct vcpu *vc, const struct domain *d = vc->domain; const struct csched_dom *sdom = CSCHED_DOM(d); - if ( d->auto_node_affinity - || cpumask_full(sdom->node_affinity_cpumask) + if (cpumask_full(sdom->node_affinity_cpumask) || !cpumask_intersects(sdom->node_affinity_cpumask, mask) ) return 0; diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index c46487e..33b8902 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -300,8 +300,6 @@ struct domain /* Does this guest need iommu mappings? */ bool_t need_iommu; #endif - /* is node-affinity automatically computed? */ - bool_t auto_node_affinity; /* Is this guest fully privileged (aka dom0)? */ bool_t is_privileged; /* Which guest this guest has privileges on */
Dario Faggioli
2013-Oct-03 17:46 UTC
[PATCH 10/15] xen: numa-sched: use per-vcpu node-affinity for actual scheduling
instead of relying the domain-wide node-affinity. To achieve that, we use the specific vcpu''s node-affinity when computing the NUMA load balancing mask in csched_balance_cpumask(). As a side effects, this simplifies the code quit a bit. In fact, prior to this change, we needed to cache the translation of d->node_affinity (which is a nodemask_t) to a cpumask_t, since that is what is needed during the actual scheduling (we used to keep it in node_affinity_cpumask). Since the per-vcpu node-affinity is maintained in a cpumask_t field (v->node_affinity) already, we don''t need that complicated updating logic in place any longer, and this change hence can remove stuff like sched_set_node_affinity(), csched_set_node_affinity() and, of course, node_affinity_cpumask from csched_dom. The high level description of NUMA placement and scheduling in docs/misc/xl-numa-placement.markdown is updated too, to match the new behavior. While at it, an attempt is made to make such document as few ambiguous as possible, with respect to the concepts of vCPU pinning, domain node-affinity and vCPU node-affinity. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- docs/misc/xl-numa-placement.markdown | 124 +++++++++++++++++++++++----------- xen/common/domain.c | 2 - xen/common/sched_credit.c | 62 ++--------------- xen/common/schedule.c | 5 - xen/include/xen/sched-if.h | 2 - xen/include/xen/sched.h | 1 6 files changed, 94 insertions(+), 102 deletions(-) diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown index caa3fec..890b856 100644 --- a/docs/misc/xl-numa-placement.markdown +++ b/docs/misc/xl-numa-placement.markdown @@ -12,13 +12,6 @@ is quite more complex and slow. On these machines, a NUMA node is usually defined as a set of processor cores (typically a physical CPU package) and the memory directly attached to the set of cores. -The Xen hypervisor deals with NUMA machines by assigning to each domain -a "node affinity", i.e., a set of NUMA nodes of the host from which they -get their memory allocated. Also, even if the node affinity of a domain -is allowed to change on-line, it is very important to "place" the domain -correctly when it is fist created, as the most of its memory is allocated -at that time and can not (for now) be moved easily. - NUMA awareness becomes very important as soon as many domains start running memory-intensive workloads on a shared host. In fact, the cost of accessing non node-local memory locations is very high, and the @@ -27,13 +20,37 @@ performance degradation is likely to be noticeable. For more information, have a look at the [Xen NUMA Introduction][numa_intro] page on the Wiki. +## Xen and NUMA machines: the concept of _node-affinity_ ## + +The Xen hypervisor deals with NUMA machines throughout the concept of +_node-affinity_. When talking about node-affinity, it is important to +distinguish two different situations. The node-affinity *of a domain* is +the set of NUMA nodes of the host where the memory for the domain is +being allocated (mostly, at domain creation time). The node-affinity *of +a vCPU* is the set of NUMA nodes of the host where the vCPU prefers to +run on, and the (credit1 only, for now) scheduler will try to accomplish +that, whenever it is possible. + +Of course, despite the fact that they belong to and affect different +subsystems domain and vCPUs node-affinities are related. In fact, the +node-affinity of a domain is the union of the node-affinities of all the +domain''s vCPUs. + +The above means that, when changing the vCPU node-affinity, the domain +node-affinity also changes. Well, although this is allowed to happen +on-line (i.e., when a domain is already running), that will not result +in the memory that has already been allocated being moved to a different +host NUMA node. This is why it is very important to "place" the domain +correctly when it is fist created, as the most of its memory is allocated +at that time and can not (for now) be moved easily. + ### Placing via pinning and cpupools ### The simplest way of placing a domain on a NUMA node is statically pinning -the domain''s vCPUs to the pCPUs of the node. This goes under the name of -CPU affinity and can be set through the "cpus=" option in the config file -(more about this below). Another option is to pool together the pCPUs -spanning the node and put the domain in such a cpupool with the "pool=" +the domain''s vCPUs to the pCPUs of the node. This also goes under the name +of vCPU-affinity and can be set through the "cpus=" option in the config +file (more about this below). Another option is to pool together the pCPUs +spanning the node and put the domain in such a _cpupool_ with the "pool=" config option (as documented in our [Wiki][cpupools_howto]). In both the above cases, the domain will not be able to execute outside @@ -45,24 +62,45 @@ may come at he cost of some load imbalances. ### NUMA aware scheduling ### -If the credit scheduler is in use, the concept of node affinity defined -above does not only apply to memory. In fact, starting from Xen 4.3, the -scheduler always tries to run the domain''s vCPUs on one of the nodes in -its node affinity. Only if that turns out to be impossible, it will just -pick any free pCPU. +If using credit scheduler, and starting from Xen 4.3, the scheduler always +tries to run the domain''s vCPUs on one of the nodes in their node-affinity. +Only if that turns out to be impossible, it will just pick any free pCPU. +Moreover, starting from Xen 4.4, each vCPU can have its own node-affinity, +potentially different from the ones of all the other vCPUs of the domain. -This is, therefore, something more flexible than CPU affinity, as a domain -can still run everywhere, it just prefers some nodes rather than others. +This is, therefore, something more flexible than vCPU pinning, as vCPUs +can still run everywhere, they just prefer some nodes rather than others. Locality of access is less guaranteed than in the pinning case, but that comes along with better chances to exploit all the host resources (e.g., the pCPUs). -In fact, if all the pCPUs in a domain''s node affinity are busy, it is -possible for the domain to run outside of there, but it is very likely that -slower execution (due to remote memory accesses) is still better than no -execution at all, as it would happen with pinning. For this reason, NUMA -aware scheduling has the potential of bringing substantial performances -benefits, although this will depend on the workload. +In fact, if all the pCPUs in a VCPU''s node-affinity are busy, it is +possible for the domain to run outside from there. The idea is that +slower execution (due to remote memory accesses) is still better than +no execution at all (as it would happen with pinning). For this reason, +NUMA aware scheduling has the potential of bringing substantial +performances benefits, although this will depend on the workload. + +Notice that, for each vCPU, the following three scenarios are possbile: + + * a vCPU *is pinned* to some pCPUs and *does not have* any vCPU + node-affinity. In this case, the vCPU is always scheduled on one + of the pCPUs to which it is pinned, without any specific peference + on which one of them. Internally, the vCPU''s node-affinity is just + automatically computed from the vCPU pinning, and the scheduler + just ignores it; + * a vCPU *has* its own vCPU node-affinity and *is not* pinned to + any particular pCPU. In this case, the vCPU can run on every pCPU. + Nevertheless, the scheduler will try to have it running on one of + the pCPUs of the node(s) to which it has node-affinity with; + * a vCPU *has* its own vCPU node-affinity and *is also* pinned to + some pCPUs. In this case, the vCPU is always scheduled on one of the + pCPUs to which it is pinned, with, among them, a preference for the + ones that are from the node(s) it has node-affinity with. In case + pinning and node-affinity form two disjoint sets of pCPUs, pinning + "wins", and the node-affinity, although it is still used to derive + the domain''s node-affinity (for memory allocation), is, from the + scheduler''s perspective, just ignored. ## Guest placement in xl ## @@ -71,25 +109,23 @@ both manual or automatic placement of them across the host''s NUMA nodes. Note that xm/xend does a very similar thing, the only differences being the details of the heuristics adopted for automatic placement (see below), -and the lack of support (in both xm/xend and the Xen versions where that\ +and the lack of support (in both xm/xend and the Xen versions where that was the default toolstack) for NUMA aware scheduling. ### Placing the guest manually ### Thanks to the "cpus=" option, it is possible to specify where a domain should be created and scheduled on, directly in its config file. This -affects NUMA placement and memory accesses as the hypervisor constructs -the node affinity of a VM basing right on its CPU affinity when it is -created. +affects NUMA placement and memory accesses as, in this case, the hypervisor +constructs the node-affinity of a VM basing right on its vCPU pinning +when it is created. This is very simple and effective, but requires the user/system -administrator to explicitly specify affinities for each and every domain, +administrator to explicitly specify the pinning of each and every domain, or Xen won''t be able to guarantee the locality for their memory accesses. -Notice that this also pins the domain''s vCPUs to the specified set of -pCPUs, so it not only sets the domain''s node affinity (its memory will -come from the nodes to which the pCPUs belong), but at the same time -forces the vCPUs of the domain to be scheduled on those same pCPUs. +That, of course, alsos mean the vCPUs of the domain will only be able to +execute on those same pCPUs. ### Placing the guest automatically ### @@ -97,7 +133,10 @@ If no "cpus=" option is specified in the config file, libxl tries to figure out on its own on which node(s) the domain could fit best. If it finds one (some), the domain''s node affinity get set to there, and both memory allocations and NUMA aware scheduling (for the credit -scheduler and starting from Xen 4.3) will comply with it. +scheduler and starting from Xen 4.3) will comply with it. Starting from +Xen 4.4, this just means all the vCPUs of the domain will have the same +vCPU node-affinity, that is the outcome of such automatic "fitting" +procedure. It is worthwhile noting that optimally fitting a set of VMs on the NUMA nodes of an host is an incarnation of the Bin Packing Problem. In fact, @@ -143,22 +182,29 @@ any placement from happening: libxl_defbool_set(&domain_build_info->numa_placement, false); Also, if `numa_placement` is set to `true`, the domain must not -have any CPU affinity (i.e., `domain_build_info->cpumap` must -have all its bits set, as it is by default), or domain creation -will fail returning `ERROR_INVAL`. +have any vCPU pinning (i.e., `domain_build_info->cpumap` must have +all its bits set, as it is by default), or domain creation will fail +with an `ERROR_INVAL`. Starting from Xen 4.3, in case automatic placement happens (and is successful), it will affect the domain''s node affinity and _not_ its -CPU affinity. Namely, the domain''s vCPUs will not be pinned to any +vCPU pinning. Namely, the domain''s vCPUs will not be pinned to any pCPU on the host, but the memory from the domain will come from the selected node(s) and the NUMA aware scheduling (if the credit scheduler -is in use) will try to keep the domain there as much as possible. +is in use) will try to keep the domain''s vCPUs there as much as possible. Besides than that, looking and/or tweaking the placement algorithm search "Automatic NUMA placement" in libxl\_internal.h. Note this may change in future versions of Xen/libxl. +## Xen < 4.4 ## + +The concept of per-vCPU node-affinity has been introduced for the first +time in Xen 4.4. In Xen versions earlier than that, the node-affinity is +the same for the whole domain, that is to say the same for all the vCPUs +of the domain. + ## Xen < 4.3 ## As NUMA aware scheduling is a new feature of Xen 4.3, things are a little diff --git a/xen/common/domain.c b/xen/common/domain.c index 366d9b9..ae29945 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -410,8 +410,6 @@ void domain_update_node_affinity(struct domain *d) for_each_cpu ( cpu, cpumask ) node_set(cpu_to_node(cpu), d->node_affinity); - sched_set_node_affinity(d, &d->node_affinity); - spin_unlock(&d->node_affinity_lock); free_cpumask_var(online_affinity); diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c index c133ca3..ab9e5df 100644 --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -178,9 +178,6 @@ struct csched_dom { struct list_head active_vcpu; struct list_head active_sdom_elem; struct domain *dom; - /* cpumask translated from the domain''s node-affinity. - * Basically, the CPUs we prefer to be scheduled on. */ - cpumask_var_t node_affinity_cpumask; uint16_t active_vcpu_count; uint16_t weight; uint16_t cap; @@ -261,32 +258,6 @@ __runq_remove(struct csched_vcpu *svc) list_del_init(&svc->runq_elem); } -/* - * Translates node-affinity mask into a cpumask, so that we can use it during - * actual scheduling. That of course will contain all the cpus from all the - * set nodes in the original node-affinity mask. - * - * Note that any serialization needed to access mask safely is complete - * responsibility of the caller of this function/hook. - */ -static void csched_set_node_affinity( - const struct scheduler *ops, - struct domain *d, - nodemask_t *mask) -{ - struct csched_dom *sdom; - int node; - - /* Skip idle domain since it doesn''t even have a node_affinity_cpumask */ - if ( unlikely(is_idle_domain(d)) ) - return; - - sdom = CSCHED_DOM(d); - cpumask_clear(sdom->node_affinity_cpumask); - for_each_node_mask( node, *mask ) - cpumask_or(sdom->node_affinity_cpumask, sdom->node_affinity_cpumask, - &node_to_cpumask(node)); -} #define for_each_csched_balance_step(step) \ for ( (step) = 0; (step) <= CSCHED_BALANCE_CPU_AFFINITY; (step)++ ) @@ -294,12 +265,12 @@ static void csched_set_node_affinity( /* * vcpu-affinity balancing is always necessary and must never be skipped. - * OTOH, if a domain''s node-affinity is said to be automatically computed - * (or if it just spans all the nodes), we can safely avoid dealing with - * node-affinity entirely. + * OTOH, if the vcpu''s numa-affinity is being automatically computed out of + * the vcpu''s vcpu-affinity, or if it just spans all the nodes, we can + * safely avoid dealing with numa-affinity entirely. * - * Node-affinity is also deemed meaningless in case it has empty - * intersection with mask, to cover the cases where using the node-affinity + * A vcpu''s numa-affinity is also deemed meaningless in case it has empty + * intersection with mask, to cover the cases where using the numa-affinity * mask seems legit, but would instead led to trying to schedule the vcpu * on _no_ pcpu! Typical use cases are for mask to be equal to the vcpu''s * vcpu-affinity, or to the && of vcpu-affinity and the set of online cpus @@ -308,11 +279,9 @@ static void csched_set_node_affinity( static inline int __vcpu_has_node_affinity(const struct vcpu *vc, const cpumask_t *mask) { - const struct domain *d = vc->domain; - const struct csched_dom *sdom = CSCHED_DOM(d); - - if (cpumask_full(sdom->node_affinity_cpumask) - || !cpumask_intersects(sdom->node_affinity_cpumask, mask) ) + if ( vc->auto_node_affinity == 1 + || cpumask_full(vc->node_affinity) + || !cpumask_intersects(vc->node_affinity, mask) ) return 0; return 1; @@ -330,8 +299,7 @@ csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask) { if ( step == CSCHED_BALANCE_NODE_AFFINITY ) { - cpumask_and(mask, CSCHED_DOM(vc->domain)->node_affinity_cpumask, - vc->cpu_affinity); + cpumask_and(mask, vc->node_affinity, vc->cpu_affinity); if ( unlikely(cpumask_empty(mask)) ) cpumask_copy(mask, vc->cpu_affinity); @@ -1104,13 +1072,6 @@ csched_alloc_domdata(const struct scheduler *ops, struct domain *dom) if ( sdom == NULL ) return NULL; - if ( !alloc_cpumask_var(&sdom->node_affinity_cpumask) ) - { - xfree(sdom); - return NULL; - } - cpumask_setall(sdom->node_affinity_cpumask); - /* Initialize credit and weight */ INIT_LIST_HEAD(&sdom->active_vcpu); INIT_LIST_HEAD(&sdom->active_sdom_elem); @@ -1140,9 +1101,6 @@ csched_dom_init(const struct scheduler *ops, struct domain *dom) static void csched_free_domdata(const struct scheduler *ops, void *data) { - struct csched_dom *sdom = data; - - free_cpumask_var(sdom->node_affinity_cpumask); xfree(data); } @@ -1965,8 +1923,6 @@ const struct scheduler sched_credit_def = { .adjust = csched_dom_cntl, .adjust_global = csched_sys_cntl, - .set_node_affinity = csched_set_node_affinity, - .pick_cpu = csched_cpu_pick, .do_schedule = csched_schedule, diff --git a/xen/common/schedule.c b/xen/common/schedule.c index 04f0a38..583c11b 100644 --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -641,11 +641,6 @@ int cpu_disable_scheduler(unsigned int cpu) return ret; } -void sched_set_node_affinity(struct domain *d, nodemask_t *mask) -{ - SCHED_OP(DOM2OP(d), set_node_affinity, d, mask); -} - int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity) { cpumask_t online_affinity; diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h index ad4d80c..84bb13f 100644 --- a/xen/include/xen/sched-if.h +++ b/xen/include/xen/sched-if.h @@ -184,8 +184,6 @@ struct scheduler { struct xen_domctl_scheduler_op *); int (*adjust_global) (const struct scheduler *, struct xen_sysctl_scheduler_op *); - void (*set_node_affinity) (const struct scheduler *, - struct domain *, nodemask_t *); void (*dump_settings) (const struct scheduler *); void (*dump_cpu_state) (const struct scheduler *, int); diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index 33b8902..8a82a83 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -563,7 +563,6 @@ void sched_destroy_domain(struct domain *d); int sched_move_domain(struct domain *d, struct cpupool *c); long sched_adjust(struct domain *, struct xen_domctl_scheduler_op *); long sched_adjust_global(struct xen_sysctl_scheduler_op *); -void sched_set_node_affinity(struct domain *, nodemask_t *); int sched_id(void); void sched_tick_suspend(void); void sched_tick_resume(void);
Dario Faggioli
2013-Oct-03 17:47 UTC
[PATCH 11/15] xen: numa-sched: enable getting/specifying per-vcpu node-affinity
via two new DOMCTLs: getvcpunodeaffinity and setvcpunodeaffinity. They''re very similar to XEN_DOMCTL_{get,set}vcpuaffinity (with the only exception that they take a nodemap instead than a cpumap). Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- tools/libxc/xc_domain.c | 8 ++++--- xen/common/domctl.c | 47 ++++++++++++++++++++++++++++++++++++++----- xen/include/public/domctl.h | 8 +++++-- xen/xsm/flask/hooks.c | 2 ++ 4 files changed, 54 insertions(+), 11 deletions(-) diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c index 81316d3..b36c2ad 100644 --- a/tools/libxc/xc_domain.c +++ b/tools/libxc/xc_domain.c @@ -219,9 +219,9 @@ int xc_vcpu_setaffinity(xc_interface *xch, memcpy(local, cpumap, cpusize); - set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, local); + set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap, local); - domctl.u.vcpuaffinity.cpumap.nr_bits = cpusize * 8; + domctl.u.vcpuaffinity.map.nr_bits = cpusize * 8; ret = do_domctl(xch, &domctl); @@ -260,8 +260,8 @@ int xc_vcpu_getaffinity(xc_interface *xch, domctl.domain = (domid_t)domid; domctl.u.vcpuaffinity.vcpu = vcpu; - set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, local); - domctl.u.vcpuaffinity.cpumap.nr_bits = cpusize * 8; + set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap, local); + domctl.u.vcpuaffinity.map.nr_bits = cpusize * 8; ret = do_domctl(xch, &domctl); diff --git a/xen/common/domctl.c b/xen/common/domctl.c index 9760d50..7770d30 100644 --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -584,7 +584,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) break; case XEN_DOMCTL_setvcpuaffinity: - case XEN_DOMCTL_getvcpuaffinity: + case XEN_DOMCTL_setvcpunodeaffinity: { struct vcpu *v; @@ -600,8 +600,8 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) { cpumask_var_t new_affinity; - ret = xenctl_bitmap_to_cpumask( - &new_affinity, &op->u.vcpuaffinity.cpumap); + ret = xenctl_bitmap_to_cpumask(&new_affinity, + &op->u.vcpuaffinity.map); if ( !ret ) { ret = vcpu_set_affinity(v, new_affinity); @@ -610,8 +610,45 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) } else { - ret = cpumask_to_xenctl_bitmap( - &op->u.vcpuaffinity.cpumap, v->cpu_affinity); + nodemask_t new_affinity; + + ret = xenctl_bitmap_to_nodemask(&new_affinity, + &op->u.vcpuaffinity.map); + if ( !ret ) + ret = vcpu_set_node_affinity(v, &new_affinity); + } + } + break; + + case XEN_DOMCTL_getvcpuaffinity: + case XEN_DOMCTL_getvcpunodeaffinity: + { + struct vcpu *v; + + ret = -EINVAL; + if ( op->u.vcpuaffinity.vcpu >= d->max_vcpus ) + break; + + ret = -ESRCH; + if ( (v = d->vcpu[op->u.vcpuaffinity.vcpu]) == NULL ) + break; + + if ( op->cmd == XEN_DOMCTL_getvcpuaffinity ) + { + ret = cpumask_to_xenctl_bitmap(&op->u.vcpuaffinity.map, + v->cpu_affinity); + } + else + { + nodemask_t affinity; + int cpu; + + nodes_clear(affinity); + for_each_cpu ( cpu, v->node_affinity ) + node_set(cpu_to_node(cpu), affinity); + + ret = nodemask_to_xenctl_bitmap(&op->u.vcpuaffinity.map, + &affinity); } } break; diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h index 4c5b2bb..07d43f2 100644 --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -290,12 +290,14 @@ typedef struct xen_domctl_nodeaffinity xen_domctl_nodeaffinity_t; DEFINE_XEN_GUEST_HANDLE(xen_domctl_nodeaffinity_t); -/* Get/set which physical cpus a vcpu can execute on. */ +/* Get/set which physical cpus a vcpu can or prefer execute on. */ /* XEN_DOMCTL_setvcpuaffinity */ /* XEN_DOMCTL_getvcpuaffinity */ +/* XEN_DOMCTL_setvcpunodeaffinity */ +/* XEN_DOMCTL_getvcpunodeaffinity */ struct xen_domctl_vcpuaffinity { uint32_t vcpu; /* IN */ - struct xenctl_bitmap cpumap; /* IN/OUT */ + struct xenctl_bitmap map; /* IN/OUT */ }; typedef struct xen_domctl_vcpuaffinity xen_domctl_vcpuaffinity_t; DEFINE_XEN_GUEST_HANDLE(xen_domctl_vcpuaffinity_t); @@ -920,6 +922,8 @@ struct xen_domctl { #define XEN_DOMCTL_set_broken_page_p2m 67 #define XEN_DOMCTL_setnodeaffinity 68 #define XEN_DOMCTL_getnodeaffinity 69 +#define XEN_DOMCTL_setvcpunodeaffinity 70 +#define XEN_DOMCTL_getvcpunodeaffinity 71 #define XEN_DOMCTL_gdbsx_guestmemio 1000 #define XEN_DOMCTL_gdbsx_pausevcpu 1001 #define XEN_DOMCTL_gdbsx_unpausevcpu 1002 diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c index fa0589a..d7cfeaf 100644 --- a/xen/xsm/flask/hooks.c +++ b/xen/xsm/flask/hooks.c @@ -616,10 +616,12 @@ static int flask_domctl(struct domain *d, int cmd) return current_has_perm(d, SECCLASS_DOMAIN, DOMAIN__UNPAUSE); case XEN_DOMCTL_setvcpuaffinity: + case XEN_DOMCTL_setvcpunodeaffinity: case XEN_DOMCTL_setnodeaffinity: return current_has_perm(d, SECCLASS_DOMAIN, DOMAIN__SETAFFINITY); case XEN_DOMCTL_getvcpuaffinity: + case XEN_DOMCTL_getvcpunodeaffinity: case XEN_DOMCTL_getnodeaffinity: return current_has_perm(d, SECCLASS_DOMAIN, DOMAIN__GETAFFINITY);
Dario Faggioli
2013-Oct-03 17:47 UTC
[PATCH 12/15] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity
by providing the proper get/set interfaces and wiring them to the new domctl-s from the previous commit. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- tools/libxc/xc_domain.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++ tools/libxc/xenctrl.h | 19 +++++++++++ 2 files changed, 101 insertions(+) diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c index b36c2ad..b1be3ee 100644 --- a/tools/libxc/xc_domain.c +++ b/tools/libxc/xc_domain.c @@ -272,6 +272,88 @@ out: return ret; } +int xc_vcpu_setnodeaffinity(xc_interface *xch, + uint32_t domid, + int vcpu, + xc_nodemap_t nodemap) +{ + DECLARE_DOMCTL; + DECLARE_HYPERCALL_BUFFER(uint8_t, local); + int ret = -1; + int nodesize; + + nodesize = xc_get_cpumap_size(xch); + if (!nodesize) + { + PERROR("Could not get number of nodes"); + goto out; + } + + local = xc_hypercall_buffer_alloc(xch, local, nodesize); + if ( local == NULL ) + { + PERROR("Could not allocate memory for setvcpunodeaffinity domctl hypercall"); + goto out; + } + + domctl.cmd = XEN_DOMCTL_setvcpunodeaffinity; + domctl.domain = (domid_t)domid; + domctl.u.vcpuaffinity.vcpu = vcpu; + + memcpy(local, nodemap, nodesize); + + set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap, local); + + domctl.u.vcpuaffinity.map.nr_bits = nodesize * 8; + + ret = do_domctl(xch, &domctl); + + xc_hypercall_buffer_free(xch, local); + + out: + return ret; +} + +int xc_vcpu_getnodeaffinity(xc_interface *xch, + uint32_t domid, + int vcpu, + xc_nodemap_t nodemap) +{ + DECLARE_DOMCTL; + DECLARE_HYPERCALL_BUFFER(uint8_t, local); + int ret = -1; + int nodesize; + + nodesize = xc_get_nodemap_size(xch); + if (!nodesize) + { + PERROR("Could not get number of nodes"); + goto out; + } + + local = xc_hypercall_buffer_alloc(xch, local, nodesize); + if (local == NULL) + { + PERROR("Could not allocate memory for getvcpunodeaffinity domctl hypercall"); + goto out; + } + + domctl.cmd = XEN_DOMCTL_getvcpunodeaffinity; + domctl.domain = (domid_t)domid; + domctl.u.vcpuaffinity.vcpu = vcpu; + + set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap, local); + domctl.u.vcpuaffinity.map.nr_bits = nodesize * 8; + + ret = do_domctl(xch, &domctl); + + memcpy(nodemap, local, nodesize); + + xc_hypercall_buffer_free(xch, local); +out: + return ret; +} + int xc_domain_get_guest_width(xc_interface *xch, uint32_t domid, unsigned int *guest_width) { diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h index 58d51f3..8ba260e 100644 --- a/tools/libxc/xenctrl.h +++ b/tools/libxc/xenctrl.h @@ -551,6 +551,25 @@ int xc_domain_node_getaffinity(xc_interface *xch, uint32_t domind, xc_nodemap_t nodemap); +/** + * These functions set and retrieves the NUMA node-affinity + * of a specific vcpu. + * + * @parm xch a handle to an open hypervisor interface. + * @parm domid the domain id one is interested in. + * @parm vcpu the vcpu one wants to set/get the affinity of. + * @parm nodemap the map of the affine nodes. + * @return 0 on success, -1 on failure. + */ +int xc_vcpu_setnodeaffinity(xc_interface *xch, + uint32_t domid, + int vcpu, + xc_nodemap_t nodemap); +int xc_vcpu_getnodeaffinity(xc_interface *xch, + uint32_t domid, + int vcpu, + xc_nodemap_t nodemap); + int xc_vcpu_setaffinity(xc_interface *xch, uint32_t domid, int vcpu,
Dario Faggioli
2013-Oct-03 17:47 UTC
[PATCH 13/15] libxl: numa-sched: enable getting/specifying per-vcpu node-affinity
by providing the proper get/set interfaces and wiring them to the new libxc calls from the previous commit. For the ''get'' part, the node-affinity of all the vcpus of a domain is also reported via libxl_list_vcpu() (exactly as it is happening for their vcpu affinity already), adding a specific field to libxl_vcpuinfo. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- tools/libxl/libxl.c | 28 ++++++++++++++++++++++++++++ tools/libxl/libxl.h | 11 +++++++++++ tools/libxl/libxl_types.idl | 1 + 3 files changed, 40 insertions(+) diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c index ca24ca3..bc2403b 100644 --- a/tools/libxl/libxl.c +++ b/tools/libxl/libxl.c @@ -4187,6 +4187,10 @@ libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t domid, LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "allocating cpumap"); return NULL; } + if (libxl_node_bitmap_alloc(ctx, &ptr->nodemap, 0)) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "allocating nodemap"); + return NULL; + } if (xc_vcpu_getinfo(ctx->xch, domid, *nb_vcpu, &vcpuinfo) == -1) { LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting vcpu info"); return NULL; @@ -4195,6 +4199,10 @@ libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t domid, LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting vcpu affinity"); return NULL; } + if (xc_vcpu_getnodeaffinity(ctx->xch, domid, *nb_vcpu, ptr->nodemap.map) == -1) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting node-affinity"); + return NULL; + } ptr->vcpuid = *nb_vcpu; ptr->cpu = vcpuinfo.cpu; ptr->online = !!vcpuinfo.online; @@ -4230,6 +4238,26 @@ int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid, return rc; } +int libxl_set_vcpunodeaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid, + const libxl_bitmap *nodemap) +{ + if (xc_vcpu_setnodeaffinity(ctx->xch, domid, vcpuid, nodemap->map)) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "setting vcpu node-affinity"); + return ERROR_FAIL; + } + return 0; +} + +int libxl_get_vcpunodeaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid, + libxl_bitmap *nodemap) +{ + if (xc_vcpu_getnodeaffinity(ctx->xch, domid, vcpuid, nodemap->map)) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting vcpu node-affinity"); + return ERROR_FAIL; + } + return 0; +} + int libxl_domain_set_nodeaffinity(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *nodemap) { diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h index 4cab294..5f3cd7b 100644 --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -82,6 +82,13 @@ #define LIBXL_HAVE_DOMAIN_NODEAFFINITY 1 /* + * LIBXL_HAVE_VCPUINFO_NODEAFFINITY indicates that a ''nodemap'' field + * (of libxl_bitmap type) is present in libxl_vcpuinfo, containing the + * node-affinity for the vcpu. + */ +#define LIBXL_HAVE_VCPUINFO_NODEAFFINITY + +/* * LIBXL_HAVE_BUILDINFO_HVM_VENDOR_DEVICE indicates that the * libxl_vendor_device field is present in the hvm sections of * libxl_domain_build_info. This field tells libxl which @@ -949,6 +956,10 @@ int libxl_set_vcpuaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid, libxl_bitmap *cpumap); int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid, unsigned int max_vcpus, libxl_bitmap *cpumap); +int libxl_set_vcpunodeaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid, + const libxl_bitmap *nodemap); +int libxl_get_vcpunodeaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid, + libxl_bitmap *nodemap); int libxl_domain_set_nodeaffinity(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *nodemap); int libxl_domain_get_nodeaffinity(libxl_ctx *ctx, uint32_t domid, diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl index 049dbb5..8f52a5e 100644 --- a/tools/libxl/libxl_types.idl +++ b/tools/libxl/libxl_types.idl @@ -487,6 +487,7 @@ libxl_vcpuinfo = Struct("vcpuinfo", [ ("running", bool), ("vcpu_time", uint64), # total vcpu time ran (ns) ("cpumap", libxl_bitmap), # current cpu''s affinities + ("nodemap", libxl_bitmap), # current node-affinity ], dir=DIR_OUT) libxl_physinfo = Struct("physinfo", [
Dario Faggioli
2013-Oct-03 17:47 UTC
[PATCH 14/15] xl: numa-sched: enable getting/specifying per-vcpu node-affinity
by showing it, upon request (''-n'' switch), as a part of the output of `xl vcpu-list''. Such output looks like this: # xl vcpu-list -n Name ID VCPU CPU State Time(s) CPU Affinity / NUMA Affinity Domain-0 0 0 8 -b- 4.4 any cpu / any node Domain-0 0 1 0 -b- 1.2 any cpu / any node vm-test 1 0 8 -b- 2.4 any cpu / 1 vm-test 1 1 13 -b- 3.3 any cpu / 1 The "vcpu-pinning / vcpu-affinity" part may indeed look weird, as it is different from the other fields. Unfortunately, it''s very hard to properly format it in columns, since there is no such things as minimal or maximal lengths for the bitmaps being printed. Specifying a particular node-affinity happens via a new xl command, `xl vcpu-node-affinity''. Introducing a "numa mode", to `xl vcpu-pin'' was the other option, but doing that would probably fuel the confusion between vcpu pinning and NUMA node-affinity, and that is certainly something we do not want. While at it, the implementation of `xl vcpu-list'' is reworked a little bit, making it more similar to the one of `xl list'' and more compliant with the libxl programming paradigm (e.g., regarding allocating at the beginning and freeing on exit). xl manual page is updated accordingly. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- docs/man/xl.pod.1 | 25 +++++ tools/libxl/xl.h | 1 tools/libxl/xl_cmdimpl.c | 210 ++++++++++++++++++++++++++++++++++++--------- tools/libxl/xl_cmdtable.c | 9 ++ 4 files changed, 201 insertions(+), 44 deletions(-) diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1 index 5975d7b..a7bb18c 100644 --- a/docs/man/xl.pod.1 +++ b/docs/man/xl.pod.1 @@ -614,11 +614,21 @@ quietly ignored. Some guests may need to actually bring the newly added CPU online after B<vcpu-set>, go to B<SEE ALSO> section for information. -=item B<vcpu-list> [I<domain-id>] +=item B<vcpu-list> [I<OPTIONS>] [I<domain-id>, ...] Lists VCPU information for a specific domain. If no domain is specified, VCPU information for all domains will be provided. +B<OPTIONS> + +=over 4 + +=item B<-n>, B<--numa> + +Print the NUMA node-affinity of each VCPU. + +=back + =item B<vcpu-pin> I<domain-id> I<vcpu> I<cpus> Pins the VCPU to only run on the specific CPUs. The keyword @@ -630,6 +640,19 @@ different run state is appropriate. Pinning can be used to restrict this, by ensuring certain VCPUs can only run on certain physical CPUs. +=item B<vcpu-node-affinity> I<domain-id> I<vcpu> I<nodes> + +Specifies the I<nodes> on which the I<vcpu> prefers to run. The +keyword B<all> can be used to set this to I<nodes> for all the +VCPUs in the domain. + +Normally VCPUs can float within the set of CPUs that they are pinned +to (see B<vcpu-pin> above). Specifying a node-affinity does not change +that, but, if using the credit scheduler, the VCPUs will strongly prefer +running on the NUMA nodes they have node-affinity with. If using any +scheduler other than credit, node-affinity is just ignored (credit2 +support is planned but not ready yet). + =item B<vm-list> Prints information about guests. This list excludes information about diff --git a/tools/libxl/xl.h b/tools/libxl/xl.h index e72a7d2..6ea2ce0 100644 --- a/tools/libxl/xl.h +++ b/tools/libxl/xl.h @@ -59,6 +59,7 @@ int main_create(int argc, char **argv); int main_config_update(int argc, char **argv); int main_button_press(int argc, char **argv); int main_vcpupin(int argc, char **argv); +int main_vcpuaff(int argc, char **argv); int main_vcpuset(int argc, char **argv); int main_memmax(int argc, char **argv); int main_memset(int argc, char **argv); diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index 33c0d0d..e358157 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -510,10 +510,32 @@ static int parse_range(const char *str, unsigned long *a, unsigned long *b) } /* - * Add or removes a specific set of cpus (specified in str, either as - * single cpus or as entire NUMA nodes) to/from cpumap. + * Set or reset some bits from bitmap, as specified in str. + * + * If str * is "all", bitmap gets fully set. If str contains a specific + * value or range (e.g., "2" or "4-6"), the corresponding bits are set (or + * unset, if the value/range is prefixed with a "^"). In this case, we don''t + * really care whether the value range is supposed to represent cpus or nodes + * (that is to say whether bitmap is supposed to be a nodemap or a cpumap), + * we just set and unset bits. + * + * In case str is prefixed by "nodes:" (or "^nodes:") we assume that the + * caller is dealing with a cpumap and wants to specify what bits to (re)set + * NUMA node-wise. We therefore act as follows: + * - consider the value/range a node or a set of nodes; + * - obtain the cpumap corresponding to such node/set of nodes; + * - set (or reset, in the "^" case) in bitmap what ended up + * in such cpumap. + * + * This means that, for instance, if calling this with str="3-6", bits + * 3-6 in bitmap will be set, and it''s up to the caller to know whether + * that meant cpus 3,4,5,6 or NUMA nodes 3,4,5,6. On the other hand, if + * calling this with str="nodes:2", bits corresponding to all the cpus + * of NUMA node 2 will be the ones that are set in bitmap. + * + * This may look tricky, but it avoids quite a bit of code duplication. */ -static int update_cpumap_range(const char *str, libxl_bitmap *cpumap) +static int update_bitmap_range(const char *str, libxl_bitmap *bitmap) { unsigned long ida, idb; libxl_bitmap node_cpumap; @@ -539,7 +561,7 @@ static int update_cpumap_range(const char *str, libxl_bitmap *cpumap) } if (STR_HAS_PREFIX(str, "all")) { - libxl_bitmap_set_any(cpumap); + libxl_bitmap_set_any(bitmap); goto out; } @@ -563,13 +585,13 @@ static int update_cpumap_range(const char *str, libxl_bitmap *cpumap) /* Add/Remove all the cpus in the node cpumap */ libxl_for_each_set_bit(i, node_cpumap) { - is_not ? libxl_bitmap_reset(cpumap, i) : - libxl_bitmap_set(cpumap, i); + is_not ? libxl_bitmap_reset(bitmap, i) : + libxl_bitmap_set(bitmap, i); } } else { /* Add/Remove this cpu */ - is_not ? libxl_bitmap_reset(cpumap, ida) : - libxl_bitmap_set(cpumap, ida); + is_not ? libxl_bitmap_reset(bitmap, ida) : + libxl_bitmap_set(bitmap, ida); } ida++; } @@ -580,18 +602,25 @@ static int update_cpumap_range(const char *str, libxl_bitmap *cpumap) } /* - * Takes a string representing a set of cpus (specified either as - * single cpus or as eintire NUMA nodes) and turns it into the - * corresponding libxl_bitmap (in cpumap). + * Takes a string representing one or more values or range of values + * separated by '','' (e.g., something like "3,2-6,8"), split it in substrings + * and, for each one of them, update bitmap to reflect that. This may result + * in bits in bitmap being both set or reset, since it is possible to append + * a not prefix ("^") to both values and ranges. + * + * It is also possible for the caller to indicate that a specific value or + * range should be treated specially, i.e., resulting in not just one bit + * (or not just the specified range of bits) to be set or reset. For details + * on that, see update_bitmap_range() above. */ -static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap) +static int parse_bitmap_range(char *cpu, libxl_bitmap *bitmap) { char *ptr, *saveptr = NULL; int rc = 0; for (ptr = strtok_r(cpu, ",", &saveptr); ptr; ptr = strtok_r(NULL, ",", &saveptr)) { - rc = update_cpumap_range(ptr, cpumap); + rc = update_bitmap_range(ptr, bitmap); if (rc) break; } @@ -778,7 +807,7 @@ static void parse_config_data(const char *config_source, } libxl_bitmap_set_none(&b_info->cpumap); - if (vcpupin_parse(buf2, &b_info->cpumap)) + if (parse_bitmap_range(buf2, &b_info->cpumap)) exit(1); free(buf2); @@ -4443,7 +4472,9 @@ int main_button_press(int argc, char **argv) static void print_vcpuinfo(uint32_t tdomid, const libxl_vcpuinfo *vcpuinfo, - uint32_t nr_cpus) + uint32_t nr_cpus, + uint32_t nr_nodes, + int numa) { char *domname; @@ -4465,10 +4496,18 @@ static void print_vcpuinfo(uint32_t tdomid, printf("%9.1f ", ((float)vcpuinfo->vcpu_time / 1e9)); /* CPU AFFINITY */ print_cpumap(vcpuinfo->cpumap.map, nr_cpus, stdout); + /* NUMA Affinity*/ + if (numa) { + printf(" / "); + print_nodemap(vcpuinfo->nodemap.map, nr_nodes, stdout); + } printf("\n"); } -static void print_domain_vcpuinfo(uint32_t domid, uint32_t nr_cpus) +static void print_domain_vcpuinfo(uint32_t domid, + uint32_t nr_cpus, + uint32_t nr_nodes, + int numa) { libxl_vcpuinfo *vcpuinfo; int i, nb_vcpu, nrcpus; @@ -4481,55 +4520,67 @@ static void print_domain_vcpuinfo(uint32_t domid, uint32_t nr_cpus) } for (i = 0; i < nb_vcpu; i++) { - print_vcpuinfo(domid, &vcpuinfo[i], nr_cpus); + print_vcpuinfo(domid, &vcpuinfo[i], nr_cpus, nr_nodes, numa); } libxl_vcpuinfo_list_free(vcpuinfo, nb_vcpu); } -static void vcpulist(int argc, char **argv) +int main_vcpulist(int argc, char **argv) { libxl_dominfo *dominfo; libxl_physinfo physinfo; - int i, nb_domain; + int opt; + int numa = 0; + static struct option opts[] = { + {"numa", 0, 0, ''n''}, + COMMON_LONG_OPTS, + {0, 0, 0, 0} + }; + int rc = -1; + + SWITCH_FOREACH_OPT(opt, "n", opts, "vcpu-list", 0) { + case ''n'': + numa = 1; + break; + } if (libxl_get_physinfo(ctx, &physinfo) != 0) { fprintf(stderr, "libxl_physinfo failed.\n"); - goto vcpulist_out; + goto out; } - printf("%-32s %5s %5s %5s %5s %9s %s\n", + printf("%-32s %5s %5s %5s %5s %9s %s", "Name", "ID", "VCPU", "CPU", "State", "Time(s)", "CPU Affinity"); - if (!argc) { + if (numa) + printf(" / NUMA Affinity"); + printf("\n"); + + if (optind >= argc) { + int i, nb_domain; + if (!(dominfo = libxl_list_domain(ctx, &nb_domain))) { fprintf(stderr, "libxl_list_domain failed.\n"); - goto vcpulist_out; + goto out; } - for (i = 0; i<nb_domain; i++) - print_domain_vcpuinfo(dominfo[i].domid, physinfo.nr_cpus); + for (i = 0; i < nb_domain; i++) + print_domain_vcpuinfo(dominfo[i].domid, physinfo.nr_cpus, + physinfo.nr_nodes, numa); libxl_dominfo_list_free(dominfo, nb_domain); } else { - for (; argc > 0; ++argv, --argc) { - uint32_t domid = find_domain(*argv); - print_domain_vcpuinfo(domid, physinfo.nr_cpus); + for (; argc > optind; ++optind) { + uint32_t domid = find_domain(argv[optind]); + print_domain_vcpuinfo(domid, physinfo.nr_cpus, + physinfo.nr_nodes, numa); } } - vcpulist_out: - libxl_physinfo_dispose(&physinfo); -} - -int main_vcpulist(int argc, char **argv) -{ - int opt; - - SWITCH_FOREACH_OPT(opt, "", NULL, "vcpu-list", 0) { - /* No options */ - } - vcpulist(argc - optind, argv + optind); - return 0; + rc = 0; + out: + libxl_physinfo_dispose(&physinfo); + return rc; } static int vcpupin(uint32_t domid, const char *vcpu, char *cpu) @@ -4555,7 +4606,7 @@ static int vcpupin(uint32_t domid, const char *vcpu, char *cpu) if (libxl_cpu_bitmap_alloc(ctx, &cpumap, 0)) goto out; - if (vcpupin_parse(cpu, &cpumap)) + if (parse_bitmap_range(cpu, &cpumap)) goto out; if (dryrun_only) { @@ -4606,6 +4657,69 @@ static int vcpupin(uint32_t domid, const char *vcpu, char *cpu) return rc; } +static int vcpuaff(uint32_t domid, const char *vcpu, char *node) +{ + libxl_bitmap nodemap; + uint32_t vcpuid; + char *endptr; + int rc = -1; + + libxl_bitmap_init(&nodemap); + + vcpuid = strtoul(vcpu, &endptr, 10); + if (vcpu == endptr) { + if (strcmp(vcpu, "all")) { + fprintf(stderr, "Error: Invalid argument\n"); + goto out; + } + vcpuid = -1; + } + + if (libxl_node_bitmap_alloc(ctx, &nodemap, 0)) + goto out; + + if (parse_bitmap_range(node, &nodemap)) + goto out; + + if (dryrun_only) { + int nr_nodes = 0; + libxl_numainfo *info = libxl_get_numainfo(ctx, &nr_nodes); + + if (!info) { + fprintf(stderr, "libxl_get_numainfo failed\n"); + goto out; + } + libxl_numainfo_list_free(info, nr_nodes); + + fprintf(stdout, "nodemap: "); + print_nodemap(nodemap.map, nr_nodes, stdout); + fprintf(stdout, "\n"); + + if (ferror(stdout) || fflush(stdout)) { + perror("stdout"); + exit(-1); + } + + rc = 0; + goto out; + } + + if (vcpuid != -1) { + if (libxl_set_vcpunodeaffinity(ctx, domid, vcpuid, &nodemap)) + fprintf(stderr, "Could not set node-affinity for vcpu `%u''\n", + vcpuid); + goto out; + } + + if (libxl_domain_set_nodeaffinity(ctx, domid, &nodemap)) + fprintf(stderr, "libxl_domain_set_nodeaffinity failed\n"); + + rc = 0; + out: + libxl_bitmap_dispose(&nodemap); + return rc; +} + int main_vcpupin(int argc, char **argv) { int opt; @@ -4617,6 +4731,18 @@ int main_vcpupin(int argc, char **argv) return vcpupin(find_domain(argv[optind]), argv[optind+1] , argv[optind+2]); } +int main_vcpuaff(int argc, char **argv) +{ + int opt; + + SWITCH_FOREACH_OPT(opt, "", NULL, "vcpu-node-affinity", 3) { + /* No options */ + } + + vcpuaff(find_domain(argv[optind]), argv[optind+1] , argv[optind+2]); + return 0; +} + static void vcpuset(uint32_t domid, const char* nr_vcpus, int check_host) { char *endptr; diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c index d3dcbf0..d3dd7e6 100644 --- a/tools/libxl/xl_cmdtable.c +++ b/tools/libxl/xl_cmdtable.c @@ -208,13 +208,20 @@ struct cmd_spec cmd_table[] = { { "vcpu-list", &main_vcpulist, 0, 0, "List the VCPUs for all/some domains", - "[Domain, ...]", + "[option] [Domain, ...]", + "-n, --numa Show NUMA node-affinity", + }, { "vcpu-pin", &main_vcpupin, 1, 1, "Set which CPUs a VCPU can use", "<Domain> <VCPU|all> <CPUs|all>", }, + { "vcpu-node-affinity", + &main_vcpuaff, 1, 1, + "Set on which CPUs a VCPU prefers to run", + "<Domain> <VCPU|all> <CPUs|all>", + }, { "vcpu-set", &main_vcpuset, 0, 1, "Set the number of active VCPUs allowed for the domain",
Dario Faggioli
2013-Oct-03 17:47 UTC
[PATCH 15/15] xl: numa-sched: enable specifying node-affinity in VM config file
in a similar way to how it is possible to specify vcpu-affinity. Manual page is updated accordingly. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- docs/man/xl.cfg.pod.5 | 70 +++++++++++++++++++++++++++++++++------ tools/libxl/libxl_dom.c | 18 ++++++---- tools/libxl/libxl_numa.c | 14 +------- tools/libxl/libxl_utils.h | 12 ++++++- tools/libxl/xl_cmdimpl.c | 80 ++++++++++++++++++++++++++++++++++++++++++++- 5 files changed, 159 insertions(+), 35 deletions(-) diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 index e63ee04..a3a2af1 100644 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -144,18 +144,64 @@ run on cpu #3 of the host. =back If this option is not specified, no vcpu to cpu pinning is established, -and the vcpus of the guest can run on all the cpus of the host. - -If we are on a NUMA machine (i.e., if the host has more than one NUMA -node) and this option is not specified, libxl automatically tries to -place the guest on the least possible number of nodes. That, however, -will not affect vcpu pinning, so the guest will still be able to run on -all the cpus, it will just prefer the ones from the node it has been -placed on. A heuristic approach is used for choosing the best node (or -set of nodes), with the goals of maximizing performance for the guest -and, at the same time, achieving efficient utilization of host cpus -and memory. See F<docs/misc/xl-numa-placement.markdown> for more -details. +and the vcpus of the guest can run on all the cpus of the host. If this +option is specified, and no B<nodes=> option present the vcpu pinning +mask of each vcpu is utilized to compute its vcpu node-affinity, and the +union of all the vcpus node-affinities is what constitutes the domain +node-affinity (which drives memory allocations). + +=back + +=item B<nodes="NODE-LIST"> + +List of on which NUMA nodes the memory for the guest is allocated. This +also means (starting from Xen 4.3 and if the credit scheduler is used) +the vcpus of the domain prefers to run on the those NUMA nodes. Default is +xl (via libxl) guesses. A C<NODE-LIST> may be specified as follows: + +=item "all" + +To specify no particular preference and avoid xl to automatically pick +a (set of) NUMA ndoe(s). In practice, using "all", this domain will have +no NUMA node-affinity and it''s memory will be spread on all the host''s +NUMA nodes. + +=item "0-3,5,^1" + +To specify a NUMA node-affinity with the host NUMA nodes 0,2,3,5. +Combining this with "all" is possible, meaning "all,^7" results in the +memory being allocated on all the host NUMA nodes except node 7, as well +as trying to avoid running the domain''s vcpu on the pcpus that belong to +node 7. + +=item ["1", "4"] (or [1, 4]) + +To ask for specific vcpu to NUMA node mapping. That means (in this example), +memory will be allocated on host NUMA nodes 1 and 4 but, at the same time, +vcpu #0 of the guest prefers to run on the pcpus of host NUMA node 1, while +vcpu #1 on the pcpus of host NUMA node 4. + +=back + +If this option is not specified, xl picks up a NUMA node (or a set of NUMA +nodes), according to some heuristics, and use that as the NUMA node-affinity +for the guest. + +If we are on a NUMA machine (i.e., if the host has more than one NUMA node) +and this option is not specified, libxl automatically tries to place the +guest on the least possible number of nodes. A heuristic approach is used +for choosing the best node (or set of nodes), with the goals of maximizing +performance for the guest and, at the same time, achieving efficient +utilization of host cpus and memory. In this case, all the vcpus of the +guest will have the same vcpu node-affinity. + +Notice that, independently from whether the node-affinity is specified +via this parameter, or automatically decided by libxl, that does not affect +vcpu pinning, so the guest will still be able to run on all the cpus to +which its vcpus are pinned, or all the cpus, if no B<cpus=> option is +provided. + +See F<docs/misc/xl-numa-placement.markdown> for more details. =back diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c index 6e2252a..ffa8552 100644 --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -212,19 +212,21 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid, xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus); /* - * Check if the domain has any CPU affinity. If not, try to build - * up one. In case numa_place_domain() find at least a suitable - * candidate, it will affect info->nodemap accordingly; if it - * does not, it just leaves it as it is. This means (unless - * some weird error manifests) the subsequent call to - * libxl_domain_set_nodeaffinity() will do the actual placement, + * Check if the domain has any pinning or node-affinity and, if not, try + * to build up one. + * + * In case numa_place_domain() find at least a suitable candidate, it will + * affect info->nodemap accordingly; if it does not, it just leaves it as + * it is. This means (unless some weird error manifests) the subsequent + * call to libxl_domain_set_nodeaffinity() will do the actual placement, * whatever that turns out to be. */ if (libxl_defbool_val(info->numa_placement)) { - if (!libxl_bitmap_is_full(&info->cpumap)) { + if (!libxl_bitmap_is_full(&info->cpumap) || + !libxl_bitmap_is_full(&info->nodemap)) { LOG(ERROR, "Can run NUMA placement only if no vcpu " - "affinity is specified"); + "pinning or node-affinity is specified"); return ERROR_INVAL; } diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c index 20c99ac..1026579 100644 --- a/tools/libxl/libxl_numa.c +++ b/tools/libxl/libxl_numa.c @@ -184,7 +184,7 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo, int vcpus_on_node[]) { libxl_dominfo *dinfo = NULL; - libxl_bitmap dom_nodemap, nodes_counted; + libxl_bitmap nodes_counted; int nr_doms, nr_cpus; int i, j, k; @@ -197,12 +197,6 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo, return ERROR_FAIL; } - if (libxl_node_bitmap_alloc(CTX, &dom_nodemap, 0) < 0) { - libxl_bitmap_dispose(&nodes_counted); - libxl_dominfo_list_free(dinfo, nr_doms); - return ERROR_FAIL; - } - for (i = 0; i < nr_doms; i++) { libxl_vcpuinfo *vinfo; int nr_dom_vcpus; @@ -211,9 +205,6 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo, if (vinfo == NULL) continue; - /* Retrieve the domain''s node-affinity map */ - libxl_domain_get_nodeaffinity(CTX, dinfo[i].domid, &dom_nodemap); - for (j = 0; j < nr_dom_vcpus; j++) { /* * For each vcpu of each domain, it must have both vcpu-affinity @@ -225,7 +216,7 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo, int node = tinfo[k].node; if (libxl_bitmap_test(suitable_cpumap, k) && - libxl_bitmap_test(&dom_nodemap, node) && + libxl_bitmap_test(&vinfo[j].nodemap, node) && !libxl_bitmap_test(&nodes_counted, node)) { libxl_bitmap_set(&nodes_counted, node); vcpus_on_node[node]++; @@ -236,7 +227,6 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo, libxl_vcpuinfo_list_free(vinfo, nr_dom_vcpus); } - libxl_bitmap_dispose(&dom_nodemap); libxl_bitmap_dispose(&nodes_counted); libxl_dominfo_list_free(dinfo, nr_doms); return 0; diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h index 7b84e6a..cac057c 100644 --- a/tools/libxl/libxl_utils.h +++ b/tools/libxl/libxl_utils.h @@ -90,7 +90,7 @@ static inline void libxl_bitmap_set_none(libxl_bitmap *bitmap) { memset(bitmap->map, 0, bitmap->size); } -static inline int libxl_bitmap_cpu_valid(libxl_bitmap *bitmap, int bit) +static inline int libxl_bitmap_valid(libxl_bitmap *bitmap, int bit) { return bit >= 0 && bit < (bitmap->size * 8); } @@ -125,6 +125,16 @@ static inline int libxl_node_bitmap_alloc(libxl_ctx *ctx, return libxl_bitmap_alloc(ctx, nodemap, max_nodes); } +static inline int libxl_bitmap_cpu_valid(libxl_bitmap *cpumap, int cpu) +{ + return libxl_bitmap_valid(cpumap, cpu); +} + +static inline int libxl_bitmap_node_valid(libxl_bitmap *nodemap, int node) +{ + return libxl_bitmap_valid(nodemap, node); +} + /* Populate cpumap with the cpus spanned by the nodes in nodemap */ int libxl_nodemap_to_cpumap(libxl_ctx *ctx, const libxl_bitmap *nodemap, diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index e358157..65a34d3 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -76,8 +76,9 @@ xlchild children[child_max]; static const char *common_domname; static int fd_lock = -1; -/* Stash for specific vcpu to pcpu mappping */ +/* Stash for specific vcpu to pcpu and vcpu to node mappping */ static int *vcpu_to_pcpu; +static int *vcpu_to_node; static const char savefileheader_magic[32] "Xen saved domain, xl format\n \0 \r"; @@ -638,7 +639,7 @@ static void parse_config_data(const char *config_source, const char *buf; long l; XLU_Config *config; - XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms; + XLU_ConfigList *cpus, *nodes, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms; XLU_ConfigList *ioports, *irqs, *iomem; int num_ioports, num_irqs, num_iomem; int pci_power_mgmt = 0; @@ -814,6 +815,53 @@ static void parse_config_data(const char *config_source, libxl_defbool_set(&b_info->numa_placement, false); } + if (!xlu_cfg_get_list (config, "nodes", &nodes, 0, 1)) { + int n_cpus = 0; + + if (libxl_node_bitmap_alloc(ctx, &b_info->nodemap, 0)) { + fprintf(stderr, "Unable to allocate nodemap\n"); + exit(1); + } + + /* + * As above, use a temporary storage for the single vcpus'' + * node-affinities. + */ + vcpu_to_node = xmalloc(sizeof(int) * b_info->max_vcpus); + memset(vcpu_to_node, -1, sizeof(int) * b_info->max_vcpus); + + libxl_bitmap_set_none(&b_info->nodemap); + while ((buf = xlu_cfg_get_listitem(nodes, n_cpus)) != NULL) { + i = atoi(buf); + if (!libxl_bitmap_node_valid(&b_info->nodemap, i)) { + fprintf(stderr, "node %d illegal\n", i); + exit(1); + } + libxl_bitmap_set(&b_info->nodemap, i); + if (n_cpus < b_info->max_vcpus) + vcpu_to_node[n_cpus] = i; + n_cpus++; + } + + /* We have a nodemap, disable automatic placement */ + libxl_defbool_set(&b_info->numa_placement, false); + } + else if (!xlu_cfg_get_string (config, "nodes", &buf, 0)) { + char *buf2 = strdup(buf); + + if (libxl_node_bitmap_alloc(ctx, &b_info->nodemap, 0)) { + fprintf(stderr, "Unable to allocate nodemap\n"); + exit(1); + } + + libxl_bitmap_set_none(&b_info->nodemap); + if (parse_bitmap_range(buf2, &b_info->nodemap)) + exit(1); + free(buf2); + + libxl_defbool_set(&b_info->numa_placement, false); + } + if (!xlu_cfg_get_long (config, "memory", &l, 0)) { b_info->max_memkb = l * 1024; b_info->target_memkb = b_info->max_memkb; @@ -2163,6 +2211,34 @@ start: free(vcpu_to_pcpu); vcpu_to_pcpu = NULL; } + /* And do the same for single vcpu to node-affinity mapping */ + if (vcpu_to_node) { + libxl_bitmap vcpu_nodemap; + + ret = libxl_node_bitmap_alloc(ctx, &vcpu_nodemap, 0); + if (ret) + goto error_out; + for (i = 0; i < d_config.b_info.max_vcpus; i++) { + + if (vcpu_to_node[i] != -1) { + libxl_bitmap_set_none(&vcpu_nodemap); + libxl_bitmap_set(&vcpu_nodemap, vcpu_to_node[i]); + } else { + libxl_bitmap_set_any(&vcpu_nodemap); + } + if (libxl_set_vcpunodeaffinity(ctx, domid, i, &vcpu_nodemap)) { + fprintf(stderr, "setting node-affinity failed" + " on vcpu `%d''.\n", i); + libxl_bitmap_dispose(&vcpu_nodemap); + free(vcpu_to_node); + ret = ERROR_FAIL; + goto error_out; + } + } + libxl_bitmap_dispose(&vcpu_nodemap); + free(vcpu_to_node); vcpu_to_node = NULL; + } + ret = libxl_userdata_store(ctx, domid, "xl", config_data, config_len); if (ret) {
On Thu, 2013-10-03 at 19:45 +0200, Dario Faggioli wrote:> which was preventing `xl vcpu-list -h'' to work.Preventing as in it didn''t work? Or did you mean "causing it to print the wrong command name"?> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>Acked-by: Ian Campbell <ian.campbell@citrix.com>> --- > tools/libxl/xl_cmdimpl.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c > index 642b130..1fe6caa 100644 > --- a/tools/libxl/xl_cmdimpl.c > +++ b/tools/libxl/xl_cmdimpl.c > @@ -4465,7 +4465,7 @@ int main_vcpulist(int argc, char **argv) > { > int opt; > > - SWITCH_FOREACH_OPT(opt, "", NULL, "cpu-list", 0) { > + SWITCH_FOREACH_OPT(opt, "", NULL, "vcpu-list", 0) { > /* No options */ > } > >
Ian Campbell
2013-Oct-10 10:32 UTC
Re: [PATCH 07/15] xl: test script for the cpumap parser (for vCPU pinning)
On Thu, 2013-10-03 at 19:46 +0200, Dario Faggioli wrote:> This commit introduces "check-xl-vcpupin-parse" for helping > verifying and debugging the (v)CPU bitmap parsing code in xl. > > The script runs "xl -N vcpu-pin 0 all <some strings>" > repeatedly, with various input strings, and checks that the > output is as expected. > > This is what the script can do: > > # ./check-xl-vcpupin-parse -h > usage: ./check-xl-vcpupin-parse [options] > > Tests various vcpu-pinning strings. If run without arguments acts > as follows: > - generates some test data and saves them in > check-xl-vcpupin-parse.data; > - tests all the generated configurations (reading them back from > check-xl-vcpupin-parse.data). > > An example of a test vector file is provided in > check-xl-vcpupin-parse.data-example. > > Options: > -h prints this message > -r seed uses seed for initializing the rundom number generator > (default: the script PID) > -s string tries using string as a vcpu pinning configuration and > reports whether that succeeds or not > -o ofile save the test data in ofile > (default: check-xl-vcpupin-parse.data) > -i ifile read test data from ifile > > An example test data file (generated on a 2 NUMA nodes, 16 CPUs > host) is being provided in check-xl-vcpupin-parse.data-example. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>Acked-by: Ian Campbell <ian.campbell@citrix.com> (I haven''t actually reviewed this in much detail, any test case is better than none IMHO) [...]> +# > +# Testing a wring configurationThere''s a certain irony in spelling wrong incorrectly I think ;-)
Ian Campbell
2013-Oct-10 11:43 UTC
Re: [PATCH 01/15] xl: update the manpage about "cpus=" and NUMA node-affinity
On Thu, 2013-10-03 at 19:45 +0200, Dario Faggioli wrote:> Since d06b1bf169a01a9c7b0947d7825e58cb455a0ba5 (''libxl: automatic placement > deals with node-affinity'') it is no longer true that, if no "cpus=" option > is specified, xl picks up some pCPUs by default and pin the domain there. > > In fact, it is the NUMA node-affinity that is affected by automatic > placement, not vCPU to pCPU pinning. > > Update the xl config file documenation accordingly, as it seems to have > been forgotten at that time. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>applied since it seems pretty independent.> --- > Picking this up from a previously submitted series ("xl: > allow for node-wise specification of vcpu pinning") as the > changes in that and in this series would otherwise be > conflicting. If this is considered fine, Feel free to apply > it from here and skip the corresponding e-mail in the > original submission. > --- > docs/man/xl.cfg.pod.5 | 22 +++++++++++++++------- > 1 file changed, 15 insertions(+), 7 deletions(-) > > diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 > index d57cd4d..cd96969 100644 > --- a/docs/man/xl.cfg.pod.5 > +++ b/docs/man/xl.cfg.pod.5 > @@ -104,8 +104,8 @@ created online and the remainder will be offline. > > =item B<cpus="CPU-LIST"> > > -List of which cpus the guest is allowed to use. By default xl will pick > -some cpus on its own (see below). A C<CPU-LIST> may be specified as follows: > +List of which cpus the guest is allowed to use. Default is no pinning at > +all (more on this below). A C<CPU-LIST> may be specified as follows: > > =over 4 > > @@ -125,11 +125,19 @@ run on cpu #3 of the host. > > =back > > -If this option is not specified, libxl automatically tries to place the new > -domain on the host''s NUMA nodes (provided the host has more than one NUMA > -node) by pinning it to the cpus of those nodes. A heuristic approach is > -utilized with the goals of maximizing performance for the domain and, at > -the same time, achieving efficient utilization of the host''s CPUs and RAM. > +If this option is not specified, no vcpu to cpu pinning is established, > +and the vcpus of the guest can run on all the cpus of the host. > + > +If we are on a NUMA machine (i.e., if the host has more than one NUMA > +node) and this option is not specified, libxl automatically tries to > +place the guest on the least possible number of nodes. That, however, > +will not affect vcpu pinning, so the guest will still be able to run on > +all the cpus, it will just prefer the ones from the node it has been > +placed on. A heuristic approach is used for choosing the best node (or > +set of nodes), with the goals of maximizing performance for the guest > +and, at the same time, achieving efficient utilization of host cpus > +and memory. See F<docs/misc/xl-numa-placement.markdown> for more > +details. > > =back > >
On Thu, 2013-10-10 at 11:29 +0100, Ian Campbell wrote:> On Thu, 2013-10-03 at 19:45 +0200, Dario Faggioli wrote: > > which was preventing `xl vcpu-list -h'' to work. > > Preventing as in it didn''t work? Or did you mean "causing it to print > the wrong command name"? > > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > Acked-by: Ian Campbell <ian.campbell@citrix.com>and applied regardless of the above.> > > --- > > tools/libxl/xl_cmdimpl.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c > > index 642b130..1fe6caa 100644 > > --- a/tools/libxl/xl_cmdimpl.c > > +++ b/tools/libxl/xl_cmdimpl.c > > @@ -4465,7 +4465,7 @@ int main_vcpulist(int argc, char **argv) > > { > > int opt; > > > > - SWITCH_FOREACH_OPT(opt, "", NULL, "cpu-list", 0) { > > + SWITCH_FOREACH_OPT(opt, "", NULL, "vcpu-list", 0) { > > /* No options */ > > } > > > > > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Ian Campbell
2013-Oct-10 11:44 UTC
Re: [PATCH 04/15] libxl: introduce libxl_node_to_cpumap
On Thu, 2013-10-03 at 19:46 +0200, Dario Faggioli wrote:> As an helper for the special case (of libxl_nodemap_to_cpumap) when > one wants the cpumap for just one node. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>applied.
On gio, 2013-10-10 at 11:29 +0100, Ian Campbell wrote:> On Thu, 2013-10-03 at 19:45 +0200, Dario Faggioli wrote: > > which was preventing `xl vcpu-list -h'' to work. > > Preventing as in it didn''t work? Or did you mean "causing it to print > the wrong command name"? >Here''s what I get: # xl vcpu-list -h command "cpu-list" not implemented So, yes, "preventing" as it does not work right now.> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > Acked-by: Ian Campbell <ian.campbell@citrix.com> >Thanks, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Oct-10 13:55 UTC
Re: [PATCH 01/15] xl: update the manpage about "cpus=" and NUMA node-affinity
On gio, 2013-10-10 at 12:43 +0100, Ian Campbell wrote:> On Thu, 2013-10-03 at 19:45 +0200, Dario Faggioli wrote: > > Since d06b1bf169a01a9c7b0947d7825e58cb455a0ba5 (''libxl: automatic placement > > deals with node-affinity'') it is no longer true that, if no "cpus=" option > > is specified, xl picks up some pCPUs by default and pin the domain there. > > > > In fact, it is the NUMA node-affinity that is affected by automatic > > placement, not vCPU to pCPU pinning. > > > > Update the xl config file documenation accordingly, as it seems to have > > been forgotten at that time. > > > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> > > applied since it seems pretty independent. >It is indeed, thanks. Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Thu, 2013-10-10 at 15:54 +0200, Dario Faggioli wrote:> On gio, 2013-10-10 at 11:29 +0100, Ian Campbell wrote: > > On Thu, 2013-10-03 at 19:45 +0200, Dario Faggioli wrote: > > > which was preventing `xl vcpu-list -h'' to work. > > > > Preventing as in it didn''t work? Or did you mean "causing it to print > > the wrong command name"? > > > Here''s what I get: > > # xl vcpu-list -h > command "cpu-list" not implemented > > So, yes, "preventing" as it does not work right now.Ah, because it looks up the help text by name. Make sense, sorry for the noise.> > > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > > > Acked-by: Ian Campbell <ian.campbell@citrix.com> > > > Thanks, > Dario >
Dario Faggioli
2013-Oct-10 14:35 UTC
Re: [PATCH 07/15] xl: test script for the cpumap parser (for vCPU pinning)
On gio, 2013-10-10 at 11:32 +0100, Ian Campbell wrote:> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > Acked-by: Ian Campbell <ian.campbell@citrix.com> >Ok.> (I haven''t actually reviewed this in much detail, any test case is > better than none IMHO) > [...] >:-)> > +# > > +# Testing a wring configuration > > There''s a certain irony in spelling wrong incorrectly I think ;-) >Ah, you noticed it! Yeah, because, of course, I did this on purpose, what where you thinking? :-P :-P Anyway, note taken. If it goes in like this, I''ll fix it with a follow up patch as soon as I get the chance. If I end up reposting, I''ll fix it there. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Jackson
2013-Oct-14 16:42 UTC
Re: [PATCH 07/15] xl: test script for the cpumap parser (for vCPU pinning)
Dario Faggioli writes ("Re: [PATCH 07/15] xl: test script for the cpumap parser (for vCPU pinning)"):> Anyway, note taken. If it goes in like this, I''ll fix it with a follow > up patch as soon as I get the chance. If I end up reposting, I''ll fix it > there.Please do repost, with Ian C''s ack. Ian.
Dario Faggioli
2013-Oct-14 17:00 UTC
Re: [PATCH 07/15] xl: test script for the cpumap parser (for vCPU pinning)
On lun, 2013-10-14 at 17:42 +0100, Ian Jackson wrote:> Dario Faggioli writes ("Re: [PATCH 07/15] xl: test script for the cpumap parser (for vCPU pinning)"): > > Anyway, note taken. If it goes in like this, I''ll fix it with a follow > > up patch as soon as I get the chance. If I end up reposting, I''ll fix it > > there. > > Please do repost, with Ian C''s ack. >Ok, I will. However, I think I''m waiting a bit more, to gather some feedback on the core changes of this series. For example (and I don''t want to push, just, say, reminding :-)), what do you think of this patch, as well as of patches 5, 12, 13, 14 and 15, (1, 4 and 6, you already acked) in this series? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-05 11:29 UTC
Re: [PATCH 00/15] Implement per-vcpu NUMA node-affinity for credit1
On gio, 2013-10-03 at 19:45 +0200, Dario Faggioli wrote:> Hi everyone, >Ping? Despite being a rather big series, it shouldn''t be too controversial. For that reason, I was hoping hat it could go in before code freeze, but I need some review ASAP. :-P Some minor and independent patches have been already applied by IanC... I''m happy to resend the series, if that would be helpful, just let me know. Thanks and Regards, Dario> So, this series introduces the concept of per-vcpu NUMA node-affinity. In fact, > up to now, node-affinity has only been "per-domain". That means it was the > domain that had a node-affinity and: > - that node-affinity was used to decide where to allocate the memory for the > domain; > - that node-affinity was used to decide on what nodes _all_ the vcpus of the > domain prefer to be scheduled. > > After this series this changes like this: > - each vcpu of a domain has (well, may have) its own node-affinity, and that > is what is used to determine (if the credit1 scheduler is used) where each > specific vcpu prefers to run; > - the node-affinity of the whole domain is the _union_ of all the > node-affinities of the domain''s vcpus; > - the memory is still allocated following what the node-affinity of the whole > domain (so, the union of vcpu node-affinities, as said above) says. > > In practise, it''s not such a big change, I''m just extending at the per-vcpu > level what we already had at the domain level. This is also making > node-affinity a lot more similar to vcpu-pinning, both in terms of functioning > and user interface. As a side efect, that simplify the scheduling code (at > least the NUMA-aware part) by quite a bit. Finally, and most important, this is > something that will become really important when we will start to support > virtual NUMA topologies, as, a that point, having the same node-affinity for > all the vcpus in a domain won''t be enough any longer (we''ll want the vcpus from > a particular vnode to have node-afinity with a particular pnode). > > More detailed description of the mechanism and of the implementation choices > are provided in the changelogs and in the documentation (docs/misc and > manpages). > > One last thing is that this series relies on some other patches and series that > I sent on xen-devel already, but have not been applied yet. I''m re-sending > them here, as a part of this series, so feel free to pick them up from here, if > wanting to apply them, or comment on them in this thread, if you want me to > change them. In particular, patches 01 and 03, I already sent as single > patches, patches 04-07, I already sent them as a series. Sorry if that is a bit > clumsy, but I couldn''t find a better way to do it. :-) > > In the detailed list of patches below, ''x'' means previously submitted, ''*'' > means already acked/reviewed-by. > > Finally, Elena, that is not super important, but perhaps, in the next release > of your vNUMA series, you could try to integrate it with this (and of course, > ask if you need anything while trying to do that). > > Matt, if/when you eventually get to release, even as RFC or something like > that, your HVM vNUMA series, we can try to figure out how to integrate that > with this, so to use node-affinity instead than pinning. > > The series is also available at the following git coordinates: > > git://xenbits.xen.org/people/dariof/xen.git numa/per-vcpu-affinity-v1 > http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/numa/per-vcpu-affinity-v1 > > Let me know what you think about all this. > > Regards, > Dario > > PS. Someone of you probably received part of this series as a direct message > (i.e., with your address in ''To'', rather than in ''Cc''). I''m sincerely sorry for > that, messed up with `stg mail''... Won''t happen again, I promise! :-P > > --- > > Dario Faggioli (15): > x * xl: update the manpage about "cpus=" and NUMA node-affinity > xl: fix a typo in main_vcpulist() > x * xen: numa-sched: leave node-affinity alone if not in "auto" mode > x * libxl: introduce libxl_node_to_cpumap > x xl: allow for node-wise specification of vcpu pinning > x * xl: implement and enable dryrun mode for `xl vcpu-pin'' > x xl: test script for the cpumap parser (for vCPU pinning) > xen: numa-sched: make space for per-vcpu node-affinity > xen: numa-sched: domain node-affinity always comes from vcpu node-affinity > xen: numa-sched: use per-vcpu node-affinity for actual scheduling > xen: numa-sched: enable getting/specifying per-vcpu node-affinity > libxc: numa-sched: enable getting/specifying per-vcpu node-affinity > libxl: numa-sched: enable getting/specifying per-vcpu node-affinity > xl: numa-sched: enable getting/specifying per-vcpu node-affinity > xl: numa-sched: enable specifying node-affinity in VM config file > > > docs/man/xl.cfg.pod.5 | 88 ++++ > docs/man/xl.pod.1 | 25 + > docs/misc/xl-numa-placement.markdown | 124 ++++-- > tools/libxc/xc_domain.c | 90 ++++- > tools/libxc/xenctrl.h | 19 + > tools/libxl/check-xl-vcpupin-parse | 294 +++++++++++++++ > tools/libxl/check-xl-vcpupin-parse.data-example | 53 +++ > tools/libxl/libxl.c | 28 + > tools/libxl/libxl.h | 11 + > tools/libxl/libxl_dom.c | 18 + > tools/libxl/libxl_numa.c | 14 - > tools/libxl/libxl_types.idl | 1 > tools/libxl/libxl_utils.c | 22 + > tools/libxl/libxl_utils.h | 15 + > tools/libxl/xl.h | 1 > tools/libxl/xl_cmdimpl.c | 458 +++++++++++++++++++---- > tools/libxl/xl_cmdtable.c | 11 - > xen/common/domain.c | 97 ++--- > xen/common/domctl.c | 47 ++ > xen/common/keyhandler.c | 6 > xen/common/sched_credit.c | 63 --- > xen/common/schedule.c | 55 +++ > xen/include/public/domctl.h | 8 > xen/include/xen/sched-if.h | 2 > xen/include/xen/sched.h | 13 + > xen/xsm/flask/hooks.c | 2 > 26 files changed, 1282 insertions(+), 283 deletions(-) > create mode 100755 tools/libxl/check-xl-vcpupin-parse > create mode 100644 tools/libxl/check-xl-vcpupin-parse.data-example >
George Dunlap
2013-Nov-05 11:32 UTC
Re: [PATCH 00/15] Implement per-vcpu NUMA node-affinity for credit1
On 11/05/2013 11:29 AM, Dario Faggioli wrote:> On gio, 2013-10-03 at 19:45 +0200, Dario Faggioli wrote: >> Hi everyone, >> > Ping? > > Despite being a rather big series, it shouldn''t be too controversial. > For that reason, I was hoping hat it could go in before code freeze, but > I need some review ASAP. :-P > > Some minor and independent patches have been already applied by IanC... > I''m happy to resend the series, if that would be helpful, just let me > know.This was basically next on my list. :-) Have you rebased locally -- does this still apply fairly cleanly to tip? If not, a re-send would be nice; otherwise I can just pull your git tree and rebase myself. -George> > Thanks and Regards, > Dario > >> So, this series introduces the concept of per-vcpu NUMA node-affinity. In fact, >> up to now, node-affinity has only been "per-domain". That means it was the >> domain that had a node-affinity and: >> - that node-affinity was used to decide where to allocate the memory for the >> domain; >> - that node-affinity was used to decide on what nodes _all_ the vcpus of the >> domain prefer to be scheduled. >> >> After this series this changes like this: >> - each vcpu of a domain has (well, may have) its own node-affinity, and that >> is what is used to determine (if the credit1 scheduler is used) where each >> specific vcpu prefers to run; >> - the node-affinity of the whole domain is the _union_ of all the >> node-affinities of the domain''s vcpus; >> - the memory is still allocated following what the node-affinity of the whole >> domain (so, the union of vcpu node-affinities, as said above) says. >> >> In practise, it''s not such a big change, I''m just extending at the per-vcpu >> level what we already had at the domain level. This is also making >> node-affinity a lot more similar to vcpu-pinning, both in terms of functioning >> and user interface. As a side efect, that simplify the scheduling code (at >> least the NUMA-aware part) by quite a bit. Finally, and most important, this is >> something that will become really important when we will start to support >> virtual NUMA topologies, as, a that point, having the same node-affinity for >> all the vcpus in a domain won''t be enough any longer (we''ll want the vcpus from >> a particular vnode to have node-afinity with a particular pnode). >> >> More detailed description of the mechanism and of the implementation choices >> are provided in the changelogs and in the documentation (docs/misc and >> manpages). >> >> One last thing is that this series relies on some other patches and series that >> I sent on xen-devel already, but have not been applied yet. I''m re-sending >> them here, as a part of this series, so feel free to pick them up from here, if >> wanting to apply them, or comment on them in this thread, if you want me to >> change them. In particular, patches 01 and 03, I already sent as single >> patches, patches 04-07, I already sent them as a series. Sorry if that is a bit >> clumsy, but I couldn''t find a better way to do it. :-) >> >> In the detailed list of patches below, ''x'' means previously submitted, ''*'' >> means already acked/reviewed-by. >> >> Finally, Elena, that is not super important, but perhaps, in the next release >> of your vNUMA series, you could try to integrate it with this (and of course, >> ask if you need anything while trying to do that). >> >> Matt, if/when you eventually get to release, even as RFC or something like >> that, your HVM vNUMA series, we can try to figure out how to integrate that >> with this, so to use node-affinity instead than pinning. >> >> The series is also available at the following git coordinates: >> >> git://xenbits.xen.org/people/dariof/xen.git numa/per-vcpu-affinity-v1 >> http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/numa/per-vcpu-affinity-v1 >> >> Let me know what you think about all this. >> >> Regards, >> Dario >> >> PS. Someone of you probably received part of this series as a direct message >> (i.e., with your address in ''To'', rather than in ''Cc''). I''m sincerely sorry for >> that, messed up with `stg mail''... Won''t happen again, I promise! :-P >> >> --- >> >> Dario Faggioli (15): >> x * xl: update the manpage about "cpus=" and NUMA node-affinity >> xl: fix a typo in main_vcpulist() >> x * xen: numa-sched: leave node-affinity alone if not in "auto" mode >> x * libxl: introduce libxl_node_to_cpumap >> x xl: allow for node-wise specification of vcpu pinning >> x * xl: implement and enable dryrun mode for `xl vcpu-pin'' >> x xl: test script for the cpumap parser (for vCPU pinning) >> xen: numa-sched: make space for per-vcpu node-affinity >> xen: numa-sched: domain node-affinity always comes from vcpu node-affinity >> xen: numa-sched: use per-vcpu node-affinity for actual scheduling >> xen: numa-sched: enable getting/specifying per-vcpu node-affinity >> libxc: numa-sched: enable getting/specifying per-vcpu node-affinity >> libxl: numa-sched: enable getting/specifying per-vcpu node-affinity >> xl: numa-sched: enable getting/specifying per-vcpu node-affinity >> xl: numa-sched: enable specifying node-affinity in VM config file >> >> >> docs/man/xl.cfg.pod.5 | 88 ++++ >> docs/man/xl.pod.1 | 25 + >> docs/misc/xl-numa-placement.markdown | 124 ++++-- >> tools/libxc/xc_domain.c | 90 ++++- >> tools/libxc/xenctrl.h | 19 + >> tools/libxl/check-xl-vcpupin-parse | 294 +++++++++++++++ >> tools/libxl/check-xl-vcpupin-parse.data-example | 53 +++ >> tools/libxl/libxl.c | 28 + >> tools/libxl/libxl.h | 11 + >> tools/libxl/libxl_dom.c | 18 + >> tools/libxl/libxl_numa.c | 14 - >> tools/libxl/libxl_types.idl | 1 >> tools/libxl/libxl_utils.c | 22 + >> tools/libxl/libxl_utils.h | 15 + >> tools/libxl/xl.h | 1 >> tools/libxl/xl_cmdimpl.c | 458 +++++++++++++++++++---- >> tools/libxl/xl_cmdtable.c | 11 - >> xen/common/domain.c | 97 ++--- >> xen/common/domctl.c | 47 ++ >> xen/common/keyhandler.c | 6 >> xen/common/sched_credit.c | 63 --- >> xen/common/schedule.c | 55 +++ >> xen/include/public/domctl.h | 8 >> xen/include/xen/sched-if.h | 2 >> xen/include/xen/sched.h | 13 + >> xen/xsm/flask/hooks.c | 2 >> 26 files changed, 1282 insertions(+), 283 deletions(-) >> create mode 100755 tools/libxl/check-xl-vcpupin-parse >> create mode 100644 tools/libxl/check-xl-vcpupin-parse.data-example >> > >
Dario Faggioli
2013-Nov-05 14:05 UTC
Re: [PATCH 00/15] Implement per-vcpu NUMA node-affinity for credit1
On mar, 2013-11-05 at 11:32 +0000, George Dunlap wrote:> On 11/05/2013 11:29 AM, Dario Faggioli wrote: > > On gio, 2013-10-03 at 19:45 +0200, Dario Faggioli wrote: > >> Hi everyone, > >> > > Ping? > > > > Despite being a rather big series, it shouldn''t be too controversial. > > For that reason, I was hoping hat it could go in before code freeze, but > > I need some review ASAP. :-P > > > > Some minor and independent patches have been already applied by IanC... > > I''m happy to resend the series, if that would be helpful, just let me > > know. > > This was basically next on my list. :-) >Ok.> Have you rebased locally -- does this still apply fairly cleanly to tip? > If not, a re-send would be nice; otherwise I can just pull your git > tree and rebase myself. >It mostly does still apply, but there are a couple of patches that Ian already applied (being them rather independent from the rest of the series). I guess it''s best if I resend it, that would hopefully make thing easier for everyone. :-) Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
George Dunlap
2013-Nov-05 14:21 UTC
Re: [PATCH 03/15] xen: numa-sched: leave node-affinity alone if not in "auto" mode
On 10/03/2013 06:45 PM, Dario Faggioli wrote:> If the domain''s NUMA node-affinity is being specified by the > user/toolstack (instead of being automatically computed by Xen), > we really should stick to that. This means domain_update_node_affinity() > is wrong when it filters out some stuff from there even in "!auto" > mode. > > This commit fixes that. Of course, this does not mean node-affinity > is always honoured (e.g., a vcpu won''t run on a pcpu of a different > cpupool) but the necessary logic for taking into account all the > possible situations lives in the scheduler code, where it belongs. > > What could happen without this change is that, under certain > circumstances, the node-affinity of a domain may change when the > user modifies the vcpu-affinity of the domain''s vcpus. This, even > if probably not a real bug, is at least something the user does > not expect, so let''s avoid it. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> > --- > This has been submitted already as a single patch on its own. > Since this series needs the change done here, just include it > in here, instead of pinging the original submission and deferring > posting this series.Kier / Jan: And as this is actually independent of the rest of the series, and already has my reviewed-by, it can be applied right now without waiting for the rest of the series to be reviewed. -George
Jan Beulich
2013-Nov-05 14:37 UTC
Re: [PATCH 03/15] xen: numa-sched: leave node-affinity alone if not in "auto" mode
>>> On 05.11.13 at 15:21, George Dunlap <george.dunlap@eu.citrix.com> wrote: > On 10/03/2013 06:45 PM, Dario Faggioli wrote: >> If the domain''s NUMA node-affinity is being specified by the >> user/toolstack (instead of being automatically computed by Xen), >> we really should stick to that. This means domain_update_node_affinity() >> is wrong when it filters out some stuff from there even in "!auto" >> mode. >> >> This commit fixes that. Of course, this does not mean node-affinity >> is always honoured (e.g., a vcpu won''t run on a pcpu of a different >> cpupool) but the necessary logic for taking into account all the >> possible situations lives in the scheduler code, where it belongs. >> >> What could happen without this change is that, under certain >> circumstances, the node-affinity of a domain may change when the >> user modifies the vcpu-affinity of the domain''s vcpus. This, even >> if probably not a real bug, is at least something the user does >> not expect, so let''s avoid it. >> >> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> >> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> >> --- >> This has been submitted already as a single patch on its own. >> Since this series needs the change done here, just include it >> in here, instead of pinging the original submission and deferring >> posting this series. > > Kier / Jan: And as this is actually independent of the rest of the > series, and already has my reviewed-by, it can be applied right now > without waiting for the rest of the series to be reviewed.Right. It needs Keir''s ack in any case. But then again this was patch 3 in a series posted over a month ago (and being a mishmash of hypervisor and tools patches), so I would have - considered it stale by now, expecting a re-submission - expected to be told that it can be applied on its own Jan
George Dunlap
2013-Nov-05 14:45 UTC
Re: [PATCH 03/15] xen: numa-sched: leave node-affinity alone if not in "auto" mode
On 11/05/2013 02:37 PM, Jan Beulich wrote:>>>> On 05.11.13 at 15:21, George Dunlap <george.dunlap@eu.citrix.com> wrote: >> On 10/03/2013 06:45 PM, Dario Faggioli wrote: >>> If the domain''s NUMA node-affinity is being specified by the >>> user/toolstack (instead of being automatically computed by Xen), >>> we really should stick to that. This means domain_update_node_affinity() >>> is wrong when it filters out some stuff from there even in "!auto" >>> mode. >>> >>> This commit fixes that. Of course, this does not mean node-affinity >>> is always honoured (e.g., a vcpu won''t run on a pcpu of a different >>> cpupool) but the necessary logic for taking into account all the >>> possible situations lives in the scheduler code, where it belongs. >>> >>> What could happen without this change is that, under certain >>> circumstances, the node-affinity of a domain may change when the >>> user modifies the vcpu-affinity of the domain''s vcpus. This, even >>> if probably not a real bug, is at least something the user does >>> not expect, so let''s avoid it. >>> >>> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> >>> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> >>> --- >>> This has been submitted already as a single patch on its own. >>> Since this series needs the change done here, just include it >>> in here, instead of pinging the original submission and deferring >>> posting this series. >> >> Kier / Jan: And as this is actually independent of the rest of the >> series, and already has my reviewed-by, it can be applied right now >> without waiting for the rest of the series to be reviewed. > > Right. It needs Keir''s ack in any case. > > But then again this was patch 3 in a series posted over a month > ago (and being a mishmash of hypervisor and tools patches), so > I would have > - considered it stale by now, expecting a re-submission > - expected to be told that it can be applied on its ownYes, that''s what I was doing. :-) -George
Dario Faggioli
2013-Nov-05 14:49 UTC
Re: [PATCH 03/15] xen: numa-sched: leave node-affinity alone if not in "auto" mode
On mar, 2013-11-05 at 14:37 +0000, Jan Beulich wrote:> >>> On 05.11.13 at 15:21, George Dunlap <george.dunlap@eu.citrix.com> wrote:> > Kier / Jan: And as this is actually independent of the rest of the > > series, and already has my reviewed-by, it can be applied right now > > without waiting for the rest of the series to be reviewed. > > Right. It needs Keir''s ack in any case. >Ok.> But then again this was patch 3 in a series posted over a month > ago (and being a mishmash of hypervisor and tools patches), so > I would have > - considered it stale by now, expecting a re-submission >Which I just did. :-)> - expected to be told that it can be applied on its own >Yeah, this comes from the fact that I included it in the series (to which it is relevant, but not related, I''d say). I tried to put that info in the changelog, but it perhaps wasn''t explicit and easy enough to catch. I submitted the patch before (alone and outside of any series) and, when it came to sending out the series, I embedded it there trying to minimize the possibilities of conflicts that could waste people''s time, depending on which one among the patch and the series was being applied first. Sorry... I was just trying to make life as easy as possible for everyone. :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel