Dario Faggioli
2013-Nov-05 14:33 UTC
[PATCH RESEND 00/12] Implement per-vcpu NUMA node-affinity for credit1
Hi, This is basically a resend of http://lists.xen.org/archives/html/xen-devel/2013-10/msg00164.html, as someone requested, for easier review. Nothing change, apart from the fact that I remove the patches that have already been applied from the series. The git branch is now this one: git://xenbits.xen.org/people/dariof/xen.git numa/per-vcpu-affinity-RESEND The original cover letter also follows. Thanks and Regards, Dario --- Hi everyone, So, this series introduces the concept of per-vcpu NUMA node-affinity. In fact, up to now, node-affinity has only been "per-domain". That means it was the domain that had a node-affinity and: - that node-affinity was used to decide where to allocate the memory for the domain; - that node-affinity was used to decide on what nodes _all_ the vcpus of the domain prefer to be scheduled. After this series this changes like this: - each vcpu of a domain has (well, may have) its own node-affinity, and that is what is used to determine (if the credit1 scheduler is used) where each specific vcpu prefers to run; - the node-affinity of the whole domain is the _union_ of all the node-affinities of the domain''s vcpus; - the memory is still allocated following what the node-affinity of the whole domain (so, the union of vcpu node-affinities, as said above) says. In practise, it''s not such a big change, I''m just extending at the per-vcpu level what we already had at the domain level. This is also making node-affinity a lot more similar to vcpu-pinning, both in terms of functioning and user interface. As a side efect, that simplify the scheduling code (at least the NUMA-aware part) by quite a bit. Finally, and most important, this is something that will become really important when we will start to support virtual NUMA topologies, as, a that point, having the same node-affinity for all the vcpus in a domain won''t be enough any longer (we''ll want the vcpus from a particular vnode to have node-afinity with a particular pnode). More detailed description of the mechanism and of the implementation choices are provided in the changelogs and in the documentation (docs/misc and manpages). One last thing is that this series relies on some other patches and series that I sent on xen-devel already, but have not been applied yet. I''m re-sending them here, as a part of this series, so feel free to pick them up from here, if wanting to apply them, or comment on them in this thread, if you want me to change them. In particular, patches 01 and 03, I already sent as single patches, patches 04-07, I already sent them as a series. Sorry if that is a bit clumsy, but I couldn''t find a better way to do it. :-) In the detailed list of patches below, ''x'' means previously submitted, ''*'' means already acked/reviewed-by. Finally, Elena, that is not super important, but perhaps, in the next release of your vNUMA series, you could try to integrate it with this (and of course, ask if you need anything while trying to do that). Matt, if/when you eventually get to release, even as RFC or something like that, your HVM vNUMA series, we can try to figure out how to integrate that with this, so to use node-affinity instead than pinning. The series is also available at the following git coordinates: git://xenbits.xen.org/people/dariof/xen.git numa/per-vcpu-affinity-v1 http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/numa/per-vcpu-affinity-v1 Let me know what you think about all this. Regards, Dario --- Dario Faggioli (12): x xen: numa-sched: leave node-affinity alone if not in "auto" mode xl: allow for node-wise specification of vcpu pinning x xl: implement and enable dryrun mode for `xl vcpu-pin'' x xl: test script for the cpumap parser (for vCPU pinning) xen: numa-sched: make space for per-vcpu node-affinity xen: numa-sched: domain node-affinity always comes from vcpu node-affinity xen: numa-sched: use per-vcpu node-affinity for actual scheduling xen: numa-sched: enable getting/specifying per-vcpu node-affinity libxc: numa-sched: enable getting/specifying per-vcpu node-affinity libxl: numa-sched: enable getting/specifying per-vcpu node-affinity xl: numa-sched: enable getting/specifying per-vcpu node-affinity xl: numa-sched: enable specifying node-affinity in VM config file docs/man/xl.cfg.pod.5 | 90 ++++- docs/man/xl.pod.1 | 25 + docs/misc/xl-numa-placement.markdown | 124 ++++-- tools/libxc/xc_domain.c | 90 ++++- tools/libxc/xenctrl.h | 19 + tools/libxl/check-xl-vcpupin-parse | 294 +++++++++++++++ tools/libxl/check-xl-vcpupin-parse.data-example | 53 +++ tools/libxl/libxl.c | 28 + tools/libxl/libxl.h | 11 + tools/libxl/libxl_dom.c | 18 + tools/libxl/libxl_numa.c | 14 - tools/libxl/libxl_types.idl | 1 tools/libxl/libxl_utils.h | 12 + tools/libxl/xl.h | 1 tools/libxl/xl_cmdimpl.c | 458 +++++++++++++++++++---- tools/libxl/xl_cmdtable.c | 11 - xen/common/domain.c | 97 ++--- xen/common/domctl.c | 47 ++ xen/common/keyhandler.c | 6 xen/common/sched_credit.c | 63 --- xen/common/schedule.c | 55 +++ xen/include/public/domctl.h | 8 xen/include/xen/sched-if.h | 2 xen/include/xen/sched.h | 13 + xen/xsm/flask/hooks.c | 2 25 files changed, 1254 insertions(+), 288 deletions(-) create mode 100755 tools/libxl/check-xl-vcpupin-parse create mode 100644 tools/libxl/check-xl-vcpupin-parse.data-example -- Signature
Dario Faggioli
2013-Nov-05 14:34 UTC
[PATCH RESEND 01/12] xen: numa-sched: leave node-affinity alone if not in "auto" mode
If the domain''s NUMA node-affinity is being specified by the user/toolstack (instead of being automatically computed by Xen), we really should stick to that. This means domain_update_node_affinity() is wrong when it filters out some stuff from there even in "!auto" mode. This commit fixes that. Of course, this does not mean node-affinity is always honoured (e.g., a vcpu won''t run on a pcpu of a different cpupool) but the necessary logic for taking into account all the possible situations lives in the scheduler code, where it belongs. What could happen without this change is that, under certain circumstances, the node-affinity of a domain may change when the user modifies the vcpu-affinity of the domain''s vcpus. This, even if probably not a real bug, is at least something the user does not expect, so let''s avoid it. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> --- This has been submitted already as a single patch on its own. Since this series needs the change done here, just include it in here, instead of pinging the original submission and deferring posting this series. --- xen/common/domain.c | 28 +++++++++------------------- 1 file changed, 9 insertions(+), 19 deletions(-) diff --git a/xen/common/domain.c b/xen/common/domain.c index 5999779..af31ab4 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -352,7 +352,6 @@ void domain_update_node_affinity(struct domain *d) cpumask_var_t cpumask; cpumask_var_t online_affinity; const cpumask_t *online; - nodemask_t nodemask = NODE_MASK_NONE; struct vcpu *v; unsigned int node; @@ -374,28 +373,19 @@ void domain_update_node_affinity(struct domain *d) cpumask_or(cpumask, cpumask, online_affinity); } + /* + * If d->auto_node_affinity is true, the domain''s node-affinity mask + * (d->node_affinity) is automaically computed from all the domain''s + * vcpus'' vcpu-affinity masks (the union of which we have just built + * above in cpumask). OTOH, if d->auto_node_affinity is false, we + * must leave the node-affinity of the domain alone. + */ if ( d->auto_node_affinity ) { - /* Node-affinity is automaically computed from all vcpu-affinities */ + nodes_clear(d->node_affinity); for_each_online_node ( node ) if ( cpumask_intersects(&node_to_cpumask(node), cpumask) ) - node_set(node, nodemask); - - d->node_affinity = nodemask; - } - else - { - /* Node-affinity is provided by someone else, just filter out cpus - * that are either offline or not in the affinity of any vcpus. */ - nodemask = d->node_affinity; - for_each_node_mask ( node, d->node_affinity ) - if ( !cpumask_intersects(&node_to_cpumask(node), cpumask) ) - node_clear(node, nodemask);//d->node_affinity); - - /* Avoid loosing track of node-affinity because of a bad - * vcpu-affinity has been specified. */ - if ( !nodes_empty(nodemask) ) - d->node_affinity = nodemask; + node_set(node, d->node_affinity); } sched_set_node_affinity(d, &d->node_affinity);
Dario Faggioli
2013-Nov-05 14:34 UTC
[PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning
Making it possible to use something like the following: * "nodes:0-3": all pCPUs of nodes 0,1,2,3; * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3; * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2 but not pCPU 6; * ... In both domain config file and `xl vcpu-pin''. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- Picking this up from a previously submitted series ("xl: allow for node-wise specification of vcpu pinning") as the changes in that and in this series would otherwise be conflicting. If this is considered fine, Feel free to apply it from here and skip the corresponding e-mail in the original submission. --- Changes from v2: * turned a ''return'' into ''goto out'', consistently with the most of exit patterns; * harmonized error handling: now parse_range() return a libxl error code, as requested during review; * dealing with "all" moved inside update_cpumap_range(). It''s tricky to move it in parse_range() (as requested during review), since we need the cpumap being modified handy when dealing with it. However, having it in update_cpumap_range() simplifies the code just as much as that; * explicitly checking for junk after a valid value or range in parse_range(), as requested during review; * xl exits on parsing failing, so no need to reset the cpumap to something sensible in vcpupin_parse(), as suggested during review; Changes from v1: * code rearranged in order to look more simple to follow and understand, as requested during review; * improved docs in xl.cfg.pod.5, as requested during review; * strtoul() now returns into unsigned long, and the case where it returns ULONG_MAX is now taken into account, as requested during review; * stuff like "all,^7" now works, as requested during review. Specifying just "^7" does not work either before or after this change * killed some magic (i.e., `ptr += 5 + (ptr[4] == ''s''`) by introducing STR_SKIP_PREFIX() macro, as requested during review. --- docs/man/xl.cfg.pod.5 | 20 ++++++ tools/libxl/xl_cmdimpl.c | 145 ++++++++++++++++++++++++++++++++-------------- 2 files changed, 121 insertions(+), 44 deletions(-) diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 index d2d8921..1c98cb4 100644 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -115,7 +115,25 @@ To allow all the vcpus of the guest to run on all the cpus on the host. =item "0-3,5,^1" -To allow all the vcpus of the guest to run on cpus 0,2,3,5. +To allow all the vcpus of the guest to run on cpus 0,2,3,5. Combining +this with "all" is possible, meaning "all,^7" results in all the vcpus +of the guest running on all the cpus on the host except cpu 7. + +=item "nodes:0-3,node:^2" + +To allow all the vcpus of the guest to run on the cpus from NUMA nodes +0,1,3 of the host. So, if cpus 0-3 belongs to node 0, cpus 4-7 belongs +to node 1 and cpus 8-11 to node 3, the above would mean all the vcpus +of the guest will run on cpus 0-3,8-11. + +Combining this notation with the one above is possible. For instance, +"1,node:2,^6", means all the vcpus of the guest will run on cpu 1 and +on all the cpus of NUMA node 2, but not on cpu 6. Following the same +example as above, that would be cpus 1,4,5,7. + +Combining this with "all" is also possible, meaning "all,^nodes:1" +results in all the vcpus of the guest running on all the cpus on the +host, except for the cpus belonging to the host NUMA node 1. =item ["2", "3"] (or [2, 3]) diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index 40feb7d..b8755b9 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -59,6 +59,11 @@ } \ }) +#define STR_HAS_PREFIX( a, b ) \ + ( strncmp(a, b, strlen(b)) == 0 ) +#define STR_SKIP_PREFIX( a, b ) \ + ( STR_HAS_PREFIX(a, b) ? (a) += strlen(b) : NULL ) + int logfile = 2; @@ -513,61 +518,115 @@ static void split_string_into_string_list(const char *str, free(s); } -static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap) +static int parse_range(const char *str, unsigned long *a, unsigned long *b) +{ + char *nstr, *endptr; + + *a = *b = strtoul(str, &endptr, 10); + if (endptr == str || *a == ULONG_MAX) + return ERROR_INVAL; + + if (*endptr == ''-'') { + nstr = endptr + 1; + + *b = strtoul(nstr, &endptr, 10); + if (endptr == nstr || *b == ULONG_MAX || *b < *a) + return ERROR_INVAL; + } + + /* Valid value or range so far, but we also don''t want junk after that */ + if (*endptr != ''\0'') + return ERROR_INVAL; + + return 0; +} + +/* + * Add or removes a specific set of cpus (specified in str, either as + * single cpus or as entire NUMA nodes) to/from cpumap. + */ +static int update_cpumap_range(const char *str, libxl_bitmap *cpumap) { - libxl_bitmap exclude_cpumap; - uint32_t cpuida, cpuidb; - char *endptr, *toka, *tokb, *saveptr = NULL; - int i, rc = 0, rmcpu; + unsigned long ida, idb; + libxl_bitmap node_cpumap; + bool is_not = false, is_nodes = false; + int rc = 0; + + libxl_bitmap_init(&node_cpumap); - if (!strcmp(cpu, "all")) { + rc = libxl_node_bitmap_alloc(ctx, &node_cpumap, 0); + if (rc) { + fprintf(stderr, "libxl_node_bitmap_alloc failed.\n"); + goto out; + } + + /* Are we adding or removing cpus/nodes? */ + if (STR_SKIP_PREFIX(str, "^")) { + is_not = true; + } + + /* Are we dealing with cpus or full nodes? */ + if (STR_SKIP_PREFIX(str, "nodes:")) { + is_nodes = true; + } + + if (STR_HAS_PREFIX(str, "all")) { libxl_bitmap_set_any(cpumap); - return 0; + goto out; } - if (libxl_cpu_bitmap_alloc(ctx, &exclude_cpumap, 0)) { - fprintf(stderr, "Error: Failed to allocate cpumap.\n"); - return ENOMEM; + rc = parse_range(str, &ida, &idb); + if (rc) { + fprintf(stderr, "Invalid pcpu range: %s.\n", str); + goto out; } - for (toka = strtok_r(cpu, ",", &saveptr); toka; - toka = strtok_r(NULL, ",", &saveptr)) { - rmcpu = 0; - if (*toka == ''^'') { - /* This (These) Cpu(s) will be removed from the map */ - toka++; - rmcpu = 1; - } - /* Extract a valid (range of) cpu(s) */ - cpuida = cpuidb = strtoul(toka, &endptr, 10); - if (endptr == toka) { - fprintf(stderr, "Error: Invalid argument.\n"); - rc = EINVAL; - goto vcpp_out; - } - if (*endptr == ''-'') { - tokb = endptr + 1; - cpuidb = strtoul(tokb, &endptr, 10); - if (endptr == tokb || cpuida > cpuidb) { - fprintf(stderr, "Error: Invalid argument.\n"); - rc = EINVAL; - goto vcpp_out; + /* Add or remove the specified cpus in the range */ + while (ida <= idb) { + if (is_nodes) { + /* Add/Remove all the cpus of a NUMA node */ + int i; + + rc = libxl_node_to_cpumap(ctx, ida, &node_cpumap); + if (rc) { + fprintf(stderr, "libxl_node_to_cpumap failed.\n"); + goto out; } + + /* Add/Remove all the cpus in the node cpumap */ + libxl_for_each_set_bit(i, node_cpumap) { + is_not ? libxl_bitmap_reset(cpumap, i) : + libxl_bitmap_set(cpumap, i); + } + } else { + /* Add/Remove this cpu */ + is_not ? libxl_bitmap_reset(cpumap, ida) : + libxl_bitmap_set(cpumap, ida); } - while (cpuida <= cpuidb) { - rmcpu == 0 ? libxl_bitmap_set(cpumap, cpuida) : - libxl_bitmap_set(&exclude_cpumap, cpuida); - cpuida++; - } + ida++; } - /* Clear all the cpus from the removal list */ - libxl_for_each_set_bit(i, exclude_cpumap) { - libxl_bitmap_reset(cpumap, i); - } + out: + libxl_bitmap_dispose(&node_cpumap); + return rc; +} + +/* + * Takes a string representing a set of cpus (specified either as + * single cpus or as eintire NUMA nodes) and turns it into the + * corresponding libxl_bitmap (in cpumap). + */ +static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap) +{ + char *ptr, *saveptr = NULL; + int rc = 0; -vcpp_out: - libxl_bitmap_dispose(&exclude_cpumap); + for (ptr = strtok_r(cpu, ",", &saveptr); ptr; + ptr = strtok_r(NULL, ",", &saveptr)) { + rc = update_cpumap_range(ptr, cpumap); + if (rc) + break; + } return rc; }
Dario Faggioli
2013-Nov-05 14:34 UTC
[PATCH RESEND 03/12] xl: implement and enable dryrun mode for `xl vcpu-pin''
As it can be useful to see if the outcome of some complex vCPU pinning bitmap specification looks as expected. This also allow for the introduction of some automatic testing and verification for the bitmap parsing code, as it happens already in check-xl-disk-parse and check-xl-vif-parse. In particular, to make the above possible, this commit also changes the implementation of the vcpu-pin command so that, instead of always returning 0, it returns an error if the parsing fails. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> --- Picking this up from a previously submitted series ("xl: allow for node-wise specification of vcpu pinning") as the changes in that and in this series would otherwise be conflicting. If this is considered fine, Feel free to apply it from here and skip the corresponding e-mail in the original submission. --- Changes since v2: * fixed a typo in the changelog --- tools/libxl/xl_cmdimpl.c | 49 +++++++++++++++++++++++++++++++++------------ tools/libxl/xl_cmdtable.c | 2 +- 2 files changed, 37 insertions(+), 14 deletions(-) diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index b8755b9..f6239d5 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -4572,30 +4572,53 @@ int main_vcpulist(int argc, char **argv) return 0; } -static void vcpupin(uint32_t domid, const char *vcpu, char *cpu) +static int vcpupin(uint32_t domid, const char *vcpu, char *cpu) { libxl_vcpuinfo *vcpuinfo; libxl_bitmap cpumap; uint32_t vcpuid; char *endptr; - int i, nb_vcpu; + int i = 0, nb_vcpu, rc = -1; + + libxl_bitmap_init(&cpumap); vcpuid = strtoul(vcpu, &endptr, 10); if (vcpu == endptr) { if (strcmp(vcpu, "all")) { fprintf(stderr, "Error: Invalid argument.\n"); - return; + goto out; } vcpuid = -1; } - if (libxl_cpu_bitmap_alloc(ctx, &cpumap, 0)) { - goto vcpupin_out; - } + if (libxl_cpu_bitmap_alloc(ctx, &cpumap, 0)) + goto out; if (vcpupin_parse(cpu, &cpumap)) - goto vcpupin_out1; + goto out; + + if (dryrun_only) { + libxl_cputopology *info = libxl_get_cpu_topology(ctx, &i); + + if (!info) { + fprintf(stderr, "libxl_get_cpu_topology failed.\n"); + goto out; + } + libxl_cputopology_list_free(info, i); + + fprintf(stdout, "cpumap: "); + print_cpumap(cpumap.map, i, stdout); + fprintf(stdout, "\n"); + + if (ferror(stdout) || fflush(stdout)) { + perror("stdout"); + exit(-1); + } + + rc = 0; + goto out; + } if (vcpuid != -1) { if (libxl_set_vcpuaffinity(ctx, domid, vcpuid, &cpumap) == -1) { @@ -4605,7 +4628,7 @@ static void vcpupin(uint32_t domid, const char *vcpu, char *cpu) else { if (!(vcpuinfo = libxl_list_vcpu(ctx, domid, &nb_vcpu, &i))) { fprintf(stderr, "libxl_list_vcpu failed.\n"); - goto vcpupin_out1; + goto out; } for (i = 0; i < nb_vcpu; i++) { if (libxl_set_vcpuaffinity(ctx, domid, vcpuinfo[i].vcpuid, @@ -4616,10 +4639,11 @@ static void vcpupin(uint32_t domid, const char *vcpu, char *cpu) } libxl_vcpuinfo_list_free(vcpuinfo, nb_vcpu); } - vcpupin_out1: + + rc = 0; + out: libxl_bitmap_dispose(&cpumap); - vcpupin_out: - ; + return rc; } int main_vcpupin(int argc, char **argv) @@ -4630,8 +4654,7 @@ int main_vcpupin(int argc, char **argv) /* No options */ } - vcpupin(find_domain(argv[optind]), argv[optind+1] , argv[optind+2]); - return 0; + return vcpupin(find_domain(argv[optind]), argv[optind+1] , argv[optind+2]); } static void vcpuset(uint32_t domid, const char* nr_vcpus, int check_host) diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c index 326a660..d3dcbf0 100644 --- a/tools/libxl/xl_cmdtable.c +++ b/tools/libxl/xl_cmdtable.c @@ -211,7 +211,7 @@ struct cmd_spec cmd_table[] = { "[Domain, ...]", }, { "vcpu-pin", - &main_vcpupin, 0, 1, + &main_vcpupin, 1, 1, "Set which CPUs a VCPU can use", "<Domain> <VCPU|all> <CPUs|all>", },
Dario Faggioli
2013-Nov-05 14:34 UTC
[PATCH RESEND 04/12] xl: test script for the cpumap parser (for vCPU pinning)
This commit introduces "check-xl-vcpupin-parse" for helping verifying and debugging the (v)CPU bitmap parsing code in xl. The script runs "xl -N vcpu-pin 0 all <some strings>" repeatedly, with various input strings, and checks that the output is as expected. This is what the script can do: # ./check-xl-vcpupin-parse -h usage: ./check-xl-vcpupin-parse [options] Tests various vcpu-pinning strings. If run without arguments acts as follows: - generates some test data and saves them in check-xl-vcpupin-parse.data; - tests all the generated configurations (reading them back from check-xl-vcpupin-parse.data). An example of a test vector file is provided in check-xl-vcpupin-parse.data-example. Options: -h prints this message -r seed uses seed for initializing the rundom number generator (default: the script PID) -s string tries using string as a vcpu pinning configuration and reports whether that succeeds or not -o ofile save the test data in ofile (default: check-xl-vcpupin-parse.data) -i ifile read test data from ifile An example test data file (generated on a 2 NUMA nodes, 16 CPUs host) is being provided in check-xl-vcpupin-parse.data-example. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> --- Picking this up from a previously submitted series ("xl: allow for node-wise specification of vcpu pinning") as the changes in that and in this series would otherwise be conflicting. If this is considered fine, Feel free to apply it from here and skip the corresponding e-mail in the original submission. --- Changes from v2: * killed the `sleep 1'', as requested during review; * allow for passing a custom randon seed, and report what is the actual random seed used, as requested during review; * allow for testing for specific pinning configuration strings, as suggested during review; * stores the test data in a file, after generating them, and read them back from there for actual testing, as suggested during review; * allow for reading the test data from an existing test file instead than always generating new ones. Changes from v1: * this was not there in v1, and adding it has been requested during review. --- tools/libxl/check-xl-vcpupin-parse | 294 +++++++++++++++++++++++ tools/libxl/check-xl-vcpupin-parse.data-example | 53 ++++ 2 files changed, 347 insertions(+) create mode 100755 tools/libxl/check-xl-vcpupin-parse create mode 100644 tools/libxl/check-xl-vcpupin-parse.data-example diff --git a/tools/libxl/check-xl-vcpupin-parse b/tools/libxl/check-xl-vcpupin-parse new file mode 100755 index 0000000..6189660 --- /dev/null +++ b/tools/libxl/check-xl-vcpupin-parse @@ -0,0 +1,294 @@ +#!/bin/bash + +set -e + +if [ -x ./xl ] ; then + export LD_LIBRARY_PATH=.:../libxc:../xenstore: + XL=./xl +else + XL=xl +fi + +fprefix=tmp.check-xl-vcpupin-parse +outfile=check-xl-vcpupin-parse.data + +usage () { +cat <<END +usage: $0 [options] + +Tests various vcpu-pinning strings. If run without arguments acts +as follows: + - generates some test data and saves them in $outfile; + - tests all the generated configurations (reading them back from + $outfile). + +An example of a test vector file is provided in ${outfile}-example. + +Options: + -h prints this message + -r seed uses seed for initializing the rundom number generator + (default: the script PID) + -s string tries using string as a vcpu pinning configuration and + reports whether that succeeds or not + -o ofile save the test data in ofile (default: $outfile) + -i ifile read test data from ifile +END +} + +expected () { + cat >$fprefix.expected +} + +# by default, re-seed with our PID +seed=$$ +failures=0 + +# Execute one test and check the result against the provided +# rc value and output +one () { + expected_rc=$1; shift + printf "test case %s...\n" "$*" + set +e + ${XL} -N vcpu-pin 0 all "$@" </dev/null >$fprefix.actual 2>/dev/null + actual_rc=$? + if [ $actual_rc != $expected_rc ]; then + diff -u $fprefix.expected $fprefix.actual + echo >&2 "test case \`$*'' failed ($actual_rc $diff_rc)" + failures=$(( $failures + 1 )) + fi + set -e +} + +# Write an entry in the test vector file. Format is as follows: +# test-string*expected-rc*expected-output +write () { + printf "$1*$2*$3\n" >> $outfile +} + +complete () { + if [ "$failures" = 0 ]; then + echo all ok.; exit 0 + else + echo "$failures tests failed."; exit 1 + fi +} + +# Test a specific pinning string +string () { + expected_rc=$1; shift + printf "test case %s...\n" "$*" + set +e + ${XL} -N vcpu-pin 0 all "$@" &> /dev/null + actual_rc=$? + set -e + + if [ $actual_rc != $expected_rc ]; then + echo >&2 "test case \`$*'' failed ($actual_rc)" + else + echo >&2 "test case \`$*'' succeeded" + fi + + exit 0 +} + +# Read a test vector file (provided as $1) line by line and +# test all the entries it contains +run () +{ + while read line + do + if [ ${line:0:1} != ''#'' ]; then + test_string="`echo $line | cut -f1 -d''*''`" + exp_rc="`echo $line | cut -f2 -d''*''`" + exp_output="`echo $line | cut -f3 -d''*''`" + + expected <<END +$exp_output +END + one $exp_rc "$test_string" + fi + done < $1 + + complete + + exit 0 +} + +while getopts "hr:s:o:i:" option +do + case $option in + h) + usage + exit 0 + ;; + r) + seed=$OPTARG + ;; + s) + string 0 "$OPTARG" + ;; + o) + outfile=$OPTARG + ;; + i) + run $OPTARG + ;; + esac +done + +#---------- test data ---------- +# +nr_cpus=`xl info | grep nr_cpus | cut -f2 -d'':''` +nr_nodes=`xl info | grep nr_nodes | cut -f2 -d'':''` +nr_cpus_per_node=`xl info -n | sed ''/cpu:/,/numa_info/!d'' | head -n -1 | \ + awk ''{print $4}'' | uniq -c | tail -1 | awk ''{print $1}''` +cat >$outfile <<END +# WARNING: some of these tests are topology based tests. +# Expect failures if the topology is not detected correctly +# detected topology: $nr_cpus CPUs, $nr_nodes nodes, $nr_cpus_per_node CPUs per node. +# +# seed used for random number generation: seed=${seed}. +# +# Format is as follows: +# test-string*expected-return-code*expected-output +# +END + +# Re-seed the random number generator +RANDOM=$seed + +echo "# Testing a wrong configuration" >> $outfile +write foo 255 "" + +echo "# Testing the ''all'' syntax" >> $outfile +write "all" 0 "cpumap: any cpu" +write "nodes:all" 0 "cpumap: any cpu" +write "all,nodes:all" 0 "cpumap: any cpu" +write "all,^nodes:0,all" 0 "cpumap: any cpu" + +echo "# Testing the empty cpumap case" >> $outfile +write "^0" 0 "cpumap: none" + +echo "# A few attempts of pinning to just one random cpu" >> $outfile +if [ $nr_cpus -gt 1 ]; then + for i in `seq 0 3`; do + cpu=$(($RANDOM % nr_cpus)) + write "$cpu" 0 "cpumap: $cpu" + done +fi + +echo "# A few attempts of pinning to all but one random cpu" >> $outfile +if [ $nr_cpus -gt 2 ]; then + for i in `seq 0 3`; do + cpu=$(($RANDOM % nr_cpus)) + if [ $cpu -eq 0 ]; then + expected_range="1-$((nr_cpus - 1))" + elif [ $cpu -eq 1 ]; then + expected_range="0,2-$((nr_cpus - 1))" + elif [ $cpu -eq $((nr_cpus - 2)) ]; then + expected_range="0-$((cpu - 1)),$((nr_cpus - 1))" + elif [ $cpu -eq $((nr_cpus - 1)) ]; then + expected_range="0-$((nr_cpus - 2))" + else + expected_range="0-$((cpu - 1)),$((cpu + 1))-$((nr_cpus - 1))" + fi + write "all,^$cpu" 0 "cpumap: $expected_range" + done +fi + +echo "# A few attempts of pinning to a random range of cpus" >> $outfile +if [ $nr_cpus -gt 2 ]; then + for i in `seq 0 3`; do + cpua=$(($RANDOM % nr_cpus)) + range=$((nr_cpus - cpua)) + cpub=$(($RANDOM % range)) + cpubb=$((cpua + cpub)) + if [ $cpua -eq $cpubb ]; then + expected_range="$cpua" + else + expected_range="$cpua-$cpubb" + fi + write "$expected_range" 0 "cpumap: $expected_range" + done +fi + +echo "# A few attempts of pinning to just one random node" >> $outfile +if [ $nr_nodes -gt 1 ]; then + for i in `seq 0 3`; do + node=$(($RANDOM % nr_nodes)) + # this assumes that the first $nr_cpus_per_node (from cpu + # 0 to cpu $nr_cpus_per_node-1) are assigned to the first node + # (node 0), the second $nr_cpus_per_node (from $nr_cpus_per_node + # to 2*$nr_cpus_per_node-1) are assigned to the second node (node + # 1), etc. Expect failures if that is not the case. + write "nodes:$node" 0 "cpumap: $((nr_cpus_per_node*node))-$((nr_cpus_per_node*(node+1)-1))" + done +fi + +echo "# A few attempts of pinning to all but one random node" >> $outfile +if [ $nr_nodes -gt 1 ]; then + for i in `seq 0 3`; do + node=$(($RANDOM % nr_nodes)) + # this assumes that the first $nr_cpus_per_node (from cpu + # 0 to cpu $nr_cpus_per_node-1) are assigned to the first node + # (node 0), the second $nr_cpus_per_node (from $nr_cpus_per_node + # to 2*$nr_cpus_per_node-1) are assigned to the second node (node + # 1), etc. Expect failures if that is not the case. + if [ $node -eq 0 ]; then + expected_range="$nr_cpus_per_node-$((nr_cpus - 1))" + elif [ $node -eq $((nr_nodes - 1)) ]; then + expected_range="0-$((nr_cpus - nr_cpus_per_node - 1))" + else + expected_range="0-$((nr_cpus_per_node*node-1)),$((nr_cpus_per_node*(node+1)))-$nr_cpus" + fi + write "all,^nodes:$node" 0 "cpumap: $expected_range" + done +fi + +echo "# A few attempts of pinning to a random range of nodes" >> $outfile +if [ $nr_nodes -gt 1 ]; then + for i in `seq 0 3`; do + nodea=$(($RANDOM % nr_nodes)) + range=$((nr_nodes - nodea)) + nodeb=$(($RANDOM % range)) + nodebb=$((nodea + nodeb)) + # this assumes that the first $nr_cpus_per_node (from cpu + # 0 to cpu $nr_cpus_per_node-1) are assigned to the first node + # (node 0), the second $nr_cpus_per_node (from $nr_cpus_per_node + # to 2*$nr_cpus_per_node-1) are assigned to the second node (node + # 1), etc. Expect failures if that is not the case. + if [ $nodea -eq 0 ] && [ $nodebb -eq $((nr_nodes - 1)) ]; then + expected_range="any cpu" + else + expected_range="$((nr_cpus_per_node*nodea))-$((nr_cpus_per_node*(nodebb+1) - 1))" + fi + write "nodes:$nodea-$nodebb" 0 "cpumap: $expected_range" + done +fi + +echo "# A few attempts of pinning to a node but excluding one random cpu" >> $outfile +if [ $nr_nodes -gt 1 ]; then + for i in `seq 0 3`; do + node=$(($RANDOM % nr_nodes)) + # this assumes that the first $nr_cpus_per_node (from cpu + # 0 to cpu $nr_cpus_per_node-1) are assigned to the first node + # (node 0), the second $nr_cpus_per_node (from $nr_cpus_per_node + # to 2*$nr_cpus_per_node-1) are assigned to the second node (node + # 1), etc. Expect failures if that is not the case. + cpu=$(($RANDOM % nr_cpus_per_node + nr_cpus_per_node*node)) + if [ $cpu -eq $((nr_cpus_per_node*node)) ]; then + expected_range="$((nr_cpus_per_node*node + 1))-$((nr_cpus_per_node*(node+1) - 1))" + elif [ $cpu -eq $((nr_cpus_per_node*node + 1)) ]; then + expected_range="$((nr_cpus_per_node*node)),$((nr_cpus_per_node*node + 2))-$((nr_cpus_per_node*(node+1) - 1))" + elif [ $cpu -eq $((nr_cpus_per_node*(node+1) - 2)) ]; then + expected_range="$((nr_cpus_per_node*node))-$((nr_cpus_per_node*(node+1) - 3)),$((nr_cpus_per_node*(node+1) - 1))" + elif [ $cpu -eq $((nr_cpus_per_node*(node+1) - 1)) ]; then + expected_range="$((nr_cpus_per_node*node))-$((nr_cpus_per_node*(node+1) - 2))" + else + expected_range="$((nr_cpus_per_node*node))-$((cpu - 1)),$((cpu + 1))-$((nr_cpus_per_node*(node+1) - 1))" + fi + write "nodes:$node,^$cpu" 0 "cpumap: $expected_range" + done +fi + +run $outfile diff --git a/tools/libxl/check-xl-vcpupin-parse.data-example b/tools/libxl/check-xl-vcpupin-parse.data-example new file mode 100644 index 0000000..c9cf660 --- /dev/null +++ b/tools/libxl/check-xl-vcpupin-parse.data-example @@ -0,0 +1,53 @@ +# WARNING: some of these tests are topology based tests. +# Expect failures if the topology is not detected correctly +# detected topology: 16 CPUs, 2 nodes, 8 CPUs per node. +# +# seed used for random number generation: seed=13328. +# +# Format is as follows: +# test-string*expected-return-code*expected-output +# +# Testing a wring configuration +foo*255* +# Testing the ''all'' syntax +all*0*cpumap: any cpu +nodes:all*0*cpumap: any cpu +all,nodes:all*0*cpumap: any cpu +all,^nodes:0,all*0*cpumap: any cpu +# Testing the empty cpumap case +^0*0*cpumap: none +# A few attempts of pinning to just one random cpu +0*0*cpumap: 0 +9*0*cpumap: 9 +6*0*cpumap: 6 +0*0*cpumap: 0 +# A few attempts of pinning to all but one random cpu +all,^12*0*cpumap: 0-11,13-15 +all,^6*0*cpumap: 0-5,7-15 +all,^3*0*cpumap: 0-2,4-15 +all,^7*0*cpumap: 0-6,8-15 +# A few attempts of pinning to a random range of cpus +13-15*0*cpumap: 13-15 +7*0*cpumap: 7 +3-5*0*cpumap: 3-5 +8-11*0*cpumap: 8-11 +# A few attempts of pinning to just one random node +nodes:1*0*cpumap: 8-15 +nodes:0*0*cpumap: 0-7 +nodes:0*0*cpumap: 0-7 +nodes:0*0*cpumap: 0-7 +# A few attempts of pinning to all but one random node +all,^nodes:0*0*cpumap: 8-15 +all,^nodes:1*0*cpumap: 0-7 +all,^nodes:1*0*cpumap: 0-7 +all,^nodes:0*0*cpumap: 8-15 +# A few attempts of pinning to a random range of nodes +nodes:1-1*0*cpumap: 8-15 +nodes:1-1*0*cpumap: 8-15 +nodes:0-1*0*cpumap: any cpu +nodes:0-0*0*cpumap: 0-7 +# A few attempts of pinning to a node but excluding one random cpu +nodes:1,^8*0*cpumap: 9-15 +nodes:0,^6*0*cpumap: 0-5,7 +nodes:1,^9*0*cpumap: 8,10-15 +nodes:0,^5*0*cpumap: 0-4,6-7
Dario Faggioli
2013-Nov-05 14:35 UTC
[PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
Before this change, each vcpu had its own vcpu-affinity (also called pinning), while the whole domain had a NUMA node-affinity. Of course, as the (credit) scheduler schedules vcpus and not whole domains, this means that all the vcpus of a domain had the same NUMA node-affinity. This change is the first step toward overcoming such limitation. It adds the data structures for storing the node-affinity on a per-vcpu basis (along with allocating and initializing it). As far as this change only is concerned, there is no specific way to change the node-affinity of a vcpu to something which is not automatically computed (basing on its vcpu-affinity). Such logic is being introduced in subsequent commits. Also, now that each vcpu has its own node-affinity, and in case the domain''s node-affinity is set to ''automatically computed'', we build it up as the union of all the node-affinities of all the vcpus of the domain. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- xen/common/domain.c | 39 ++++++++++++++++++++++++++++++++----- xen/common/keyhandler.c | 6 +++++- xen/common/schedule.c | 50 +++++++++++++++++++++++++++++++++++++++++++++++ xen/include/xen/sched.h | 10 +++++++++ 4 files changed, 99 insertions(+), 6 deletions(-) diff --git a/xen/common/domain.c b/xen/common/domain.c index af31ab4..8d2ff49 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -128,6 +128,7 @@ struct vcpu *alloc_vcpu( if ( !zalloc_cpumask_var(&v->cpu_affinity) || !zalloc_cpumask_var(&v->cpu_affinity_tmp) || !zalloc_cpumask_var(&v->cpu_affinity_saved) || + !zalloc_cpumask_var(&v->node_affinity) || !zalloc_cpumask_var(&v->vcpu_dirty_cpumask) ) goto fail_free; @@ -159,6 +160,7 @@ struct vcpu *alloc_vcpu( free_cpumask_var(v->cpu_affinity); free_cpumask_var(v->cpu_affinity_tmp); free_cpumask_var(v->cpu_affinity_saved); + free_cpumask_var(v->node_affinity); free_cpumask_var(v->vcpu_dirty_cpumask); free_vcpu_struct(v); return NULL; @@ -353,7 +355,7 @@ void domain_update_node_affinity(struct domain *d) cpumask_var_t online_affinity; const cpumask_t *online; struct vcpu *v; - unsigned int node; + unsigned int cpu; if ( !zalloc_cpumask_var(&cpumask) ) return; @@ -367,9 +369,36 @@ void domain_update_node_affinity(struct domain *d) spin_lock(&d->node_affinity_lock); + /* + * Let''s prepare the cpumask that will be used below to actually update + * the node-affinity of the whole domain. Each vcpu has a vcpu-affinity + * and a numa-affinity. What gets built in cpumask (and used below) is + * the union of all the (online) cpus in all the vcpu''s numa-affinity + * masks. + * + * On its turn, the numa-affinity mask of the i-eth vcpu (say, ''v'') is + * either derived directly from the vcpu''s vcpu-affinity mask (in case + * v->auto_node_affinity is true) or has its own value, (potentially) + * completely independent from v->cpu_affinity. In the former case, it + * is here that we make sure the two affinity masks matches (since this + * function gets called in correspondence of each modification to + * v->cpu_affinity happening in vcpu_set_affinity() ); in the latter + * case, we just leave v->node_affinity alone. + */ for_each_vcpu ( d, v ) { - cpumask_and(online_affinity, v->cpu_affinity, online); + if ( v->auto_node_affinity ) + { + cpumask_clear(v->node_affinity); + for_each_cpu ( cpu, v->cpu_affinity ) + cpumask_or(v->node_affinity, v->node_affinity, + &node_to_cpumask(cpu_to_node(cpu))); + + cpumask_and(online_affinity, v->node_affinity, online); + } + else + cpumask_copy(online_affinity, v->node_affinity); + cpumask_or(cpumask, cpumask, online_affinity); } @@ -383,9 +412,8 @@ void domain_update_node_affinity(struct domain *d) if ( d->auto_node_affinity ) { nodes_clear(d->node_affinity); - for_each_online_node ( node ) - if ( cpumask_intersects(&node_to_cpumask(node), cpumask) ) - node_set(node, d->node_affinity); + for_each_cpu ( cpu, cpumask ) + node_set(cpu_to_node(cpu), d->node_affinity); } sched_set_node_affinity(d, &d->node_affinity); @@ -734,6 +762,7 @@ static void complete_domain_destroy(struct rcu_head *head) { free_cpumask_var(v->cpu_affinity); free_cpumask_var(v->cpu_affinity_tmp); + free_cpumask_var(v->node_affinity); free_cpumask_var(v->vcpu_dirty_cpumask); free_vcpu_struct(v); } diff --git a/xen/common/keyhandler.c b/xen/common/keyhandler.c index 8e4b3f8..8d5e8b2 100644 --- a/xen/common/keyhandler.c +++ b/xen/common/keyhandler.c @@ -297,7 +297,11 @@ static void dump_domains(unsigned char key) cpuset_print(tmpstr, sizeof(tmpstr), v->vcpu_dirty_cpumask); printk("dirty_cpus=%s ", tmpstr); cpuset_print(tmpstr, sizeof(tmpstr), v->cpu_affinity); - printk("cpu_affinity=%s\n", tmpstr); + printk("cpu_affinity=%s ", tmpstr); + cpuset_print(tmpstr, sizeof(tmpstr), v->node_affinity); + printk("node_affinity=%s%s\n", + v->auto_node_affinity ? "(auto)" : "(manual)", + tmpstr); printk(" pause_count=%d pause_flags=%lx\n", atomic_read(&v->pause_count), v->pause_flags); arch_dump_vcpu_info(v); diff --git a/xen/common/schedule.c b/xen/common/schedule.c index 0f45f07..b3966ad 100644 --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -198,6 +198,9 @@ int sched_init_vcpu(struct vcpu *v, unsigned int processor) else cpumask_setall(v->cpu_affinity); + v->auto_node_affinity = 1; + cpumask_copy(v->node_affinity, v->cpu_affinity); + /* Initialise the per-vcpu timers. */ init_timer(&v->periodic_timer, vcpu_periodic_timer_fn, v, v->processor); @@ -684,6 +687,53 @@ int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity) return 0; } +int vcpu_set_node_affinity(struct vcpu *v, const nodemask_t *nodes) +{ + nodemask_t online_nodes; + int node; + + nodes_and(online_nodes, node_online_map, *nodes); + + /* Having no affinity at all is just wrong */ + if ( nodes_empty(online_nodes) ) + return -EINVAL; + + spin_lock(&v->domain->node_affinity_lock); + + /* + * Explicitly saying "all nodes" is not particularly useful here. + * Let''s use it as the `reset numa-affinity to auto'' command. + */ + if ( nodes_full(*nodes) ) + { + v->auto_node_affinity = 1; + goto out; + } + + /* + * When someone asks for a specific numa-affinity for a vcpu we need to + * clear auto_node_affinity, convert the nodemask in online_nodes + * into a cpumask_t and store it in node_affinity. + */ + v->auto_node_affinity = 0; + + cpumask_clear(v->node_affinity); + for_each_node_mask( node, online_nodes ) + cpumask_or(v->node_affinity, v->node_affinity, + &node_to_cpumask(node)); + +out: + spin_unlock(&v->domain->node_affinity_lock); + + /* + * Changing the numa-affinity of a vcpu calls for an update + * of the node-affinity of the whole domain. + */ + domain_update_node_affinity(v->domain); + + return 0; +} + /* Block the currently-executing domain until a pertinent event occurs. */ void vcpu_block(void) { diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index 25bf637..732d6b6 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -172,6 +172,8 @@ struct vcpu /* VCPU need affinity restored */ bool_t affinity_broken; + /* is node_affinity (below) automatically computed from vcpu-affinity? */ + bool_t auto_node_affinity; /* * > 0: a single port is being polled; @@ -197,6 +199,13 @@ struct vcpu /* Used to restore affinity across S3. */ cpumask_var_t cpu_affinity_saved; + /* + * Bitmask of CPUs on which this VCPU prefers to run. For both this + * and auto_node_affinity access is serialized against + * v->domain->node_affinity_lock. + */ + cpumask_var_t node_affinity; + /* Bitmask of CPUs which are holding onto this VCPU''s state. */ cpumask_var_t vcpu_dirty_cpumask; @@ -740,6 +749,7 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool *c); void vcpu_force_reschedule(struct vcpu *v); int cpu_disable_scheduler(unsigned int cpu); int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity); +int vcpu_set_node_affinity(struct vcpu *v, const nodemask_t *nodes); void restore_vcpu_affinity(struct domain *d); void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate);
Dario Faggioli
2013-Nov-05 14:35 UTC
[PATCH RESEND 06/12] xen: numa-sched: domain node-affinity always comes from vcpu node-affinity
Now that we have per-vcpu node-affinity we can do the following: * always consider the domain''s node-affinity as ''automaically computed''; * always construct it out of the domain''s vcpus'' own node-affinity. That means, if someone wants to change the node-affinity of a domain, it is the node-affinities of all the domain''s vcpus that need to be modified. This change modifies domain_set_node_affinity() in such a way that it does right the above, i.e., it goes through all the domain''s vcpus and set their node-affinity to some specified mask. This means that, "seen from the outside", nothing changes: you call domain_set_node_affinity(), passing a nodemask_t to it, and you get, (1) that mask to be the node-affinity for the domain (which basically means on what NUMA nodes memory is allocated), and (2) all the vcpus of the domains prefer to run on the pcpus from the nodes in that mask, exactly as it was before this commit. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- xen/common/domain.c | 48 ++++++++++++++------------------------------- xen/common/sched_credit.c | 3 +-- xen/include/xen/sched.h | 2 -- 3 files changed, 16 insertions(+), 37 deletions(-) diff --git a/xen/common/domain.c b/xen/common/domain.c index 8d2ff49..366d9b9 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -228,7 +228,6 @@ struct domain *domain_create( spin_lock_init(&d->node_affinity_lock); d->node_affinity = NODE_MASK_ALL; - d->auto_node_affinity = 1; spin_lock_init(&d->shutdown_lock); d->shutdown_code = -1; @@ -403,18 +402,13 @@ void domain_update_node_affinity(struct domain *d) } /* - * If d->auto_node_affinity is true, the domain''s node-affinity mask - * (d->node_affinity) is automaically computed from all the domain''s - * vcpus'' vcpu-affinity masks (the union of which we have just built - * above in cpumask). OTOH, if d->auto_node_affinity is false, we - * must leave the node-affinity of the domain alone. + * A domain''s node-affinity is just the union of all the domain''s vcpus'' + * numa-affinity masks, which is exactly what we have in cpumask + * (although, we need to convert it from cpumask to nodemask, of course). */ - if ( d->auto_node_affinity ) - { - nodes_clear(d->node_affinity); - for_each_cpu ( cpu, cpumask ) - node_set(cpu_to_node(cpu), d->node_affinity); - } + nodes_clear(d->node_affinity); + for_each_cpu ( cpu, cpumask ) + node_set(cpu_to_node(cpu), d->node_affinity); sched_set_node_affinity(d, &d->node_affinity); @@ -425,33 +419,21 @@ void domain_update_node_affinity(struct domain *d) } +/* Sets the numa-affinity (via vcpu_set_node_affinity() ) for all + * the vcpus of the domain. */ int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity) { - /* Being affine with no nodes is just wrong */ - if ( nodes_empty(*affinity) ) - return -EINVAL; - - spin_lock(&d->node_affinity_lock); + struct vcpu *v; + int rc = 0; - /* - * Being/becoming explicitly affine to all nodes is not particularly - * useful. Let''s take it as the `reset node affinity` command. - */ - if ( nodes_full(*affinity) ) + for_each_vcpu ( d, v ) { - d->auto_node_affinity = 1; - goto out; + rc = vcpu_set_node_affinity(v, affinity); + if ( rc ) + break; } - d->auto_node_affinity = 0; - d->node_affinity = *affinity; - -out: - spin_unlock(&d->node_affinity_lock); - - domain_update_node_affinity(d); - - return 0; + return rc; } diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c index 28dafcf..c53a36b 100644 --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -311,8 +311,7 @@ static inline int __vcpu_has_node_affinity(const struct vcpu *vc, const struct domain *d = vc->domain; const struct csched_dom *sdom = CSCHED_DOM(d); - if ( d->auto_node_affinity - || cpumask_full(sdom->node_affinity_cpumask) + if (cpumask_full(sdom->node_affinity_cpumask) || !cpumask_intersects(sdom->node_affinity_cpumask, mask) ) return 0; diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index 732d6b6..d8e4735 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -325,8 +325,6 @@ struct domain /* Does this guest need iommu mappings? */ bool_t need_iommu; #endif - /* is node-affinity automatically computed? */ - bool_t auto_node_affinity; /* Is this guest fully privileged (aka dom0)? */ bool_t is_privileged; /* Which guest this guest has privileges on */
Dario Faggioli
2013-Nov-05 14:35 UTC
[PATCH RESEND 07/12] xen: numa-sched: use per-vcpu node-affinity for actual scheduling
instead of relying the domain-wide node-affinity. To achieve that, we use the specific vcpu''s node-affinity when computing the NUMA load balancing mask in csched_balance_cpumask(). As a side effects, this simplifies the code quit a bit. In fact, prior to this change, we needed to cache the translation of d->node_affinity (which is a nodemask_t) to a cpumask_t, since that is what is needed during the actual scheduling (we used to keep it in node_affinity_cpumask). Since the per-vcpu node-affinity is maintained in a cpumask_t field (v->node_affinity) already, we don''t need that complicated updating logic in place any longer, and this change hence can remove stuff like sched_set_node_affinity(), csched_set_node_affinity() and, of course, node_affinity_cpumask from csched_dom. The high level description of NUMA placement and scheduling in docs/misc/xl-numa-placement.markdown is updated too, to match the new behavior. While at it, an attempt is made to make such document as few ambiguous as possible, with respect to the concepts of vCPU pinning, domain node-affinity and vCPU node-affinity. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- docs/misc/xl-numa-placement.markdown | 124 +++++++++++++++++++++++----------- xen/common/domain.c | 2 - xen/common/sched_credit.c | 62 ++--------------- xen/common/schedule.c | 5 - xen/include/xen/sched-if.h | 2 - xen/include/xen/sched.h | 1 6 files changed, 94 insertions(+), 102 deletions(-) diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown index caa3fec..890b856 100644 --- a/docs/misc/xl-numa-placement.markdown +++ b/docs/misc/xl-numa-placement.markdown @@ -12,13 +12,6 @@ is quite more complex and slow. On these machines, a NUMA node is usually defined as a set of processor cores (typically a physical CPU package) and the memory directly attached to the set of cores. -The Xen hypervisor deals with NUMA machines by assigning to each domain -a "node affinity", i.e., a set of NUMA nodes of the host from which they -get their memory allocated. Also, even if the node affinity of a domain -is allowed to change on-line, it is very important to "place" the domain -correctly when it is fist created, as the most of its memory is allocated -at that time and can not (for now) be moved easily. - NUMA awareness becomes very important as soon as many domains start running memory-intensive workloads on a shared host. In fact, the cost of accessing non node-local memory locations is very high, and the @@ -27,13 +20,37 @@ performance degradation is likely to be noticeable. For more information, have a look at the [Xen NUMA Introduction][numa_intro] page on the Wiki. +## Xen and NUMA machines: the concept of _node-affinity_ ## + +The Xen hypervisor deals with NUMA machines throughout the concept of +_node-affinity_. When talking about node-affinity, it is important to +distinguish two different situations. The node-affinity *of a domain* is +the set of NUMA nodes of the host where the memory for the domain is +being allocated (mostly, at domain creation time). The node-affinity *of +a vCPU* is the set of NUMA nodes of the host where the vCPU prefers to +run on, and the (credit1 only, for now) scheduler will try to accomplish +that, whenever it is possible. + +Of course, despite the fact that they belong to and affect different +subsystems domain and vCPUs node-affinities are related. In fact, the +node-affinity of a domain is the union of the node-affinities of all the +domain''s vCPUs. + +The above means that, when changing the vCPU node-affinity, the domain +node-affinity also changes. Well, although this is allowed to happen +on-line (i.e., when a domain is already running), that will not result +in the memory that has already been allocated being moved to a different +host NUMA node. This is why it is very important to "place" the domain +correctly when it is fist created, as the most of its memory is allocated +at that time and can not (for now) be moved easily. + ### Placing via pinning and cpupools ### The simplest way of placing a domain on a NUMA node is statically pinning -the domain''s vCPUs to the pCPUs of the node. This goes under the name of -CPU affinity and can be set through the "cpus=" option in the config file -(more about this below). Another option is to pool together the pCPUs -spanning the node and put the domain in such a cpupool with the "pool=" +the domain''s vCPUs to the pCPUs of the node. This also goes under the name +of vCPU-affinity and can be set through the "cpus=" option in the config +file (more about this below). Another option is to pool together the pCPUs +spanning the node and put the domain in such a _cpupool_ with the "pool=" config option (as documented in our [Wiki][cpupools_howto]). In both the above cases, the domain will not be able to execute outside @@ -45,24 +62,45 @@ may come at he cost of some load imbalances. ### NUMA aware scheduling ### -If the credit scheduler is in use, the concept of node affinity defined -above does not only apply to memory. In fact, starting from Xen 4.3, the -scheduler always tries to run the domain''s vCPUs on one of the nodes in -its node affinity. Only if that turns out to be impossible, it will just -pick any free pCPU. +If using credit scheduler, and starting from Xen 4.3, the scheduler always +tries to run the domain''s vCPUs on one of the nodes in their node-affinity. +Only if that turns out to be impossible, it will just pick any free pCPU. +Moreover, starting from Xen 4.4, each vCPU can have its own node-affinity, +potentially different from the ones of all the other vCPUs of the domain. -This is, therefore, something more flexible than CPU affinity, as a domain -can still run everywhere, it just prefers some nodes rather than others. +This is, therefore, something more flexible than vCPU pinning, as vCPUs +can still run everywhere, they just prefer some nodes rather than others. Locality of access is less guaranteed than in the pinning case, but that comes along with better chances to exploit all the host resources (e.g., the pCPUs). -In fact, if all the pCPUs in a domain''s node affinity are busy, it is -possible for the domain to run outside of there, but it is very likely that -slower execution (due to remote memory accesses) is still better than no -execution at all, as it would happen with pinning. For this reason, NUMA -aware scheduling has the potential of bringing substantial performances -benefits, although this will depend on the workload. +In fact, if all the pCPUs in a VCPU''s node-affinity are busy, it is +possible for the domain to run outside from there. The idea is that +slower execution (due to remote memory accesses) is still better than +no execution at all (as it would happen with pinning). For this reason, +NUMA aware scheduling has the potential of bringing substantial +performances benefits, although this will depend on the workload. + +Notice that, for each vCPU, the following three scenarios are possbile: + + * a vCPU *is pinned* to some pCPUs and *does not have* any vCPU + node-affinity. In this case, the vCPU is always scheduled on one + of the pCPUs to which it is pinned, without any specific peference + on which one of them. Internally, the vCPU''s node-affinity is just + automatically computed from the vCPU pinning, and the scheduler + just ignores it; + * a vCPU *has* its own vCPU node-affinity and *is not* pinned to + any particular pCPU. In this case, the vCPU can run on every pCPU. + Nevertheless, the scheduler will try to have it running on one of + the pCPUs of the node(s) to which it has node-affinity with; + * a vCPU *has* its own vCPU node-affinity and *is also* pinned to + some pCPUs. In this case, the vCPU is always scheduled on one of the + pCPUs to which it is pinned, with, among them, a preference for the + ones that are from the node(s) it has node-affinity with. In case + pinning and node-affinity form two disjoint sets of pCPUs, pinning + "wins", and the node-affinity, although it is still used to derive + the domain''s node-affinity (for memory allocation), is, from the + scheduler''s perspective, just ignored. ## Guest placement in xl ## @@ -71,25 +109,23 @@ both manual or automatic placement of them across the host''s NUMA nodes. Note that xm/xend does a very similar thing, the only differences being the details of the heuristics adopted for automatic placement (see below), -and the lack of support (in both xm/xend and the Xen versions where that\ +and the lack of support (in both xm/xend and the Xen versions where that was the default toolstack) for NUMA aware scheduling. ### Placing the guest manually ### Thanks to the "cpus=" option, it is possible to specify where a domain should be created and scheduled on, directly in its config file. This -affects NUMA placement and memory accesses as the hypervisor constructs -the node affinity of a VM basing right on its CPU affinity when it is -created. +affects NUMA placement and memory accesses as, in this case, the hypervisor +constructs the node-affinity of a VM basing right on its vCPU pinning +when it is created. This is very simple and effective, but requires the user/system -administrator to explicitly specify affinities for each and every domain, +administrator to explicitly specify the pinning of each and every domain, or Xen won''t be able to guarantee the locality for their memory accesses. -Notice that this also pins the domain''s vCPUs to the specified set of -pCPUs, so it not only sets the domain''s node affinity (its memory will -come from the nodes to which the pCPUs belong), but at the same time -forces the vCPUs of the domain to be scheduled on those same pCPUs. +That, of course, alsos mean the vCPUs of the domain will only be able to +execute on those same pCPUs. ### Placing the guest automatically ### @@ -97,7 +133,10 @@ If no "cpus=" option is specified in the config file, libxl tries to figure out on its own on which node(s) the domain could fit best. If it finds one (some), the domain''s node affinity get set to there, and both memory allocations and NUMA aware scheduling (for the credit -scheduler and starting from Xen 4.3) will comply with it. +scheduler and starting from Xen 4.3) will comply with it. Starting from +Xen 4.4, this just means all the vCPUs of the domain will have the same +vCPU node-affinity, that is the outcome of such automatic "fitting" +procedure. It is worthwhile noting that optimally fitting a set of VMs on the NUMA nodes of an host is an incarnation of the Bin Packing Problem. In fact, @@ -143,22 +182,29 @@ any placement from happening: libxl_defbool_set(&domain_build_info->numa_placement, false); Also, if `numa_placement` is set to `true`, the domain must not -have any CPU affinity (i.e., `domain_build_info->cpumap` must -have all its bits set, as it is by default), or domain creation -will fail returning `ERROR_INVAL`. +have any vCPU pinning (i.e., `domain_build_info->cpumap` must have +all its bits set, as it is by default), or domain creation will fail +with an `ERROR_INVAL`. Starting from Xen 4.3, in case automatic placement happens (and is successful), it will affect the domain''s node affinity and _not_ its -CPU affinity. Namely, the domain''s vCPUs will not be pinned to any +vCPU pinning. Namely, the domain''s vCPUs will not be pinned to any pCPU on the host, but the memory from the domain will come from the selected node(s) and the NUMA aware scheduling (if the credit scheduler -is in use) will try to keep the domain there as much as possible. +is in use) will try to keep the domain''s vCPUs there as much as possible. Besides than that, looking and/or tweaking the placement algorithm search "Automatic NUMA placement" in libxl\_internal.h. Note this may change in future versions of Xen/libxl. +## Xen < 4.4 ## + +The concept of per-vCPU node-affinity has been introduced for the first +time in Xen 4.4. In Xen versions earlier than that, the node-affinity is +the same for the whole domain, that is to say the same for all the vCPUs +of the domain. + ## Xen < 4.3 ## As NUMA aware scheduling is a new feature of Xen 4.3, things are a little diff --git a/xen/common/domain.c b/xen/common/domain.c index 366d9b9..ae29945 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -410,8 +410,6 @@ void domain_update_node_affinity(struct domain *d) for_each_cpu ( cpu, cpumask ) node_set(cpu_to_node(cpu), d->node_affinity); - sched_set_node_affinity(d, &d->node_affinity); - spin_unlock(&d->node_affinity_lock); free_cpumask_var(online_affinity); diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c index c53a36b..d228127 100644 --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -178,9 +178,6 @@ struct csched_dom { struct list_head active_vcpu; struct list_head active_sdom_elem; struct domain *dom; - /* cpumask translated from the domain''s node-affinity. - * Basically, the CPUs we prefer to be scheduled on. */ - cpumask_var_t node_affinity_cpumask; uint16_t active_vcpu_count; uint16_t weight; uint16_t cap; @@ -261,32 +258,6 @@ __runq_remove(struct csched_vcpu *svc) list_del_init(&svc->runq_elem); } -/* - * Translates node-affinity mask into a cpumask, so that we can use it during - * actual scheduling. That of course will contain all the cpus from all the - * set nodes in the original node-affinity mask. - * - * Note that any serialization needed to access mask safely is complete - * responsibility of the caller of this function/hook. - */ -static void csched_set_node_affinity( - const struct scheduler *ops, - struct domain *d, - nodemask_t *mask) -{ - struct csched_dom *sdom; - int node; - - /* Skip idle domain since it doesn''t even have a node_affinity_cpumask */ - if ( unlikely(is_idle_domain(d)) ) - return; - - sdom = CSCHED_DOM(d); - cpumask_clear(sdom->node_affinity_cpumask); - for_each_node_mask( node, *mask ) - cpumask_or(sdom->node_affinity_cpumask, sdom->node_affinity_cpumask, - &node_to_cpumask(node)); -} #define for_each_csched_balance_step(step) \ for ( (step) = 0; (step) <= CSCHED_BALANCE_CPU_AFFINITY; (step)++ ) @@ -294,12 +265,12 @@ static void csched_set_node_affinity( /* * vcpu-affinity balancing is always necessary and must never be skipped. - * OTOH, if a domain''s node-affinity is said to be automatically computed - * (or if it just spans all the nodes), we can safely avoid dealing with - * node-affinity entirely. + * OTOH, if the vcpu''s numa-affinity is being automatically computed out of + * the vcpu''s vcpu-affinity, or if it just spans all the nodes, we can + * safely avoid dealing with numa-affinity entirely. * - * Node-affinity is also deemed meaningless in case it has empty - * intersection with mask, to cover the cases where using the node-affinity + * A vcpu''s numa-affinity is also deemed meaningless in case it has empty + * intersection with mask, to cover the cases where using the numa-affinity * mask seems legit, but would instead led to trying to schedule the vcpu * on _no_ pcpu! Typical use cases are for mask to be equal to the vcpu''s * vcpu-affinity, or to the && of vcpu-affinity and the set of online cpus @@ -308,11 +279,9 @@ static void csched_set_node_affinity( static inline int __vcpu_has_node_affinity(const struct vcpu *vc, const cpumask_t *mask) { - const struct domain *d = vc->domain; - const struct csched_dom *sdom = CSCHED_DOM(d); - - if (cpumask_full(sdom->node_affinity_cpumask) - || !cpumask_intersects(sdom->node_affinity_cpumask, mask) ) + if ( vc->auto_node_affinity == 1 + || cpumask_full(vc->node_affinity) + || !cpumask_intersects(vc->node_affinity, mask) ) return 0; return 1; @@ -330,8 +299,7 @@ csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask) { if ( step == CSCHED_BALANCE_NODE_AFFINITY ) { - cpumask_and(mask, CSCHED_DOM(vc->domain)->node_affinity_cpumask, - vc->cpu_affinity); + cpumask_and(mask, vc->node_affinity, vc->cpu_affinity); if ( unlikely(cpumask_empty(mask)) ) cpumask_copy(mask, vc->cpu_affinity); @@ -1110,13 +1078,6 @@ csched_alloc_domdata(const struct scheduler *ops, struct domain *dom) if ( sdom == NULL ) return NULL; - if ( !alloc_cpumask_var(&sdom->node_affinity_cpumask) ) - { - xfree(sdom); - return NULL; - } - cpumask_setall(sdom->node_affinity_cpumask); - /* Initialize credit and weight */ INIT_LIST_HEAD(&sdom->active_vcpu); INIT_LIST_HEAD(&sdom->active_sdom_elem); @@ -1146,9 +1107,6 @@ csched_dom_init(const struct scheduler *ops, struct domain *dom) static void csched_free_domdata(const struct scheduler *ops, void *data) { - struct csched_dom *sdom = data; - - free_cpumask_var(sdom->node_affinity_cpumask); xfree(data); } @@ -1975,8 +1933,6 @@ const struct scheduler sched_credit_def = { .adjust = csched_dom_cntl, .adjust_global = csched_sys_cntl, - .set_node_affinity = csched_set_node_affinity, - .pick_cpu = csched_cpu_pick, .do_schedule = csched_schedule, diff --git a/xen/common/schedule.c b/xen/common/schedule.c index b3966ad..454f27d 100644 --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -648,11 +648,6 @@ int cpu_disable_scheduler(unsigned int cpu) return ret; } -void sched_set_node_affinity(struct domain *d, nodemask_t *mask) -{ - SCHED_OP(DOM2OP(d), set_node_affinity, d, mask); -} - int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity) { cpumask_t online_affinity; diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h index d95e254..4164dff 100644 --- a/xen/include/xen/sched-if.h +++ b/xen/include/xen/sched-if.h @@ -158,8 +158,6 @@ struct scheduler { struct xen_domctl_scheduler_op *); int (*adjust_global) (const struct scheduler *, struct xen_sysctl_scheduler_op *); - void (*set_node_affinity) (const struct scheduler *, - struct domain *, nodemask_t *); void (*dump_settings) (const struct scheduler *); void (*dump_cpu_state) (const struct scheduler *, int); diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index d8e4735..83f89c7 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -595,7 +595,6 @@ void sched_destroy_domain(struct domain *d); int sched_move_domain(struct domain *d, struct cpupool *c); long sched_adjust(struct domain *, struct xen_domctl_scheduler_op *); long sched_adjust_global(struct xen_sysctl_scheduler_op *); -void sched_set_node_affinity(struct domain *, nodemask_t *); int sched_id(void); void sched_tick_suspend(void); void sched_tick_resume(void);
Dario Faggioli
2013-Nov-05 14:35 UTC
[PATCH RESEND 08/12] xen: numa-sched: enable getting/specifying per-vcpu node-affinity
via two new DOMCTLs: getvcpunodeaffinity and setvcpunodeaffinity. They''re very similar to XEN_DOMCTL_{get,set}vcpuaffinity (with the only exception that they take a nodemap instead than a cpumap). Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- tools/libxc/xc_domain.c | 8 ++++--- xen/common/domctl.c | 47 ++++++++++++++++++++++++++++++++++++++----- xen/include/public/domctl.h | 8 +++++-- xen/xsm/flask/hooks.c | 2 ++ 4 files changed, 54 insertions(+), 11 deletions(-) diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c index 2cea6e3..f8825a3 100644 --- a/tools/libxc/xc_domain.c +++ b/tools/libxc/xc_domain.c @@ -219,9 +219,9 @@ int xc_vcpu_setaffinity(xc_interface *xch, memcpy(local, cpumap, cpusize); - set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, local); + set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap, local); - domctl.u.vcpuaffinity.cpumap.nr_bits = cpusize * 8; + domctl.u.vcpuaffinity.map.nr_bits = cpusize * 8; ret = do_domctl(xch, &domctl); @@ -260,8 +260,8 @@ int xc_vcpu_getaffinity(xc_interface *xch, domctl.domain = (domid_t)domid; domctl.u.vcpuaffinity.vcpu = vcpu; - set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, local); - domctl.u.vcpuaffinity.cpumap.nr_bits = cpusize * 8; + set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap, local); + domctl.u.vcpuaffinity.map.nr_bits = cpusize * 8; ret = do_domctl(xch, &domctl); diff --git a/xen/common/domctl.c b/xen/common/domctl.c index 870eef1..6bc3f26 100644 --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -584,7 +584,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) break; case XEN_DOMCTL_setvcpuaffinity: - case XEN_DOMCTL_getvcpuaffinity: + case XEN_DOMCTL_setvcpunodeaffinity: { struct vcpu *v; @@ -600,8 +600,8 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) { cpumask_var_t new_affinity; - ret = xenctl_bitmap_to_cpumask( - &new_affinity, &op->u.vcpuaffinity.cpumap); + ret = xenctl_bitmap_to_cpumask(&new_affinity, + &op->u.vcpuaffinity.map); if ( !ret ) { ret = vcpu_set_affinity(v, new_affinity); @@ -610,8 +610,45 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) } else { - ret = cpumask_to_xenctl_bitmap( - &op->u.vcpuaffinity.cpumap, v->cpu_affinity); + nodemask_t new_affinity; + + ret = xenctl_bitmap_to_nodemask(&new_affinity, + &op->u.vcpuaffinity.map); + if ( !ret ) + ret = vcpu_set_node_affinity(v, &new_affinity); + } + } + break; + + case XEN_DOMCTL_getvcpuaffinity: + case XEN_DOMCTL_getvcpunodeaffinity: + { + struct vcpu *v; + + ret = -EINVAL; + if ( op->u.vcpuaffinity.vcpu >= d->max_vcpus ) + break; + + ret = -ESRCH; + if ( (v = d->vcpu[op->u.vcpuaffinity.vcpu]) == NULL ) + break; + + if ( op->cmd == XEN_DOMCTL_getvcpuaffinity ) + { + ret = cpumask_to_xenctl_bitmap(&op->u.vcpuaffinity.map, + v->cpu_affinity); + } + else + { + nodemask_t affinity; + int cpu; + + nodes_clear(affinity); + for_each_cpu ( cpu, v->node_affinity ) + node_set(cpu_to_node(cpu), affinity); + + ret = nodemask_to_xenctl_bitmap(&op->u.vcpuaffinity.map, + &affinity); } } break; diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h index d4e479f..78b6d70 100644 --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -290,12 +290,14 @@ typedef struct xen_domctl_nodeaffinity xen_domctl_nodeaffinity_t; DEFINE_XEN_GUEST_HANDLE(xen_domctl_nodeaffinity_t); -/* Get/set which physical cpus a vcpu can execute on. */ +/* Get/set which physical cpus a vcpu can or prefer execute on. */ /* XEN_DOMCTL_setvcpuaffinity */ /* XEN_DOMCTL_getvcpuaffinity */ +/* XEN_DOMCTL_setvcpunodeaffinity */ +/* XEN_DOMCTL_getvcpunodeaffinity */ struct xen_domctl_vcpuaffinity { uint32_t vcpu; /* IN */ - struct xenctl_bitmap cpumap; /* IN/OUT */ + struct xenctl_bitmap map; /* IN/OUT */ }; typedef struct xen_domctl_vcpuaffinity xen_domctl_vcpuaffinity_t; DEFINE_XEN_GUEST_HANDLE(xen_domctl_vcpuaffinity_t); @@ -932,6 +934,8 @@ struct xen_domctl { #define XEN_DOMCTL_setnodeaffinity 68 #define XEN_DOMCTL_getnodeaffinity 69 #define XEN_DOMCTL_set_max_evtchn 70 +#define XEN_DOMCTL_setvcpunodeaffinity 71 +#define XEN_DOMCTL_getvcpunodeaffinity 72 #define XEN_DOMCTL_gdbsx_guestmemio 1000 #define XEN_DOMCTL_gdbsx_pausevcpu 1001 #define XEN_DOMCTL_gdbsx_unpausevcpu 1002 diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c index b1e2593..a9b2401 100644 --- a/xen/xsm/flask/hooks.c +++ b/xen/xsm/flask/hooks.c @@ -616,10 +616,12 @@ static int flask_domctl(struct domain *d, int cmd) return current_has_perm(d, SECCLASS_DOMAIN, DOMAIN__UNPAUSE); case XEN_DOMCTL_setvcpuaffinity: + case XEN_DOMCTL_setvcpunodeaffinity: case XEN_DOMCTL_setnodeaffinity: return current_has_perm(d, SECCLASS_DOMAIN, DOMAIN__SETAFFINITY); case XEN_DOMCTL_getvcpuaffinity: + case XEN_DOMCTL_getvcpunodeaffinity: case XEN_DOMCTL_getnodeaffinity: return current_has_perm(d, SECCLASS_DOMAIN, DOMAIN__GETAFFINITY);
Dario Faggioli
2013-Nov-05 14:35 UTC
[PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity
by providing the proper get/set interfaces and wiring them to the new domctl-s from the previous commit. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- tools/libxc/xc_domain.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++ tools/libxc/xenctrl.h | 19 +++++++++++ 2 files changed, 101 insertions(+) diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c index f8825a3..e44ad2e 100644 --- a/tools/libxc/xc_domain.c +++ b/tools/libxc/xc_domain.c @@ -272,6 +272,88 @@ out: return ret; } +int xc_vcpu_setnodeaffinity(xc_interface *xch, + uint32_t domid, + int vcpu, + xc_nodemap_t nodemap) +{ + DECLARE_DOMCTL; + DECLARE_HYPERCALL_BUFFER(uint8_t, local); + int ret = -1; + int nodesize; + + nodesize = xc_get_cpumap_size(xch); + if (!nodesize) + { + PERROR("Could not get number of nodes"); + goto out; + } + + local = xc_hypercall_buffer_alloc(xch, local, nodesize); + if ( local == NULL ) + { + PERROR("Could not allocate memory for setvcpunodeaffinity domctl hypercall"); + goto out; + } + + domctl.cmd = XEN_DOMCTL_setvcpunodeaffinity; + domctl.domain = (domid_t)domid; + domctl.u.vcpuaffinity.vcpu = vcpu; + + memcpy(local, nodemap, nodesize); + + set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap, local); + + domctl.u.vcpuaffinity.map.nr_bits = nodesize * 8; + + ret = do_domctl(xch, &domctl); + + xc_hypercall_buffer_free(xch, local); + + out: + return ret; +} + +int xc_vcpu_getnodeaffinity(xc_interface *xch, + uint32_t domid, + int vcpu, + xc_nodemap_t nodemap) +{ + DECLARE_DOMCTL; + DECLARE_HYPERCALL_BUFFER(uint8_t, local); + int ret = -1; + int nodesize; + + nodesize = xc_get_nodemap_size(xch); + if (!nodesize) + { + PERROR("Could not get number of nodes"); + goto out; + } + + local = xc_hypercall_buffer_alloc(xch, local, nodesize); + if (local == NULL) + { + PERROR("Could not allocate memory for getvcpunodeaffinity domctl hypercall"); + goto out; + } + + domctl.cmd = XEN_DOMCTL_getvcpunodeaffinity; + domctl.domain = (domid_t)domid; + domctl.u.vcpuaffinity.vcpu = vcpu; + + set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap, local); + domctl.u.vcpuaffinity.map.nr_bits = nodesize * 8; + + ret = do_domctl(xch, &domctl); + + memcpy(nodemap, local, nodesize); + + xc_hypercall_buffer_free(xch, local); +out: + return ret; +} + int xc_domain_get_guest_width(xc_interface *xch, uint32_t domid, unsigned int *guest_width) { diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h index 8cf3f3b..208fa2c 100644 --- a/tools/libxc/xenctrl.h +++ b/tools/libxc/xenctrl.h @@ -551,6 +551,25 @@ int xc_domain_node_getaffinity(xc_interface *xch, uint32_t domind, xc_nodemap_t nodemap); +/** + * These functions set and retrieves the NUMA node-affinity + * of a specific vcpu. + * + * @parm xch a handle to an open hypervisor interface. + * @parm domid the domain id one is interested in. + * @parm vcpu the vcpu one wants to set/get the affinity of. + * @parm nodemap the map of the affine nodes. + * @return 0 on success, -1 on failure. + */ +int xc_vcpu_setnodeaffinity(xc_interface *xch, + uint32_t domid, + int vcpu, + xc_nodemap_t nodemap); +int xc_vcpu_getnodeaffinity(xc_interface *xch, + uint32_t domid, + int vcpu, + xc_nodemap_t nodemap); + int xc_vcpu_setaffinity(xc_interface *xch, uint32_t domid, int vcpu,
Dario Faggioli
2013-Nov-05 14:35 UTC
[PATCH RESEND 10/12] libxl: numa-sched: enable getting/specifying per-vcpu node-affinity
by providing the proper get/set interfaces and wiring them to the new libxc calls from the previous commit. For the ''get'' part, the node-affinity of all the vcpus of a domain is also reported via libxl_list_vcpu() (exactly as it is happening for their vcpu affinity already), adding a specific field to libxl_vcpuinfo. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- tools/libxl/libxl.c | 28 ++++++++++++++++++++++++++++ tools/libxl/libxl.h | 11 +++++++++++ tools/libxl/libxl_types.idl | 1 + 3 files changed, 40 insertions(+) diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c index 0f0f56c..4a36e70 100644 --- a/tools/libxl/libxl.c +++ b/tools/libxl/libxl.c @@ -4208,6 +4208,10 @@ libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t domid, LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "allocating cpumap"); return NULL; } + if (libxl_node_bitmap_alloc(ctx, &ptr->nodemap, 0)) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "allocating nodemap"); + return NULL; + } if (xc_vcpu_getinfo(ctx->xch, domid, *nb_vcpu, &vcpuinfo) == -1) { LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting vcpu info"); return NULL; @@ -4216,6 +4220,10 @@ libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t domid, LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting vcpu affinity"); return NULL; } + if (xc_vcpu_getnodeaffinity(ctx->xch, domid, *nb_vcpu, ptr->nodemap.map) == -1) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting node-affinity"); + return NULL; + } ptr->vcpuid = *nb_vcpu; ptr->cpu = vcpuinfo.cpu; ptr->online = !!vcpuinfo.online; @@ -4251,6 +4259,26 @@ int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid, return rc; } +int libxl_set_vcpunodeaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid, + const libxl_bitmap *nodemap) +{ + if (xc_vcpu_setnodeaffinity(ctx->xch, domid, vcpuid, nodemap->map)) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "setting vcpu node-affinity"); + return ERROR_FAIL; + } + return 0; +} + +int libxl_get_vcpunodeaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid, + libxl_bitmap *nodemap) +{ + if (xc_vcpu_getnodeaffinity(ctx->xch, domid, vcpuid, nodemap->map)) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting vcpu node-affinity"); + return ERROR_FAIL; + } + return 0; +} + int libxl_domain_set_nodeaffinity(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *nodemap) { diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h index 1c6675d..d5b8ade 100644 --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -82,6 +82,13 @@ #define LIBXL_HAVE_DOMAIN_NODEAFFINITY 1 /* + * LIBXL_HAVE_VCPUINFO_NODEAFFINITY indicates that a ''nodemap'' field + * (of libxl_bitmap type) is present in libxl_vcpuinfo, containing the + * node-affinity for the vcpu. + */ +#define LIBXL_HAVE_VCPUINFO_NODEAFFINITY + +/* * LIBXL_HAVE_BUILDINFO_HVM_VENDOR_DEVICE indicates that the * libxl_vendor_device field is present in the hvm sections of * libxl_domain_build_info. This field tells libxl which @@ -985,6 +992,10 @@ int libxl_set_vcpuaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid, libxl_bitmap *cpumap); int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid, unsigned int max_vcpus, libxl_bitmap *cpumap); +int libxl_set_vcpunodeaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid, + const libxl_bitmap *nodemap); +int libxl_get_vcpunodeaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid, + libxl_bitmap *nodemap); int libxl_domain_set_nodeaffinity(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *nodemap); int libxl_domain_get_nodeaffinity(libxl_ctx *ctx, uint32_t domid, diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl index 5c43d6f..5c83f98 100644 --- a/tools/libxl/libxl_types.idl +++ b/tools/libxl/libxl_types.idl @@ -492,6 +492,7 @@ libxl_vcpuinfo = Struct("vcpuinfo", [ ("running", bool), ("vcpu_time", uint64), # total vcpu time ran (ns) ("cpumap", libxl_bitmap), # current cpu''s affinities + ("nodemap", libxl_bitmap), # current node-affinity ], dir=DIR_OUT) libxl_physinfo = Struct("physinfo", [
Dario Faggioli
2013-Nov-05 14:36 UTC
[PATCH RESEND 11/12] xl: numa-sched: enable getting/specifying per-vcpu node-affinity
by showing it, upon request (''-n'' switch), as a part of the output of `xl vcpu-list''. Such output looks like this: # xl vcpu-list -n Name ID VCPU CPU State Time(s) CPU Affinity / NUMA Affinity Domain-0 0 0 8 -b- 4.4 any cpu / any node Domain-0 0 1 0 -b- 1.2 any cpu / any node vm-test 1 0 8 -b- 2.4 any cpu / 1 vm-test 1 1 13 -b- 3.3 any cpu / 1 The "vcpu-pinning / vcpu-affinity" part may indeed look weird, as it is different from the other fields. Unfortunately, it''s very hard to properly format it in columns, since there is no such things as minimal or maximal lengths for the bitmaps being printed. Specifying a particular node-affinity happens via a new xl command, `xl vcpu-node-affinity''. Introducing a "numa mode", to `xl vcpu-pin'' was the other option, but doing that would probably fuel the confusion between vcpu pinning and NUMA node-affinity, and that is certainly something we do not want. While at it, the implementation of `xl vcpu-list'' is reworked a little bit, making it more similar to the one of `xl list'' and more compliant with the libxl programming paradigm (e.g., regarding allocating at the beginning and freeing on exit). xl manual page is updated accordingly. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- docs/man/xl.pod.1 | 25 +++++ tools/libxl/xl.h | 1 tools/libxl/xl_cmdimpl.c | 210 ++++++++++++++++++++++++++++++++++++--------- tools/libxl/xl_cmdtable.c | 9 ++ 4 files changed, 201 insertions(+), 44 deletions(-) diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1 index e7b9de2..e1894b4 100644 --- a/docs/man/xl.pod.1 +++ b/docs/man/xl.pod.1 @@ -614,11 +614,21 @@ quietly ignored. Some guests may need to actually bring the newly added CPU online after B<vcpu-set>, go to B<SEE ALSO> section for information. -=item B<vcpu-list> [I<domain-id>] +=item B<vcpu-list> [I<OPTIONS>] [I<domain-id>, ...] Lists VCPU information for a specific domain. If no domain is specified, VCPU information for all domains will be provided. +B<OPTIONS> + +=over 4 + +=item B<-n>, B<--numa> + +Print the NUMA node-affinity of each VCPU. + +=back + =item B<vcpu-pin> I<domain-id> I<vcpu> I<cpus> Pins the VCPU to only run on the specific CPUs. The keyword @@ -630,6 +640,19 @@ different run state is appropriate. Pinning can be used to restrict this, by ensuring certain VCPUs can only run on certain physical CPUs. +=item B<vcpu-node-affinity> I<domain-id> I<vcpu> I<nodes> + +Specifies the I<nodes> on which the I<vcpu> prefers to run. The +keyword B<all> can be used to set this to I<nodes> for all the +VCPUs in the domain. + +Normally VCPUs can float within the set of CPUs that they are pinned +to (see B<vcpu-pin> above). Specifying a node-affinity does not change +that, but, if using the credit scheduler, the VCPUs will strongly prefer +running on the NUMA nodes they have node-affinity with. If using any +scheduler other than credit, node-affinity is just ignored (credit2 +support is planned but not ready yet). + =item B<vm-list> Prints information about guests. This list excludes information about diff --git a/tools/libxl/xl.h b/tools/libxl/xl.h index e005c39..b708199 100644 --- a/tools/libxl/xl.h +++ b/tools/libxl/xl.h @@ -59,6 +59,7 @@ int main_create(int argc, char **argv); int main_config_update(int argc, char **argv); int main_button_press(int argc, char **argv); int main_vcpupin(int argc, char **argv); +int main_vcpuaff(int argc, char **argv); int main_vcpuset(int argc, char **argv); int main_memmax(int argc, char **argv); int main_memset(int argc, char **argv); diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index f6239d5..1659259 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -542,10 +542,32 @@ static int parse_range(const char *str, unsigned long *a, unsigned long *b) } /* - * Add or removes a specific set of cpus (specified in str, either as - * single cpus or as entire NUMA nodes) to/from cpumap. + * Set or reset some bits from bitmap, as specified in str. + * + * If str * is "all", bitmap gets fully set. If str contains a specific + * value or range (e.g., "2" or "4-6"), the corresponding bits are set (or + * unset, if the value/range is prefixed with a "^"). In this case, we don''t + * really care whether the value range is supposed to represent cpus or nodes + * (that is to say whether bitmap is supposed to be a nodemap or a cpumap), + * we just set and unset bits. + * + * In case str is prefixed by "nodes:" (or "^nodes:") we assume that the + * caller is dealing with a cpumap and wants to specify what bits to (re)set + * NUMA node-wise. We therefore act as follows: + * - consider the value/range a node or a set of nodes; + * - obtain the cpumap corresponding to such node/set of nodes; + * - set (or reset, in the "^" case) in bitmap what ended up + * in such cpumap. + * + * This means that, for instance, if calling this with str="3-6", bits + * 3-6 in bitmap will be set, and it''s up to the caller to know whether + * that meant cpus 3,4,5,6 or NUMA nodes 3,4,5,6. On the other hand, if + * calling this with str="nodes:2", bits corresponding to all the cpus + * of NUMA node 2 will be the ones that are set in bitmap. + * + * This may look tricky, but it avoids quite a bit of code duplication. */ -static int update_cpumap_range(const char *str, libxl_bitmap *cpumap) +static int update_bitmap_range(const char *str, libxl_bitmap *bitmap) { unsigned long ida, idb; libxl_bitmap node_cpumap; @@ -571,7 +593,7 @@ static int update_cpumap_range(const char *str, libxl_bitmap *cpumap) } if (STR_HAS_PREFIX(str, "all")) { - libxl_bitmap_set_any(cpumap); + libxl_bitmap_set_any(bitmap); goto out; } @@ -595,13 +617,13 @@ static int update_cpumap_range(const char *str, libxl_bitmap *cpumap) /* Add/Remove all the cpus in the node cpumap */ libxl_for_each_set_bit(i, node_cpumap) { - is_not ? libxl_bitmap_reset(cpumap, i) : - libxl_bitmap_set(cpumap, i); + is_not ? libxl_bitmap_reset(bitmap, i) : + libxl_bitmap_set(bitmap, i); } } else { /* Add/Remove this cpu */ - is_not ? libxl_bitmap_reset(cpumap, ida) : - libxl_bitmap_set(cpumap, ida); + is_not ? libxl_bitmap_reset(bitmap, ida) : + libxl_bitmap_set(bitmap, ida); } ida++; } @@ -612,18 +634,25 @@ static int update_cpumap_range(const char *str, libxl_bitmap *cpumap) } /* - * Takes a string representing a set of cpus (specified either as - * single cpus or as eintire NUMA nodes) and turns it into the - * corresponding libxl_bitmap (in cpumap). + * Takes a string representing one or more values or range of values + * separated by '','' (e.g., something like "3,2-6,8"), split it in substrings + * and, for each one of them, update bitmap to reflect that. This may result + * in bits in bitmap being both set or reset, since it is possible to append + * a not prefix ("^") to both values and ranges. + * + * It is also possible for the caller to indicate that a specific value or + * range should be treated specially, i.e., resulting in not just one bit + * (or not just the specified range of bits) to be set or reset. For details + * on that, see update_bitmap_range() above. */ -static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap) +static int parse_bitmap_range(char *cpu, libxl_bitmap *bitmap) { char *ptr, *saveptr = NULL; int rc = 0; for (ptr = strtok_r(cpu, ",", &saveptr); ptr; ptr = strtok_r(NULL, ",", &saveptr)) { - rc = update_cpumap_range(ptr, cpumap); + rc = update_bitmap_range(ptr, bitmap); if (rc) break; } @@ -810,7 +839,7 @@ static void parse_config_data(const char *config_source, } libxl_bitmap_set_none(&b_info->cpumap); - if (vcpupin_parse(buf2, &b_info->cpumap)) + if (parse_bitmap_range(buf2, &b_info->cpumap)) exit(1); free(buf2); @@ -4483,7 +4512,9 @@ int main_button_press(int argc, char **argv) static void print_vcpuinfo(uint32_t tdomid, const libxl_vcpuinfo *vcpuinfo, - uint32_t nr_cpus) + uint32_t nr_cpus, + uint32_t nr_nodes, + int numa) { char *domname; @@ -4505,10 +4536,18 @@ static void print_vcpuinfo(uint32_t tdomid, printf("%9.1f ", ((float)vcpuinfo->vcpu_time / 1e9)); /* CPU AFFINITY */ print_cpumap(vcpuinfo->cpumap.map, nr_cpus, stdout); + /* NUMA Affinity*/ + if (numa) { + printf(" / "); + print_nodemap(vcpuinfo->nodemap.map, nr_nodes, stdout); + } printf("\n"); } -static void print_domain_vcpuinfo(uint32_t domid, uint32_t nr_cpus) +static void print_domain_vcpuinfo(uint32_t domid, + uint32_t nr_cpus, + uint32_t nr_nodes, + int numa) { libxl_vcpuinfo *vcpuinfo; int i, nb_vcpu, nrcpus; @@ -4521,55 +4560,67 @@ static void print_domain_vcpuinfo(uint32_t domid, uint32_t nr_cpus) } for (i = 0; i < nb_vcpu; i++) { - print_vcpuinfo(domid, &vcpuinfo[i], nr_cpus); + print_vcpuinfo(domid, &vcpuinfo[i], nr_cpus, nr_nodes, numa); } libxl_vcpuinfo_list_free(vcpuinfo, nb_vcpu); } -static void vcpulist(int argc, char **argv) +int main_vcpulist(int argc, char **argv) { libxl_dominfo *dominfo; libxl_physinfo physinfo; - int i, nb_domain; + int opt; + int numa = 0; + static struct option opts[] = { + {"numa", 0, 0, ''n''}, + COMMON_LONG_OPTS, + {0, 0, 0, 0} + }; + int rc = -1; + + SWITCH_FOREACH_OPT(opt, "n", opts, "vcpu-list", 0) { + case ''n'': + numa = 1; + break; + } if (libxl_get_physinfo(ctx, &physinfo) != 0) { fprintf(stderr, "libxl_physinfo failed.\n"); - goto vcpulist_out; + goto out; } - printf("%-32s %5s %5s %5s %5s %9s %s\n", + printf("%-32s %5s %5s %5s %5s %9s %s", "Name", "ID", "VCPU", "CPU", "State", "Time(s)", "CPU Affinity"); - if (!argc) { + if (numa) + printf(" / NUMA Affinity"); + printf("\n"); + + if (optind >= argc) { + int i, nb_domain; + if (!(dominfo = libxl_list_domain(ctx, &nb_domain))) { fprintf(stderr, "libxl_list_domain failed.\n"); - goto vcpulist_out; + goto out; } - for (i = 0; i<nb_domain; i++) - print_domain_vcpuinfo(dominfo[i].domid, physinfo.nr_cpus); + for (i = 0; i < nb_domain; i++) + print_domain_vcpuinfo(dominfo[i].domid, physinfo.nr_cpus, + physinfo.nr_nodes, numa); libxl_dominfo_list_free(dominfo, nb_domain); } else { - for (; argc > 0; ++argv, --argc) { - uint32_t domid = find_domain(*argv); - print_domain_vcpuinfo(domid, physinfo.nr_cpus); + for (; argc > optind; ++optind) { + uint32_t domid = find_domain(argv[optind]); + print_domain_vcpuinfo(domid, physinfo.nr_cpus, + physinfo.nr_nodes, numa); } } - vcpulist_out: - libxl_physinfo_dispose(&physinfo); -} - -int main_vcpulist(int argc, char **argv) -{ - int opt; - - SWITCH_FOREACH_OPT(opt, "", NULL, "vcpu-list", 0) { - /* No options */ - } - vcpulist(argc - optind, argv + optind); - return 0; + rc = 0; + out: + libxl_physinfo_dispose(&physinfo); + return rc; } static int vcpupin(uint32_t domid, const char *vcpu, char *cpu) @@ -4595,7 +4646,7 @@ static int vcpupin(uint32_t domid, const char *vcpu, char *cpu) if (libxl_cpu_bitmap_alloc(ctx, &cpumap, 0)) goto out; - if (vcpupin_parse(cpu, &cpumap)) + if (parse_bitmap_range(cpu, &cpumap)) goto out; if (dryrun_only) { @@ -4646,6 +4697,69 @@ static int vcpupin(uint32_t domid, const char *vcpu, char *cpu) return rc; } +static int vcpuaff(uint32_t domid, const char *vcpu, char *node) +{ + libxl_bitmap nodemap; + uint32_t vcpuid; + char *endptr; + int rc = -1; + + libxl_bitmap_init(&nodemap); + + vcpuid = strtoul(vcpu, &endptr, 10); + if (vcpu == endptr) { + if (strcmp(vcpu, "all")) { + fprintf(stderr, "Error: Invalid argument\n"); + goto out; + } + vcpuid = -1; + } + + if (libxl_node_bitmap_alloc(ctx, &nodemap, 0)) + goto out; + + if (parse_bitmap_range(node, &nodemap)) + goto out; + + if (dryrun_only) { + int nr_nodes = 0; + libxl_numainfo *info = libxl_get_numainfo(ctx, &nr_nodes); + + if (!info) { + fprintf(stderr, "libxl_get_numainfo failed\n"); + goto out; + } + libxl_numainfo_list_free(info, nr_nodes); + + fprintf(stdout, "nodemap: "); + print_nodemap(nodemap.map, nr_nodes, stdout); + fprintf(stdout, "\n"); + + if (ferror(stdout) || fflush(stdout)) { + perror("stdout"); + exit(-1); + } + + rc = 0; + goto out; + } + + if (vcpuid != -1) { + if (libxl_set_vcpunodeaffinity(ctx, domid, vcpuid, &nodemap)) + fprintf(stderr, "Could not set node-affinity for vcpu `%u''\n", + vcpuid); + goto out; + } + + if (libxl_domain_set_nodeaffinity(ctx, domid, &nodemap)) + fprintf(stderr, "libxl_domain_set_nodeaffinity failed\n"); + + rc = 0; + out: + libxl_bitmap_dispose(&nodemap); + return rc; +} + int main_vcpupin(int argc, char **argv) { int opt; @@ -4657,6 +4771,18 @@ int main_vcpupin(int argc, char **argv) return vcpupin(find_domain(argv[optind]), argv[optind+1] , argv[optind+2]); } +int main_vcpuaff(int argc, char **argv) +{ + int opt; + + SWITCH_FOREACH_OPT(opt, "", NULL, "vcpu-node-affinity", 3) { + /* No options */ + } + + vcpuaff(find_domain(argv[optind]), argv[optind+1] , argv[optind+2]); + return 0; +} + static void vcpuset(uint32_t domid, const char* nr_vcpus, int check_host) { char *endptr; diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c index d3dcbf0..d3dd7e6 100644 --- a/tools/libxl/xl_cmdtable.c +++ b/tools/libxl/xl_cmdtable.c @@ -208,13 +208,20 @@ struct cmd_spec cmd_table[] = { { "vcpu-list", &main_vcpulist, 0, 0, "List the VCPUs for all/some domains", - "[Domain, ...]", + "[option] [Domain, ...]", + "-n, --numa Show NUMA node-affinity", + }, { "vcpu-pin", &main_vcpupin, 1, 1, "Set which CPUs a VCPU can use", "<Domain> <VCPU|all> <CPUs|all>", }, + { "vcpu-node-affinity", + &main_vcpuaff, 1, 1, + "Set on which CPUs a VCPU prefers to run", + "<Domain> <VCPU|all> <CPUs|all>", + }, { "vcpu-set", &main_vcpuset, 0, 1, "Set the number of active VCPUs allowed for the domain",
Dario Faggioli
2013-Nov-05 14:36 UTC
[PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file
in a similar way to how it is possible to specify vcpu-affinity. Manual page is updated accordingly. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- docs/man/xl.cfg.pod.5 | 70 +++++++++++++++++++++++++++++++++------ tools/libxl/libxl_dom.c | 18 ++++++---- tools/libxl/libxl_numa.c | 14 +------- tools/libxl/libxl_utils.h | 12 ++++++- tools/libxl/xl_cmdimpl.c | 80 ++++++++++++++++++++++++++++++++++++++++++++- 5 files changed, 159 insertions(+), 35 deletions(-) diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 index 1c98cb4..1212426 100644 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -144,18 +144,64 @@ run on cpu #3 of the host. =back If this option is not specified, no vcpu to cpu pinning is established, -and the vcpus of the guest can run on all the cpus of the host. - -If we are on a NUMA machine (i.e., if the host has more than one NUMA -node) and this option is not specified, libxl automatically tries to -place the guest on the least possible number of nodes. That, however, -will not affect vcpu pinning, so the guest will still be able to run on -all the cpus, it will just prefer the ones from the node it has been -placed on. A heuristic approach is used for choosing the best node (or -set of nodes), with the goals of maximizing performance for the guest -and, at the same time, achieving efficient utilization of host cpus -and memory. See F<docs/misc/xl-numa-placement.markdown> for more -details. +and the vcpus of the guest can run on all the cpus of the host. If this +option is specified, and no B<nodes=> option present the vcpu pinning +mask of each vcpu is utilized to compute its vcpu node-affinity, and the +union of all the vcpus node-affinities is what constitutes the domain +node-affinity (which drives memory allocations). + +=back + +=item B<nodes="NODE-LIST"> + +List of on which NUMA nodes the memory for the guest is allocated. This +also means (starting from Xen 4.3 and if the credit scheduler is used) +the vcpus of the domain prefers to run on the those NUMA nodes. Default is +xl (via libxl) guesses. A C<NODE-LIST> may be specified as follows: + +=item "all" + +To specify no particular preference and avoid xl to automatically pick +a (set of) NUMA ndoe(s). In practice, using "all", this domain will have +no NUMA node-affinity and it''s memory will be spread on all the host''s +NUMA nodes. + +=item "0-3,5,^1" + +To specify a NUMA node-affinity with the host NUMA nodes 0,2,3,5. +Combining this with "all" is possible, meaning "all,^7" results in the +memory being allocated on all the host NUMA nodes except node 7, as well +as trying to avoid running the domain''s vcpu on the pcpus that belong to +node 7. + +=item ["1", "4"] (or [1, 4]) + +To ask for specific vcpu to NUMA node mapping. That means (in this example), +memory will be allocated on host NUMA nodes 1 and 4 but, at the same time, +vcpu #0 of the guest prefers to run on the pcpus of host NUMA node 1, while +vcpu #1 on the pcpus of host NUMA node 4. + +=back + +If this option is not specified, xl picks up a NUMA node (or a set of NUMA +nodes), according to some heuristics, and use that as the NUMA node-affinity +for the guest. + +If we are on a NUMA machine (i.e., if the host has more than one NUMA node) +and this option is not specified, libxl automatically tries to place the +guest on the least possible number of nodes. A heuristic approach is used +for choosing the best node (or set of nodes), with the goals of maximizing +performance for the guest and, at the same time, achieving efficient +utilization of host cpus and memory. In this case, all the vcpus of the +guest will have the same vcpu node-affinity. + +Notice that, independently from whether the node-affinity is specified +via this parameter, or automatically decided by libxl, that does not affect +vcpu pinning, so the guest will still be able to run on all the cpus to +which its vcpus are pinned, or all the cpus, if no B<cpus=> option is +provided. + +See F<docs/misc/xl-numa-placement.markdown> for more details. =back diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c index 1812bdc..bc4cf9a 100644 --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -215,19 +215,21 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid, } /* - * Check if the domain has any CPU affinity. If not, try to build - * up one. In case numa_place_domain() find at least a suitable - * candidate, it will affect info->nodemap accordingly; if it - * does not, it just leaves it as it is. This means (unless - * some weird error manifests) the subsequent call to - * libxl_domain_set_nodeaffinity() will do the actual placement, + * Check if the domain has any pinning or node-affinity and, if not, try + * to build up one. + * + * In case numa_place_domain() find at least a suitable candidate, it will + * affect info->nodemap accordingly; if it does not, it just leaves it as + * it is. This means (unless some weird error manifests) the subsequent + * call to libxl_domain_set_nodeaffinity() will do the actual placement, * whatever that turns out to be. */ if (libxl_defbool_val(info->numa_placement)) { - if (!libxl_bitmap_is_full(&info->cpumap)) { + if (!libxl_bitmap_is_full(&info->cpumap) || + !libxl_bitmap_is_full(&info->nodemap)) { LOG(ERROR, "Can run NUMA placement only if no vcpu " - "affinity is specified"); + "pinning or node-affinity is specified"); return ERROR_INVAL; } diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c index 20c99ac..1026579 100644 --- a/tools/libxl/libxl_numa.c +++ b/tools/libxl/libxl_numa.c @@ -184,7 +184,7 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo, int vcpus_on_node[]) { libxl_dominfo *dinfo = NULL; - libxl_bitmap dom_nodemap, nodes_counted; + libxl_bitmap nodes_counted; int nr_doms, nr_cpus; int i, j, k; @@ -197,12 +197,6 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo, return ERROR_FAIL; } - if (libxl_node_bitmap_alloc(CTX, &dom_nodemap, 0) < 0) { - libxl_bitmap_dispose(&nodes_counted); - libxl_dominfo_list_free(dinfo, nr_doms); - return ERROR_FAIL; - } - for (i = 0; i < nr_doms; i++) { libxl_vcpuinfo *vinfo; int nr_dom_vcpus; @@ -211,9 +205,6 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo, if (vinfo == NULL) continue; - /* Retrieve the domain''s node-affinity map */ - libxl_domain_get_nodeaffinity(CTX, dinfo[i].domid, &dom_nodemap); - for (j = 0; j < nr_dom_vcpus; j++) { /* * For each vcpu of each domain, it must have both vcpu-affinity @@ -225,7 +216,7 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo, int node = tinfo[k].node; if (libxl_bitmap_test(suitable_cpumap, k) && - libxl_bitmap_test(&dom_nodemap, node) && + libxl_bitmap_test(&vinfo[j].nodemap, node) && !libxl_bitmap_test(&nodes_counted, node)) { libxl_bitmap_set(&nodes_counted, node); vcpus_on_node[node]++; @@ -236,7 +227,6 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo, libxl_vcpuinfo_list_free(vinfo, nr_dom_vcpus); } - libxl_bitmap_dispose(&dom_nodemap); libxl_bitmap_dispose(&nodes_counted); libxl_dominfo_list_free(dinfo, nr_doms); return 0; diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h index 7b84e6a..cac057c 100644 --- a/tools/libxl/libxl_utils.h +++ b/tools/libxl/libxl_utils.h @@ -90,7 +90,7 @@ static inline void libxl_bitmap_set_none(libxl_bitmap *bitmap) { memset(bitmap->map, 0, bitmap->size); } -static inline int libxl_bitmap_cpu_valid(libxl_bitmap *bitmap, int bit) +static inline int libxl_bitmap_valid(libxl_bitmap *bitmap, int bit) { return bit >= 0 && bit < (bitmap->size * 8); } @@ -125,6 +125,16 @@ static inline int libxl_node_bitmap_alloc(libxl_ctx *ctx, return libxl_bitmap_alloc(ctx, nodemap, max_nodes); } +static inline int libxl_bitmap_cpu_valid(libxl_bitmap *cpumap, int cpu) +{ + return libxl_bitmap_valid(cpumap, cpu); +} + +static inline int libxl_bitmap_node_valid(libxl_bitmap *nodemap, int node) +{ + return libxl_bitmap_valid(nodemap, node); +} + /* Populate cpumap with the cpus spanned by the nodes in nodemap */ int libxl_nodemap_to_cpumap(libxl_ctx *ctx, const libxl_bitmap *nodemap, diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index 1659259..a035162 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -76,8 +76,9 @@ xlchild children[child_max]; static const char *common_domname; static int fd_lock = -1; -/* Stash for specific vcpu to pcpu mappping */ +/* Stash for specific vcpu to pcpu and vcpu to node mappping */ static int *vcpu_to_pcpu; +static int *vcpu_to_node; static const char savefileheader_magic[32] "Xen saved domain, xl format\n \0 \r"; @@ -670,7 +671,7 @@ static void parse_config_data(const char *config_source, const char *buf; long l; XLU_Config *config; - XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms; + XLU_ConfigList *cpus, *nodes, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms; XLU_ConfigList *ioports, *irqs, *iomem; int num_ioports, num_irqs, num_iomem; int pci_power_mgmt = 0; @@ -846,6 +847,53 @@ static void parse_config_data(const char *config_source, libxl_defbool_set(&b_info->numa_placement, false); } + if (!xlu_cfg_get_list (config, "nodes", &nodes, 0, 1)) { + int n_cpus = 0; + + if (libxl_node_bitmap_alloc(ctx, &b_info->nodemap, 0)) { + fprintf(stderr, "Unable to allocate nodemap\n"); + exit(1); + } + + /* + * As above, use a temporary storage for the single vcpus'' + * node-affinities. + */ + vcpu_to_node = xmalloc(sizeof(int) * b_info->max_vcpus); + memset(vcpu_to_node, -1, sizeof(int) * b_info->max_vcpus); + + libxl_bitmap_set_none(&b_info->nodemap); + while ((buf = xlu_cfg_get_listitem(nodes, n_cpus)) != NULL) { + i = atoi(buf); + if (!libxl_bitmap_node_valid(&b_info->nodemap, i)) { + fprintf(stderr, "node %d illegal\n", i); + exit(1); + } + libxl_bitmap_set(&b_info->nodemap, i); + if (n_cpus < b_info->max_vcpus) + vcpu_to_node[n_cpus] = i; + n_cpus++; + } + + /* We have a nodemap, disable automatic placement */ + libxl_defbool_set(&b_info->numa_placement, false); + } + else if (!xlu_cfg_get_string (config, "nodes", &buf, 0)) { + char *buf2 = strdup(buf); + + if (libxl_node_bitmap_alloc(ctx, &b_info->nodemap, 0)) { + fprintf(stderr, "Unable to allocate nodemap\n"); + exit(1); + } + + libxl_bitmap_set_none(&b_info->nodemap); + if (parse_bitmap_range(buf2, &b_info->nodemap)) + exit(1); + free(buf2); + + libxl_defbool_set(&b_info->numa_placement, false); + } + if (!xlu_cfg_get_long (config, "memory", &l, 0)) { b_info->max_memkb = l * 1024; b_info->target_memkb = b_info->max_memkb; @@ -2205,6 +2253,34 @@ start: free(vcpu_to_pcpu); vcpu_to_pcpu = NULL; } + /* And do the same for single vcpu to node-affinity mapping */ + if (vcpu_to_node) { + libxl_bitmap vcpu_nodemap; + + ret = libxl_node_bitmap_alloc(ctx, &vcpu_nodemap, 0); + if (ret) + goto error_out; + for (i = 0; i < d_config.b_info.max_vcpus; i++) { + + if (vcpu_to_node[i] != -1) { + libxl_bitmap_set_none(&vcpu_nodemap); + libxl_bitmap_set(&vcpu_nodemap, vcpu_to_node[i]); + } else { + libxl_bitmap_set_any(&vcpu_nodemap); + } + if (libxl_set_vcpunodeaffinity(ctx, domid, i, &vcpu_nodemap)) { + fprintf(stderr, "setting node-affinity failed" + " on vcpu `%d''.\n", i); + libxl_bitmap_dispose(&vcpu_nodemap); + free(vcpu_to_node); + ret = ERROR_FAIL; + goto error_out; + } + } + libxl_bitmap_dispose(&vcpu_nodemap); + free(vcpu_to_node); vcpu_to_node = NULL; + } + ret = libxl_userdata_store(ctx, domid, "xl", config_data, config_len); if (ret) {
George Dunlap
2013-Nov-05 14:43 UTC
Re: [PATCH RESEND 01/12] xen: numa-sched: leave node-affinity alone if not in "auto" mode
On 11/05/2013 02:34 PM, Dario Faggioli wrote:> If the domain''s NUMA node-affinity is being specified by the > user/toolstack (instead of being automatically computed by Xen), > we really should stick to that. This means domain_update_node_affinity() > is wrong when it filters out some stuff from there even in "!auto" > mode. > > This commit fixes that. Of course, this does not mean node-affinity > is always honoured (e.g., a vcpu won''t run on a pcpu of a different > cpupool) but the necessary logic for taking into account all the > possible situations lives in the scheduler code, where it belongs. > > What could happen without this change is that, under certain > circumstances, the node-affinity of a domain may change when the > user modifies the vcpu-affinity of the domain''s vcpus. This, even > if probably not a real bug, is at least something the user does > not expect, so let''s avoid it. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> > --- > This has been submitted already as a single patch on its own. > Since this series needs the change done here, just include it > in here, instead of pinging the original submission and deferring > posting this series.Just reiterating what I said on the last send... this one is independent and can be checked in whenever you''re ready. -George> --- > xen/common/domain.c | 28 +++++++++------------------- > 1 file changed, 9 insertions(+), 19 deletions(-) > > diff --git a/xen/common/domain.c b/xen/common/domain.c > index 5999779..af31ab4 100644 > --- a/xen/common/domain.c > +++ b/xen/common/domain.c > @@ -352,7 +352,6 @@ void domain_update_node_affinity(struct domain *d) > cpumask_var_t cpumask; > cpumask_var_t online_affinity; > const cpumask_t *online; > - nodemask_t nodemask = NODE_MASK_NONE; > struct vcpu *v; > unsigned int node; > > @@ -374,28 +373,19 @@ void domain_update_node_affinity(struct domain *d) > cpumask_or(cpumask, cpumask, online_affinity); > } > > + /* > + * If d->auto_node_affinity is true, the domain''s node-affinity mask > + * (d->node_affinity) is automaically computed from all the domain''s > + * vcpus'' vcpu-affinity masks (the union of which we have just built > + * above in cpumask). OTOH, if d->auto_node_affinity is false, we > + * must leave the node-affinity of the domain alone. > + */ > if ( d->auto_node_affinity ) > { > - /* Node-affinity is automaically computed from all vcpu-affinities */ > + nodes_clear(d->node_affinity); > for_each_online_node ( node ) > if ( cpumask_intersects(&node_to_cpumask(node), cpumask) ) > - node_set(node, nodemask); > - > - d->node_affinity = nodemask; > - } > - else > - { > - /* Node-affinity is provided by someone else, just filter out cpus > - * that are either offline or not in the affinity of any vcpus. */ > - nodemask = d->node_affinity; > - for_each_node_mask ( node, d->node_affinity ) > - if ( !cpumask_intersects(&node_to_cpumask(node), cpumask) ) > - node_clear(node, nodemask);//d->node_affinity); > - > - /* Avoid loosing track of node-affinity because of a bad > - * vcpu-affinity has been specified. */ > - if ( !nodes_empty(nodemask) ) > - d->node_affinity = nodemask; > + node_set(node, d->node_affinity); > } > > sched_set_node_affinity(d, &d->node_affinity); >
George Dunlap
2013-Nov-05 14:50 UTC
Re: [PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning
On 11/05/2013 02:34 PM, Dario Faggioli wrote:> Making it possible to use something like the following: > * "nodes:0-3": all pCPUs of nodes 0,1,2,3; > * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3; > * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2 > but not pCPU 6; > * ... > > In both domain config file and `xl vcpu-pin''. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>Overall looks like a pretty clean patch; just a few comments.> --- > Picking this up from a previously submitted series ("xl: > allow for node-wise specification of vcpu pinning") as the > changes in that and in this series would otherwise be > conflicting. If this is considered fine, Feel free to apply > it from here and skip the corresponding e-mail in the > original submission. > --- > Changes from v2: > * turned a ''return'' into ''goto out'', consistently with the > most of exit patterns; > * harmonized error handling: now parse_range() return a > libxl error code, as requested during review; > * dealing with "all" moved inside update_cpumap_range(). > It''s tricky to move it in parse_range() (as requested > during review), since we need the cpumap being modified > handy when dealing with it. However, having it in > update_cpumap_range() simplifies the code just as much > as that; > * explicitly checking for junk after a valid value or range > in parse_range(), as requested during review; > * xl exits on parsing failing, so no need to reset the > cpumap to something sensible in vcpupin_parse(), as > suggested during review; > > Changes from v1: > * code rearranged in order to look more simple to follow > and understand, as requested during review; > * improved docs in xl.cfg.pod.5, as requested during > review; > * strtoul() now returns into unsigned long, and the > case where it returns ULONG_MAX is now taken into > account, as requested during review; > * stuff like "all,^7" now works, as requested during > review. Specifying just "^7" does not work either > before or after this change > * killed some magic (i.e., `ptr += 5 + (ptr[4] == ''s''`) > by introducing STR_SKIP_PREFIX() macro, as requested > during review. > --- > docs/man/xl.cfg.pod.5 | 20 ++++++ > tools/libxl/xl_cmdimpl.c | 145 ++++++++++++++++++++++++++++++++-------------- > 2 files changed, 121 insertions(+), 44 deletions(-) > > diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 > index d2d8921..1c98cb4 100644 > --- a/docs/man/xl.cfg.pod.5 > +++ b/docs/man/xl.cfg.pod.5 > @@ -115,7 +115,25 @@ To allow all the vcpus of the guest to run on all the cpus on the host. > > =item "0-3,5,^1" > > -To allow all the vcpus of the guest to run on cpus 0,2,3,5. > +To allow all the vcpus of the guest to run on cpus 0,2,3,5. Combining > +this with "all" is possible, meaning "all,^7" results in all the vcpus > +of the guest running on all the cpus on the host except cpu 7. > + > +=item "nodes:0-3,node:^2"Here you use both "nodes" and "node", while the code seems to only check for "nodes". I was originally going to say we should just check one; but on the other hand, it''s just an extra string compare -- I feel like we might as well accept either "node" or "nodes". (No need to enforce plurality: "nodes:2" and "node:1-3" should both be fine with me.)> + > +To allow all the vcpus of the guest to run on the cpus from NUMA nodes > +0,1,3 of the host. So, if cpus 0-3 belongs to node 0, cpus 4-7 belongs > +to node 1 and cpus 8-11 to node 3, the above would mean all the vcpus > +of the guest will run on cpus 0-3,8-11. > + > +Combining this notation with the one above is possible. For instance, > +"1,node:2,^6", means all the vcpus of the guest will run on cpu 1 and > +on all the cpus of NUMA node 2, but not on cpu 6. Following the same > +example as above, that would be cpus 1,4,5,7. > + > +Combining this with "all" is also possible, meaning "all,^nodes:1" > +results in all the vcpus of the guest running on all the cpus on the > +host, except for the cpus belonging to the host NUMA node 1. > > =item ["2", "3"] (or [2, 3]) > > diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c > index 40feb7d..b8755b9 100644 > --- a/tools/libxl/xl_cmdimpl.c > +++ b/tools/libxl/xl_cmdimpl.c > @@ -59,6 +59,11 @@ > } \ > }) > > +#define STR_HAS_PREFIX( a, b ) \ > + ( strncmp(a, b, strlen(b)) == 0 ) > +#define STR_SKIP_PREFIX( a, b ) \ > + ( STR_HAS_PREFIX(a, b) ? (a) += strlen(b) : NULL ) > + > > int logfile = 2; > > @@ -513,61 +518,115 @@ static void split_string_into_string_list(const char *str, > free(s); > } > > -static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap) > +static int parse_range(const char *str, unsigned long *a, unsigned long *b) > +{ > + char *nstr, *endptr; > + > + *a = *b = strtoul(str, &endptr, 10); > + if (endptr == str || *a == ULONG_MAX) > + return ERROR_INVAL; > + > + if (*endptr == ''-'') { > + nstr = endptr + 1; > + > + *b = strtoul(nstr, &endptr, 10); > + if (endptr == nstr || *b == ULONG_MAX || *b < *a) > + return ERROR_INVAL; > + } > + > + /* Valid value or range so far, but we also don''t want junk after that */ > + if (*endptr != ''\0'') > + return ERROR_INVAL; > + > + return 0; > +} > + > +/* > + * Add or removes a specific set of cpus (specified in str, either as > + * single cpus or as entire NUMA nodes) to/from cpumap. > + */ > +static int update_cpumap_range(const char *str, libxl_bitmap *cpumap) > { > - libxl_bitmap exclude_cpumap; > - uint32_t cpuida, cpuidb; > - char *endptr, *toka, *tokb, *saveptr = NULL; > - int i, rc = 0, rmcpu; > + unsigned long ida, idb; > + libxl_bitmap node_cpumap; > + bool is_not = false, is_nodes = false; > + int rc = 0; > + > + libxl_bitmap_init(&node_cpumap); > > - if (!strcmp(cpu, "all")) { > + rc = libxl_node_bitmap_alloc(ctx, &node_cpumap, 0); > + if (rc) { > + fprintf(stderr, "libxl_node_bitmap_alloc failed.\n"); > + goto out; > + } > + > + /* Are we adding or removing cpus/nodes? */ > + if (STR_SKIP_PREFIX(str, "^")) { > + is_not = true; > + } > + > + /* Are we dealing with cpus or full nodes? */ > + if (STR_SKIP_PREFIX(str, "nodes:")) { > + is_nodes = true; > + } > + > + if (STR_HAS_PREFIX(str, "all")) {Is there any reason not to keep this "strcmp"? As it is, this will accept any string that *starts* with "all", which isn''t exactly what you want, I don''t think. Other than that, I think it looks good. -George
Jan Beulich
2013-Nov-05 14:52 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
>>> On 05.11.13 at 15:35, Dario Faggioli <dario.faggioli@citrix.com> wrote: > @@ -197,6 +199,13 @@ struct vcpu > /* Used to restore affinity across S3. */ > cpumask_var_t cpu_affinity_saved; > > + /* > + * Bitmask of CPUs on which this VCPU prefers to run. For both this > + * and auto_node_affinity access is serialized against > + * v->domain->node_affinity_lock. > + */ > + cpumask_var_t node_affinity;This all looks quite sensible, except for the naming here: We already have a node_affinity field in struct domain, having a meaning that one can expect with this name. So you break both consistency and the rule of least surprise here. How about just "preferred_cpus"? Jan
George Dunlap
2013-Nov-05 15:03 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On 11/05/2013 02:52 PM, Jan Beulich wrote:>>>> On 05.11.13 at 15:35, Dario Faggioli <dario.faggioli@citrix.com> wrote: >> @@ -197,6 +199,13 @@ struct vcpu >> /* Used to restore affinity across S3. */ >> cpumask_var_t cpu_affinity_saved; >> >> + /* >> + * Bitmask of CPUs on which this VCPU prefers to run. For both this >> + * and auto_node_affinity access is serialized against >> + * v->domain->node_affinity_lock. >> + */ >> + cpumask_var_t node_affinity; > > This all looks quite sensible, except for the naming here: We > already have a node_affinity field in struct domain, having a > meaning that one can expect with this name. So you break > both consistency and the rule of least surprise here. How > about just "preferred_cpus"?Actually, would it make more sense to remove node_affinity from the domain struct, and have the tools manually set the node_affinity for the various vcpus if the user attempts to set the "domain numa affinity"? -George
Jan Beulich
2013-Nov-05 15:11 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
>>> On 05.11.13 at 16:03, George Dunlap <george.dunlap@eu.citrix.com> wrote: > On 11/05/2013 02:52 PM, Jan Beulich wrote: >>>>> On 05.11.13 at 15:35, Dario Faggioli <dario.faggioli@citrix.com> wrote: >>> @@ -197,6 +199,13 @@ struct vcpu >>> /* Used to restore affinity across S3. */ >>> cpumask_var_t cpu_affinity_saved; >>> >>> + /* >>> + * Bitmask of CPUs on which this VCPU prefers to run. For both this >>> + * and auto_node_affinity access is serialized against >>> + * v->domain->node_affinity_lock. >>> + */ >>> + cpumask_var_t node_affinity; >> >> This all looks quite sensible, except for the naming here: We >> already have a node_affinity field in struct domain, having a >> meaning that one can expect with this name. So you break >> both consistency and the rule of least surprise here. How >> about just "preferred_cpus"? > > Actually, would it make more sense to remove node_affinity from the > domain struct, and have the tools manually set the node_affinity for the > various vcpus if the user attempts to set the "domain numa affinity"?Quite likely, yes. But that doesn''t mean that the name can be kept as is. Jan
George Dunlap
2013-Nov-05 15:11 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On 11/05/2013 03:03 PM, George Dunlap wrote:> On 11/05/2013 02:52 PM, Jan Beulich wrote: >>>>> On 05.11.13 at 15:35, Dario Faggioli <dario.faggioli@citrix.com> >>>>> wrote: >>> @@ -197,6 +199,13 @@ struct vcpu >>> /* Used to restore affinity across S3. */ >>> cpumask_var_t cpu_affinity_saved; >>> >>> + /* >>> + * Bitmask of CPUs on which this VCPU prefers to run. For both this >>> + * and auto_node_affinity access is serialized against >>> + * v->domain->node_affinity_lock. >>> + */ >>> + cpumask_var_t node_affinity; >> >> This all looks quite sensible, except for the naming here: We >> already have a node_affinity field in struct domain, having a >> meaning that one can expect with this name. So you break >> both consistency and the rule of least surprise here. How >> about just "preferred_cpus"? > > Actually, would it make more sense to remove node_affinity from the > domain struct, and have the tools manually set the node_affinity for the > various vcpus if the user attempts to set the "domain numa affinity"?Sorry, speaking before I had thought it through. Of course we need a numa affinity for the domain for allocating memory. How about, "cpu_node_affinity"? Or, we could internally change the names to "cpu_hard_affinity" and "cpu_soft_affinity", since that''s effectively what the scheduler will do. It''s possible someone might want to set soft affinities for some other reason besides NUMA performance. -George
Jan Beulich
2013-Nov-05 15:23 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
>>> On 05.11.13 at 16:11, George Dunlap <george.dunlap@eu.citrix.com> wrote: > Or, we could internally change the names to "cpu_hard_affinity" and > "cpu_soft_affinity", since that''s effectively what the scheduler will > do. It''s possible someone might want to set soft affinities for some > other reason besides NUMA performance.I like that. Jan
George Dunlap
2013-Nov-05 15:24 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On 11/05/2013 03:11 PM, Jan Beulich wrote:>>>> On 05.11.13 at 16:03, George Dunlap <george.dunlap@eu.citrix.com> wrote: >> On 11/05/2013 02:52 PM, Jan Beulich wrote: >>>>>> On 05.11.13 at 15:35, Dario Faggioli <dario.faggioli@citrix.com> wrote: >>>> @@ -197,6 +199,13 @@ struct vcpu >>>> /* Used to restore affinity across S3. */ >>>> cpumask_var_t cpu_affinity_saved; >>>> >>>> + /* >>>> + * Bitmask of CPUs on which this VCPU prefers to run. For both this >>>> + * and auto_node_affinity access is serialized against >>>> + * v->domain->node_affinity_lock. >>>> + */ >>>> + cpumask_var_t node_affinity; >>> >>> This all looks quite sensible, except for the naming here: We >>> already have a node_affinity field in struct domain, having a >>> meaning that one can expect with this name. So you break >>> both consistency and the rule of least surprise here. How >>> about just "preferred_cpus"? >> >> Actually, would it make more sense to remove node_affinity from the >> domain struct, and have the tools manually set the node_affinity for the >> various vcpus if the user attempts to set the "domain numa affinity"? > > Quite likely, yes. But that doesn''t mean that the name can be kept > as is.Right, I see what you''re saying -- in the domain struct, node_affinity is a mask of nodes, but here it''s being used as a mask of cpus. Yes, that needs to change. -George
George Dunlap
2013-Nov-05 15:39 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On 11/05/2013 03:23 PM, Jan Beulich wrote:>>>> On 05.11.13 at 16:11, George Dunlap <george.dunlap@eu.citrix.com> wrote: >> Or, we could internally change the names to "cpu_hard_affinity" and >> "cpu_soft_affinity", since that''s effectively what the scheduler will >> do. It''s possible someone might want to set soft affinities for some >> other reason besides NUMA performance. > > I like that.A potential problem with that is the "auto domain numa" thing. In this patch, if the domain numa affinity is not set but vcpu numa affinity is, the domain numa affinity (which will be used to allocate memory for the domain) will be set based on the vcpu numa affinity. That seems like a useful feature (though perhaps it''s starting to violate the "policy should be in the tools" principle). If we change this to just "hard affinity" and "soft affinity", we''ll lose the natural logical connection there. It might have impacts on how we end up doing vNUMA as well. So I''m a bit torn ATM. Dario, any thoughts? -George
George Dunlap
2013-Nov-05 16:20 UTC
Re: [PATCH RESEND 07/12] xen: numa-sched: use per-vcpu node-affinity for actual scheduling
On 11/05/2013 02:35 PM, Dario Faggioli wrote:> instead of relying the domain-wide node-affinity. To achieve > that, we use the specific vcpu''s node-affinity when computing > the NUMA load balancing mask in csched_balance_cpumask(). > > As a side effects, this simplifies the code quit a bit. In > fact, prior to this change, we needed to cache the translation > of d->node_affinity (which is a nodemask_t) to a cpumask_t, > since that is what is needed during the actual scheduling > (we used to keep it in node_affinity_cpumask). > > Since the per-vcpu node-affinity is maintained in a cpumask_t > field (v->node_affinity) already, we don''t need that complicated > updating logic in place any longer, and this change hence can > remove stuff like sched_set_node_affinity(), > csched_set_node_affinity() and, of course, node_affinity_cpumask > from csched_dom. > > The high level description of NUMA placement and scheduling in > docs/misc/xl-numa-placement.markdown is updated too, to match > the new behavior. While at it, an attempt is made to make such > document as few ambiguous as possible, with respect to the > concepts of vCPU pinning, domain node-affinity and vCPU > node-affinity. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > --- > docs/misc/xl-numa-placement.markdown | 124 +++++++++++++++++++++++----------- > xen/common/domain.c | 2 - > xen/common/sched_credit.c | 62 ++--------------- > xen/common/schedule.c | 5 - > xen/include/xen/sched-if.h | 2 - > xen/include/xen/sched.h | 1 > 6 files changed, 94 insertions(+), 102 deletions(-) > > diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown > index caa3fec..890b856 100644 > --- a/docs/misc/xl-numa-placement.markdown > +++ b/docs/misc/xl-numa-placement.markdown > @@ -12,13 +12,6 @@ is quite more complex and slow. On these machines, a NUMA node is usually > defined as a set of processor cores (typically a physical CPU package) and > the memory directly attached to the set of cores. > > -The Xen hypervisor deals with NUMA machines by assigning to each domain > -a "node affinity", i.e., a set of NUMA nodes of the host from which they > -get their memory allocated. Also, even if the node affinity of a domain > -is allowed to change on-line, it is very important to "place" the domain > -correctly when it is fist created, as the most of its memory is allocated > -at that time and can not (for now) be moved easily. > - > NUMA awareness becomes very important as soon as many domains start > running memory-intensive workloads on a shared host. In fact, the cost > of accessing non node-local memory locations is very high, and the > @@ -27,13 +20,37 @@ performance degradation is likely to be noticeable. > For more information, have a look at the [Xen NUMA Introduction][numa_intro] > page on the Wiki. > > +## Xen and NUMA machines: the concept of _node-affinity_ ## > + > +The Xen hypervisor deals with NUMA machines throughout the concept of > +_node-affinity_. When talking about node-affinity, it is important to > +distinguish two different situations. The node-affinity *of a domain* is > +the set of NUMA nodes of the host where the memory for the domain is > +being allocated (mostly, at domain creation time). The node-affinity *of > +a vCPU* is the set of NUMA nodes of the host where the vCPU prefers to > +run on, and the (credit1 only, for now) scheduler will try to accomplish > +that, whenever it is possible. > + > +Of course, despite the fact that they belong to and affect different > +subsystems domain and vCPUs node-affinities are related. In fact, the > +node-affinity of a domain is the union of the node-affinities of all the > +domain''s vCPUs. > + > +The above means that, when changing the vCPU node-affinity, the domain > +node-affinity also changes. Well, although this is allowed to happen > +on-line (i.e., when a domain is already running), that will not result > +in the memory that has already been allocated being moved to a different > +host NUMA node. This is why it is very important to "place" the domain > +correctly when it is fist created, as the most of its memory is allocated > +at that time and can not (for now) be moved easily. > + > ### Placing via pinning and cpupools ### > > The simplest way of placing a domain on a NUMA node is statically pinning > -the domain''s vCPUs to the pCPUs of the node. This goes under the name of > -CPU affinity and can be set through the "cpus=" option in the config file > -(more about this below). Another option is to pool together the pCPUs > -spanning the node and put the domain in such a cpupool with the "pool=" > +the domain''s vCPUs to the pCPUs of the node. This also goes under the name > +of vCPU-affinity and can be set through the "cpus=" option in the config > +file (more about this below). Another option is to pool together the pCPUs > +spanning the node and put the domain in such a _cpupool_ with the "pool=" > config option (as documented in our [Wiki][cpupools_howto]). > > In both the above cases, the domain will not be able to execute outside > @@ -45,24 +62,45 @@ may come at he cost of some load imbalances. > > ### NUMA aware scheduling ### > > -If the credit scheduler is in use, the concept of node affinity defined > -above does not only apply to memory. In fact, starting from Xen 4.3, the > -scheduler always tries to run the domain''s vCPUs on one of the nodes in > -its node affinity. Only if that turns out to be impossible, it will just > -pick any free pCPU. > +If using credit scheduler, and starting from Xen 4.3, the scheduler always > +tries to run the domain''s vCPUs on one of the nodes in their node-affinity. > +Only if that turns out to be impossible, it will just pick any free pCPU. > +Moreover, starting from Xen 4.4, each vCPU can have its own node-affinity, > +potentially different from the ones of all the other vCPUs of the domain. > > -This is, therefore, something more flexible than CPU affinity, as a domain > -can still run everywhere, it just prefers some nodes rather than others. > +This is, therefore, something more flexible than vCPU pinning, as vCPUs > +can still run everywhere, they just prefer some nodes rather than others. > Locality of access is less guaranteed than in the pinning case, but that > comes along with better chances to exploit all the host resources (e.g., > the pCPUs). > > -In fact, if all the pCPUs in a domain''s node affinity are busy, it is > -possible for the domain to run outside of there, but it is very likely that > -slower execution (due to remote memory accesses) is still better than no > -execution at all, as it would happen with pinning. For this reason, NUMA > -aware scheduling has the potential of bringing substantial performances > -benefits, although this will depend on the workload. > +In fact, if all the pCPUs in a VCPU''s node-affinity are busy, it is > +possible for the domain to run outside from there. The idea is that > +slower execution (due to remote memory accesses) is still better than > +no execution at all (as it would happen with pinning). For this reason, > +NUMA aware scheduling has the potential of bringing substantial > +performances benefits, although this will depend on the workload. > + > +Notice that, for each vCPU, the following three scenarios are possbile: > + > + * a vCPU *is pinned* to some pCPUs and *does not have* any vCPU > + node-affinity. In this case, the vCPU is always scheduled on one > + of the pCPUs to which it is pinned, without any specific peference > + on which one of them. Internally, the vCPU''s node-affinity is just > + automatically computed from the vCPU pinning, and the scheduler > + just ignores it; > + * a vCPU *has* its own vCPU node-affinity and *is not* pinned to > + any particular pCPU. In this case, the vCPU can run on every pCPU. > + Nevertheless, the scheduler will try to have it running on one of > + the pCPUs of the node(s) to which it has node-affinity with; > + * a vCPU *has* its own vCPU node-affinity and *is also* pinned to > + some pCPUs. In this case, the vCPU is always scheduled on one of the > + pCPUs to which it is pinned, with, among them, a preference for the > + ones that are from the node(s) it has node-affinity with. In case > + pinning and node-affinity form two disjoint sets of pCPUs, pinning > + "wins", and the node-affinity, although it is still used to derive > + the domain''s node-affinity (for memory allocation), is, from the > + scheduler''s perspective, just ignored. > > ## Guest placement in xl ## > > @@ -71,25 +109,23 @@ both manual or automatic placement of them across the host''s NUMA nodes. > > Note that xm/xend does a very similar thing, the only differences being > the details of the heuristics adopted for automatic placement (see below), > -and the lack of support (in both xm/xend and the Xen versions where that\ > +and the lack of support (in both xm/xend and the Xen versions where that > was the default toolstack) for NUMA aware scheduling. > > ### Placing the guest manually ### > > Thanks to the "cpus=" option, it is possible to specify where a domain > should be created and scheduled on, directly in its config file. This > -affects NUMA placement and memory accesses as the hypervisor constructs > -the node affinity of a VM basing right on its CPU affinity when it is > -created. > +affects NUMA placement and memory accesses as, in this case, the hypervisor > +constructs the node-affinity of a VM basing right on its vCPU pinning > +when it is created. > > This is very simple and effective, but requires the user/system > -administrator to explicitly specify affinities for each and every domain, > +administrator to explicitly specify the pinning of each and every domain, > or Xen won''t be able to guarantee the locality for their memory accesses. > > -Notice that this also pins the domain''s vCPUs to the specified set of > -pCPUs, so it not only sets the domain''s node affinity (its memory will > -come from the nodes to which the pCPUs belong), but at the same time > -forces the vCPUs of the domain to be scheduled on those same pCPUs. > +That, of course, alsos mean the vCPUs of the domain will only be able to > +execute on those same pCPUs. > > ### Placing the guest automatically ### > > @@ -97,7 +133,10 @@ If no "cpus=" option is specified in the config file, libxl tries > to figure out on its own on which node(s) the domain could fit best. > If it finds one (some), the domain''s node affinity get set to there, > and both memory allocations and NUMA aware scheduling (for the credit > -scheduler and starting from Xen 4.3) will comply with it. > +scheduler and starting from Xen 4.3) will comply with it. Starting from > +Xen 4.4, this just means all the vCPUs of the domain will have the same > +vCPU node-affinity, that is the outcome of such automatic "fitting" > +procedure. > > It is worthwhile noting that optimally fitting a set of VMs on the NUMA > nodes of an host is an incarnation of the Bin Packing Problem. In fact, > @@ -143,22 +182,29 @@ any placement from happening: > libxl_defbool_set(&domain_build_info->numa_placement, false); > > Also, if `numa_placement` is set to `true`, the domain must not > -have any CPU affinity (i.e., `domain_build_info->cpumap` must > -have all its bits set, as it is by default), or domain creation > -will fail returning `ERROR_INVAL`. > +have any vCPU pinning (i.e., `domain_build_info->cpumap` must have > +all its bits set, as it is by default), or domain creation will fail > +with an `ERROR_INVAL`. > > Starting from Xen 4.3, in case automatic placement happens (and is > successful), it will affect the domain''s node affinity and _not_ its > -CPU affinity. Namely, the domain''s vCPUs will not be pinned to any > +vCPU pinning. Namely, the domain''s vCPUs will not be pinned to any > pCPU on the host, but the memory from the domain will come from the > selected node(s) and the NUMA aware scheduling (if the credit scheduler > -is in use) will try to keep the domain there as much as possible. > +is in use) will try to keep the domain''s vCPUs there as much as possible. > > Besides than that, looking and/or tweaking the placement algorithm > search "Automatic NUMA placement" in libxl\_internal.h. > > Note this may change in future versions of Xen/libxl. > > +## Xen < 4.4 ## > + > +The concept of per-vCPU node-affinity has been introduced for the first > +time in Xen 4.4. In Xen versions earlier than that, the node-affinity is > +the same for the whole domain, that is to say the same for all the vCPUs > +of the domain. > + > ## Xen < 4.3 ## > > As NUMA aware scheduling is a new feature of Xen 4.3, things are a little > diff --git a/xen/common/domain.c b/xen/common/domain.c > index 366d9b9..ae29945 100644 > --- a/xen/common/domain.c > +++ b/xen/common/domain.c > @@ -410,8 +410,6 @@ void domain_update_node_affinity(struct domain *d) > for_each_cpu ( cpu, cpumask ) > node_set(cpu_to_node(cpu), d->node_affinity); > > - sched_set_node_affinity(d, &d->node_affinity); > - > spin_unlock(&d->node_affinity_lock); > > free_cpumask_var(online_affinity); > diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c > index c53a36b..d228127 100644 > --- a/xen/common/sched_credit.c > +++ b/xen/common/sched_credit.c > @@ -178,9 +178,6 @@ struct csched_dom { > struct list_head active_vcpu; > struct list_head active_sdom_elem; > struct domain *dom; > - /* cpumask translated from the domain''s node-affinity. > - * Basically, the CPUs we prefer to be scheduled on. */ > - cpumask_var_t node_affinity_cpumask; > uint16_t active_vcpu_count; > uint16_t weight; > uint16_t cap; > @@ -261,32 +258,6 @@ __runq_remove(struct csched_vcpu *svc) > list_del_init(&svc->runq_elem); > } > > -/* > - * Translates node-affinity mask into a cpumask, so that we can use it during > - * actual scheduling. That of course will contain all the cpus from all the > - * set nodes in the original node-affinity mask. > - * > - * Note that any serialization needed to access mask safely is complete > - * responsibility of the caller of this function/hook. > - */ > -static void csched_set_node_affinity( > - const struct scheduler *ops, > - struct domain *d, > - nodemask_t *mask) > -{ > - struct csched_dom *sdom; > - int node; > - > - /* Skip idle domain since it doesn''t even have a node_affinity_cpumask */ > - if ( unlikely(is_idle_domain(d)) ) > - return; > - > - sdom = CSCHED_DOM(d); > - cpumask_clear(sdom->node_affinity_cpumask); > - for_each_node_mask( node, *mask ) > - cpumask_or(sdom->node_affinity_cpumask, sdom->node_affinity_cpumask, > - &node_to_cpumask(node)); > -} > > #define for_each_csched_balance_step(step) \ > for ( (step) = 0; (step) <= CSCHED_BALANCE_CPU_AFFINITY; (step)++ ) > @@ -294,12 +265,12 @@ static void csched_set_node_affinity( > > /* > * vcpu-affinity balancing is always necessary and must never be skipped. > - * OTOH, if a domain''s node-affinity is said to be automatically computed > - * (or if it just spans all the nodes), we can safely avoid dealing with > - * node-affinity entirely. > + * OTOH, if the vcpu''s numa-affinity is being automatically computed out of > + * the vcpu''s vcpu-affinity, or if it just spans all the nodes, we can > + * safely avoid dealing with numa-affinity entirely. > * > - * Node-affinity is also deemed meaningless in case it has empty > - * intersection with mask, to cover the cases where using the node-affinity > + * A vcpu''s numa-affinity is also deemed meaningless in case it has empty > + * intersection with mask, to cover the cases where using the numa-affinity > * mask seems legit, but would instead led to trying to schedule the vcpu > * on _no_ pcpu! Typical use cases are for mask to be equal to the vcpu''s > * vcpu-affinity, or to the && of vcpu-affinity and the set of online cpus > @@ -308,11 +279,9 @@ static void csched_set_node_affinity( > static inline int __vcpu_has_node_affinity(const struct vcpu *vc, > const cpumask_t *mask) > { > - const struct domain *d = vc->domain; > - const struct csched_dom *sdom = CSCHED_DOM(d); > - > - if (cpumask_full(sdom->node_affinity_cpumask) > - || !cpumask_intersects(sdom->node_affinity_cpumask, mask) ) > + if ( vc->auto_node_affinity == 1 > + || cpumask_full(vc->node_affinity) > + || !cpumask_intersects(vc->node_affinity, mask) )Isn''t cpumask_full(vc->node_affinity) <=> vc->auto_node_affinity at this point? Other than that, I think this whole patch looks good -- nice when you can add a feature and simplify the code. :-) -George
George Dunlap
2013-Nov-05 16:56 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On 11/05/2013 03:39 PM, George Dunlap wrote:> On 11/05/2013 03:23 PM, Jan Beulich wrote: >>>>> On 05.11.13 at 16:11, George Dunlap <george.dunlap@eu.citrix.com> >>>>> wrote: >>> Or, we could internally change the names to "cpu_hard_affinity" and >>> "cpu_soft_affinity", since that''s effectively what the scheduler will >>> do. It''s possible someone might want to set soft affinities for some >>> other reason besides NUMA performance. >> >> I like that. > > A potential problem with that is the "auto domain numa" thing. In this > patch, if the domain numa affinity is not set but vcpu numa affinity is, > the domain numa affinity (which will be used to allocate memory for the > domain) will be set based on the vcpu numa affinity. That seems like a > useful feature (though perhaps it''s starting to violate the "policy > should be in the tools" principle). If we change this to just "hard > affinity" and "soft affinity", we''ll lose the natural logical connection > there. It might have impacts on how we end up doing vNUMA as well. So > I''m a bit torn ATM. > > Dario, any thoughts?[Coming back after going through the whole series] This is basically the main architectural question that needs to be sorted out with the series: Do we bake in that the "soft affinity" is specifically for NUMA-ness, or not? The patch the way it is does make this connection, and that has several implications: * There is no more concept of a separate "domain numa affinity" (Patch 06); the domain numa affinity is just a pre-calculated union of the vcpu affinities. * The interface to this "soft affinity" is a bitmask of numa nodes, not a bitmask of cpus. If we''re OK with that direction, then I think this patch series looks pretty good. Release-wise, I think as long as we''re OK with libxl providing a "set_vcpu_numa_affinity", then we can always come back and change the implementation later if we want to maintain that distinction internally. -George
George Dunlap
2013-Nov-05 17:16 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On 11/05/2013 04:56 PM, George Dunlap wrote:> On 11/05/2013 03:39 PM, George Dunlap wrote: >> On 11/05/2013 03:23 PM, Jan Beulich wrote: >>>>>> On 05.11.13 at 16:11, George Dunlap <george.dunlap@eu.citrix.com> >>>>>> wrote: >>>> Or, we could internally change the names to "cpu_hard_affinity" and >>>> "cpu_soft_affinity", since that''s effectively what the scheduler will >>>> do. It''s possible someone might want to set soft affinities for some >>>> other reason besides NUMA performance. >>> >>> I like that. >> >> A potential problem with that is the "auto domain numa" thing. In this >> patch, if the domain numa affinity is not set but vcpu numa affinity is, >> the domain numa affinity (which will be used to allocate memory for the >> domain) will be set based on the vcpu numa affinity. That seems like a >> useful feature (though perhaps it''s starting to violate the "policy >> should be in the tools" principle). If we change this to just "hard >> affinity" and "soft affinity", we''ll lose the natural logical connection >> there. It might have impacts on how we end up doing vNUMA as well. So >> I''m a bit torn ATM. >> >> Dario, any thoughts? > > [Coming back after going through the whole series] > > This is basically the main architectural question that needs to be > sorted out with the series: Do we bake in that the "soft affinity" is > specifically for NUMA-ness, or not? > > The patch the way it is does make this connection, and that has several > implications: > * There is no more concept of a separate "domain numa affinity" (Patch > 06); the domain numa affinity is just a pre-calculated union of the vcpu > affinities. > * The interface to this "soft affinity" is a bitmask of numa nodes, not > a bitmask of cpus. > > If we''re OK with that direction, then I think this patch series looks > pretty good.Just to outline what the alternative would look like: The hypervisor would focus on the minimum mechanisms required to do something useful for NUMA systems. The domain NUMA affinity would be only used for memory allocation. vcpus would only have "hard" and "soft" affinities. The toolstack (libxl? xl?) would be responsible for stitching these together into a useable interface for NUMA: e.g., it would have the concept of "numa affinity" for vcpus (or indeed, virtual NUMA topologies), and would do things like update the domain NUMA affinity based on vcpu affinities. This would mean the toolstack either assuming, when someone calls vcpu_set_node_affinity, that soft_affinity == numa_affinity, or keeping its own copy of numa_affinity for each vcpu around somewhere. Alternately, we could punt on the NUMA interface altogether for this patch series, and wait until we can implement a full-featured vNUMA interface. That is, for this patch series, make an interface just do NUMA affinity for memory, and "soft" and "hard" affinities for vcpus. Then in another series (perhaps one shortly after that), implement a full vNUMA interface, with vcpus mapped to vNUMA nodes, and vNUMA nodes mapped to pNUMA nodes -- with the toolstack implementing all of this just using the soft affinities. Thoughts? -George
Jan Beulich
2013-Nov-05 17:24 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
>>> On 05.11.13 at 17:56, George Dunlap <george.dunlap@eu.citrix.com> wrote: > On 11/05/2013 03:39 PM, George Dunlap wrote: >> A potential problem with that is the "auto domain numa" thing. In this >> patch, if the domain numa affinity is not set but vcpu numa affinity is, >> the domain numa affinity (which will be used to allocate memory for the >> domain) will be set based on the vcpu numa affinity. That seems like a >> useful feature (though perhaps it''s starting to violate the "policy >> should be in the tools" principle). If we change this to just "hard >> affinity" and "soft affinity", we''ll lose the natural logical connection >> there. It might have impacts on how we end up doing vNUMA as well. So >> I''m a bit torn ATM. > > [Coming back after going through the whole series] > > This is basically the main architectural question that needs to be > sorted out with the series: Do we bake in that the "soft affinity" is > specifically for NUMA-ness, or not?I think that it would be beneficial if we could keep the soft affinity as a more abstract construct than just representing NUMAness, even if that means that domain node affinity won''t be able to be fully "automated" anymore. But I''m saying that without having a specific use case in mind. Jan
Jan Beulich
2013-Nov-05 17:30 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
>>> On 05.11.13 at 18:16, George Dunlap <george.dunlap@eu.citrix.com> wrote: > This would mean the toolstack either assuming, when someone calls > vcpu_set_node_affinity, that soft_affinity == numa_affinity, or keeping > its own copy of numa_affinity for each vcpu around somewhere.Question is how reasonable it is for the tool stack to keep such information around and up-to-date, in particular considering node/CPU/memory hotplug. If that''s reasonable, then your proposal below seems to be the right one architecturally, i.e. keeping policy entirely at the tools level. Jan> Alternately, we could punt on the NUMA interface altogether for this > patch series, and wait until we can implement a full-featured vNUMA > interface. That is, for this patch series, make an interface just do > NUMA affinity for memory, and "soft" and "hard" affinities for vcpus. > Then in another series (perhaps one shortly after that), implement a > full vNUMA interface, with vcpus mapped to vNUMA nodes, and vNUMA nodes > mapped to pNUMA nodes -- with the toolstack implementing all of this > just using the soft affinities. > > Thoughts? > > -George
George Dunlap
2013-Nov-05 17:31 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On 11/05/2013 05:24 PM, Jan Beulich wrote:>>>> On 05.11.13 at 17:56, George Dunlap <george.dunlap@eu.citrix.com> wrote: >> On 11/05/2013 03:39 PM, George Dunlap wrote: >>> A potential problem with that is the "auto domain numa" thing. In this >>> patch, if the domain numa affinity is not set but vcpu numa affinity is, >>> the domain numa affinity (which will be used to allocate memory for the >>> domain) will be set based on the vcpu numa affinity. That seems like a >>> useful feature (though perhaps it''s starting to violate the "policy >>> should be in the tools" principle). If we change this to just "hard >>> affinity" and "soft affinity", we''ll lose the natural logical connection >>> there. It might have impacts on how we end up doing vNUMA as well. So >>> I''m a bit torn ATM. >> >> [Coming back after going through the whole series] >> >> This is basically the main architectural question that needs to be >> sorted out with the series: Do we bake in that the "soft affinity" is >> specifically for NUMA-ness, or not? > > I think that it would be beneficial if we could keep the soft > affinity as a more abstract construct than just representing > NUMAness, even if that means that domain node affinity > won''t be able to be fully "automated" anymore. But I''m > saying that without having a specific use case in mind.People seem to have found all kinds of fun use cases for the "hard" pinning we have; it''s just a handy tool to have around. Allowing an arbitrary "soft" pinning just seems like a similarly handy tool. I''m sure between our users and our downstreams (Suse, Oracle, XenServer, QubesOS, &c &c) someone will find a use for it. :-) -George
Dario Faggioli
2013-Nov-05 22:15 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On mar, 2013-11-05 at 15:11 +0000, Jan Beulich wrote:> >>> On 05.11.13 at 16:03, George Dunlap <george.dunlap@eu.citrix.com> wrote: > > On 11/05/2013 02:52 PM, Jan Beulich wrote: > >>>>> On 05.11.13 at 15:35, Dario Faggioli <dario.faggioli@citrix.com> wrote: > >>> @@ -197,6 +199,13 @@ struct vcpu > >>> /* Used to restore affinity across S3. */ > >>> cpumask_var_t cpu_affinity_saved; > >>> > >>> + /* > >>> + * Bitmask of CPUs on which this VCPU prefers to run. For both this > >>> + * and auto_node_affinity access is serialized against > >>> + * v->domain->node_affinity_lock. > >>> + */ > >>> + cpumask_var_t node_affinity; > >> > >> This all looks quite sensible, except for the naming here: We > >> already have a node_affinity field in struct domain, having a > >> meaning that one can expect with this name. So you break > >> both consistency and the rule of least surprise here. How > >> about just "preferred_cpus"? > > > > Actually, would it make more sense to remove node_affinity from the > > domain struct, and have the tools manually set the node_affinity for the > > various vcpus if the user attempts to set the "domain numa affinity"? > > Quite likely, yes. But that doesn''t mean that the name can be kept > as is. >Not really. Well, it sure is possible, but it''s a bit more involved than it sounds, as the node_affinity field in struct domain drives all the actual memory allocations, and it''s maintained in a nodemask_t as that''s easier to use in that circumstance. Getting rid of it would mean going through the node_affinity of all vcpus and build such node mask all the times (or something like that), which I think is both more code and less performance! :-P (BTW, seeing the rest of the e-mails, I think you ended up realizing that, but still...) So, as I''ll say replying to the other messages, I''m fine with changing names, much less with removing d->node_affinity. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-05 22:22 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On mar, 2013-11-05 at 15:11 +0000, George Dunlap wrote:> On 11/05/2013 03:03 PM, George Dunlap wrote: > > On 11/05/2013 02:52 PM, Jan Beulich wrote: > >>>>> On 05.11.13 at 15:35, Dario Faggioli <dario.faggioli@citrix.com> > >>>>> wrote: > >>> @@ -197,6 +199,13 @@ struct vcpu > >>> /* Used to restore affinity across S3. */ > >>> cpumask_var_t cpu_affinity_saved; > >>> > >>> + /* > >>> + * Bitmask of CPUs on which this VCPU prefers to run. For both this > >>> + * and auto_node_affinity access is serialized against > >>> + * v->domain->node_affinity_lock. > >>> + */ > >>> + cpumask_var_t node_affinity; > >> > >> This all looks quite sensible, except for the naming here: We > >> already have a node_affinity field in struct domain, having a > >> meaning that one can expect with this name. So you break > >> both consistency and the rule of least surprise here. How > >> about just "preferred_cpus"? > > > > Actually, would it make more sense to remove node_affinity from the > > domain struct, and have the tools manually set the node_affinity for the > > various vcpus if the user attempts to set the "domain numa affinity"? > > Sorry, speaking before I had thought it through. Of course we need a > numa affinity for the domain for allocating memory. >Ah! And I''m also sorry I missed this one before replying! :-P> How about, "cpu_node_affinity"? > > Or, we could internally change the names to "cpu_hard_affinity" and > "cpu_soft_affinity", since that''s effectively what the scheduler will > do. It''s possible someone might want to set soft affinities for some > other reason besides NUMA performance. >That sounds reasonable to me, as far as we are aware that the upper layer would continue to call "hard affinity" as either "vcpu affinity" (domctl_, libxc and libxl) or "vcpu pinning" (xl, command name is ''vcpu-pin''). For the "soft affinity", I think the term "node affinity" or "NUMA node affinity" already spread a little bit around the codebase, so even if there is no real libxc or libxl interface for that before this patch, we may have to live with that too. Is this acceptable? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-05 22:54 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On mar, 2013-11-05 at 15:39 +0000, George Dunlap wrote:> On 11/05/2013 03:23 PM, Jan Beulich wrote: > >>>> On 05.11.13 at 16:11, George Dunlap <george.dunlap@eu.citrix.com> wrote: > >> Or, we could internally change the names to "cpu_hard_affinity" and > >> "cpu_soft_affinity", since that''s effectively what the scheduler will > >> do. It''s possible someone might want to set soft affinities for some > >> other reason besides NUMA performance. > > > > I like that. >I like that too.> A potential problem with that is the "auto domain numa" thing. In this > patch, if the domain numa affinity is not set but vcpu numa affinity is, > the domain numa affinity (which will be used to allocate memory for the > domain) will be set based on the vcpu numa affinity. >That would indeed be an issue, but, hopefully, even if we want to go for the soft!=NUMA way, that would mean that, at the toolstack level, two calls instead than one are required to properly place a guest on one/some NUMA node: one for vcpus (setting the soft affinity) and one for memory (setting the node affinity, which would then affect mem only). Really, let''s see if I succeed in coding it up tomorrow morning, but I don''t expect too serious issues. The only thing that puzzles me is what the ''default behaviour'' should be. I mean, if in a VM config file I find the following: cpus = "2,9" And nothing else, related to hard, soft or memory affinity, what do I do? I''d say, even if only for consistency with what we''re doing right now (even without this patch), we should go all the way through both the calls above, and set both hard affinity and memory node affinity. If not, people doing this in 4.4 (or 4.5, if we miss the train) would experience different behaviour wrt current one, which would be bad! Thoughts? Similar question, provided I implement something like libxl_set_vcpu_hardaffinity(), libxl_set_vcpu_softaffinity() and libxl_set_mem_nodeaffinity(), what should a call to the existent libxl_set_vcpuaffinity() do? As above, right now it affects both vcpu-pinning and node-affinity, so I''d say it should call both libxl_set_vcpu_hardaffinity() and libxl_set_mem_nodeaffinity(). Is that ok? BTW, does the interface sketched above look reasonable? Someone else up for suggesting more fancy names? :-)> That seems like a > useful feature (though perhaps it''s starting to violate the "policy > should be in the tools" principle). >The current situation being that d->node_affinity is set based on vcpu-pinning, having it retain the same meaning and behavior, apart from being set based on vcpu node-affinity looked like the natural extension of it to me. However...> If we change this to just "hard > affinity" and "soft affinity", we''ll lose the natural logical connection > there. >...I think I see what you mean and I agree that it wasn''t that much of a violation if we stick to soft==NUMA (as implemented in the series). If we go for your proposal (which, again, I like), it''d indeed become hard to decide whether node affinity should derive from hard or soft affinity. :-)> It might have impacts on how we end up doing vNUMA as well. So > I''m a bit torn ATM. > > Dario, any thoughts? >As said above, I like the idea of separating soft affinities and NUMA-ness... I''ll give it a try and either submit an updated series implementing such behaviour, or follow-up to this conversation with any issue I find while doing so. :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-05 23:01 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On mar, 2013-11-05 at 17:16 +0000, George Dunlap wrote:> Just to outline what the alternative would look like: The hypervisor > would focus on the minimum mechanisms required to do something useful > for NUMA systems. The domain NUMA affinity would be only used for > memory allocation. vcpus would only have "hard" and "soft" affinities. > The toolstack (libxl? xl?) would be responsible for stitching these > together into a useable interface for NUMA: e.g., it would have the > concept of "numa affinity" for vcpus (or indeed, virtual NUMA > topologies), and would do things like update the domain NUMA affinity > based on vcpu affinities. >Yep, I confirm that I like it, and that I don''t think this should be too difficult to implement. The hardest thing is going to be picking up sensible names for libxl functions and xl confing options (and no simpley here, because I''m not joking!)> This would mean the toolstack either assuming, when someone calls > vcpu_set_node_affinity, that soft_affinity == numa_affinity, or keeping > its own copy of numa_affinity for each vcpu around somewhere. >Right.> Alternately, we could punt on the NUMA interface altogether for this > patch series, and wait until we can implement a full-featured vNUMA > interface. >Well, yes, and let''s fall back to that if implementing the above reveals too difficult or too long, ok?> That is, for this patch series, make an interface just do > NUMA affinity for memory, and "soft" and "hard" affinities for vcpus. > Then in another series (perhaps one shortly after that), implement a > full vNUMA interface, with vcpus mapped to vNUMA nodes, and vNUMA nodes > mapped to pNUMA nodes -- with the toolstack implementing all of this > just using the soft affinities. >I agree with everything. About the interactions the evolution of this series will have with vNUMA, I already talked with Elena in Edinburgh, and I''ll be taking care of those. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-05 23:08 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On mar, 2013-11-05 at 16:56 +0000, George Dunlap wrote:> [Coming back after going through the whole series] > > This is basically the main architectural question that needs to be > sorted out with the series: Do we bake in that the "soft affinity" is > specifically for NUMA-ness, or not? >Damn... I was really hoping to be almost done with this series! Unfortunately, this point you''re raising is a very good one and, even more unfortunately, I like the way in which you propose to solve it, so I''ve got to re-implement this thing, or --I''m sure of that-- I won''t be able to get any sleep at night for months! :-P :-P :-P> The patch the way it is does make this connection, and that has several > implications: > * There is no more concept of a separate "domain numa affinity" (Patch > 06); the domain numa affinity is just a pre-calculated union of the vcpu > affinities. > * The interface to this "soft affinity" is a bitmask of numa nodes, not > a bitmask of cpus. > > If we''re OK with that direction, then I think this patch series looks > pretty good. >We''re not... As said, now that you put it this way, it really sound so much wrong, that I have to fix it! :-)> Release-wise, I think as long as we''re OK with libxl providing a > "set_vcpu_numa_affinity", then we can always come back and change the > implementation later if we want to maintain that distinction internally. >That''s true too. Let''s see how far and how quick can I get to implementing what you propose and then decide. BTW, if I can ask (and sorry if it''s my fault for having missed it), what''s the updated release schedule? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-05 23:12 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On mar, 2013-11-05 at 17:30 +0000, Jan Beulich wrote:> >>> On 05.11.13 at 18:16, George Dunlap <george.dunlap@eu.citrix.com> wrote: > > This would mean the toolstack either assuming, when someone calls > > vcpu_set_node_affinity, that soft_affinity == numa_affinity, or keeping > > its own copy of numa_affinity for each vcpu around somewhere. > > Question is how reasonable it is for the tool stack to keep such > information around and up-to-date, in particular considering > node/CPU/memory hotplug. >Right. Well, I may be wrong, and I won''t know until I''ll actually try coding it (first thing in the morning), but I honestly don''t think there will be much to keep around at the toolstack level.> If that''s reasonable, then your > proposal below seems to be the right one architecturally, i.e. > keeping policy entirely at the tools level. >Agreed. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-06 08:48 UTC
Re: [PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning
On mar, 2013-11-05 at 14:50 +0000, George Dunlap wrote:> On 11/05/2013 02:34 PM, Dario Faggioli wrote: > > Making it possible to use something like the following: > > * "nodes:0-3": all pCPUs of nodes 0,1,2,3; > > * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3; > > * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2 > > but not pCPU 6; > > * ... > > > > In both domain config file and `xl vcpu-pin''. > > > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > Overall looks like a pretty clean patch; just a few comments. >Ok.> > diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 > > > > -To allow all the vcpus of the guest to run on cpus 0,2,3,5. > > +To allow all the vcpus of the guest to run on cpus 0,2,3,5. Combining > > +this with "all" is possible, meaning "all,^7" results in all the vcpus > > +of the guest running on all the cpus on the host except cpu 7. > > + > > +=item "nodes:0-3,node:^2" > > Here you use both "nodes" and "node", while the code seems to only check > for "nodes". >Right... This was a leftover from a previous version where I was actually checking for both, as well as trying to enforce singular/plural consistency (which, I agree with you) is overkill.> I was originally going to say we should just check one; > but on the other hand, it''s just an extra string compare -- I feel like > we might as well accept either "node" or "nodes". (No need to enforce > plurality: "nodes:2" and "node:1-3" should both be fine with me.) >Right, I''ll add the support for both then.> > +static int update_cpumap_range(const char *str, libxl_bitmap *cpumap) > > { > > - libxl_bitmap exclude_cpumap; > > - uint32_t cpuida, cpuidb; > > - char *endptr, *toka, *tokb, *saveptr = NULL; > > - int i, rc = 0, rmcpu; > > + unsigned long ida, idb; > > + libxl_bitmap node_cpumap; > > + bool is_not = false, is_nodes = false; > > + int rc = 0; > > + > > + libxl_bitmap_init(&node_cpumap); > > > > - if (!strcmp(cpu, "all")) { > > + rc = libxl_node_bitmap_alloc(ctx, &node_cpumap, 0); > > + if (rc) { > > + fprintf(stderr, "libxl_node_bitmap_alloc failed.\n"); > > + goto out; > > + } > > + > > + /* Are we adding or removing cpus/nodes? */ > > + if (STR_SKIP_PREFIX(str, "^")) { > > + is_not = true; > > + } > > + > > + /* Are we dealing with cpus or full nodes? */ > > + if (STR_SKIP_PREFIX(str, "nodes:")) { > > + is_nodes = true; > > + } > > + > > + if (STR_HAS_PREFIX(str, "all")) { > > Is there any reason not to keep this "strcmp"? As it is, this will > accept any string that *starts* with "all", which isn''t exactly what you > want, I don''t think. >Ah, true... I''ll revert this to strcmp(). Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-06 09:15 UTC
Re: [PATCH RESEND 07/12] xen: numa-sched: use per-vcpu node-affinity for actual scheduling
On mar, 2013-11-05 at 16:20 +0000, George Dunlap wrote:> On 11/05/2013 02:35 PM, Dario Faggioli wrote:> > @@ -308,11 +279,9 @@ static void csched_set_node_affinity( > > static inline int __vcpu_has_node_affinity(const struct vcpu *vc, > > const cpumask_t *mask) > > { > > - const struct domain *d = vc->domain; > > - const struct csched_dom *sdom = CSCHED_DOM(d); > > - > > - if (cpumask_full(sdom->node_affinity_cpumask) > > - || !cpumask_intersects(sdom->node_affinity_cpumask, mask) ) > > + if ( vc->auto_node_affinity == 1 > > + || cpumask_full(vc->node_affinity) > > + || !cpumask_intersects(vc->node_affinity, mask) ) > > Isn''t cpumask_full(vc->node_affinity) <=> vc->auto_node_affinity at this > point? >It is. I''d be killing the cpumask_full() part, if it wasn''t for the fact that we decided to re-architect all the thing, and I''m doing the latter right now.> Other than that, I think this whole patch looks good -- nice when you > can add a feature and simplify the code. :-) >Nice indeed! :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-06 09:39 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On mar, 2013-11-05 at 17:16 +0000, George Dunlap wrote:> Just to outline what the alternative would look like: The hypervisor > would focus on the minimum mechanisms required to do something useful > for NUMA systems. The domain NUMA affinity would be only used for > memory allocation. vcpus would only have "hard" and "soft" affinities. > The toolstack (libxl? xl?) would be responsible for stitching these > together into a useable interface for NUMA: e.g., it would have the > concept of "numa affinity" for vcpus (or indeed, virtual NUMA > topologies), and would do things like update the domain NUMA affinity > based on vcpu affinities. > > This would mean the toolstack either assuming, when someone calls > vcpu_set_node_affinity, that soft_affinity == numa_affinity, or keeping > its own copy of numa_affinity for each vcpu around somewhere. >And to elaborate a bit more what I said yesterday night, now that I have the code in front of me, going for the above would actually mean the following. In domain.c we have domain_update_node_affinity(). What it does *before* this series is calculating d->node_affinity basing on all the vcpu''s cpu_affinity (i.e., pinning). What it does *after* this series is calculating d->node_affinity besing on _vcpu''s_ node_affinity. (*) Such function is currently called, basically, when a new vcpu is allocated (alloc_vcpu()), when a domain changes cpupool (sched_move_domain()), when the cpupool the domain is in changes (cpupool_assign_cpu_locked() or cpupool_unassign_cpu(). That means that all the above operations _automatically_ affect d->node_affinity. Now, we''re talking about killing vc->cpu_affinity and not introducing vc->node_affinity and, instead, introduce vc->cpu_hard_affinity and vc->cpu_soft_affinity and, more important, not to link any of the above to d->node_affinity. That means all the above operations _will_NOT_ automatically affect d->node_affinity any longer, at least from the hypervisor (and, most likely, libxc) perspective. OTOH, I''m almost sure that I can force libxl (and xl) to retain the exact same behaviour it is exposing to the user (just by adding an extra call when needed). So, although all this won''t be an issue for xl and libxl consumers (or, at least, that''s my goal), it will change how the hypervisor used to behave in all those situations. This means that xl and libxl users will see no change, while folks issuing hypercalls and/or libxc calls will. Is that ok? I mean, I know there are no stability concerns for those APIs, but still, is that an acceptable change? Regards, Dario (*) yes, in both cases (before and after this series), it is possible already that d->node_affinity is not automatically calculated, but that it just stick to something the toolstack provided. That will stay, so it''s pretty much irrelevant to this discussion... Actually, it won''t just "stay", it will become the sole and only case! -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Nov-06 09:46 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
>>> On 06.11.13 at 10:39, Dario Faggioli <dario.faggioli@citrix.com> wrote: > Now, we''re talking about killing vc->cpu_affinity and not introducing > vc->node_affinity and, instead, introduce vc->cpu_hard_affinity and > vc->cpu_soft_affinity and, more important, not to link any of the above > to d->node_affinity. That means all the above operations _will_NOT_ > automatically affect d->node_affinity any longer, at least from the > hypervisor (and, most likely, libxc) perspective. OTOH, I''m almost sure > that I can force libxl (and xl) to retain the exact same behaviour it is > exposing to the user (just by adding an extra call when needed). > > So, although all this won''t be an issue for xl and libxl consumers (or, > at least, that''s my goal), it will change how the hypervisor used to > behave in all those situations. This means that xl and libxl users will > see no change, while folks issuing hypercalls and/or libxc calls will. > > Is that ok? I mean, I know there are no stability concerns for those > APIs, but still, is that an acceptable change?I would think that as long as d->node_affinity is empty, it should still be set based on all vCPU-s'' hard affinities (as it is an error - possibly to be interpret as "do this for me" to try to set an empty node affinity there''s no conflict here). Or alternatively a flag could be set once it got set, preventing further implicit updates. Jan
Dario Faggioli
2013-Nov-06 10:00 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On mer, 2013-11-06 at 09:46 +0000, Jan Beulich wrote:> >>> On 06.11.13 at 10:39, Dario Faggioli <dario.faggioli@citrix.com> wrote: > > Now, we''re talking about killing vc->cpu_affinity and not introducing > > vc->node_affinity and, instead, introduce vc->cpu_hard_affinity and > > vc->cpu_soft_affinity and, more important, not to link any of the above > > to d->node_affinity. That means all the above operations _will_NOT_ > > automatically affect d->node_affinity any longer, at least from the > > hypervisor (and, most likely, libxc) perspective. OTOH, I''m almost sure > > that I can force libxl (and xl) to retain the exact same behaviour it is > > exposing to the user (just by adding an extra call when needed). > > > > So, although all this won''t be an issue for xl and libxl consumers (or, > > at least, that''s my goal), it will change how the hypervisor used to > > behave in all those situations. This means that xl and libxl users will > > see no change, while folks issuing hypercalls and/or libxc calls will. > > > > Is that ok? I mean, I know there are no stability concerns for those > > APIs, but still, is that an acceptable change? > > I would think that as long as d->node_affinity is empty, it should > still be set based on all vCPU-s'' hard affinities >I see, and that sounds sensible to me... It''s mostly a matter a matter of deciding whether o not we want something like that, and, if yes, whether we want it based on hard of soft. Personally, I think I agree with you on having it based on hard affinities by default. Let''s see if George get to say something before I get to that part of the (re)implementation. :-)> (as it is an error - > possibly to be interpret as "do this for me" to try to set an empty > node affinity there''s no conflict here). Or alternatively a flag > could be set once it got set, preventing further implicit updates. >Sure. It''s actually quite similar to that already. Both in the tree and in the series, affinity to none is just error, affinity to all means "do it for me", and I do have the flag there. I can easily change this into what you''re suggesting (i.e., make none => "do it for me"). Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-06 11:41 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On mar, 2013-11-05 at 15:11 +0000, George Dunlap wrote:> How about, "cpu_node_affinity"? > > Or, we could internally change the names to "cpu_hard_affinity" and > "cpu_soft_affinity", since that''s effectively what the scheduler will > do. >And about naming, what about introducing ''cpu_soft_affinity'' and sticking to just ''cpu_affinity'' for the hard one? That way we could leave XEN_DOMCTL_{set,get}vcpuaffinity alone an introduce XEN_DOMCTL_{set,get}vcpusoftaffinity. It''s not super pretty, but it looks better than anything else I can think of. In fact, I can of course rename ''cpu_affinity'' to ''cpu_hard_affinity'' too, but that would mean having the following: case XEN_DOMCTL_setvcpuaffinity: case XEN_DOMCTL_getvcpuaffinity: ... if ( op->cmd == XEN_DOMCTL_setvcpuaffinity ) xenctl_bitmap_to_cpumask(&new_affinity, &op->u.vcpuaffinity.cpumap); vcpu_set_hard_affinity(v, new_affinity); else cpumask_to_xenctl_bitmap(&op->u.vcpuaffinity.cpumap, v->cpu_hard_affinity); ... case XEN_DOMCTL_setvcpusoftaffinity: case XEN_DOMCTL_getvcpusoftaffinity: ... if ( op->cmd == XEN_DOMCTL_setvcpusoftaffinity ) xenctl_bitmap_to_cpumask(&new_affinity, &op->u.vcpuaffinity.cpumap); vcpu_set_soft_affinity(v, new_affinity); else cpumask_to_xenctl_bitmap(&op->u.vcpuaffinity.cpumap, v->cpu_soft_affinity); ... I.e., *vcpusoftaffinity manipulates cpu_soft_affinity while *vcpuaffinity manipulates cpu_hard_affinity, which looks asymmetrical, in addition to not being pretty. And, of course, the same will also happen at the higher (libxc and libxl) levels. There''s of course the option of renaming XEN_DOMCTL_setvcpuaffinity to XEN_DOMCTL_setvcpuhardaffinity (and, while there, put some ''_'' in it! )... But is that really an option? And even if it is an option for the hypervisor, and perhaps, libxc, I''m quite sure it''s not for libxl... Am I wrong? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
George Dunlap
2013-Nov-06 11:44 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On 06/11/13 10:00, Dario Faggioli wrote:> On mer, 2013-11-06 at 09:46 +0000, Jan Beulich wrote: >>>>> On 06.11.13 at 10:39, Dario Faggioli <dario.faggioli@citrix.com> wrote: >>> Now, we''re talking about killing vc->cpu_affinity and not introducing >>> vc->node_affinity and, instead, introduce vc->cpu_hard_affinity and >>> vc->cpu_soft_affinity and, more important, not to link any of the above >>> to d->node_affinity. That means all the above operations _will_NOT_ >>> automatically affect d->node_affinity any longer, at least from the >>> hypervisor (and, most likely, libxc) perspective. OTOH, I''m almost sure >>> that I can force libxl (and xl) to retain the exact same behaviour it is >>> exposing to the user (just by adding an extra call when needed). >>> >>> So, although all this won''t be an issue for xl and libxl consumers (or, >>> at least, that''s my goal), it will change how the hypervisor used to >>> behave in all those situations. This means that xl and libxl users will >>> see no change, while folks issuing hypercalls and/or libxc calls will. >>> >>> Is that ok? I mean, I know there are no stability concerns for those >>> APIs, but still, is that an acceptable change? >> I would think that as long as d->node_affinity is empty, it should >> still be set based on all vCPU-s'' hard affinities >> > I see, and that sounds sensible to me... It''s mostly a matter a matter > of deciding whether o not we want something like that, and, if yes, > whether we want it based on hard of soft. > > Personally, I think I agree with you on having it based on hard > affinities by default. > > Let''s see if George get to say something before I get to that part of > the (re)implementation. :-)I would probably have it based on soft affinities, since that''s where we expect to have the domain''s vcpus actually running most if the time; but it''s really a bike-shed issue, and something we can change / adjust in the future. (Although I suppose ideal behavior would be for the allocator to have three levels of preference instead of just two: allocate from soft affinity first; if that''s not available, allocate from hard affinity; and finally allocate wherever you can find ram. But that''s probably more work than it''s worth at this point.) So what''s the plan now Dario? You''re going to re-spin the patch series to just do hard and soft affinities at the HV level, plumbing the results through the toolstack? I think for now I might advise putting off doing a NUMA interface at the libxl level, and do a full vNUMA interface in another series (perhaps for 4.5, depending on the timing). -George
Dario Faggioli
2013-Nov-06 14:26 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On mer, 2013-11-06 at 11:44 +0000, George Dunlap wrote:> On 06/11/13 10:00, Dario Faggioli wrote: > > I see, and that sounds sensible to me... It''s mostly a matter a matter > > of deciding whether o not we want something like that, and, if yes, > > whether we want it based on hard of soft. > > > > Personally, I think I agree with you on having it based on hard > > affinities by default. > > > > Let''s see if George get to say something before I get to that part of > > the (re)implementation. :-) > > I would probably have it based on soft affinities, since that''s where we > expect to have the domain''s vcpus actually running most if the time; >True. However, doing that would rule out cpupool and vcpu-pinning. That is to say, if you create a domain in a pool or with its vcpu pinned (by specifying "pool=" or "cpus=" in the config file), it''ll get its memory striped over all the nodes. In fact, in both these cases, there really is no soft affinity involved. This is bad, because people may be already used to create a domain with "cpus=", and have the memory allocated from the relevant nodes. OTOH, there is no similar thing (i.e., a user interface/API) for soft affinities, and the way I was using it at the xl and libxl level for the sake of NUMA performance, I can easily implement there on top of soft affinities. So, in summary, I''d have liked to have it based on soft affinity too, but I think that would break more things than having it based on hard ones.> but > it''s really a bike-shed issue, and something we can change / adjust in > the future. >That is also true, indeed.> (Although I suppose ideal behavior would be for the allocator to have > three levels of preference instead of just two: allocate from soft > affinity first; if that''s not available, allocate from hard affinity; > and finally allocate wherever you can find ram. But that''s probably > more work than it''s worth at this point.) >I like this too, but that''s definitely something longer term than this or next week.> So what''s the plan now Dario? You''re going to re-spin the patch series > to just do hard and soft affinities at the HV level, plumbing the > results through the toolstack? >Yes, that was what I had in mind, and what I already started working on (see the other mail I sent before lunch about (re)naming). I should be able to craft and send something by ether today or tomorrow.> I think for now I might advise putting off doing a NUMA interface at the > libxl level, and do a full vNUMA interface in another series (perhaps > for 4.5, depending on the timing). >Well, I agree that all this is of very little use without vNUMA, but at the same time, it''s not necessarily only useful for it. Also, whatever it is vcpu-node-affinity or soft-affinity, if it is not wired up properly up to the higher layers, there''s very few point of having the HV part only... So my idea was to redo and resend everything, including libxl and xl bits. Of course, that doesn''t mean we must necessarily have this for 4.4 (although I think it would still be feasible), just that we either check-in or wait for both the implementation and the interface. Again, how''s the updated release schedule? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
George Dunlap
2013-Nov-06 14:47 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On 06/11/13 11:41, Dario Faggioli wrote:> There''s of course the option of renaming XEN_DOMCTL_setvcpuaffinity to > XEN_DOMCTL_setvcpuhardaffinity (and, while there, put some ''_'' in it! > )... But is that really an option? And even if it is an option for the > hypervisor, and perhaps, libxc, I''m quite sure it''s not for libxl... Am > I wrong?I think vcpuaffinity is a domctl, which means it''s a interface to libxc, and thus malleable. So I *think* we could actually just add a parameter to vcpuaffinity to say "hard" or "soft". (Correct me if I''m wrong, Jan.) Where we need to consider backwards compatibility is in the libxl interface. I''m not sure whether the best thing to do there. I think we can''t break the existing API, so we need to keep libxl_set_vcpuaffinity() available for old callers. Options include: * Just add libxl_set_vcpuaffinity_soft(), and leave the first one as it is * Add libxl_set_vcpu_affinity_[something]() which takes a soft/hard flag, and deprecate the existing one * Redefine libxl_set_vcpuaffinity() to include a soft/hard flag, bumping the API version and having a work-around for old callers. IanJ / IanC, any opinions? -George
George Dunlap
2013-Nov-06 14:56 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On 06/11/13 14:26, Dario Faggioli wrote:> On mer, 2013-11-06 at 11:44 +0000, George Dunlap wrote: >> On 06/11/13 10:00, Dario Faggioli wrote: >>> I see, and that sounds sensible to me... It''s mostly a matter a matter >>> of deciding whether o not we want something like that, and, if yes, >>> whether we want it based on hard of soft. >>> >>> Personally, I think I agree with you on having it based on hard >>> affinities by default. >>> >>> Let''s see if George get to say something before I get to that part of >>> the (re)implementation. :-) >> I would probably have it based on soft affinities, since that''s where we >> expect to have the domain''s vcpus actually running most if the time; >> > True. However, doing that would rule out cpupool and vcpu-pinning.I guess I was assuming that a vcpu''s soft affinity would always be considered a subset of its hard affinity and the cpus in its cpupool.>> I think for now I might advise putting off doing a NUMA interface at the >> libxl level, and do a full vNUMA interface in another series (perhaps >> for 4.5, depending on the timing). >> > Well, I agree that all this is of very little use without vNUMA, but at > the same time, it''s not necessarily only useful for it. Also, whatever > it is vcpu-node-affinity or soft-affinity, if it is not wired up > properly up to the higher layers, there''s very few point of having the > HV part only... So my idea was to redo and resend everything, including > libxl and xl bits. > > Of course, that doesn''t mean we must necessarily have this for 4.4 > (although I think it would still be feasible), just that we either > check-in or wait for both the implementation and the interface. Again, > how''s the updated release schedule?Well we definitely should plumb through access to the soft affinity directly through libxl. The question is whether it would be useful to have a libxl per-vcpu *numa* affinity that is not yet a full vNUMA implementation. If long-term we want to have vcpu -> vnode, and then vnode->pnode, the having a direct vcpu->pnode interface in the interim is probably not particularly useful, I don''t think. Re the release schedule: we agreed to move the code freeze to November 18, so we''ve got just under two weeks. I think keeping the libxl side simple will allow us to at least get the "soft affinity" stuff in for 4.4, whether the vNUMA stuff makes it or not. -George
Jan Beulich
2013-Nov-06 15:14 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
>>> On 06.11.13 at 15:56, George Dunlap <george.dunlap@eu.citrix.com> wrote: > On 06/11/13 14:26, Dario Faggioli wrote: >> On mer, 2013-11-06 at 11:44 +0000, George Dunlap wrote: >>> On 06/11/13 10:00, Dario Faggioli wrote: >>>> I see, and that sounds sensible to me... It''s mostly a matter a matter >>>> of deciding whether o not we want something like that, and, if yes, >>>> whether we want it based on hard of soft. >>>> >>>> Personally, I think I agree with you on having it based on hard >>>> affinities by default. >>>> >>>> Let''s see if George get to say something before I get to that part of >>>> the (re)implementation. :-) >>> I would probably have it based on soft affinities, since that''s where we >>> expect to have the domain''s vcpus actually running most if the time; >>> >> True. However, doing that would rule out cpupool and vcpu-pinning. > > I guess I was assuming that a vcpu''s soft affinity would always be > considered a subset of its hard affinity and the cpus in its cpupool.For CPU pools I agree, but didn''t we mean hard affinity to control memory allocation, and soft affinity scheduling decisions (in which case there''s no strict ordering between the two)? If not, I guess I''d need a brief but clear definition what "soft" and "hard" are supposed to represent... Jan
George Dunlap
2013-Nov-06 16:12 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On 06/11/13 15:14, Jan Beulich wrote:>>>> On 06.11.13 at 15:56, George Dunlap <george.dunlap@eu.citrix.com> wrote: >> On 06/11/13 14:26, Dario Faggioli wrote: >>> On mer, 2013-11-06 at 11:44 +0000, George Dunlap wrote: >>>> On 06/11/13 10:00, Dario Faggioli wrote: >>>>> I see, and that sounds sensible to me... It''s mostly a matter a matter >>>>> of deciding whether o not we want something like that, and, if yes, >>>>> whether we want it based on hard of soft. >>>>> >>>>> Personally, I think I agree with you on having it based on hard >>>>> affinities by default. >>>>> >>>>> Let''s see if George get to say something before I get to that part of >>>>> the (re)implementation. :-) >>>> I would probably have it based on soft affinities, since that''s where we >>>> expect to have the domain''s vcpus actually running most if the time; >>>> >>> True. However, doing that would rule out cpupool and vcpu-pinning. >> I guess I was assuming that a vcpu''s soft affinity would always be >> considered a subset of its hard affinity and the cpus in its cpupool. > For CPU pools I agree, but didn''t we mean hard affinity to control > memory allocation, and soft affinity scheduling decisions (in which > case there''s no strict ordering between the two)? If not, I guess I''d > need a brief but clear definition what "soft" and "hard" are supposed > to represent...We have two "affinities" at the moment: * per-vcpu cpu affinity. This is this is a scheduling construct, and is used to restrict vcpus to run on specific pcpus. This is what we would call hard affinity -- "hard" because vcpus are *not allowed* to run on cpus not in this mask. * Domain NUMA affinity. As of 4.3 this has two functions: both a memory allocation function, and a scheduling function. For the scheduling function, it acts as a "soft affinity", but it''s domain-wide, rather than being per-vcpu. It''s "soft" because the scheduler will *try* to run it on pcpus in that mask, but if it cannot, it *will allow* them to run on pcpus not in that mask. At the moment you can set this manually, or if it''s not set, then Xen will set it automatically based on the vcpu hard affinities (and I think the cpupool that it''s in). What we''re proposing is to remove the domain-wide soft affinity and replace it with a per-vcpu soft affinity. That way each vcpu has some pcpus that it prefers (in the soft affinity *and* the hard affinity), some that it doesn''t prefer but is OK with (hard affinity but not soft affinity), and some that it cannot run on at all (not in the hard affinity). The question Dario has is this: given that we now have per-vcpu hard and soft scheduling affinity, how should we automatically construct the per-domain memory allocation affinity, if at all? Should we construct it from the "hard" scheduling affinities, or from the "soft" scheduling affinities? I said that I thought we should use the soft affinity; but I really meant the "effective soft affinity" -- i.e., the union of soft, hard, and cpupools. -George
Dario Faggioli
2013-Nov-06 16:20 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On mer, 2013-11-06 at 15:14 +0000, Jan Beulich wrote:> >>> On 06.11.13 at 15:56, George Dunlap <george.dunlap@eu.citrix.com> wrote:> > I guess I was assuming that a vcpu''s soft affinity would always be > > considered a subset of its hard affinity and the cpus in its cpupool. >Well, we''re doing all this talking with the specific goal or disentangle soft from har from memory affinities, aren''t we? However, yes, for both logical and implementation reasons, a domain that lives in a pool should not have either soft or hard affinity to any pcpu which is outside of the pool. For hard affinities, not so much. Meaning that is it possible (it was possible in this series, and I think it should remain possible in its new incarnation) to have completely disjoint sets of pcpus as hard an soft affinities, at least at the hypervisor level. Then, of course, high in the toolstack we can come up with proper policing, but that''s a different issue. Or are you (George) saying that if domain A has soft affinity to pcpus 0-15, setting its hard affinity to 0-7 should affect the soft atinity (an constraint it within that range) too? I''m almost sure we had a similar discussion in another thread, and it was you tha argued that we should not introduce these kind of dependencies, when the user changes parameter X and this silently result in Y changing too...> For CPU pools I agree, but didn''t we mean hard affinity to control > memory allocation, and soft affinity scheduling decisions (in which > case there''s no strict ordering between the two)? If not, I guess I''d > need a brief but clear definition what "soft" and "hard" are supposed > to represent... >Well, let me try. Both hard and soft affinity is all about scheduling, with the following meaning: - hard: the *allowed* pcpus - soft: the *preferred* pcpus If there are no free *preferred* pcpus, as well as if the soft affinity mask is empty or has empty intersection with the hard one, a pcpu from the *allowed* one is picked up by the scheduler. Memory allocation is controlled by (let''s call i like that) node affinity. There will be specific (hyper)calls to set it explicitly, but the point here is, if none of those is invoked, how should it look like? My opinion is that, doing as you said, and have it be based on hard affinities makes things easier and quicker to implement, and guarantees the most consistent behavior wrt the current one. Thus, although, I know we don''t care much about backward compatibility at this level, I''d go for it. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Nov-06 16:22 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
>>> On 06.11.13 at 17:12, George Dunlap <george.dunlap@eu.citrix.com> wrote: > The question Dario has is this: given that we now have per-vcpu hard and > soft scheduling affinity, how should we automatically construct the > per-domain memory allocation affinity, if at all? Should we construct > it from the "hard" scheduling affinities, or from the "soft" scheduling > affinities? > > I said that I thought we should use the soft affinity; but I really > meant the "effective soft affinity" -- i.e., the union of soft, hard, > and cpupools.Actually I think trying on the most narrow set first (soft) and then widening (hard, anywhere) is indeed the most sensible approach then. Jan
Dario Faggioli
2013-Nov-06 16:23 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On mer, 2013-11-06 at 14:56 +0000, George Dunlap wrote:> On 06/11/13 14:26, Dario Faggioli wrote:> > Well, I agree that all this is of very little use without vNUMA, but at > > the same time, it''s not necessarily only useful for it. Also, whatever > > it is vcpu-node-affinity or soft-affinity, if it is not wired up > > properly up to the higher layers, there''s very few point of having the > > HV part only... So my idea was to redo and resend everything, including > > libxl and xl bits. > > > > Of course, that doesn''t mean we must necessarily have this for 4.4 > > (although I think it would still be feasible), just that we either > > check-in or wait for both the implementation and the interface. Again, > > how''s the updated release schedule? > > Well we definitely should plumb through access to the soft affinity > directly through libxl. The question is whether it would be useful to > have a libxl per-vcpu *numa* affinity that is not yet a full vNUMA > implementation. >Ah, ok, I didn''t read your comment this way, sorry. :-) Now that you explained it, yes, I think we''re on the same page, and I''m more than happy to leave the libx?_{set,get}vcpu_node_affinity() stuff out of the new series.> If long-term we want to have vcpu -> vnode, and then > vnode->pnode, the having a direct vcpu->pnode interface in the interim > is probably not particularly useful, I don''t think. >Indeed.> Re the release schedule: we agreed to move the code freeze to November > 18, so we''ve got just under two weeks. >Ok, that was my recollection from the developer meeting, but just wanted to be sure.> I think keeping the libxl side > simple will allow us to at least get the "soft affinity" stuff in for > 4.4, whether the vNUMA stuff makes it or not. >Agreed. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-06 16:48 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On mer, 2013-11-06 at 16:12 +0000, George Dunlap wrote:> We have two "affinities" at the moment: > * per-vcpu cpu affinity. This is this is a scheduling construct, and is > used to restrict vcpus to run on specific pcpus. This is what we would > call hard affinity -- "hard" because vcpus are *not allowed* to run on > cpus not in this mask. > * Domain NUMA affinity. As of 4.3 this has two functions: both a memory > allocation function, and a scheduling function. For the scheduling > function, it acts as a "soft affinity", but it''s domain-wide, rather > than being per-vcpu. It''s "soft" because the scheduler will *try* to > run it on pcpus in that mask, but if it cannot, it *will allow* them to > run on pcpus not in that mask. > > At the moment you can set this manually, or if it''s not set, then Xen > will set it automatically based on the vcpu hard affinities (and I think > the cpupool that it''s in). > > What we''re proposing is to remove the domain-wide soft affinity and > replace it with a per-vcpu soft affinity. That way each vcpu has some > pcpus that it prefers (in the soft affinity *and* the hard affinity), > some that it doesn''t prefer but is OK with (hard affinity but not soft > affinity), and some that it cannot run on at all (not in the hard affinity). >Great explanation... A bit longer but *much* better than mine. :-)> The question Dario has is this: given that we now have per-vcpu hard and > soft scheduling affinity, how should we automatically construct the > per-domain memory allocation affinity, if at all? Should we construct > it from the "hard" scheduling affinities, or from the "soft" scheduling > affinities? >Exactly.> I said that I thought we should use the soft affinity; but I really > meant the "effective soft affinity" -- i.e., the union of soft, hard, > and cpupools. >Aha... I see it now... Only one more thing: union or intersection? Union would be nice, especially because there are very few risks of it being empty, but it may be too broad (if one of the tree is ''all cpus/nodes'', so will the memory affinity). I''d go for the intersection, with some extra care to avoid degeneration into the empty set. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-06 16:53 UTC
Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity
On mer, 2013-11-06 at 14:47 +0000, George Dunlap wrote:> On 06/11/13 11:41, Dario Faggioli wrote: > > There''s of course the option of renaming XEN_DOMCTL_setvcpuaffinity to > > XEN_DOMCTL_setvcpuhardaffinity (and, while there, put some ''_'' in it! > > )... But is that really an option? And even if it is an option for the > > hypervisor, and perhaps, libxc, I''m quite sure it''s not for libxl... Am > > I wrong? > > I think vcpuaffinity is a domctl, which means it''s a interface to libxc, > and thus malleable. So I *think* we could actually just add a parameter > to vcpuaffinity to say "hard" or "soft". (Correct me if I''m wrong, Jan.) >Ok, that certainly make it a lot easier than I thought. I''ll take it that this also holds for xc_vcpu_setaffinity(), and will add a flag there too.> Where we need to consider backwards compatibility is in the libxl > interface. >Sure.> I''m not sure whether the best thing to do there. I think we > can''t break the existing API, so we need to keep > libxl_set_vcpuaffinity() available for old callers. Options include: > * Just add libxl_set_vcpuaffinity_soft(), and leave the first one as it is >All these three looks sensible to me. Personally, I think I''d like this first one the most. Anyway...> * Add libxl_set_vcpu_affinity_[something]() which takes a soft/hard > flag, and deprecate the existing one > * Redefine libxl_set_vcpuaffinity() to include a soft/hard flag, bumping > the API version and having a work-around for old callers. > > IanJ / IanC, any opinions? >... Sure, let''s hear the maintainers! :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Jackson
2013-Nov-07 18:17 UTC
Re: [PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning
Dario Faggioli writes ("[PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning"):> Making it possible to use something like the following: > * "nodes:0-3": all pCPUs of nodes 0,1,2,3; > * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3; > * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2 > but not pCPU 6; > * ...Thanks. This parsing is a lot clearer now.> @@ -59,6 +59,11 @@ > } \ > }) > > +#define STR_HAS_PREFIX( a, b ) \ > + ( strncmp(a, b, strlen(b)) == 0 ) > +#define STR_SKIP_PREFIX( a, b ) \ > + ( STR_HAS_PREFIX(a, b) ? (a) += strlen(b) : NULL )I think it might be worth making the type of STR_SKIP_PREFIX be explicitly boolean. Eg, + ( STR_HAS_PREFIX(a, b) ? ((a) += strlen(b), 1) : 0 ) Since the returned pointer value isn''t very helpful.> -static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap) > +static int parse_range(const char *str, unsigned long *a, unsigned long *b) > +{ > + char *nstr, *endptr;Missing consts ?> + if (STR_HAS_PREFIX(str, "all")) { > libxl_bitmap_set_any(cpumap); > - return 0; > + goto out;I think this does the wrong thing with "^all".> + for (ptr = strtok_r(cpu, ",", &saveptr); ptr; > + ptr = strtok_r(NULL, ",", &saveptr)) {A minor style complaint: If you are going to split two of these three items onto their own line, please give them all their own line. Thanks, Ian.
Ian Jackson
2013-Nov-07 18:27 UTC
Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity
Dario Faggioli writes ("[PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity"):> by providing the proper get/set interfaces and wiring them > to the new domctl-s from the previous commit....> +int xc_vcpu_setnodeaffinity(xc_interface *xch, > + uint32_t domid, > + int vcpu, > + xc_nodemap_t nodemap) > +{ > + DECLARE_DOMCTL; > + DECLARE_HYPERCALL_BUFFER(uint8_t, local);This looks good to me, but I would like Ian C to check that the use of the hypercall buffer machinery is correct. Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Ian.
Ian Jackson
2013-Nov-07 18:29 UTC
Re: [PATCH RESEND 10/12] libxl: numa-sched: enable getting/specifying per-vcpu node-affinity
Dario Faggioli writes ("[PATCH RESEND 10/12] libxl: numa-sched: enable getting/specifying per-vcpu node-affinity"):> by providing the proper get/set interfaces and wiring them > to the new libxc calls from the previous commit. > > For the ''get'' part, the node-affinity of all the vcpus of a > domain is also reported via libxl_list_vcpu() (exactly as it > is happening for their vcpu affinity already), adding a > specific field to libxl_vcpuinfo.This is a change to the libxl ABI. I think you should bump the MAJOR number in this patch (or a previous one). Otherwise it looks good. Ian.
Ian Jackson
2013-Nov-07 18:33 UTC
Re: [PATCH RESEND 11/12] xl: numa-sched: enable getting/specifying per-vcpu node-affinity
Dario Faggioli writes ("[PATCH RESEND 11/12] xl: numa-sched: enable getting/specifying per-vcpu node-affinity"):> by showing it, upon request (''-n'' switch), as a part of the > output of `xl vcpu-list''. Such output looks like this:...> # xl vcpu-list -n > > Name ID VCPU CPU State Time(s) CPU Affinity / NUMA Affinity > Domain-0 0 0 8 -b- 4.4 any cpu / any node > Domain-0 0 1 0 -b- 1.2 any cpu / any node > vm-test 1 0 8 -b- 2.4 any cpu / 1 > vm-test 1 1 13 -b- 3.3 any cpu / 1 > > The "vcpu-pinning / vcpu-affinity" part may indeed look > weird, as it is different from the other fields. > Unfortunately, it''s very hard to properly format it in > columns, since there is no such things as minimal or > maximal lengths for the bitmaps being printed.Can you use the same syntax for printing as would be legal in the config file ?> While at it, the implementation of `xl vcpu-list'' is reworked > a little bit, making it more similar to the one of `xl list'' > and more compliant with the libxl programming paradigm (e.g., > regarding allocating at the beginning and freeing on exit).Can you do this in a pre-patch ?> xl manual page is updated accordingly.> -static int update_cpumap_range(const char *str, libxl_bitmap *cpumap) > +static int update_bitmap_range(const char *str, libxl_bitmap *bitmap)I don''t understand how this set of changes relates to this patch. Thanks, Ian.
Ian Jackson
2013-Nov-07 18:35 UTC
Re: [PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file
Dario Faggioli writes ("[PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file"):> in a similar way to how it is possible to specify vcpu-affinity....> /* > - * Check if the domain has any CPU affinity. If not, try to build > - * up one. In case numa_place_domain() find at least a suitable > - * candidate, it will affect info->nodemap accordingly; if it > - * does not, it just leaves it as it is. This means (unless > - * some weird error manifests) the subsequent call to > - * libxl_domain_set_nodeaffinity() will do the actual placement, > + * Check if the domain has any pinning or node-affinity and, if not, try > + * to build up one. > + * > + * In case numa_place_domain() find at least a suitable candidate, it will > + * affect info->nodemap accordingly; if it does not, it just leaves it as > + * it is. This means (unless some weird error manifests) the subsequent > + * call to libxl_domain_set_nodeaffinity() will do the actual placement, > * whatever that turns out to be. > */ > if (libxl_defbool_val(info->numa_placement)) { > > - if (!libxl_bitmap_is_full(&info->cpumap)) { > + if (!libxl_bitmap_is_full(&info->cpumap) || > + !libxl_bitmap_is_full(&info->nodemap)) { > LOG(ERROR, "Can run NUMA placement only if no vcpu " > - "affinity is specified"); > + "pinning or node-affinity is specified");I appreciate I may be a bit late with this complaint, but surely this approach (setting the cpumap and nodemap when they aren''t all-bits-set) means that it''s not possible to explicitly set "all cpus". If I say in the config file "all" for cpus and "all" for nodes, this code will override that. I don''t think that can be right. Ian.
Dario Faggioli
2013-Nov-08 09:18 UTC
Re: [PATCH RESEND 10/12] libxl: numa-sched: enable getting/specifying per-vcpu node-affinity
On gio, 2013-11-07 at 18:29 +0000, Ian Jackson wrote:> Dario Faggioli writes ("[PATCH RESEND 10/12] libxl: numa-sched: enable getting/specifying per-vcpu node-affinity"): > > by providing the proper get/set interfaces and wiring them > > to the new libxc calls from the previous commit. > > > > For the ''get'' part, the node-affinity of all the vcpus of a > > domain is also reported via libxl_list_vcpu() (exactly as it > > is happening for their vcpu affinity already), adding a > > specific field to libxl_vcpuinfo. > > This is a change to the libxl ABI. I think you should bump the MAJOR > number in this patch (or a previous one). >Oh, do I? I may be missing something then. Every time I''ve done something similar to this before, I did provide the proper LIBXL_HAVE_* stuff (which this patch does), for the sake of _API_ compatibility, and that was it. :-( I''ve just looked up in libxl.h and found the comment block about ABI compatibility and SONAME bumping but (sorry) I''m not 100% sure I fully understand what it says. Does it (and you) mean that, since this is the first change since 4.3 that breaks ABI compatibility, I should include a patch to tools/libxl/Makefile doing something like: -MAJOR=4.3 +MAJOR=4.4 ? If yes, anything more than that? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-08 09:24 UTC
Re: [PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning
On gio, 2013-11-07 at 18:17 +0000, Ian Jackson wrote:> Dario Faggioli writes ("[PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning"): > > Making it possible to use something like the following: > > * "nodes:0-3": all pCPUs of nodes 0,1,2,3; > > * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3; > > * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2 > > but not pCPU 6; > > * ... > > Thanks. This parsing is a lot clearer now. >Good to hear that, again (yep, you said this before in a previous review of a different series still including this patch :-D).> > @@ -59,6 +59,11 @@ > > } \ > > }) > > > > +#define STR_HAS_PREFIX( a, b ) \ > > + ( strncmp(a, b, strlen(b)) == 0 ) > > +#define STR_SKIP_PREFIX( a, b ) \ > > + ( STR_HAS_PREFIX(a, b) ? (a) += strlen(b) : NULL ) > > I think it might be worth making the type of STR_SKIP_PREFIX be > explicitly boolean. Eg, > + ( STR_HAS_PREFIX(a, b) ? ((a) += strlen(b), 1) : 0 ) > > Since the returned pointer value isn''t very helpful. >Good point. Will do.> > -static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap) > > +static int parse_range(const char *str, unsigned long *a, unsigned long *b) > > +{ > > + char *nstr, *endptr; > > Missing consts ? >Is it? For instance, I''m using endptr like this: *a = *b = strtoul(str, &endptr, 10); And the prototype of strtoul is: unsigned long int strtoul(const char *nptr, char **endptr, int base); So it won''t work, not for endptr at least. What I can do is the following: const char *nstr; char *endptr; And I will, if you think it''s better.> > + if (STR_HAS_PREFIX(str, "all")) { > > libxl_bitmap_set_any(cpumap); > > - return 0; > > + goto out; > > I think this does the wrong thing with "^all". >Good point again. I''ll fix that.> > + for (ptr = strtok_r(cpu, ",", &saveptr); ptr; > > + ptr = strtok_r(NULL, ",", &saveptr)) { > > A minor style complaint: If you are going to split two of these three > items onto their own line, please give them all their own line. >You mean ptr should have its own line? Like this? + for (ptr = strtok_r(cpu, ",", &saveptr); + ptr; + ptr = strtok_r(NULL, ",", &saveptr)) { If yes, I sure can do that, although the result really looks super unpleasant to me (but that''s a matter of taste, I guess). Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-08 09:33 UTC
Re: [PATCH RESEND 11/12] xl: numa-sched: enable getting/specifying per-vcpu node-affinity
On gio, 2013-11-07 at 18:33 +0000, Ian Jackson wrote:> Dario Faggioli writes ("[PATCH RESEND 11/12] xl: numa-sched: enable getting/specifying per-vcpu node-affinity"): > > by showing it, upon request (''-n'' switch), as a part of the > > output of `xl vcpu-list''. Such output looks like this: > ... > > # xl vcpu-list -n > > > > Name ID VCPU CPU State Time(s) CPU Affinity / NUMA Affinity > > Domain-0 0 0 8 -b- 4.4 any cpu / any node > > Domain-0 0 1 0 -b- 1.2 any cpu / any node > > vm-test 1 0 8 -b- 2.4 any cpu / 1 > > vm-test 1 1 13 -b- 3.3 any cpu / 1 > > > > The "vcpu-pinning / vcpu-affinity" part may indeed look > > weird, as it is different from the other fields. > > Unfortunately, it''s very hard to properly format it in > > columns, since there is no such things as minimal or > > maximal lengths for the bitmaps being printed. > > Can you use the same syntax for printing as would be legal in > the config file ? >That is actually what is happening, for single cpu/node bitmaps. The problem arises when I need to print two bitmaps, one next to the other, like the cpumap for vCPU affinity and the nodemap for NUMA affinity above. The thing is I don''t think there is a maximal length for the strings representing the bitmaps being printed, and this no easy way to format the last two values (again "CPU Affinity" and "NUMA Affinity") above in columns... Hence the "/". Ideas? (BTW, the new version of the series I''m preparing will be very different, considering we decided with George and Jan to change the fundamental architecture of the whole thing. However, there still may be the need to doing something like the above, independently whether it will be for printing hard and soft affinity, instead than CPU and NUMA affinity.)> > While at it, the implementation of `xl vcpu-list'' is reworked > > a little bit, making it more similar to the one of `xl list'' > > and more compliant with the libxl programming paradigm (e.g., > > regarding allocating at the beginning and freeing on exit). > > Can you do this in a pre-patch ? >Ok.> > xl manual page is updated accordingly. > > > -static int update_cpumap_range(const char *str, libxl_bitmap *cpumap) > > +static int update_bitmap_range(const char *str, libxl_bitmap *bitmap) > > I don''t understand how this set of changes relates to this patch. >Well, I need to parse strings that may be very similar (the usual "1,2-7,^4" stuff), but with potentially different meaning for the various substrings (pcpus vs. nodes). Anyway, in case I''m keeping this, or doing something similar to this for the new series, I''ll try to figure out a way to make it more simple, or at minimum more readable and clarify why it is needed (by breaking the patch down, etc.). Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-08 09:49 UTC
Re: [PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file
On gio, 2013-11-07 at 18:35 +0000, Ian Jackson wrote:> Dario Faggioli writes ("[PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file"): > > in a similar way to how it is possible to specify vcpu-affinity. > ... > > /* > > - * Check if the domain has any CPU affinity. If not, try to build > > - * up one. In case numa_place_domain() find at least a suitable > > - * candidate, it will affect info->nodemap accordingly; if it > > - * does not, it just leaves it as it is. This means (unless > > - * some weird error manifests) the subsequent call to > > - * libxl_domain_set_nodeaffinity() will do the actual placement, > > + * Check if the domain has any pinning or node-affinity and, if not, try > > + * to build up one. > > + * > > + * In case numa_place_domain() find at least a suitable candidate, it will > > + * affect info->nodemap accordingly; if it does not, it just leaves it as > > + * it is. This means (unless some weird error manifests) the subsequent > > + * call to libxl_domain_set_nodeaffinity() will do the actual placement, > > * whatever that turns out to be. > > */ > > if (libxl_defbool_val(info->numa_placement)) { > > > > - if (!libxl_bitmap_is_full(&info->cpumap)) { > > + if (!libxl_bitmap_is_full(&info->cpumap) || > > + !libxl_bitmap_is_full(&info->nodemap)) { > > LOG(ERROR, "Can run NUMA placement only if no vcpu " > > - "affinity is specified"); > > + "pinning or node-affinity is specified"); > > I appreciate I may be a bit late with this complaint, but surely this > approach (setting the cpumap and nodemap when they aren''t > all-bits-set) means that it''s not possible to explicitly set "all > cpus". >The funny part is not that you''re late, but to whom you''re saying it! :-D :-D http://lists.xen.org/archives/html/xen-devel/2012-07/msg01077.html From: Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx> Date: Mon, 23 Jul 2012 12:04:29 +0100 Dario Faggioli writes ("[PATCH 2 of 4 v6/leftover] xl: inform libxl if there was a cpumap in the config file"): [...] Shouldn''t this be - if (libxl_bitmap_is_full(&info->cpumap)) { + if (libxl_defbool_val(info->numa_placement)) { + if (!libxl_bitmap_is_full(&info->cpumap)) { + rc = ERROR_INVAL; + LOG blah blah invalid not supported + goto out; + } The story behind all this is that, back at that time, someone (and I''m sure it''s you again, although I don''t find the thread now) asked to introduce info->numa_placement, to have a mean for: (1) disabling NUMA placement completely; (2) help distinguishing the case where cpumap is full because we want "cpus=all" from the one where it''s full because we did not touched it, since being full is the default value. I remember finding it sound, and thus going for it. Perhaps we could have done it differently, but given that info->cpumap is full by default, I really don''t see that many alternatives (apart from making cpumap empty by default, which may look even weirder). So, the idea here is: - if it''s full, and placement is enabled, full means "do this for me", and that''s what happens. - if placement is disabled, then nobody touches it, so if it was full it stay that way. So, for saying "all", the procedure is as follows: info->numa_placement = false; fill_bitmap(info->cpumap);> If I say in the config file "all" for cpus and "all" for nodes, this > code will override that. I don''t think that can be right. >And in fact, xl, when parsing the config file, turns numa_placement to false when finding a "cpus=" item. All the above applies to the new "nodes=" switch as well, of course. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Jackson
2013-Nov-08 15:07 UTC
Re: [PATCH RESEND 10/12] libxl: numa-sched: enable getting/specifying per-vcpu node-affinity
Dario Faggioli writes ("Re: [PATCH RESEND 10/12] libxl: numa-sched: enable getting/specifying per-vcpu node-affinity"):> On gio, 2013-11-07 at 18:29 +0000, Ian Jackson wrote: > > This is a change to the libxl ABI. I think you should bump the MAJOR > > number in this patch (or a previous one). > > > Oh, do I? I may be missing something then. Every time I''ve done > something similar to this before, I did provide the proper LIBXL_HAVE_* > stuff (which this patch does), for the sake of _API_ compatibility, and > that was it. :-(Yes. Probably in the past this bump had already been done for some other reason.> I''ve just looked up in libxl.h and found the comment block about ABI > compatibility and SONAME bumping but (sorry) I''m not 100% sure I fully > understand what it says. Does it (and you) mean that, since this is the > first change since 4.3 that breaks ABI compatibility, I should include a > patch to tools/libxl/Makefile doing something like: > > -MAJOR=4.3 > +MAJOR=4.4 > > ?Exactly.> If yes, anything more than that?No, nothing more. Thanks, Ian.
Ian Jackson
2013-Nov-08 15:18 UTC
Re: [PATCH RESEND 11/12] xl: numa-sched: enable getting/specifying per-vcpu node-affinity
Dario Faggioli writes ("Re: [PATCH RESEND 11/12] xl: numa-sched: enable getting/specifying per-vcpu node-affinity"):> On gio, 2013-11-07 at 18:33 +0000, Ian Jackson wrote: > > Dario Faggioli writes ("[PATCH RESEND 11/12] xl: numa-sched: enable getting/specifying per-vcpu node-affinity"): > > > # xl vcpu-list -n > > > > > > Name ID VCPU CPU State Time(s) CPU Affinity / NUMA Affinity > > > Domain-0 0 0 8 -b- 4.4 any cpu / any node > > > Domain-0 0 1 0 -b- 1.2 any cpu / any node > > > vm-test 1 0 8 -b- 2.4 any cpu / 1 > > > vm-test 1 1 13 -b- 3.3 any cpu / 1...> > Can you use the same syntax for printing as would be legal in > > the config file ? > > > That is actually what is happening, for single cpu/node bitmaps.Err, I meant always, not just in one special case. Eg, "any cpu" should be "all". I don''t know if you want to try to boil ranges cpumap ranges down to node sets during printing - that''s probably too hard.> The problem arises when I need to print two bitmaps, one next to the > other, like the cpumap for vCPU affinity and the nodemap for NUMA > affinity above.Separating them with a / is fine I think.> (BTW, the new version of the series I''m preparing will be very > different, considering we decided with George and Jan to change the > fundamental architecture of the whole thing. However, there still may be > the need to doing something like the above, independently whether it > will be for printing hard and soft affinity, instead than CPU and NUMA > affinity.)Right.> > > xl manual page is updated accordingly. > > > > > -static int update_cpumap_range(const char *str, libxl_bitmap *cpumap) > > > +static int update_bitmap_range(const char *str, libxl_bitmap *bitmap) > > > > I don''t understand how this set of changes relates to this patch. > > > Well, I need to parse strings that may be very similar (the usual > "1,2-7,^4" stuff), but with potentially different meaning for the > various substrings (pcpus vs. nodes).So perhaps there should be a pre-patch to change the names of these functions.> Anyway, in case I''m keeping this, or doing something similar to this for > the new series, I''ll try to figure out a way to make it more simple, or > at minimum more readable and clarify why it is needed (by breaking the > patch down, etc.).Thanks, Ian.
Ian Jackson
2013-Nov-08 15:20 UTC
Re: [PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning
Dario Faggioli writes ("Re: [PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning"):> On gio, 2013-11-07 at 18:17 +0000, Ian Jackson wrote: > > > -static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap) > > > +static int parse_range(const char *str, unsigned long *a, unsigned long *b) > > > +{ > > > + char *nstr, *endptr; > > > > Missing consts ? > > > Is it? For instance, I''m using endptr like this: > > *a = *b = strtoul(str, &endptr, 10); > > And the prototype of strtoul is: > unsigned long int strtoul(const char *nptr, char **endptr, int base);Oh, yes, sorry, you''re right about endptr.> So it won''t work, not for endptr at least. What I can do is the > following: > > const char *nstr; > char *endptr; > > And I will, if you think it''s better.Thanks, yes.> > > + for (ptr = strtok_r(cpu, ",", &saveptr); ptr; > > > + ptr = strtok_r(NULL, ",", &saveptr)) { > > > > A minor style complaint: If you are going to split two of these three > > items onto their own line, please give them all their own line. > > > You mean ptr should have its own line? Like this? > > + for (ptr = strtok_r(cpu, ",", &saveptr); > + ptr; > + ptr = strtok_r(NULL, ",", &saveptr)) { > > If yes, I sure can do that, although the result really looks super > unpleasant to me (but that''s a matter of taste, I guess).IMO it makes it much easier to see which of these expressions is which of the three items in a for(;;). That''s more important than making the lines compact. thanks, Ian.
Ian Jackson
2013-Nov-08 15:22 UTC
Re: [PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file
Dario Faggioli writes ("Re: [PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file"):> On gio, 2013-11-07 at 18:35 +0000, Ian Jackson wrote: > > I appreciate I may be a bit late with this complaint, but surely this > > approach (setting the cpumap and nodemap when they aren''t > > all-bits-set) means that it''s not possible to explicitly set "all > > cpus". > > > The funny part is not that you''re late, but to whom you''re saying > it! :-D :-DSorry about this. Forget about this complaint. I hadn''t spotted the line> if (libxl_defbool_val(info->numa_placement)) {above. So now I think it''s correct. Ian.
Konrad Rzeszutek Wilk
2013-Nov-12 16:01 UTC
Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity
On Tue, Nov 05, 2013 at 03:35:42PM +0100, Dario Faggioli wrote:> by providing the proper get/set interfaces and wiring them > to the new domctl-s from the previous commit.s/previous commit/<title of the commit>/> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > --- > tools/libxc/xc_domain.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++ > tools/libxc/xenctrl.h | 19 +++++++++++ > 2 files changed, 101 insertions(+) > > diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c > index f8825a3..e44ad2e 100644 > --- a/tools/libxc/xc_domain.c > +++ b/tools/libxc/xc_domain.c > @@ -272,6 +272,88 @@ out: > return ret; > } > > +int xc_vcpu_setnodeaffinity(xc_interface *xch, > + uint32_t domid, > + int vcpu, > + xc_nodemap_t nodemap) > +{ > + DECLARE_DOMCTL; > + DECLARE_HYPERCALL_BUFFER(uint8_t, local); > + int ret = -1;Ewww.. Could we just use regular -Exx for new xc_* calls?> + int nodesize; > + > + nodesize = xc_get_cpumap_size(xch); > + if (!nodesize) > + { > + PERROR("Could not get number of nodes");Extra space.> + goto out; > + } > + > + local = xc_hypercall_buffer_alloc(xch, local, nodesize); > + if ( local == NULL ) > + { > + PERROR("Could not allocate memory for setvcpunodeaffinity domctl hypercall"); > + goto out; > + } > + > + domctl.cmd = XEN_DOMCTL_setvcpunodeaffinity; > + domctl.domain = (domid_t)domid; > + domctl.u.vcpuaffinity.vcpu = vcpu; > + > + memcpy(local, nodemap, nodesize); > + > + set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap, local); > + > + domctl.u.vcpuaffinity.map.nr_bits = nodesize * 8; > + > + ret = do_domctl(xch, &domctl); > + > + xc_hypercall_buffer_free(xch, local); > + > + out: > + return ret; > +} > + > +int xc_vcpu_getnodeaffinity(xc_interface *xch, > + uint32_t domid, > + int vcpu, > + xc_nodemap_t nodemap) > +{ > + DECLARE_DOMCTL; > + DECLARE_HYPERCALL_BUFFER(uint8_t, local); > + int ret = -1; > + int nodesize; > + > + nodesize = xc_get_nodemap_size(xch); > + if (!nodesize) > + { > + PERROR("Could not get number of nodes"); > + goto out; > + } > + > + local = xc_hypercall_buffer_alloc(xch, local, nodesize); > + if (local == NULL) > + { > + PERROR("Could not allocate memory for getvcpunodeaffinity domctl hypercall"); > + goto out; > + } > + > + domctl.cmd = XEN_DOMCTL_getvcpunodeaffinity; > + domctl.domain = (domid_t)domid; > + domctl.u.vcpuaffinity.vcpu = vcpu; > + > + set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap, local); > + domctl.u.vcpuaffinity.map.nr_bits = nodesize * 8;Could the 8 be replaced by a sizeof?> + > + ret = do_domctl(xch, &domctl); > + > + memcpy(nodemap, local, nodesize); > + > + xc_hypercall_buffer_free(xch, local); > +out: > + return ret; > +} > + > int xc_domain_get_guest_width(xc_interface *xch, uint32_t domid, > unsigned int *guest_width) > { > diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h > index 8cf3f3b..208fa2c 100644 > --- a/tools/libxc/xenctrl.h > +++ b/tools/libxc/xenctrl.h > @@ -551,6 +551,25 @@ int xc_domain_node_getaffinity(xc_interface *xch, > uint32_t domind, > xc_nodemap_t nodemap); > > +/** > + * These functions set and retrieves the NUMA node-affinity > + * of a specific vcpu. > + * > + * @parm xch a handle to an open hypervisor interface. > + * @parm domid the domain id one is interested in. > + * @parm vcpu the vcpu one wants to set/get the affinity of. > + * @parm nodemap the map of the affine nodes. > + * @return 0 on success, -1 on failure.and something in errno?> + */ > +int xc_vcpu_setnodeaffinity(xc_interface *xch, > + uint32_t domid, > + int vcpu, > + xc_nodemap_t nodemap); > +int xc_vcpu_getnodeaffinity(xc_interface *xch, > + uint32_t domid, > + int vcpu, > + xc_nodemap_t nodemap); > + > int xc_vcpu_setaffinity(xc_interface *xch, > uint32_t domid, > int vcpu, > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
George Dunlap
2013-Nov-12 16:43 UTC
Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity
On 11/12/2013 04:01 PM, Konrad Rzeszutek Wilk wrote:> On Tue, Nov 05, 2013 at 03:35:42PM +0100, Dario Faggioli wrote: >> by providing the proper get/set interfaces and wiring them >> to the new domctl-s from the previous commit. > > s/previous commit/<title of the commit>/ > >> >> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> >> --- >> tools/libxc/xc_domain.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++ >> tools/libxc/xenctrl.h | 19 +++++++++++ >> 2 files changed, 101 insertions(+) >> >> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c >> index f8825a3..e44ad2e 100644 >> --- a/tools/libxc/xc_domain.c >> +++ b/tools/libxc/xc_domain.c >> @@ -272,6 +272,88 @@ out: >> return ret; >> } >> >> +int xc_vcpu_setnodeaffinity(xc_interface *xch, >> + uint32_t domid, >> + int vcpu, >> + xc_nodemap_t nodemap) >> +{ >> + DECLARE_DOMCTL; >> + DECLARE_HYPERCALL_BUFFER(uint8_t, local); >> + int ret = -1; > > Ewww.. Could we just use regular -Exx for new xc_* calls?I think the standard (which Dario is just following here) is to return -1 and set errno. Since the failures are all after making calls to other xc functions, they should be setting errno properly, and we should be able to just leave it. -George
Konrad Rzeszutek Wilk
2013-Nov-12 16:55 UTC
Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity
George Dunlap <george.dunlap@eu.citrix.com> wrote:>On 11/12/2013 04:01 PM, Konrad Rzeszutek Wilk wrote: >> On Tue, Nov 05, 2013 at 03:35:42PM +0100, Dario Faggioli wrote: >>> by providing the proper get/set interfaces and wiring them >>> to the new domctl-s from the previous commit. >> >> s/previous commit/<title of the commit>/ >> >>> >>> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> >>> --- >>> tools/libxc/xc_domain.c | 82 >+++++++++++++++++++++++++++++++++++++++++++++++ >>> tools/libxc/xenctrl.h | 19 +++++++++++ >>> 2 files changed, 101 insertions(+) >>> >>> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c >>> index f8825a3..e44ad2e 100644 >>> --- a/tools/libxc/xc_domain.c >>> +++ b/tools/libxc/xc_domain.c >>> @@ -272,6 +272,88 @@ out: >>> return ret; >>> } >>> >>> +int xc_vcpu_setnodeaffinity(xc_interface *xch, >>> + uint32_t domid, >>> + int vcpu, >>> + xc_nodemap_t nodemap) >>> +{ >>> + DECLARE_DOMCTL; >>> + DECLARE_HYPERCALL_BUFFER(uint8_t, local); >>> + int ret = -1; >> >> Ewww.. Could we just use regular -Exx for new xc_* calls? > >I think the standard (which Dario is just following here) is to return >-1 and set errno. Since the failures are all after making calls to >other xc functions, they should be setting errno properly, and we >should >be able to just leave it. > > -GeorgeI recall some discussion with Ian Campbell where he expressed frustration with the return -1. I thought we wanted to stamp these out going forward?
Dario Faggioli
2013-Nov-12 18:40 UTC
Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity
On mar, 2013-11-12 at 11:01 -0500, Konrad Rzeszutek Wilk wrote:> On Tue, Nov 05, 2013 at 03:35:42PM +0100, Dario Faggioli wrote: > > by providing the proper get/set interfaces and wiring them > > to the new domctl-s from the previous commit. > > s/previous commit/<title of the commit>/ >Ok.> > diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c > > +int xc_vcpu_setnodeaffinity(xc_interface *xch, > > + uint32_t domid, > > + int vcpu, > > + xc_nodemap_t nodemap) > > +{ > > + DECLARE_DOMCTL; > > + DECLARE_HYPERCALL_BUFFER(uint8_t, local); > > + int ret = -1; > > Ewww.. Could we just use regular -Exx for new xc_* calls? >Mmm... As George said already, that is the most common behavior in libxc... If we want something different, we should really write it down somewhere (or has that happened already and I''m missing it?). Also, yes, this does return -1 in cases where I don''t have an error code provided by the call that is actually failing, and yes, I totally rely on it to set errno properly. In the main operation implemented here, I just return the output of do_domctl(), which, again, may be -1, may be some err-code, but it really looks like it is what every other function there does, and, TBH, it also seems the very most sane thing to do to me.> > +int xc_vcpu_getnodeaffinity(xc_interface *xch, > > + uint32_t domid, > > + int vcpu, > > + xc_nodemap_t nodemap) > > +{ > > + DECLARE_DOMCTL; > > + DECLARE_HYPERCALL_BUFFER(uint8_t, local); > > + int ret = -1; > > + int nodesize; > > + > > + nodesize = xc_get_nodemap_size(xch); > > + if (!nodesize) > > + { > > + PERROR("Could not get number of nodes"); > > + goto out; > > + } > > + > > + local = xc_hypercall_buffer_alloc(xch, local, nodesize); > > + if (local == NULL) > > + { > > + PERROR("Could not allocate memory for getvcpunodeaffinity domctl hypercall"); > > + goto out; > > + } > > + > > + domctl.cmd = XEN_DOMCTL_getvcpunodeaffinity; > > + domctl.domain = (domid_t)domid; > > + domctl.u.vcpuaffinity.vcpu = vcpu; > > + > > + set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap, local); > > + domctl.u.vcpuaffinity.map.nr_bits = nodesize * 8; > > Could the 8 be replaced by a sizeof? >I guess it could... What was it that you had in mind in particular? Personally, I consider the names ''bitmap'' and ''nr_bits'' talking enough to feel comfortable with the 8... To the point that I think the following would be even less readable: domctl.u.vcpuaffinity.map.nr_bits = nodesize * sizeof(domctl.u.vcpuaffinity.cpumap.bitmap); But, of course, I''d do it that way if a maintainer asks for that.> > diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h > > index 8cf3f3b..208fa2c 100644 > > --- a/tools/libxc/xenctrl.h > > +++ b/tools/libxc/xenctrl.h > > @@ -551,6 +551,25 @@ int xc_domain_node_getaffinity(xc_interface *xch, > > uint32_t domind, > > xc_nodemap_t nodemap); > > > > +/** > > + * These functions set and retrieves the NUMA node-affinity > > + * of a specific vcpu. > > + * > > + * @parm xch a handle to an open hypervisor interface. > > + * @parm domid the domain id one is interested in. > > + * @parm vcpu the vcpu one wants to set/get the affinity of. > > + * @parm nodemap the map of the affine nodes. > > + * @return 0 on success, -1 on failure. > > and something in errno? >Right. I''ll add a mention to that in the comment. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Konrad Rzeszutek Wilk
2013-Nov-12 19:13 UTC
Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity
On Tue, Nov 12, 2013 at 07:40:24PM +0100, Dario Faggioli wrote:> On mar, 2013-11-12 at 11:01 -0500, Konrad Rzeszutek Wilk wrote: > > On Tue, Nov 05, 2013 at 03:35:42PM +0100, Dario Faggioli wrote: > > > by providing the proper get/set interfaces and wiring them > > > to the new domctl-s from the previous commit. > > > > s/previous commit/<title of the commit>/ > > > Ok. > > > > diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c > > > +int xc_vcpu_setnodeaffinity(xc_interface *xch, > > > + uint32_t domid, > > > + int vcpu, > > > + xc_nodemap_t nodemap) > > > +{ > > > + DECLARE_DOMCTL; > > > + DECLARE_HYPERCALL_BUFFER(uint8_t, local); > > > + int ret = -1; > > > > Ewww.. Could we just use regular -Exx for new xc_* calls? > > > Mmm... As George said already, that is the most common behavior in > libxc... If we want something different, we should really write it down > somewhere (or has that happened already and I''m missing it?).I was under the impression that was it - but I am probably misremembering Ian''s comments and now sowing disinformation. So ignore my code comment please.> > Also, yes, this does return -1 in cases where I don''t have an error code > provided by the call that is actually failing, and yes, I totally rely > on it to set errno properly. > > In the main operation implemented here, I just return the output of > do_domctl(), which, again, may be -1, may be some err-code, but it > really looks like it is what every other function there does, and, TBH, > it also seems the very most sane thing to do to me. > > > > +int xc_vcpu_getnodeaffinity(xc_interface *xch, > > > + uint32_t domid, > > > + int vcpu, > > > + xc_nodemap_t nodemap) > > > +{ > > > + DECLARE_DOMCTL; > > > + DECLARE_HYPERCALL_BUFFER(uint8_t, local); > > > + int ret = -1; > > > + int nodesize; > > > + > > > + nodesize = xc_get_nodemap_size(xch); > > > + if (!nodesize) > > > + { > > > + PERROR("Could not get number of nodes"); > > > + goto out; > > > + } > > > + > > > + local = xc_hypercall_buffer_alloc(xch, local, nodesize); > > > + if (local == NULL) > > > + { > > > + PERROR("Could not allocate memory for getvcpunodeaffinity domctl hypercall"); > > > + goto out; > > > + } > > > + > > > + domctl.cmd = XEN_DOMCTL_getvcpunodeaffinity; > > > + domctl.domain = (domid_t)domid; > > > + domctl.u.vcpuaffinity.vcpu = vcpu; > > > + > > > + set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap, local); > > > + domctl.u.vcpuaffinity.map.nr_bits = nodesize * 8; > > > > Could the 8 be replaced by a sizeof? > > > I guess it could... What was it that you had in mind in particular? > > Personally, I consider the names ''bitmap'' and ''nr_bits'' talking enough > to feel comfortable with the 8... To the point that I think the > following would be even less readable: > > domctl.u.vcpuaffinity.map.nr_bits = nodesize * > sizeof(domctl.u.vcpuaffinity.cpumap.bitmap);Eww. That is worst. Somehow I assumed you could just do ''sizeof(unsigned long long)'' or such. or some #define for this magic number.> > But, of course, I''d do it that way if a maintainer asks for that. > > > > diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h > > > index 8cf3f3b..208fa2c 100644 > > > --- a/tools/libxc/xenctrl.h > > > +++ b/tools/libxc/xenctrl.h > > > @@ -551,6 +551,25 @@ int xc_domain_node_getaffinity(xc_interface *xch, > > > uint32_t domind, > > > xc_nodemap_t nodemap); > > > > > > +/** > > > + * These functions set and retrieves the NUMA node-affinity > > > + * of a specific vcpu. > > > + * > > > + * @parm xch a handle to an open hypervisor interface. > > > + * @parm domid the domain id one is interested in. > > > + * @parm vcpu the vcpu one wants to set/get the affinity of. > > > + * @parm nodemap the map of the affine nodes. > > > + * @return 0 on success, -1 on failure. > > > > and something in errno? > > > Right. I''ll add a mention to that in the comment. > > Thanks and Regards, > Dario > > -- > <<This happens because I choose it to happen!>> (Raistlin Majere) > ----------------------------------------------------------------- > Dario Faggioli, Ph.D, http://about.me/dario.faggioli > Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) >
Dario Faggioli
2013-Nov-12 21:36 UTC
Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity
On mar, 2013-11-12 at 14:13 -0500, Konrad Rzeszutek Wilk wrote:> On Tue, Nov 12, 2013 at 07:40:24PM +0100, Dario Faggioli wrote: > > Mmm... As George said already, that is the most common behavior in > > libxc... If we want something different, we should really write it down > > somewhere (or has that happened already and I''m missing it?). > > I was under the impression that was it - but I am probably misremembering > Ian''s comments and now sowing disinformation. >Well, it may well have been me having missed that, so, really, no big deal at all. :-)> > > > + set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap, local); > > > > + domctl.u.vcpuaffinity.map.nr_bits = nodesize * 8; > > > > > > Could the 8 be replaced by a sizeof? > > > > > I guess it could... What was it that you had in mind in particular? > > > > Personally, I consider the names ''bitmap'' and ''nr_bits'' talking enough > > to feel comfortable with the 8... To the point that I think the > > following would be even less readable: > > > > domctl.u.vcpuaffinity.map.nr_bits = nodesize * > > sizeof(domctl.u.vcpuaffinity.cpumap.bitmap); > > Eww. That is worst. Somehow I assumed you could just do ''sizeof(unsigned long long)'' > or such. or some #define for this magic number. >Maybe I can... I''ll look into that (although no promises, ok? :-P). This is all about playing games with uint8_t and array of them with weird sizes... Again, I''ll see if I can come up with something... Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-13 10:57 UTC
Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity
On mar, 2013-11-12 at 14:13 -0500, Konrad Rzeszutek Wilk wrote:> On Tue, Nov 12, 2013 at 07:40:24PM +0100, Dario Faggioli wrote:> > > > +int xc_vcpu_getnodeaffinity(xc_interface *xch, > > > > + uint32_t domid, > > > > + int vcpu, > > > > + xc_nodemap_t nodemap) > > > > +{ > > > > + DECLARE_DOMCTL; > > > > + DECLARE_HYPERCALL_BUFFER(uint8_t, local); > > > > + int ret = -1; > > > > + int nodesize; > > > > + > > > > + nodesize = xc_get_nodemap_size(xch); > > > > + if (!nodesize) > > > > + { > > > > + PERROR("Could not get number of nodes"); > > > > + goto out; > > > > + } > > > > + > > > > + local = xc_hypercall_buffer_alloc(xch, local, nodesize); > > > > + if (local == NULL) > > > > + { > > > > + PERROR("Could not allocate memory for getvcpunodeaffinity domctl hypercall"); > > > > + goto out; > > > > + } > > > > + > > > > + domctl.cmd = XEN_DOMCTL_getvcpunodeaffinity; > > > > + domctl.domain = (domid_t)domid; > > > > + domctl.u.vcpuaffinity.vcpu = vcpu; > > > > + > > > > + set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap, local); > > > > + domctl.u.vcpuaffinity.map.nr_bits = nodesize * 8; > > > > > > Could the 8 be replaced by a sizeof? > > > > > I guess it could... What was it that you had in mind in particular? > > > > Personally, I consider the names ''bitmap'' and ''nr_bits'' talking enough > > to feel comfortable with the 8... To the point that I think the > > following would be even less readable: > > > > domctl.u.vcpuaffinity.map.nr_bits = nodesize * > > sizeof(domctl.u.vcpuaffinity.cpumap.bitmap); > > Eww. That is worst. Somehow I assumed you could just do ''sizeof(unsigned long long)'' > or such. or some #define for this magic number. >I did think more about this and (re)did the math. After that, I don''t think there is much we can do to make it more clear than it already is with the 8, assuming that we all are cool with the fact that there are 8 bits in a byte, because that''s what the 8 means there. In fact, both the passed xc_nodemap_t (nodemap, in this case) and the hypercall buffer (local) are allocated to be nodesize (i.e., the result of xc_get_nodemap_size()) _bytes_. Since what we want to store in nr_bits is the number of actual bit available in the resulting bitmap, given how large it is in bytes, there is nothing much else I can think of to get there than multiplying such number of byte by the number of bits in a byte. Therefore, either we find a suitable sizeof() for "nr_of_bits_in_a_byte", or I "#define BITS_PER_BYTE 8" (:-P), or we stick with the 8, which again I think it''s the best option. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel